☆ 4.1 Article

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

LANGUAGE RESOURCES AND EVALUATION (2009)

Journal

LANGUAGE RESOURCES AND EVALUATION

Volume 43, Issue 3, Pages 209-226

Publisher

SPRINGER

DOI: 10.1007/s10579-009-9081-4

Keywords

Annotated corpora; Corpus construction; General-purpose linguistic resources; English; German; Italian; Web as corpus; WaCky

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC versus the British National Corpus was also conducted, so as to highlight differences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

Journal

LANGUAGE RESOURCES AND EVALUATION

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

Journal

LANGUAGE RESOURCES AND EVALUATION

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper