☆ 4.7 Article

Learning multilingual named entity recognition from Wikipedia

ARTIFICIAL INTELLIGENCE (2013)

Journal

ARTIFICIAL INTELLIGENCE

Volume 194, Issue -, Pages 151-175

Publisher

ELSEVIER SCIENCE BV

DOI: 10.1016/j.artint.2012.03.006

Keywords

Named entity recognition; Information extraction; Wikipedia; Semi-structured resources; Annotated corpora; Semi-supervised learning

Funding

Computable News project at the Capital Markets CRC
University of Sydney Honours Scholarship
Vice-Chancellor's Research Scholarship
Australian Postgraduate Award
Capital Markets CRC PhD top-up scholarship
Australian Research Council [DP0665973, DP1097291]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (NER) by exploiting the text and structure of Wikipedia. Most NER systems rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes. We first classify each Wikipedia article into named entity (NE) types, training and evaluating on 7200 manually-labelled Wikipedia articles across nine languages. Our cross-lingual approach achieves up to 95% accuracy. We transform the links between articles into NE annotations by projecting the target article's classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards. We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against CONLL shared task data and other gold-standard corpora. Our approach outperforms other approaches to automatic NE annotation (Richman and Schone, 2008 [61], Mika et al., 2008 [46]) competes with gold-standard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text. (C) 2012 Elsevier B.V. All rights reserved.

Learning multilingual named entity recognition from Wikipedia

Journal

ARTIFICIAL INTELLIGENCE

Publisher

ELSEVIER SCIENCE BV

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Learning multilingual named entity recognition from Wikipedia

Journal

ARTIFICIAL INTELLIGENCE

Publisher

ELSEVIER SCIENCE BV

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper