☆ 4.6 Article

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION (2015)

期刊

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION

卷 22, 期 5, 页码 948-956

出版社

OXFORD UNIV PRESS

DOI: 10.1093/jamia/ocv037

关键词

gold-standard corpus; multilinguality; inter-annotator agreement; concept identification; semantic enrichment

类别

Computer Science, Information Systems Computer Science, Interdisciplinary Applications Health Care Sciences & Services Information Science & Library Science Medical Informatics

资金

Mantra project (STREP) under EU [296410, FP7 ICT-2011.4.1]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated.

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

期刊

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

期刊

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文