☆ 4.6 Article

Self-training in significance space of support vectors for imbalanced biomedical event data

BMC BIOINFORMATICS (2015)

期刊

BMC BIOINFORMATICS

卷 16, 期 -, 页码 -

出版社

BMC

DOI: 10.1186/1471-2105-16-S7-S6

关键词

类别

Biochemical Research Methods Biotechnology & Applied Microbiology Mathematical & Computational Biology

资金

Basic Science Research Program through the National Research Foundation of Korea (NRF) - Ministry of Science, ICT and Future Planning [2013R1A2A2A01068923]
National Research Foundation of Korea (NRF) grant - Korea government (MSIP) [2008-0062611]
National Research Foundation of Korea [2008-0062611] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Background: Pairwise relationships extracted from biomedical literature are insufficient in formulating biomolecular interactions. Extraction of complex relations (namely, biomedical events) has become the main focus of the text-mining community. However, there are two critical issues that are seldom dealt with by existing systems. First, an annotated corpus for training a prediction model is highly imbalanced. Second, supervised models trained on only a single annotated corpus can limit system performance. Fortunately, there is a large pool of unlabeled data containing much of the domain background that one can exploit. Results: In this study, we develop a new semi-supervised learning method to address the issues outlined above. The proposed algorithm efficiently exploits the unlabeled data to leverage system performance. We furthermore extend our algorithm to a two-phase learning framework. The first phase balances the training data for initial model induction. The second phase incorporates domain knowledge into the event extraction model. The effectiveness of our method is evaluated on the Genia event extraction corpus and a PubMed document pool. Our method can identify a small subset of the majority class, which is sufficient for building a well-generalized prediction model. It outperforms the traditional self-training algorithm in terms of f measure. Our model, based on the training data and the unlabeled data pool, achieves comparable performance to the state-of-the-art systems that are trained on a larger annotated set consisting of training and evaluation data.

Self-training in significance space of support vectors for imbalanced biomedical event data

期刊

BMC BIOINFORMATICS

出版社

BMC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Self-training in significance space of support vectors for imbalanced biomedical event data

期刊

BMC BIOINFORMATICS

出版社

BMC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文