4.5 Article

Combining natural language processing and metabarcoding to reveal pathogen-environment associations

期刊

PLOS NEGLECTED TROPICAL DISEASES
卷 15, 期 4, 页码 -

出版社

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pntd.0008755

关键词

-

向作者/读者索取更多资源

This study utilized Natural Language Processing to better understand the ecological niches of Cryptococcus neoformans. By analyzing metagenetic research articles through a topic modeling approach, a potential association between C. neoformans and soils associated with decomposing wood was identified. Through the use of machine learning and metagenetic data, the study highlights the importance of utilizing large-scale datasets to better understand environmental associations of rare pathogens.
Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year-with 180,000 resulting deaths-mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations. Author summary We expand the utility of Natural Language Processing (NLP), backtracking through metabarcodes, utilizing papers that may not mention our subject of interest, C. neoformans, in a departure from usual text analysis methods. We confirm that C. neoformans is associated with decomposing wood which is reinforced by the inferred literature studied here on C. neoformans and its close congeneric relatives. This work demonstrates the potential utility of pairing NLP with single-locus metagenetic data for the study of Neglected Tropical Diseases. While the results of this article are largely confirmatory, we present a novel method to study the ecological niches of rare pathogens that leverages the immense amount of data available to researchers in the NCBI Sequence Read Archive (SRA) combined with a text-mining analysis based on Natural Language Processing. We demonstrate that text processing, noun identification, and verb identification can play an important role in analyzing a large corpus of documents together with metagenetic data. Forging this connection requires access to all of the available ecological 18S ribosomal RNA and Internal Transcribed Spacer NCBI SRA datasets. These datasets use metabarcoding to query taxonomic diversity in eukaryotic organisms, and in the case of the Internal Transcribed Spacer, they specifically target Fungi. The presence of specific species is inferred when diagnostic 18S or ITS gene region sequences are found in the SRA data. We searched for C. neoformans in all 18S and ITS datasets available and gathered all associated journal articles that either cite the SRA data accessions or are cited in the SRA data accessions. Published metagenetic data often have associated metadata including: latitude and longitude, temperature, and other physical characteristics describing the conditions in which the metagenetic sample was collected. These metadata are not always presented in consistent formats, so harmonizing study methods may be needed to appropriately compare metagenetic data as commonly required in metanalysis studies. We present an analysis which takes as input articles associated with SRA datasets that were found to contain evidence of C. neoformans. We apply NLP methods to this corpus of articles to describe the niche of C. neoformans. Our results reinforce the current understanding of C. neoformans's niche, indicating the pertinence of employing an NLP analysis to identify the niche of an organism. This approach could further the description of virtually any other organism that routinely appears in metagenetic surveys, especially pathogens, whose ecological niches are unknown or poorly understood.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据