4.6 Article

LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data

期刊

BMC BIOINFORMATICS
卷 23, 期 1, 页码 -

出版社

BMC
DOI: 10.1186/s12859-022-04631-z

关键词

Biomarker selection; Metagenomics; Metabarcoding; Biomonitoring; Ecological assessment; Machine learning

资金

  1. Food from Thought project as part of Canada First Research Excellence Fund
  2. Government of Canada through the Genomics Research and Development Initiative (GRDI) Ecobiomics project
  3. Government of Canada through Genome Canada
  4. Ontario Genomics
  5. Natural Sciences and Engineering Research Council of Canada (NSERC) [RGPIN-2020-05733]

向作者/读者索取更多资源

Our work introduces LANDMark, a meta-classifier that combines characteristics of several machine learning models into a decision tree and ensemble learning framework. Leveraging contemporary machine learning approaches, LANDMark is able to create highly predictive and consistent models when analyzing amplicon sequencing data, outperforming other models like Random Forest, Linear Support Vector Machine, and Logistic Regression.
Background Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algorithmic failures to account for specific distributional properties, and feature interactions can complicate the discovery of ASV biomarkers. In addition, these issues can impact the replicability of various models and elevate false-discovery rates. Contemporary machine learning approaches can be leveraged to address these issues. Ensembles of decision trees are particularly effective at classifying the types of data commonly generated in high-throughput sequencing (HTS) studies due to their robustness when the number of features in the training data is orders of magnitude larger than the number of samples. In addition, when combined with appropriate model introspection algorithms, machine learning algorithms can also be used to discover and select potential biomarkers. However, the construction of these models could introduce various biases which potentially obfuscate feature discovery. Results We developed a decision tree ensemble, LANDMark, which uses oblique and non-linear cuts at each node. In synthetic and toy tests LANDMark consistently ranked as the best classifier and often outperformed the Random Forest classifier. When trained on the full metabarcoding dataset obtained from Canada's Wood Buffalo National Park, LANDMark was able to create highly predictive models and achieved an overall balanced accuracy score of 0.96 +/- 0.06. The use of recursive feature elimination did not impact LANDMark's generalization performance and, when trained on data from the BE amplicon, it was able to outperform the Linear Support Vector Machine, Logistic Regression models, and Stochastic Gradient Descent models (p <= 0.05). Finally, LANDMark distinguishes itself due to its ability to learn smoother non-linear decision boundaries. Conclusions Our work introduces LANDMark, a meta-classifier which blends the characteristics of several machine learning models into a decision tree and ensemble learning framework. To our knowledge, this is the first study to apply this type of ensemble approach to amplicon sequencing data and we have shown that analyzing these datasets using LANDMark can produce highly predictive and consistent models.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据