4.6 Article

Automatic specimen identification of Harpacticoids (Crustacea: Copepoda) using Random Forest and MALDI-TOF mass spectra, including a post hoc test for false positive discovery

Journal

METHODS IN ECOLOGY AND EVOLUTION
Volume 9, Issue 6, Pages 1421-1434

Publisher

WILEY
DOI: 10.1111/2041-210X.13000

Keywords

false positive; machine learning tools; MALDI-TOF MS; Meiobenthos; proteomic fingerprint; random forest; species identification

Categories

Funding

  1. Land Niedersachsen [IBR B7]

Ask authors/readers for more resources

1. Ecological studies require accurate identification of specimens. This is very time consuming when processing plankton, meiobenthos or soil biota samples due to the presence of a high number of minute specimens. A solution to this problem may be MALDI-TOF MS, an emerging technique for identification of metazoan species. As an alternative to factory delivered software or clustering approaches, Random Forest (RF) models can be trained to identify species, using MALDI-TOF data. However, in a real-world scenario, RF models will fail in detecting species which were not included in the training dataset as well, thus producing false positives (misidentifications). 2. We produced MALDI-TOF MS spectra for meiofauna species and trained RF models, using MALDI-TOF bins as predictors and species as multi-level target class. We used the empirical beta distribution of the probability of class assignment in the model to design a post hoc test for false positive discovery. Two strategies increase the final accuracy of species identification: (1) class smoothing consisting of in silico observations of classes, created by bootstrapping the value of each predictor within each class and: (2) adding a null class to the training dataset by bootstrapping predictor values and shuffling predictor labels creating a class without multivariate signal. 3. We prove that RF is an excellent method for species identification, using MALDI-TOF MS data. The models are flexible enough to correctly classify observations created in silico by smoothing the classes. Our post hoc test unmasks false positive classifications successfully. Smoothing the classes and adding a null class to the training model attracts assignment of false positives to this class. In our example, a 100% false positive discovery could be achieved, while maintaining very high overall prediction accuracy. 4. Combining MALDI-TOF MS and RF models is a step towards a fully automatic species identification workflow that is particularly necessary for species-rich communities of small organism for ecological studies but also for routine monitoring. The post hoc test for false positive discovery can be applied to any RF multilevel classification model, not only in a biological context.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available