4.7 Article

Embracing imperfection: Machine-assisted invertebrate classification in real-world datasets

Journal

ECOLOGICAL INFORMATICS
Volume 72, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.ecoinf.2022.101896

Keywords

Machine learning; Computer vision; Image classification; Macroecology; Terrestrial invertebrates

Categories

Funding

  1. NSERC Discovery grant
  2. NSF REU program
  3. [DEB 1702426]

Ask authors/readers for more resources

This study presents a practical methodology of using machine learning in ecological data acquisition pipelines, training and testing algorithms to classify a large number of terrestrial invertebrate specimens. The study addresses issues of inconsistent taxonomic label specificity and unknown taxa classification. The results show that complex machine learning methods are not necessarily more accurate than traditional methods, and the inclusion of contextual metadata improves accuracy.
Despite growing concerns over the health of global invertebrate diversity, terrestrial invertebrate monitoring efforts remain poorly geographically distributed. Machine-assisted classification has been proposed as a potential solution to quickly gather large amounts of data; however, previous studies have often used unrealistic or idealized datasets to train and test their models. In this study, we describe a practical methodology for including machine learning in ecological data acqui-sition pipelines. Here we train and test machine learning algorithms to classify over 72,000 terrestrial inverte-brate specimens from morphometric data and contextual metadata. All vouchered specimens were collected in pitfall traps by the National Ecological Observatory Network (NEON) at 45 locations across the United States from 2016 to 2019. Specimens were photographed, and two separate machine learning paradigms were used to classify them. In the first, we used a convolutional neural network (ResNet-50), and in the second, we extracted morphometric data as feature vectors using ImageJ and used traditional machine learning methods to classify specimens. Issues stemming from inconsistent taxonomic label specificity were resolved by making classifications at the lowest identified taxonomic level (LITL). Taxa with too few specimens to be included in the training dataset were classified by the model using zero-shot classification. When classifying specimens that were known and seen by our models, we reached a maximum accuracy of 72.7% using eXtreme Gradient Boosting (XGBoost) at the LITL. This nearly matched the maximum accuracy achieved by the CNN of 72.8% at the LITL. Models that were trained without contextual metadata under-performed models with contextual metadata. We also classified invertebrate taxa that were unknown to the model using zero-shot classification, reaching a maximum accuracy of 65.5% when using the ResNet-50, compared to 39.4% when using XGBoost. The general methodology outlined here represents a realistic application of machine learning as a tool for ecological studies. We found that more advanced and complex machine learning methods such as convolutional neural networks are not necessarily more accurate than traditional machine learning methods. Hierarchical and LITL classifications allow for flexible taxonomic specificity at the input and output layers. These methods also help address the 'long tail' problem of underrepresented taxa missed by machine learning models. Finally, we encourage researchers to consider more than just morphometric data when training their models, as we have shown that the inclusion of contextual metadata can provide significant improvements to accuracy.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available