4.7 Article

Embracing imperfection: Machine-assisted invertebrate classification in real-world datasets

期刊

ECOLOGICAL INFORMATICS
卷 72, 期 -, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.ecoinf.2022.101896

关键词

Machine learning; Computer vision; Image classification; Macroecology; Terrestrial invertebrates

类别

资金

  1. NSERC Discovery grant
  2. NSF REU program
  3. [DEB 1702426]

向作者/读者索取更多资源

This study presents a practical methodology of using machine learning in ecological data acquisition pipelines, training and testing algorithms to classify a large number of terrestrial invertebrate specimens. The study addresses issues of inconsistent taxonomic label specificity and unknown taxa classification. The results show that complex machine learning methods are not necessarily more accurate than traditional methods, and the inclusion of contextual metadata improves accuracy.
Despite growing concerns over the health of global invertebrate diversity, terrestrial invertebrate monitoring efforts remain poorly geographically distributed. Machine-assisted classification has been proposed as a potential solution to quickly gather large amounts of data; however, previous studies have often used unrealistic or idealized datasets to train and test their models. In this study, we describe a practical methodology for including machine learning in ecological data acqui-sition pipelines. Here we train and test machine learning algorithms to classify over 72,000 terrestrial inverte-brate specimens from morphometric data and contextual metadata. All vouchered specimens were collected in pitfall traps by the National Ecological Observatory Network (NEON) at 45 locations across the United States from 2016 to 2019. Specimens were photographed, and two separate machine learning paradigms were used to classify them. In the first, we used a convolutional neural network (ResNet-50), and in the second, we extracted morphometric data as feature vectors using ImageJ and used traditional machine learning methods to classify specimens. Issues stemming from inconsistent taxonomic label specificity were resolved by making classifications at the lowest identified taxonomic level (LITL). Taxa with too few specimens to be included in the training dataset were classified by the model using zero-shot classification. When classifying specimens that were known and seen by our models, we reached a maximum accuracy of 72.7% using eXtreme Gradient Boosting (XGBoost) at the LITL. This nearly matched the maximum accuracy achieved by the CNN of 72.8% at the LITL. Models that were trained without contextual metadata under-performed models with contextual metadata. We also classified invertebrate taxa that were unknown to the model using zero-shot classification, reaching a maximum accuracy of 65.5% when using the ResNet-50, compared to 39.4% when using XGBoost. The general methodology outlined here represents a realistic application of machine learning as a tool for ecological studies. We found that more advanced and complex machine learning methods such as convolutional neural networks are not necessarily more accurate than traditional machine learning methods. Hierarchical and LITL classifications allow for flexible taxonomic specificity at the input and output layers. These methods also help address the 'long tail' problem of underrepresented taxa missed by machine learning models. Finally, we encourage researchers to consider more than just morphometric data when training their models, as we have shown that the inclusion of contextual metadata can provide significant improvements to accuracy.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据