4.6 Article

Potential of natural language processing for metadata extraction fromenvironmental scientific publications

期刊

SOIL
卷 9, 期 1, 页码 155-168

出版社

COPERNICUS GESELLSCHAFT MBH
DOI: 10.5194/soil-9-155-2023

关键词

-

向作者/读者索取更多资源

Summarizing information from large bodies of scientific literature is an essential but work-intensive task. This study explores three NLP techniques (topic modeling, tailored regular expressions, and the shortest dependency path method) to support evidence synthesis tasks. The results show that all three tested NLP techniques are able to support this task and have the potential for automated updating as new publications become available.
Summarizing information from large bodies of scientific literature is anessential but work-intensive task. This is especially true in environmentalstudies where multiple factors (e.g., soil, climate, vegetation) cancontribute to the effects observed. Meta-analyses, studies thatquantitatively summarize findings of a large body of literature, rely onmanually curated databases built upon primary publications. However, giventhe increasing amount of literature, this manual work is likely to requiremore and more effort in the future. Natural language processing (NLP)facilitates this task, but it is not clear yet to which extent theextraction process is reliable or complete. In this work, we explore threeNLP techniques that can help support this task: topic modeling, tailoredregular expressions and the shortest dependency path method. We apply thesetechniques in a practical and reproducible workflow on two corpora ofdocuments: the Open Tension-diskInfiltrometer Meta-database (OTIM) and the Meta corpus. The OTIM corpus contains the sourcepublications of the entries of the OTIM database of near-saturated hydraulicconductivity from tension-disk infiltrometer measurements(https://github.com/climasoma/otim-db, last access: 1 March 2023). The Meta corpus is constituted ofall primary studies from 36 selected meta-analyses on the impact ofagricultural practices on sustainable water management in Europe. As a firststep of our practical workflow, we identified different topics from theindividual source publications of the Meta corpus using topic modeling.This enabled us to distinguish well-researched topics (e.g., conventionaltillage, cover crops), where meta-analysis would be useful, from neglectedtopics (e.g., effect of irrigation on soil properties), showing potentialknowledge gaps. Then, we used tailored regular expressions to extractcoordinates, soil texture, soil type, rainfall, disk diameter and tensionsfrom the OTIM corpus to build a quantitative database. We were able toretrieve the respective information with 56 % up to 100 % of allrelevant information (recall) and with a precision between 83 % and100 %. Finally, we extracted relationships between a set of driverscorresponding to different soil management practices or amendments (e.g.,biochar, zero tillage) and target variables (e.g., soilaggregate, hydraulic conductivity, crop yield) from thesource publications' abstracts of the Meta corpus using the shortestdependency path between them. These relationships were further classifiedaccording to positive, negative or absent correlations between the driverand the target variable. This quickly provided an overview of the differentdriver-variable relationships and their abundance for an entire body ofliterature. Overall, we found that all three tested NLP techniques were ableto support evidence synthesis tasks. While human supervision remainsessential, NLP methods have the potential to support automated evidencesynthesis which can be continuously updated as new publications becomeavailable.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据