4.6 Article

A text mining framework for screening catalysts and critical process parameters from scientific literature - A study on Hydrogen production from alcohol

Journal

CHEMICAL ENGINEERING RESEARCH & DESIGN
Volume 184, Issue -, Pages 90-102

Publisher

ELSEVIER
DOI: 10.1016/j.cherd.2022.05.018

Keywords

Catalyst; Process parameter; LDA; Hydrogen; Alcohol; NLP; SciBERT; Classification; NER; Ex-SciBERT

Funding

  1. BRNS [51/14/11/2019BRNS]
  2. SERB India [CRG/2018/001555]

Ask authors/readers for more resources

This work aims to develop a recommendation system using Natural Language Processing (NLP) tools to identify optimal process conditions and catalyst information in hydrogen production. The study utilizes full-text articles, applies Latent Dirichlet allocation (LDA) for topic clustering, and develops a dedicated NLP model called Ex-SciBERT for classification and Named Entity Recognition (NER). The Ex-SciBERT model achieves high accuracy scores and automates the screening of relevant information from literature.
Hydrogen production is an active area of research with a vast amount of available scientific literature. However, this data is unstructured and scattered, making its utilization difficult from an academic and industrial point of view. This work aims to develop a recommendation system to identify optimal process conditions and catalyst information using Natural Language Processing (NLP) tools. To this end, full-text articles were extracted using the Elsevier API key followed by a custom XML parser. Latent Dirichlet allocation (LDA) was applied on this dataset to form clusters of topics. The experimental section of each article is annotated using state-of-the-art sentiment analysis techniques and divided into four categories based on the presence of catalyst and process information. This dataset is used to develop a dedicated NLP model, Ex-SciBERT by performing transfer learning on the Sci-BERT model. This model performs classification followed by Named Entity Recognition (NER) to extract catalyst and process parameters. Ex-SciBERT model produces an accuracy score of 0.915 (train dataset) and 0.890 (test dataset) for the classification of sentences task and an excellent accuracy score of 0.998 (train dataset) and 0.997 (test dataset) for the NER task. Deployment of this model will automate and accelerate the screening of relevant information from literature by reducing manual efforts. (C) 2022 Institution of Chemical Engineers. Published by Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available