4.7 Article

DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain

期刊

COMPUTERS & GEOSCIENCES
卷 121, 期 -, 页码 1-11

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.cageo.2018.08.006

关键词

Chinese word segmentation; Geoscience reports; Unigram language model; Natural language processing

资金

  1. National Key Research and Development Program [2017YFB0503600, 2017YFC0602204]
  2. National Natural Science Foundation of China [41671400]
  3. National Science and Technology Major Project of China [2018YFB0505500]

向作者/读者索取更多资源

Larger numbers of geoscience reports create challenges and opportunities for data analysis and knowledge discovery. Segmenting texts into semantically and syntactically meaningful words is known as the Chinese word segmentation (CWS) problem because there is no space between words in the Chinese language. CWS is a crucial first step toward natural language processing (NLP). Although the available generic segmenters can process geoscience reports, their performance degrades dramatically without sufficient domain knowledge. Hence, developing effective segmenters remains a challenge and requires more work. This inspired us to build a segmenter for the geoscience subject domain. By integrating the unigram language model and deep learning, we propose a weakly supervised model: DGeoSegmenter. DGeoSegmenter is trained with words and corresponding frequencies. We built DGeoSegmenter using the bi-directional long short-term memory (Bi-LSTM) model, which randomly extracts words and combines them into sentences. Our evaluation results using geoscience reports and benchmark datasets demonstrate the effectiveness of our method, DGeoSegmenter can segment both geoscience terms and general terms. Since manually labeled datasets and hand-crafted rules are not necessary for this proposed algorithm, it can easily be applied to various domains including information retrieval and text mining.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据