4.7 Article

DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain

Journal

COMPUTERS & GEOSCIENCES
Volume 121, Issue -, Pages 1-11

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.cageo.2018.08.006

Keywords

Chinese word segmentation; Geoscience reports; Unigram language model; Natural language processing

Funding

  1. National Key Research and Development Program [2017YFB0503600, 2017YFC0602204]
  2. National Natural Science Foundation of China [41671400]
  3. National Science and Technology Major Project of China [2018YFB0505500]

Ask authors/readers for more resources

Larger numbers of geoscience reports create challenges and opportunities for data analysis and knowledge discovery. Segmenting texts into semantically and syntactically meaningful words is known as the Chinese word segmentation (CWS) problem because there is no space between words in the Chinese language. CWS is a crucial first step toward natural language processing (NLP). Although the available generic segmenters can process geoscience reports, their performance degrades dramatically without sufficient domain knowledge. Hence, developing effective segmenters remains a challenge and requires more work. This inspired us to build a segmenter for the geoscience subject domain. By integrating the unigram language model and deep learning, we propose a weakly supervised model: DGeoSegmenter. DGeoSegmenter is trained with words and corresponding frequencies. We built DGeoSegmenter using the bi-directional long short-term memory (Bi-LSTM) model, which randomly extracts words and combines them into sentences. Our evaluation results using geoscience reports and benchmark datasets demonstrate the effectiveness of our method, DGeoSegmenter can segment both geoscience terms and general terms. Since manually labeled datasets and hand-crafted rules are not necessary for this proposed algorithm, it can easily be applied to various domains including information retrieval and text mining.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available