4.6 Article

NLP-Based Approach to Semantic Classification of Heterogeneous Transportation Asset Data Terminology

Journal

Publisher

ASCE-AMER SOC CIVIL ENGINEERS
DOI: 10.1061/(ASCE)CP.1943-5487.0000701

Keywords

Heterogeneous data terminology; Data sharing; Semantic interoperability; Semantic relation; Natural language processing; Vector space model; Transportation data

Funding

  1. National Science Foundation (NSF) [NSF-CIS 420-60-83]
  2. NSF

Ask authors/readers for more resources

The inconsistency of data terminology has imposed big challenges on integrating transportation project data from distinct sources. Differences in meaning of data elements may lead to miscommunication between data senders and receivers. Semantic relations between terms in digital dictionaries, such as ontologies, can enable the semantics of a data element to be transparent and unambiguous to computer systems. However, because of the lack of effective automated methods, identifying these relations is labor intensive and time consuming. This paper presents a novel integrated methodology that leverages multiple computational techniques to extract heterogeneous American-English data terms used in different highway agencies and their semantic relations from design manuals and other technical specifications. The proposed method implements natural language processing (NLP) to detect data elements from text documents and uses machine learning to determine the semantic relatedness among terms using their occurrence statistics in a corpus. The study also consists of developing an algorithm that classifies semantically related terms into three different lexical groups including synonymy, hyponymy, and meronymy. The key merit in this technique is that the detection of semantic relations uses only linguistic information in texts and does not depend on other existing hand-coded semantic resources. A case study was undertaken that implemented the proposed method on a 16-million-word corpus of roadway design manuals to extract and classify roadway data items. The developed classifier was evaluated using a human-encoded test set, and the results show an overall performance of 92.76% in precision and 81.02% recall. (C) 2017 American Society of Civil Engineers.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available