4.6 Article

SEthesaurus: WordNet in Software Engineering

Journal

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
Volume 47, Issue 9, Pages 1960-1979

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TSE.2019.2940439

Keywords

Thesauri; Software engineering; Encyclopedias; Electronic publishing; Internet; Natural language processing; Software-specific thesaurus; natural language processing; morphological form; word embedding

Funding

  1. Monash University
  2. National Natural Science Foundation of China [61702041, 61602267, 61872263, 61202006]
  3. Jiangsu Government Scholarship for Overseas Studies

Ask authors/readers for more resources

This paper proposes an automatic unsupervised approach to build a thesaurus for software engineering text, utilizing software-specific and general corpora to identify terms, infer morphological forms, and perform graph analysis. Experimental results show high coverage and accuracy of the approach, confirmed through manual verification of abbreviations and synonyms in the thesaurus.
Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process (NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations. Finally, we perform graph analysis on morphological relations. We evaluate the coverage and accuracy of our constructed thesaurus against community-cumulated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our constructed thesaurus by developing three applications and also verify the generality of our approach in constructing thesauruses from data sources in other domains.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available