☆ 4.6 Article

SEthesaurus: WordNet in Software Engineering

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING (2021)

Journal

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

Volume 47, Issue 9, Pages 1960-1979

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TSE.2019.2940439

Keywords

Thesauri; Software engineering; Encyclopedias; Electronic publishing; Internet; Natural language processing; Software-specific thesaurus; natural language processing; morphological form; word embedding

Funding

Monash University
National Natural Science Foundation of China [61702041, 61602267, 61872263, 61202006]
Jiangsu Government Scholarship for Overseas Studies

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes an automatic unsupervised approach to build a thesaurus for software engineering text, utilizing software-specific and general corpora to identify terms, infer morphological forms, and perform graph analysis. Experimental results show high coverage and accuracy of the approach, confirmed through manual verification of abbreviations and synonyms in the thesaurus.

Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process (NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations. Finally, we perform graph analysis on morphological relations. We evaluate the coverage and accuracy of our constructed thesaurus against community-cumulated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our constructed thesaurus by developing three applications and also verify the generality of our approach in constructing thesauruses from data sources in other domains.

SEthesaurus: WordNet in Software Engineering

Journal

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

SEthesaurus: WordNet in Software Engineering

Journal

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper