4.3 Article

BenLem (A Bengali Lemmatizer) and Its Role in WSD

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/2835494

Keywords

Bengali; evaluation; Indic languages; lemmatizer; word sense disambiguation (WSD)

Ask authors/readers for more resources

A lemmatization algorithm for Bengali has been developed and evaluated. Its effectiveness for word sense disambiguation (WSD) is also investigated. One of the key challenges for computer processing of highly inflected languages is to deal with the frequent morphological variations of the root words appearing in the text. Therefore, a lemmatizer is essential for developing natural language processing (NLP) tools for such languages. In this experiment, Bengali, which is the national language of Bangladesh and the second most popular language in the Indian subcontinent, has been taken as a reference. In order to design the Bengali lemmatizer (named as BenLem), possible transformations through which surface words are formed from lemmas are studied so that appropriate reverse transformations can be applied on a surface word to get the corresponding lemma back. BenLem is found to be capable of handling both inflectional and derivational morphology in Bengali. It is evaluated on a set of 18 news articles taken from the FIRE Bengali News Corpus consisting of 3,342 surface words (excluding proper nouns) and found to be 81.95% accurate. The role of the lemmatizer is then investigated for Bengali WSD. Ten highly polysemous Bengali words are considered for sense disambiguation. The FIRE corpus and a collection of Tagore's short stories are considered for creating the WSD dataset. Different WSD systems are considered for this experiment, and it is noticed that BenLem improves the performance of all the WSD systems and the improvements are statistically significant.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.3
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available