4.6 Article

Rank Diversity of Languages: Generic Behavior in Computational Linguistics

Journal

PLOS ONE
Volume 10, Issue 4, Pages -

Publisher

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pone.0121898

Keywords

-

Funding

  1. Programa de Apoyo a Proyectos de Investigacion e Innovacion Tecnologica of the Universidad Nacional Autonoma de Mexico [IN107414, IA101713]
  2. SNI membership of Consejo Nacional de Ciencia y Tecnologia, Mexico [47907]
  3. Consejo Nacional de Ciencia y Tecnologia [153190]

Ask authors/readers for more resources

Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution rank diversity. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: heads consist of words which almost do not change their rank in time, bodies are words of general use, while tails are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available