☆ 4.6 Article

Entropy Rate Estimates for Natural Language-A New Extrapolation of Compressed Large-Scale Corpora

ENTROPY (2016)

期刊

ENTROPY

卷 18, 期 10, 页码 -

出版社

MDPI

DOI: 10.3390/e18100364

关键词

entropy rate; universal compression; stretched exponential; language universals

类别

Physics, Multidisciplinary

资金

Japan Science and Technology Agency (JST, Precursory Research for Embryonic Science and Technology)

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

One of the fundamental questions about human language is whether its entropy rate is positive. The entropy rate measures the average amount of information communicated per unit time. The question about the entropy of language dates back to experiments by Shannon in 1951, but in 1990 Hilberg raised doubt regarding a correct interpretation of these experiments. This article provides an in-depth empirical analysis, using 20 corpora of up to 7.8 gigabytes across six languages (English, French, Russian, Korean, Chinese, and Japanese), to conclude that the entropy rate is positive. To obtain the estimates for data length tending to infinity, we use an extrapolation function given by an ansatz. Whereas some ansatzes were proposed previously, here we use a new stretched exponential extrapolation function that has a smaller error of fit. Thus, we conclude that the entropy rates of human languages are positive but approximately 20% smaller than without extrapolation. Although the entropy rate estimates depend on the script kind, the exponent of the ansatz function turns out to be constant across different languages and governs the complexity of natural language in general. In other words, in spite of typological differences, all languages seem equally hard to learn, which partly confirms Hilberg's hypothesis.

Entropy Rate Estimates for Natural Language-A New Extrapolation of Compressed Large-Scale Corpora

期刊

ENTROPY

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Entropy Rate Estimates for Natural Language-A New Extrapolation of Compressed Large-Scale Corpora

期刊

ENTROPY

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文