☆ 4.4 Article

A Large Chinese Text Dataset in the Wild

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY (2019)

期刊

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY

卷 34, 期 3, 页码 509-521

出版社

SCIENCE PRESS

DOI: 10.1007/s11390-019-1923-y

关键词

Chinese text dataset; Chinese text detection; Chinese text recognition

类别

Computer Science, Hardware & Architecture Computer Science, Software Engineering

资金

National Natural Science Foundation of China [61822204, 61521002]
Beijing Higher Institution Engineering Research Center
Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3 850 unique ones annotated by experts in over 30 000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. For each character, the annotation includes its underlying character, bounding box, and six attributes. The attributes indicate the character's background complexity, appearance, style, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks: character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.

A Large Chinese Text Dataset in the Wild

期刊

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY

出版社

SCIENCE PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

A Large Chinese Text Dataset in the Wild

期刊

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY

出版社

SCIENCE PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文