☆ 4.7 Article

Unsupervised learning of mDTD extraction patterns for Web text mining

INFORMATION PROCESSING & MANAGEMENT (2003)

期刊

INFORMATION PROCESSING & MANAGEMENT

卷 39, 期 4, 页码 623-637

出版社

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/S0306-4573(03)00004-9

关键词

Web text mining; information extraction; extraction pattern; document type definition; sequential covering algorithm

类别

Computer Science, Information Systems Information Science & Library Science

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

This paper presents a new extraction pattern, called modified Document Type Definition (mDTD), which relies on analytical interpretation to identify extraction target from the contents of the Web documents. From conventional DTD in XML documents, we develop two major extensions: first, we introduce an extended content model with type-specific operators and keywords, and second, we refine the way to interpret the conventional DTD rules. As the result of the two, bur mDTD becomes freely represent HTML structures and extraction targets. The goal of mDTD is to overcome the current major barriers, that is, domain portability (with minimal human intervention) and high performance, on information extraction. The human experts compose an mDTD as seed rules, and then our system automatically extracts a set of instances by the mDTD from structured documents on the Web. We use the extracted instances as Sequential mDTD Learner (SmL) inputs to generate new mDTD rules based on part-of-speech tags and features for lexical similarity. This process does not require any hand-annotated corpus. We have experimented with 330 Korean and 220 English Web documents on audio and video shopping sites. The average extraction precision is 91.3% for Korean and 81.9% for English. (C) 2003 Elsevier Science Ltd. All rights reserved.

Unsupervised learning of mDTD extraction patterns for Web text mining

期刊

INFORMATION PROCESSING & MANAGEMENT

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Unsupervised learning of mDTD extraction patterns for Web text mining

期刊

INFORMATION PROCESSING & MANAGEMENT

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文