☆ 4.5 Article

A Discriminative Model for Semi-Supervised Learning

JOURNAL OF THE ACM (2010)

期刊

JOURNAL OF THE ACM

卷 57, 期 3, 页码 -

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.1145/1706591.1706599

关键词

Algorithms; Theory; Machine learning; semi-supervised learning; value of unlabeled data; sample complexity; cover bounds; uniform convergence bounds; structural risk minimization (SRM); data dependent SRM; efficient learning algorithms; multi-view classification; co-training

类别

Computer Science, Hardware & Architecture Computer Science, Information Systems Computer Science, Software Engineering Computer Science, Theory & Methods

资金

NSF [IIS-0312814, CCR-0105488]
IBM
Direct For Computer & Info Scie & Enginr
Division of Computing and Communication Foundations [0830540] Funding Source: National Science Foundation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Supervised learning-that is, learning from labeled examples-is an area of Machine Learning that has reached substantial maturity. It has generated general-purpose and practically successful algorithms and the foundations are quite well understood and captured by theoretical frameworks such as the PAC-learning model and the Statistical Learning theory framework. However, for many contemporary practical problems such as classifying web pages or detecting spam, there is often additional information available in the form of unlabeled data, which is often much cheaper and more plentiful than labeled data. As a consequence, there has recently been substantial interest in semi-supervised learning-using unlabeled data together with labeled data-since any useful information that reduces the amount of labeled data needed can be a significant benefit. Several techniques have been developed for doing this, along with experimental results on a variety of different learning problems. Unfortunately, the standard learning frameworks for reasoning about supervised learning do not capture the key aspects and the assumptions underlying these semi-supervised learning methods. In this article, we describe an augmented version of the PAC model designed for semi-supervised learning, that can be used to reason about many of the different approaches taken over the past decade in the Machine Learning community. This model provides a unified framework for analyzing when and why unlabeled data can help, in which one can analyze both sample-complexity and algorithmic issues. The model can be viewed as an extension of the standard PAC model where, in addition to a concept class C, one also proposes a compatibility notion: a type of compatibility that one believes the target concept should have with the underlying distribution of data. Unlabeled data is then potentially helpful in this setting because it allows one to estimate compatibility over the space of hypotheses, and to reduce the size of the search space from the whole set of hypotheses C down to those that, according to one's assumptions, are a-priori reasonable with respect to the distribution. As we show, many of the assumptions underlying existing semi-supervised learning algorithms can be formulated in this framework. After proposing the model, we then analyze sample-complexity issues in this setting: that is, how much of each type of data one should expect to need in order to learn well, and what the key quantities are that these numbers depend on. We also consider the algorithmic question of how to efficiently optimize for natural classes and compatibility notions, and provide several algorithmic results including an improved bound for Co-Training with linear separators when the distribution satisfies independence given the label.

A Discriminative Model for Semi-Supervised Learning

期刊

JOURNAL OF THE ACM

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

A Discriminative Model for Semi-Supervised Learning

期刊

JOURNAL OF THE ACM

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文