Results 1  10
of
848
Latent dirichlet allocation
 Journal of Machine Learning Research
, 2003
"... We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a threelevel hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, ..."
Abstract

Cited by 2492 (63 self)
 Add to MetaCart
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a threelevel hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model. 1.
Machine Learning in Automated Text Categorization
 ACM COMPUTING SURVEYS
, 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract

Cited by 1156 (20 self)
 Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Focused crawling: a new approach to topicspecific Web resource discovery
, 1999
"... The rapid growth of the WorldWide Web poses unprecedented scaling challenges for generalpurpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevan ..."
Abstract

Cited by 497 (9 self)
 Add to MetaCart
The rapid growth of the WorldWide Web poses unprecedented scaling challenges for generalpurpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a predefined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible adhoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more uptodate. To achieve such goaldirected crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, ...
SemiSupervised Learning Literature Survey
, 2006
"... We review the literature on semisupervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole
spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semisupervised learning. This document is a chapter ..."
Abstract

Cited by 475 (8 self)
 Add to MetaCart
We review the literature on semisupervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole
spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semisupervised learning. This document is a chapter excerpt from the author’s
doctoral thesis (Zhu, 2005). However the author plans to update the online version frequently to incorporate the latest development in the field. Please obtain the latest
version at http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf
Manifold regularization: A geometric framework for learning from labeled and unlabeled examples
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semisupervised framework that incorporates labeled and unlabeled data in a generalpurpose learner. Some transductive graph learning al ..."
Abstract

Cited by 343 (13 self)
 Add to MetaCart
We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semisupervised framework that incorporates labeled and unlabeled data in a generalpurpose learner. Some transductive graph learning algorithms and standard methods including Support Vector Machines and Regularized Least Squares can be obtained as special cases. We utilize properties of Reproducing Kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graphbased approaches) we obtain a natural outofsample extension to novel examples and so are able to handle both transductive and truly semisupervised settings. We present experimental evidence suggesting that our semisupervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework.
A framework for learning predictive structures from multiple tasks and unlabeled data
 Journal of Machine Learning Research
, 2005
"... One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semisupervised learning. Although a number of such methods ar ..."
Abstract

Cited by 325 (3 self)
 Add to MetaCart
One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semisupervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semisupervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semisupervised learning setting. 1.
Clustering with Bregman Divergences
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2005
"... A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergence ..."
Abstract

Cited by 315 (50 self)
 Add to MetaCart
A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroidbased parametric clustering approaches, such as classical kmeans and informationtheoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the basic idea to a very large class of clustering loss functions. There are two main contributions in this paper. First, we pose the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by ratedistortion theory, and present an algorithm to minimize this loss. Secondly, we show an explicit bijection between Bregman divergences and exponential families. The bijection enables the development of an alternative interpretation of an ecient EM scheme for learning models involving mixtures of exponential distributions. This leads to a simple soft clustering algorithm for all Bregman divergences.
Unsupervised NamedEntity Extraction from the Web: An Experimental Study
 ARTIFICIAL INTELLIGENCE
, 2005
"... The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domainindependent, and scalable manner. The paper presents an overview of KNOWITALL’s novel architecture and design princip ..."
Abstract

Cited by 283 (38 self)
 Add to MetaCart
The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domainindependent, and scalable manner. The paper presents an overview of KNOWITALL’s novel architecture and design principles, emphasizing its distinctive ability to extract information without any handlabeled training examples. In its first major run, KNOWITALL extracted over 50,000 facts, but suggested a challenge: How can we improve KNOWITALL’s recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domainspecific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies subclasses in order to boost recall. List Extraction locates lists of class instances, learns a “wrapper ” for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL’s domainindependent methods, the methods also obviate handlabeled training examples. The paper reports on experiments, focused on namedentity extraction, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4fold to 8fold increase in recall, while maintaining high precision, and discovered over 10,000 cities missing from the Tipster Gazetteer.
Using Maximum Entropy for Text Classification
, 1999
"... This paper proposes the use of maximum entropy techniques for text classification. Maximum entropy is a probability distribution estimation technique widely used for a variety of natural language tasks, such as language modeling, partofspeech tagging, and text segmentation. The underlying principl ..."
Abstract

Cited by 261 (5 self)
 Add to MetaCart
This paper proposes the use of maximum entropy techniques for text classification. Maximum entropy is a probability distribution estimation technique widely used for a variety of natural language tasks, such as language modeling, partofspeech tagging, and text segmentation. The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uniform. Constraints on the distribution, derived from labeled training data, inform the technique where to be minimally nonuniform. The maximum entropy formulation has a unique solution which can be found by the improved iterative scaling algorithm. In this paper, maximum entropy is used for text classification by estimating the conditional distribution of the class variable given the document. In experiments on several text datasets we compare accuracy to naive Bayes and show that maximum entropy is sometimes significantly better, but also sometimes worse. Much future work remains, but the re...
The State of Record Linkage and Current Research Problems
 Statistical Research Division, U.S. Census Bureau
, 1999
"... This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced many powerful id ..."
Abstract

Cited by 221 (7 self)
 Add to MetaCart
This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced many powerful ideas for estimating record linkage parameters and other ideas that still influence record linkage today. Record linkage research is characterized by its synergism of statistics, computer science, and operations research. Many difficult algorithms have been developed and put in software systems. Record linkage practice is still very limited. Some limits are due to existing software. Other limits are due to the difficulty in automatically estimating matching parameters and error rates, with current research highlighted by the work of Larsen and Rubin. Keywords: computer matching, modeling, iterative fitting, string comparison, optimization RsSUMs Cet article donne une vue d'ensemble sur les ...