Results 1  10
of
6,486
Data Clustering: A Review
 ACM COMPUTING SURVEYS
, 1999
"... Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exp ..."
Abstract

Cited by 1284 (13 self)
 Add to MetaCart
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify crosscutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
A Systematic Comparison of Various Statistical Alignment Models
 Computational Linguistics
, 2003
"... this article the problem of finding the word alignment of a bilingual sentencealigned corpus by using languageindependent statistical methods. There is a vast literature on this topic, and many different systems have been suggested to solve this problem. Our work follows and extends the methods in ..."
Abstract

Cited by 1250 (58 self)
 Add to MetaCart
this article the problem of finding the word alignment of a bilingual sentencealigned corpus by using languageindependent statistical methods. There is a vast literature on this topic, and many different systems have been suggested to solve this problem. Our work follows and extends the methods introduced by Brown, Della Pietra, Della Pietra, and Mercer (1993) by using refined statistical models for the translation process. The basic idea of this approach is to develop a model of the translation process with the word alignment as a hidden variable of this process, to apply statistical estimation theory to compute the "optimal" model parameters, and to perform alignment search to compute the best word alignment
Combining labeled and unlabeled data with cotraining
, 1998
"... We consider the problem of using a large unlabeled sample to boost performance of a learning algorithm when only a small set of labeled examples is available. In particular, we consider a setting in which the description of each example can be partitioned into two distinct views, motivated by the ta ..."
Abstract

Cited by 1244 (28 self)
 Add to MetaCart
We consider the problem of using a large unlabeled sample to boost performance of a learning algorithm when only a small set of labeled examples is available. In particular, we consider a setting in which the description of each example can be partitioned into two distinct views, motivated by the task of learning to classify web pages. For example, the description of a web page can be partitioned into the words occurring on that page, and the words occurring in hyperlinks that point to that page. We assume that either view of the example would be su cient for learning if we had enough labeled data, but our goal is to use both views together to allow inexpensive unlabeled data to augment amuch smaller set of labeled examples. Speci cally, the presence of two distinct views of each example suggests strategies in which two learning algorithms are trained separately on each view, and then each algorithm's predictions on new unlabeled examples are used to enlarge the training set of the other. Our goal in this paper is to provide a PACstyle analysis for this setting, and, more broadly, a PACstyle framework for the general problem of learning from both labeled and unlabeled data. We also provide empirical results on real webpage data indicating that this use of unlabeled examples can lead to signi cant improvement of hypotheses in practice. As part of our analysis, we provide new re
The Mathematics of Statistical Machine Translation: Parameter Estimation
 Computational Linguistics
, 1993
"... this paper, we focus on the translation modeling problem. Before we turn to this problem, however, we should address an issue that may be a concern to some readers: Why do we estimate Pr(e) and Pr(fle) rather than estimate Pr(elf ) directly? We are really interested in this latter probability. Would ..."
Abstract

Cited by 1176 (1 self)
 Add to MetaCart
this paper, we focus on the translation modeling problem. Before we turn to this problem, however, we should address an issue that may be a concern to some readers: Why do we estimate Pr(e) and Pr(fle) rather than estimate Pr(elf ) directly? We are really interested in this latter probability. Wouldn't we reduce our problems from three to two by this direct approach? If we can estimate Pr(fle) adequately, why can't we just turn the whole process around to estimate Pr(elf)? To understand this, imagine that we divide French and English strings into those that are wellformed and those that are illformed. This is not a precise notion. We have in mind that strings like Il va la bibliothque, or I live in a house, or even Colorless green ideas sleep furiously are wellformed, but that strings like lava I1 bibliothque or a I in live house are not. When we translate a French string into English, we can think of ourselves as springing from a wellformed French string into the sea of wellformed English strings with the hope of landing on a good one. It is important, therefore, that our model for Pr(elf ) concentrate its probability as much as possible on wellformed English strings. But it is not important that our model for Pr(fle ) concentrate its probability on wellformed French strings. If we were to reduce the probability of all wellformed French strings by the same factor, spreading the probability thus 265 liberated over illformed French strings, there would be no effect on our translations: the argument that maximizes some function f(x) also maximizes cf(x) for any positive constant c. As we shall see below, our translation models are prodigal, spraying probability all over the place, most of it on illformed French strings. In fact, as we discuss in Section 4.5, two...
A Maximum Entropy approach to Natural Language Processing
 COMPUTATIONAL LINGUISTICS
, 1996
"... The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we des ..."
Abstract

Cited by 1082 (5 self)
 Add to MetaCart
The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we describe a method for statistical modeling based on maximum entropy. We present a maximumlikelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing.
Empirical Analysis of Predictive Algorithm for Collaborative Filtering
 Proceedings of the 14 th Conference on Uncertainty in Artificial Intelligence
, 1998
"... 1 ..."
HeadDriven Statistical Models for Natural Language Parsing
, 1999
"... Mitch Marcus was a wonderful advisor. He gave consistently good advice, and allowed an ideal level of intellectual freedom in pursuing ideas and research topics. I would like to thank the members of my thesis committee Aravind Joshi, Mark Liberman, Fernando Pereira and Mark Steedman  for the remar ..."
Abstract

Cited by 955 (16 self)
 Add to MetaCart
Mitch Marcus was a wonderful advisor. He gave consistently good advice, and allowed an ideal level of intellectual freedom in pursuing ideas and research topics. I would like to thank the members of my thesis committee Aravind Joshi, Mark Liberman, Fernando Pereira and Mark Steedman  for the remarkable breadth and depth of their feedback. I had countless impromptu but in uential discussions with Jason Eisner, Dan Melamed and Adwait Ratnaparkhi in the LINC lab. They also provided feedback on many drafts of papers and thesis chapters. Paola Merlo pushed me to think about many new angles of the research. Dimitrios Samaras gave invaluable feedback on many portions of the work. Thanks to James Brooks for his contribution to the work that comprises chapter 5 of this thesis. The community of faculty, students and visitors involved with the Institute for Research in Cognitive Science at Penn provided an intensely varied and stimulating environment. I would like to thank them collectively. Some deserve special mention for discussions that contributed quite directly to this research: Breck Baldwin, Srinivas Bangalore, Dan
Learning Bayesian networks: The combination of knowledge and statistical data
 Machine Learning
, 1995
"... We describe scoring metrics for learning Bayesian networks from a combination of user knowledge and statistical data. We identify two important properties of metrics, which we call event equivalence and parameter modularity. These properties have been mostly ignored, but when combined, greatly simpl ..."
Abstract

Cited by 913 (38 self)
 Add to MetaCart
We describe scoring metrics for learning Bayesian networks from a combination of user knowledge and statistical data. We identify two important properties of metrics, which we call event equivalence and parameter modularity. These properties have been mostly ignored, but when combined, greatly simplify the encoding of a user’s prior knowledge. In particular, a user can express his knowledge—for the most part—as a single prior Bayesian network for the domain. 1
Object class recognition by unsupervised scaleinvariant learning
 In CVPR
, 2003
"... We present a method to learn and recognize object class models from unlabeled and unsegmented cluttered scenes in a scale invariant manner. Objects are modeled as flexible constellations of parts. A probabilistic representation is used for all aspects of the object: shape, appearance, occlusion and ..."
Abstract

Cited by 871 (45 self)
 Add to MetaCart
We present a method to learn and recognize object class models from unlabeled and unsegmented cluttered scenes in a scale invariant manner. Objects are modeled as flexible constellations of parts. A probabilistic representation is used for all aspects of the object: shape, appearance, occlusion and relative scale. An entropybased feature detector is used to select regions and their scale within the image. In learning the parameters of the scaleinvariant object model are estimated. This is done using expectationmaximization in a maximumlikelihood setting. In recognition, this model is used in a Bayesian manner to classify images. The flexible nature of the model is demonstrated by excellent results over a range of datasets including geometrically constrained classes (e.g. faces, cars) and flexible objects (such as animals). 1.