Results 1 - 10
of
100
Distributional Clustering Of English Words
- In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics
, 1993
"... We describe and evaluate experimentally a method for clustering words according to their dis- tribution in particular syntactic contexts. Words are represented by the relative frequency distributions of contexts in which they appear, and relative entropy between those distributions is used as the si ..."
Abstract
-
Cited by 478 (24 self)
- Add to MetaCart
We describe and evaluate experimentally a method for clustering words according to their dis- tribution in particular syntactic contexts. Words are represented by the relative frequency distributions of contexts in which they appear, and relative entropy between those distributions is used as the similarity measure for clustering. Clusters are represented by average context distributions derived from the given words according to their probabilities of cluster membership. In many cases, the clusters can be thought of as encoding coarse sense distinctions. Deterministic annealing is used to find lowest distortion sets of clusters: as the an- nealing parameter increases, existing clusters become unstable and subdivide, yielding a hierarchi- cal "soft" clustering of the data. Clusters are used as the basis for class models of word coocurrence, and the models evaluated with respect to held-out test data.
Clustering Gene Expression Patterns
, 1999
"... Recent advances in biotechnology allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. Analysis of data produced by such experiments offers potential insight into gene function and regulatory mechanisms. A key step in the ana ..."
Abstract
-
Cited by 273 (10 self)
- Add to MetaCart
Recent advances in biotechnology allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. Analysis of data produced by such experiments offers potential insight into gene function and regulatory mechanisms. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. The corresponding algorithmic problem is to cluster multi-condition gene expression patterns. In this paper we describe a novel clustering algorithm that was developed for analysis of gene expression data. We define an appropriate stochastic error model on the input, and prove that under the conditions of the model, the algorithm recovers the cluster structure with high probability. The running time of the algorithm on an n-gene dataset is O(n 2 (log(n)) c ). We also present a practical heuristic based on the same algorithmic ideas. The heuristic was implemented and its p...
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
, 1998
"... The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump- tions made abou ..."
Abstract
-
Cited by 268 (1 self)
- Add to MetaCart
The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump- tions made about word occurrences in documents.
A Comparison of Two Learning Algorithms for Text Categorization
- In Third Annual Symposium on Document Analysis and Information Retrieval
, 1994
"... This paper examines the use of inductive learning to categorize natural language documents into predefined content categories. Categorization of text is of increasing importance in information retrieval and natural language processing systems. Previous research on automated text categorization has m ..."
Abstract
-
Cited by 239 (1 self)
- Add to MetaCart
This paper examines the use of inductive learning to categorize natural language documents into predefined content categories. Categorization of text is of increasing importance in information retrieval and natural language processing systems. Previous research on automated text categorization has mixed machine learning and knowledge engineering methods, making it difficult to draw conclusions about the performance of particular methods. In this paper we present empirical results on the performance of a Bayesian classifier and a decision tree learning algorithm on two text categorization data sets. We find that both algorithms achieve reasonable performance and allow controlled tradeoffs between false positives and false negatives. The stepwise feature selection in the decision tree algorithm is particularly effective in dealing with the large feature sets common in text categorization. However, even this algorithm is aided by an initial prefiltering of features, confirming the results...
Training Algorithms for Linear Text Classifiers
, 1996
"... Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides pe ..."
Abstract
-
Cited by 216 (12 self)
- Add to MetaCart
Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides performance guarantees and guidance on parameter settings for these algorithms. Experimental data is presented showing Widrow-Hoff and EG to be more effective than the widely used Rocchio algorithm on several categorization and routing tasks. 1 Introduction Document retrieval, categorization, routing, and filtering systems often are based on classification. That is, the IR system decides for each document which of two or more classes it belongs to, or how strongly it belongs to a class, in order to accomplish the IR task of interest. For instance, the two classes may be the documents relevant to and not relevant to a particular user, and the system may rank documents based on how likely it i...
Linear programming in linear time when the dimension is fixed
- J. ACM
, 1984
"... Abstract. It is demonstrated that the linear programming problem in d variables and n constraints can be solved in O(n) time when d is fixed. This bound follows from a multidimensional search technique which is applicable for quadratic programming as well. There is also developed an algorithm that i ..."
Abstract
-
Cited by 168 (13 self)
- Add to MetaCart
Abstract. It is demonstrated that the linear programming problem in d variables and n constraints can be solved in O(n) time when d is fixed. This bound follows from a multidimensional search technique which is applicable for quadratic programming as well. There is also developed an algorithm that is polynomial in both n and d provided d is bounded by a certain slowly growing function of n. Categories and Subject Descriptors: F.2.1 [Analysis of Algorithms and Problem Complexity]: Numerical Algorithms and Problems-computations on matrices; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems-geometrical problems and computations; sort-ing and searching; G. 1.6 [Mathematics of Computing]: Optimization-linear programming
Toward Machine Emotional Intelligence: Analysis of Affective Physiological State
- IEEE TRANSACTIONS PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 200
"... ..."
Interactive learning using a "society of models"
- SUBMITTED TO SPECIAL ISSUE OF PATTERN RECOGNITION ON IMAGE DATABASE: CLASSIFICATION AND RETRIEVAL
"... Digital library access is driven by features, but features are often context-dependent and noisy, and their relevance for a query is not always obvious. This paper describes an approach for utilizing many data-dependent, user-dependent, and task-dependent features in a semi-automated tool. Instead o ..."
Abstract
-
Cited by 132 (10 self)
- Add to MetaCart
Digital library access is driven by features, but features are often context-dependent and noisy, and their relevance for a query is not always obvious. This paper describes an approach for utilizing many data-dependent, user-dependent, and task-dependent features in a semi-automated tool. Instead of requiring universal similarity measures or manual selection of relevant features, the approach provides a learning algorithm for selecting and combining groupings of the data, where groupings can be induced by highlyspecialized and context-dependent features. The selection process is guided by arichexample-based interaction with the user. The inherent combinatorics
Finding interesting associations without support pruning
- In ICDE
, 2000
"... Abstract Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a-priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applicat ..."
Abstract
-
Cited by 111 (13 self)
- Add to MetaCart
Abstract Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a-priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identification of similar web documents, clustering, and collaborative filtering, where the rules of interest have comparatively few instances in the data. In these cases, we must look for highly correlated items, or possibly even causal relationships between infrequent items. We develop a family of algorithms for solving this problem, employing a combination of random sampling and hashing techniques. We provide analysis of the algorithms developed, and conduct experiments on real and synthetic data to obtain a comparative performance analysis.
A Data-Driven Reflectance Model
- ACM TRANSACTIONS ON GRAPHICS
, 2003
"... We present a generative model for isotropic bidirectional reflectance distribution functions (BRDFs) based on acquired reflectance data. Instead of using analytical reflectance models, we represent each BRDF as a dense set of measurements. This allows us to interpolate and extrapolate in the space o ..."
Abstract
-
Cited by 108 (5 self)
- Add to MetaCart
We present a generative model for isotropic bidirectional reflectance distribution functions (BRDFs) based on acquired reflectance data. Instead of using analytical reflectance models, we represent each BRDF as a dense set of measurements. This allows us to interpolate and extrapolate in the space of acquired BRDFs to create new BRDFs. We treat each acquired BRDF as a single high-dimensional vector taken from a space of all possible BRDFs. We apply both linear (subspace) and non-linear (manifold) dimensionality reduction tools in an effort to discover a lowerdimensional representation that characterizes our measurements. We let users define perceptually meaningful parametrization directions to navigate in the reduced-dimension BRDF space. On the low-dimensional manifold, movement along these directions produces novel but valid BRDFs.

