Results 1 - 10
of
29
Concept Decompositions for Large Sparse Text Data using Clustering
- Machine Learning
, 2000
"... . Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99 ..."
Abstract
-
Cited by 231 (23 self)
- Add to MetaCart
. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractal-like" and "self-similar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned...
A divisive information-theoretic feature clustering algorithm for text classification
- Journal of Machine Learning Research
, 2003
"... High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new informationth ..."
Abstract
-
Cited by 82 (14 self)
- Add to MetaCart
High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new informationtheoretic divisive algorithm for feature/word clustering and apply it to text classification. Existing techniques for such “distributional clustering ” of words are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value. We show that our algorithm minimizes the “within-cluster Jensen-Shannon divergence ” while simultaneously maximizing the “between-cluster Jensen-Shannon divergence”. In comparison to the previously proposed agglomerative strategies our divisive algorithm is much faster and achieves comparable or higher classification accuracies. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20Newsgroups data set and a 3-level hierarchy of HTML documents collected from the Open Directory project (www.dmoz.org).
Efficient Clustering Of Very Large Document Collections
, 2001
"... An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical da ..."
Abstract
-
Cited by 74 (9 self)
- Add to MetaCart
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory ecient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-ecient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented - a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.
Clustering Hypertext With Applications To Web Searching
- In Proceedings of the 11th ACM Conference on Hypertext and Hypermedia
, 2000
"... : Clustering separates unrelated documents and groups related documents, and is useful for discrimination, disambiguation, summarization, organization, and navigation of unstructured collections of hypertext documents. We propose a novel clustering algorithm that clusters hypertext documents using w ..."
Abstract
-
Cited by 45 (0 self)
- Add to MetaCart
: Clustering separates unrelated documents and groups related documents, and is useful for discrimination, disambiguation, summarization, organization, and navigation of unstructured collections of hypertext documents. We propose a novel clustering algorithm that clusters hypertext documents using words (contained in the document), out-links (from the document) , and in-links (to the document). The algorithm automatically determines the relative importance of words, out-links, and in-links for a given collection of hypertext documents. We annotate each cluster using six information nuggets: summary, breakthrough, review, keywords, citation, and reference. These nuggets constitute high-quality information resources that are representatives of the content of the clusters, and are extremely effective in compactly summarizing and navigating the collection of hypertext documents. We employ web searching as an application to illustrate our results. Keywords: cluster annotation, feature comb...
Enhanced Word Clustering for Hierarchical Text Classification
, 2002
"... In this paper we propose a new information-theoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
In this paper we propose a new information-theoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at lower number of features [2, 28]. However the existing clustering techniques are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value, thus converging to a local minimum. We show that our algorithm minimizes the "within-cluster Jensen-Shannon divergence" while simultaneously maximizing the "between-cluster Jensen-Shannon divergence". In comparison to the previously proposed agglomerative strategies our divisive algorithm achieves higher classification accuracy especially at lower number of features. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20 Newsgroups data set and a 3-level hierarchy of HTML documents collected from Dmoz Open Directory.
A formal framework for positive and negative detection schemes
- IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS PART B: CYBERNETICS
, 2004
"... In anomaly detection, the normal behavior of a process is characterized by a model, and deviations from the model are called anomalies. In behavior-based approaches to anomaly detection, the model of normal behavior is constructed from an observed sample of normally occurring patterns. Models of nor ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
In anomaly detection, the normal behavior of a process is characterized by a model, and deviations from the model are called anomalies. In behavior-based approaches to anomaly detection, the model of normal behavior is constructed from an observed sample of normally occurring patterns. Models of normal behavior can represent either the set of allowed patterns (positive detection) or the set of anomalous patterns (negative detection). A formal framework is given for analyzing the tradeoffs between positive and negative detection schemes in terms of the number of detectors needed to maximize coverage. For realistically sized problems, the universe of possible patterns is too large to represent exactly (in either the positive or negative scheme). Partial matching rules generalize the set of allowable (or unallowable) patterns, and the choice of matching rule affects the tradeoff between positive and negative detection. A new match rule is introduced, called-chunks, and the generalizations induced by different partial matching rules are characterized in terms of the crossover closure. Permutations of the representation can be used to achieve more precise discrimination between normal and anomalous patterns. Quantitative results are given for the recognition ability of contiguous-bits matching together with permutations.
Feature Weighting in k-Means Clustering
- Machine Learning
, 2002
"... Data sets with multiple, heterogeneous feature spaces occur frequently. We present an abstract framework for integrating multiple feature spaces in the k-means clustering algorithm. Our main ideas are (i) to represent each data object as a tuple of multiple feature vectors, (ii) to assign a suitable ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
Data sets with multiple, heterogeneous feature spaces occur frequently. We present an abstract framework for integrating multiple feature spaces in the k-means clustering algorithm. Our main ideas are (i) to represent each data object as a tuple of multiple feature vectors, (ii) to assign a suitable (and possibly different) distortion measure to each feature space, (iii) to combine distortions on different feature spaces, in a convex fashion, by assigning (possibly) different relative weights to each, (iv) for a fixed weighting, to cluster using the proposed convex k-means algorithm, and (v) to determine the optimal feature weighting to be the one that yields the clustering that simultaneously minimizes the average within-cluster dispersion and maximizes the average between-cluster dispersion along all the feature spaces. Using precision/recall evaluations and known ground truth classifications, we empirically demonstrate the effectiveness of feature weighting in clustering on several different application domains.
Visualizing Class Structure of Multidimensional Data
- Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics
, 1998
"... We consider the problem of visualizing multidimensional data that has been categorized into classes. Our goal in visualizing is to quickly absorb inter- and intra-class relationships. Towards this end, we introduce class-preserving projections of the multidimensional data onto twodimensional planes ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
We consider the problem of visualizing multidimensional data that has been categorized into classes. Our goal in visualizing is to quickly absorb inter- and intra-class relationships. Towards this end, we introduce class-preserving projections of the multidimensional data onto twodimensional planes which can then be displayed on a computer screen. These class-preserving projections maintain the high-dimensional class structure, and are closely related to Fisher's linear discriminants. By displaying sequences of such two-dimensional projections and by moving continuously from one projection to the next, we can create illusions of smooth motion through a multidimensional display. Such sequences are termed class tours. We illustrate the proposed ideas by various computer simulations on the classical Iris plant dataset and a text corpus of book reviews. 1 Introduction It is often desirable to classify data into meaningful categories. The Yahoo! hierarchy of the World-Wide Web is a prime...
Class Visualization of High-Dimensional Data with Applications
, 2003
"... Consider the problem of visualizing high-dimensional data that has been categorized into various classes. Our goal in visualizing is to quickly absorb inter-class and intra-class relationships. Towards this end, class-preserving projections of the multidimensional data onto twodimensional planes, ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Consider the problem of visualizing high-dimensional data that has been categorized into various classes. Our goal in visualizing is to quickly absorb inter-class and intra-class relationships. Towards this end, class-preserving projections of the multidimensional data onto twodimensional planes, which can be displayed on a computer screen, are introduced. These class-preserving projections maintain the high-dimensional class structure, and are closely related to Fisher's linear discriminants. By displaying sequences of such two-dimensional projections and by moving continuously from one projection to the next, an illusion of smooth motion through a multidimensional display can be created. We call such sequences class tours. Furthermore, we overlay class-similarity graphs on our two-dimensional projections to capture the distance relationships in the original high-dimensional space. We illustrate

