Results 1  10
of
36
Concept Decompositions for Large Sparse Text Data using Clustering
 Machine Learning
, 2000
"... . Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as highdimensional and sparse vectorsa few thousand dimensions and a sparsity of 95 to 99 ..."
Abstract

Cited by 301 (27 self)
 Add to MetaCart
. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as highdimensional and sparse vectorsa few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical kmeans algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the highdimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractallike" and "selfsimilar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the leastsquares approximation onto the linear subspace spanned...
A divisive informationtheoretic feature clustering algorithm for text classification
 Journal of Machine Learning Research
, 2003
"... High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new informationth ..."
Abstract

Cited by 108 (16 self)
 Add to MetaCart
High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new informationtheoretic divisive algorithm for feature/word clustering and apply it to text classification. Existing techniques for such “distributional clustering ” of words are agglomerative in nature and result in (i) suboptimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value. We show that our algorithm minimizes the “withincluster JensenShannon divergence ” while simultaneously maximizing the “betweencluster JensenShannon divergence”. In comparison to the previously proposed agglomerative strategies our divisive algorithm is much faster and achieves comparable or higher classification accuracies. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20Newsgroups data set and a 3level hierarchy of HTML documents collected from the Open Directory project (www.dmoz.org).
Efficient Clustering Of Very Large Document Collections
, 2001
"... An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as highdimensional but sparse numerical da ..."
Abstract

Cited by 92 (11 self)
 Add to MetaCart
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as highdimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory ecient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memoryecient multithreaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented  a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.
Clustering Hypertext With Applications To Web Searching
 IN PROCEEDINGS OF THE 11TH ACM CONFERENCE ON HYPERTEXT AND HYPERMEDIA
, 2000
"... Clustering separates unrelated documents and groups related documents, and is useful for discrimination, disambiguation, summarization, organization, and navigation of unstructured collections of hypertext documents. We propose a novel clustering algorithm that clusters hypertext documents using wor ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
Clustering separates unrelated documents and groups related documents, and is useful for discrimination, disambiguation, summarization, organization, and navigation of unstructured collections of hypertext documents. We propose a novel clustering algorithm that clusters hypertext documents using words (contained in the document), outlinks (from the document) , and inlinks (to the document). The algorithm automatically determines the relative importance of words, outlinks, and inlinks for a given collection of hypertext documents. We annotate each cluster using six information nuggets: summary, breakthrough, review, keywords, citation, and reference. These nuggets constitute highquality information resources that are representatives of the content of the clusters, and are extremely effective in compactly summarizing and navigating the collection of hypertext documents. We employ web searching as an application to illustrate our results.
Enhanced Word Clustering for Hierarchical Text Classification
, 2002
"... In this paper we propose a new informationtheoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at ..."
Abstract

Cited by 44 (2 self)
 Add to MetaCart
In this paper we propose a new informationtheoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at lower number of features [2, 28]. However the existing clustering techniques are agglomerative in nature and result in (i) suboptimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value, thus converging to a local minimum. We show that our algorithm minimizes the "withincluster JensenShannon divergence" while simultaneously maximizing the "betweencluster JensenShannon divergence". In comparison to the previously proposed agglomerative strategies our divisive algorithm achieves higher classification accuracy especially at lower number of features. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20 Newsgroups data set and a 3level hierarchy of HTML documents collected from Dmoz Open Directory.
A formal framework for positive and negative detection schemes
 IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS PART B: CYBERNETICS
, 2004
"... In anomaly detection, the normal behavior of a process is characterized by a model, and deviations from the model are called anomalies. In behaviorbased approaches to anomaly detection, the model of normal behavior is constructed from an observed sample of normally occurring patterns. Models of nor ..."
Abstract

Cited by 41 (7 self)
 Add to MetaCart
In anomaly detection, the normal behavior of a process is characterized by a model, and deviations from the model are called anomalies. In behaviorbased approaches to anomaly detection, the model of normal behavior is constructed from an observed sample of normally occurring patterns. Models of normal behavior can represent either the set of allowed patterns (positive detection) or the set of anomalous patterns (negative detection). A formal framework is given for analyzing the tradeoffs between positive and negative detection schemes in terms of the number of detectors needed to maximize coverage. For realistically sized problems, the universe of possible patterns is too large to represent exactly (in either the positive or negative scheme). Partial matching rules generalize the set of allowable (or unallowable) patterns, and the choice of matching rule affects the tradeoff between positive and negative detection. A new match rule is introduced, calledchunks, and the generalizations induced by different partial matching rules are characterized in terms of the crossover closure. Permutations of the representation can be used to achieve more precise discrimination between normal and anomalous patterns. Quantitative results are given for the recognition ability of contiguousbits matching together with permutations.
Feature Weighting in kMeans Clustering
 Machine Learning
, 2002
"... Data sets with multiple, heterogeneous feature spaces occur frequently. We present an abstract framework for integrating multiple feature spaces in the kmeans clustering algorithm. Our main ideas are (i) to represent each data object as a tuple of multiple feature vectors, (ii) to assign a suitable ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
Data sets with multiple, heterogeneous feature spaces occur frequently. We present an abstract framework for integrating multiple feature spaces in the kmeans clustering algorithm. Our main ideas are (i) to represent each data object as a tuple of multiple feature vectors, (ii) to assign a suitable (and possibly different) distortion measure to each feature space, (iii) to combine distortions on different feature spaces, in a convex fashion, by assigning (possibly) different relative weights to each, (iv) for a fixed weighting, to cluster using the proposed convex kmeans algorithm, and (v) to determine the optimal feature weighting to be the one that yields the clustering that simultaneously minimizes the average withincluster dispersion and maximizes the average betweencluster dispersion along all the feature spaces. Using precision/recall evaluations and known ground truth classifications, we empirically demonstrate the effectiveness of feature weighting in clustering on several different application domains.
Visualizing Class Structure of Multidimensional Data
 Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics
, 1998
"... We consider the problem of visualizing multidimensional data that has been categorized into classes. Our goal in visualizing is to quickly absorb inter and intraclass relationships. Towards this end, we introduce classpreserving projections of the multidimensional data onto twodimensional planes ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
We consider the problem of visualizing multidimensional data that has been categorized into classes. Our goal in visualizing is to quickly absorb inter and intraclass relationships. Towards this end, we introduce classpreserving projections of the multidimensional data onto twodimensional planes which can then be displayed on a computer screen. These classpreserving projections maintain the highdimensional class structure, and are closely related to Fisher's linear discriminants. By displaying sequences of such twodimensional projections and by moving continuously from one projection to the next, we can create illusions of smooth motion through a multidimensional display. Such sequences are termed class tours. We illustrate the proposed ideas by various computer simulations on the classical Iris plant dataset and a text corpus of book reviews. 1 Introduction It is often desirable to classify data into meaningful categories. The Yahoo! hierarchy of the WorldWide Web is a prime...
Class Visualization of HighDimensional Data with Applications
, 2003
"... Consider the problem of visualizing highdimensional data that has been categorized into various classes. Our goal in visualizing is to quickly absorb interclass and intraclass relationships. Towards this end, classpreserving projections of the multidimensional data onto twodimensional planes, ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
Consider the problem of visualizing highdimensional data that has been categorized into various classes. Our goal in visualizing is to quickly absorb interclass and intraclass relationships. Towards this end, classpreserving projections of the multidimensional data onto twodimensional planes, which can be displayed on a computer screen, are introduced. These classpreserving projections maintain the highdimensional class structure, and are closely related to Fisher's linear discriminants. By displaying sequences of such twodimensional projections and by moving continuously from one projection to the next, an illusion of smooth motion through a multidimensional display can be created. We call such sequences class tours. Furthermore, we overlay classsimilarity graphs on our twodimensional projections to capture the distance relationships in the original highdimensional space. We illustrate