Results 11 - 20
of
373
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract
-
Cited by 177 (0 self)
- Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Discovering Word Senses from Text
- In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining
, 2002
"... Inventories of manually compiled dictionaries usually serve as a source for word senses. However, they often include many rare senses while missing corpus/domain-specific senses. We present a clustering algorithm called CBC (Clustering By Committee) that automatically discovers word senses from text ..."
Abstract
-
Cited by 159 (10 self)
- Add to MetaCart
Inventories of manually compiled dictionaries usually serve as a source for word senses. However, they often include many rare senses while missing corpus/domain-specific senses. We present a clustering algorithm called CBC (Clustering By Committee) that automatically discovers word senses from text. It initially discovers a set of tight clusters called committees that are well scattered in the similarity space. The centroid of the members of a committee is used as the feature vector of the cluster. We proceed by assigning words to their most similar clusters. After assigning an element to a cluster, we remove their overlapping features from the element. This allows CBC to discover the less frequent senses of a word and to avoid discovering duplicate senses. Each cluster that a word belongs to represents one of its senses. We also present an evaluation methodology for automatically measuring the precision and recall of discovered senses. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval---Clustering.
The MetaCrawler Architecture for Resource Aggregation on the Web
- IEEE Expert
, 1997
"... The MetaCrawler Softbot is a parallel Web search service that has been available at the University of Washington since June of 1995. It provides users with a single interface with which they can query popular general-purpose Web search services, such as Lycos[6] and AltaVista[1], and has some sophis ..."
Abstract
-
Cited by 155 (1 self)
- Add to MetaCart
The MetaCrawler Softbot is a parallel Web search service that has been available at the University of Washington since June of 1995. It provides users with a single interface with which they can query popular general-purpose Web search services, such as Lycos[6] and AltaVista[1], and has some sophisticated features that allow it to obtain results of much higher quality than simply regurgitating the output from each search
Recognizing text genres with simple metrics using discriminant analysis
- In Proceedings of the 15th Conference on Computational Linguistics
, 1994
"... A simple method for categorizing texts into pre-determined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the Brown corpus. Discriminant analysis makes it possible use a large number of parameters that may be specific for a ..."
Abstract
-
Cited by 132 (14 self)
- Add to MetaCart
A simple method for categorizing texts into pre-determined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the Brown corpus. Discriminant analysis makes it possible use a large number of parameters that may be specific for a certain corpus or information stream, and combine them into a small number of functions, with the parameters weighted on basis of how useful they are for discriminating text genres. An application to information retrieval is discussed. Text Types There are different types of text. Texts “about ” the same thing may be in differing genres, of different types, and of varying quality. Texts vary along several parameters, all relevant for the general information retrieval problem of matching reader needs and texts. Given this variation, in a text retrieval context the problems are (i) identifying genres, and (ii) choosing criteria to cluster texts of the same genre, with predictable precision and recall. This should not be confused with the issue of identifying topics, and choosing criteria that discriminate one topic from another. Although not orthogonal to genre-dependent variation, the variation that relates directly to content and topic is along other dimensions. Naturally, there is co-variance. Texts about certain topics may only occur in certain genres, and texts in certain genres may only treat certain topics; most topics do, however, occur in several
Incremental Clustering and Dynamic Information Retrieval
, 1997
"... Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retri ..."
Abstract
-
Cited by 129 (3 self)
- Add to MetaCart
Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters. 1 Introduction We consider the following problem: as a sequence of points from a metric...
Dimensions of Meaning
, 1992
"... The representation of documents and queries as vectors in a high-dimensional space is well-established in information retrieval [1]. This paper proposes to represent the semantics of words and contexts in a text as vectors. The dimensions of the space are words and the initial vectors are determined ..."
Abstract
-
Cited by 125 (4 self)
- Add to MetaCart
The representation of documents and queries as vectors in a high-dimensional space is well-established in information retrieval [1]. This paper proposes to represent the semantics of words and contexts in a text as vectors. The dimensions of the space are words and the initial vectors are determined by the words occurring close to the entity to be represented which implies that the space has several thousand dimensions (words). This makes the vector representations (which are dense) too cumbersome to use directly. Therefore, dimensionality reduction by means of a singular value decomposition is employed. The paper analyzes the structure of the vector representations and applies them to word sense disambiguation and thesaurus induction.
The FindMe Approach to Assisted Browsing
- IEEE Expert
, 1997
"... While the explosion of on-line information has brought new opportunities for finding and using electronic data, it has also brought to the forefront the problem of isolating useful information and making sense of large multidimensional information spaces. In response to this problem, we have develop ..."
Abstract
-
Cited by 116 (7 self)
- Add to MetaCart
While the explosion of on-line information has brought new opportunities for finding and using electronic data, it has also brought to the forefront the problem of isolating useful information and making sense of large multidimensional information spaces. In response to this problem, we have developed an approach to building data "tour guides," called FindMe systems. These programs know enough about an information space to be able to help a user navigate through it, making sure that the user not only comes away with items of useful information but also insights into the structure of the information space itself. In these systems, we have combined ideas of instance-based browsing, structuring retrieval around the critiquing of previously retrieved examples; and retrieval strategies, knowledgebased heuristics for finding relevant information. We illustrate these techniques with examples of working FindMe systems, and describe the similarities and differences between them. 1 Introduction...
Evaluation of Hierarchical Clustering Algorithms for Document Datasets
- Data Mining and Knowledge Discovery
, 2002
"... Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at ..."
Abstract
-
Cited by 116 (4 self)
- Add to MetaCart
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.
Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections
, 1993
"... The Scatter/Gather document browsing method uses fast document clustering to produce table-of-contentslike outlines of large document collections. Previous work [1] developed linear-time document clustering algorithms to establish the feasibility of this method over moderately large collections. How ..."
Abstract
-
Cited by 113 (5 self)
- Add to MetaCart
The Scatter/Gather document browsing method uses fast document clustering to produce table-of-contentslike outlines of large document collections. Previous work [1] developed linear-time document clustering algorithms to establish the feasibility of this method over moderately large collections. However, even linear-time algorithms are too slow to support interactive browsing of very large collections such as Tipster, the DARPA standard text retrieval evaluation collection. We present a scheme that supports constant interaction-time Scatter /Gather of arbitrarily large collections after nearlinear time preprocessing. This involves the construction of a cluster hierarchy. A modification of Scatter /Gather employing this scheme, and an example of its use over the Tipster collection are presented. 1 Background Our previous work on Scatter/Gather [1] has shown that document clustering can be used as a first-class tool for browsing large text collections. Browsing is distinguished from sea...
Creating efficient codebooks for visual recognition
- In Proceedings of the IEEE International Conference on Computer Vision
, 2005
"... Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as k-means to cluster the descriptor vect ..."
Abstract
-
Cited by 111 (12 self)
- Add to MetaCart
Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as k-means to cluster the descriptor vectors of patches sampled either densely (‘textons’) or sparsely (‘bags of features ’ based on keypoints or salience measures) from a set of training images. This works well for texture analysis in homogeneous images, but the images that arise in natural object recognition tasks have far less uniform statistics. We show that for dense sampling, k-means over-adapts to this, clustering centres almost exclusively around the densest few regions in descriptor space and thus failing to code other informative regions. This gives suboptimal codes that are no better than using randomly selected centres. We describe a scalable acceptance-radius based clusterer that generates better codebooks and study its performance on several image classification tasks. We also show that dense representations outperform equivalent keypoint based ones on these tasks and that SVM or Mutual Information based feature selection starting from a dense codebook further improves the performance. 1.

