Results 1 - 10
of
25
Data Clustering: A Review
- ACM COMPUTING SURVEYS
, 1999
"... Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exp ..."
Abstract
-
Cited by 912 (9 self)
- Add to MetaCart
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
Principal Direction Divisive Partitioning
- Data Mining and Knowledge Discovery
, 1997
"... We propose a new algorithm capable of partitioning a set of documents or other samples based on an embedding in a high dimensional Euclidean space (i.e. in which every document is a vector of real numbers). The method is unusual in that it is divisive, as opposed to agglomerative, and operates by re ..."
Abstract
-
Cited by 81 (18 self)
- Add to MetaCart
We propose a new algorithm capable of partitioning a set of documents or other samples based on an embedding in a high dimensional Euclidean space (i.e. in which every document is a vector of real numbers). The method is unusual in that it is divisive, as opposed to agglomerative, and operates by repeatedly splitting clusters into smaller clusters. The splits are not based on any distance or similarity measure. The documents are assembled in to a matrix which is very sparse. It is this sparsity that permits the algorithm to be very efficient. The performance of the method is illustrated with a set of text documents obtained from the World Wide Web. Some possible extensions are proposed for further investigation.
WebACE: A Web Agent for Document Categorization and Exploration
, 1998
"... We propose an agent for exploring and categorizing documents on the World Wide Web. The heart of the agent is an automatic categorization of a set of documents, combined with a process for generating new queries used to search for new related documents and filtering the resulting documents to extrac ..."
Abstract
-
Cited by 57 (16 self)
- Add to MetaCart
We propose an agent for exploring and categorizing documents on the World Wide Web. The heart of the agent is an automatic categorization of a set of documents, combined with a process for generating new queries used to search for new related documents and filtering the resulting documents to extract the set of documents most closely related to the starting set. The document categories are not given a-priori. We present the overall architecture and describe two novel algorithms which provide significant improvement over traditional clustering algorithms and form the basis for the query generation and search component of the agent. 1 Introduction The World Wide Web is a vast resource of information and services that continues to grow rapidly. Powerful search engines have been developed to aid in locating unfamiliar documents by category, contents, or subject. Relying on large indexes to documents located on the Web, search engines determine the URLs of those documents satisfying a use...
Partitioning-based clustering for web document categorization. Decision Support Systems
, 1999
"... Clustering techniques have been used by manyintelligent software agents in order to retrieve, lter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other simi ..."
Abstract
-
Cited by 56 (12 self)
- Add to MetaCart
Clustering techniques have been used by manyintelligent software agents in order to retrieve, lter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to de ne a distance or similarity among these documents, or use probabilistic techniques such as Bayesian classi cation. Many of these traditional algorithms, however, falter when the dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorithms that can e ectively cluster documents, even in the presence of a very high dimensional feature space. These clustering techniques, which are based on generalizations of graph partitioning, do not require pre-speci ed ad hoc distance functions, and are capable of automatically discovering document similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clustering schemes to standard distance-based techniques, such ashierarchical agglomeration clustering, and Bayesian classi cation methods, such as AutoClass.
The Parameterized Complexity of Sequence Alignment and Consensus
, 1994
"... The Longest common subsequence problem is examined from the point of view of parameterized computational complexity. There are several different ways in which parameters enter the problem, such as the number of sequences to be analyzed, the length of the common subsequence, and the size of the alpha ..."
Abstract
-
Cited by 35 (13 self)
- Add to MetaCart
The Longest common subsequence problem is examined from the point of view of parameterized computational complexity. There are several different ways in which parameters enter the problem, such as the number of sequences to be analyzed, the length of the common subsequence, and the size of the alphabet. Lower bounds on the complexity of this basic problem imply lower bounds on a number of other sequence alignment and consensus problems. At issue in the theory of parameterized complexity is whether a problem which takes input (x; k) can be solved in time f(k) \Delta n ff where ff is independent of k (termed fixed-parameter tractability). It can be argued that this is the appropriate asymptotic model of feasible computability for problems for which a small range of parameter values covers important applications --- a situation which certainly holds for many problems in biological sequence analysis. Our main results show that: (1) The Longest Common Subsequence (LCS) parameterized by t...
Clutter reduction in multi-dimensional data visualization using dimension reordering
- In INFOVIS ’04: Proceedings of the IEEE Symposium on Information Visualization (INFOVIS’04
, 2004
"... Visual clutter denotes a disordered collection of graphical entities in information visualization. Clutter can obscure the structure present in the data. Even in a small dataset, clutter can make it hard for the viewer to find patterns, relationships and structure. In this paper, we define visual cl ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
Visual clutter denotes a disordered collection of graphical entities in information visualization. Clutter can obscure the structure present in the data. Even in a small dataset, clutter can make it hard for the viewer to find patterns, relationships and structure. In this paper, we define visual clutter as any aspect of the visualization that interferes with the viewer’s understanding of the data, and present the concept of clutter-based dimension reordering. Dimension order is an attribute that can significantly affect a visualization’s expressiveness. By varying the dimension order in a display, it is possible to reduce clutter without reducing information content or modifying the data in any way. Clutter reduction is a display-dependent task. In this paper, we follow a three-step procedure for four different visualization techniques. For each display technique, first, we determine what constitutes clutter in terms of display properties; then we design a metric to measure visual clutter in this display; finally we search for an order that minimizes the clutter in a display.
Efficient phrase-based document indexing for Web document clustering
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2004
"... Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.
Maximum Likelihood Modelling Of Pronunciation Variation
, 1998
"... This paper addresses the problem of generating lexical word representations that properly represent natural pronunciation variations for the purpose of improved speech recognition accuracy. The current work is based on a procedure for data-driven optimisation of the pronunciation dictionary which cr ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
This paper addresses the problem of generating lexical word representations that properly represent natural pronunciation variations for the purpose of improved speech recognition accuracy. The current work is based on a procedure for data-driven optimisation of the pronunciation dictionary which creates a single baseform per word in the vocabulary, subject to a maximum likelihood (ML) criterion [1]. In the current approach, we extend the ML formulation in order to achieve optimal modelling of pronunciation variations. Since different words will not in general exhibit the same amount of pronunciation variation, the procedure allows words to be represented by a different number of baseforms. The method improves the sub-word description of the vocabulary words, and has been shown to improve recognition performance on the DARPA Resource Management (RM) task. 1. INTRODUCTION Most contemporary speech recognisers employ some kind of sub-word unit as the basic modelling entity. Such recognis...
Web Page Categorization and Feature Selection Using Association Rule And . . .
- IN 7TH WORKSHOP ON INFORMATION TECHNOLOGIES AND SYSTEMS
, 1997
"... Clustering techniques have been used by many intelligent software agents in order to retrieve, filter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other s ..."
Abstract
-
Cited by 20 (11 self)
- Add to MetaCart
Clustering techniques have been used by many intelligent software agents in order to retrieve, filter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to define a distance or similarity among these documents, or use probabilistic techniques such as Bayesian classification. Many of these traditional algorithms, however, falter when the dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorithms that can effectively cluster documents, even in the presence of a very high dimensional feature space. These clustering techniques. which are based on generalizations of graph partitioning, do not require pre-specified ad hoc distance functions, and are capable of automatically discov- ering document similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clustering schemes to standard distance-based techniques, such as hierarchical agglomeration clustering, and Bayesian classification methods, AutoClass.
An effective algorithm for string correction using generalized edit distances-III. Computational complexity of Xhe algorithm and some app~cations Infor~tion Sci
"... This paper deals with the problem of estimating a transmitted string X, from the corresponding received string Y, which is a noisy version of X,. We assume that Y contains*any number of substitution, insertion, and deletion errors, and that no two consecutive symbols of X, were deleted in transmissi ..."
Abstract
-
Cited by 18 (10 self)
- Add to MetaCart
This paper deals with the problem of estimating a transmitted string X, from the corresponding received string Y, which is a noisy version of X,. We assume that Y contains*any number of substitution, insertion, and deletion errors, and that no two consecutive symbols of X, were deleted in transmission. We have shown that for channels which cause independent errors, and whose error probabilities exceed those of noisy strings studied in the literature [ 121, at least 99.5 % of the erroneous strings will not contain two consecutive deletion errors. The best estimate X * of X, is defined as that element of H which minimizes the generalized Levenshtein distance D ( X/Y) between X and Y. Using dynamic programming principles, an algorithm is presented which yields X+ without computing individually the distances between every word of H and Y. Though this algorithm requires more memory, it can be shown that it is, in general, computationally less complex than all other existing algorithms which perform the same task. I.

