Results 1  10
of
36
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 351 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Model Selection and the Principle of Minimum Description Length
 Journal of the American Statistical Association
, 1998
"... This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This ..."
Abstract

Cited by 170 (6 self)
 Add to MetaCart
This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed interest within the statistics community. In the pages that follow, we review both the practical as well as the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines, we find many interesting interpretations of popular frequentist and Bayesian procedures. As we will see, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can coexist and be compared. We illustrate th...
Iterative Optimization and Simplification of Hierarchical Clusterings
 Journal of Artificial Intelligence Research
, 1995
"... Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high qual ..."
Abstract

Cited by 117 (2 self)
 Add to MetaCart
Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high quality, but be computationally inexpensive as well. In general, we cannot have it both ways, but we can partition the search so that a system inexpensively constructs a `tentative' clustering for initial examination, followed by iterative optimization, which continues to search in background for improved clusterings. Given this motivation, we evaluate an inexpensive strategy for creating initial clusterings, coupled with several control strategies for iterative optimization, each of which repeatedly modifies an initial clustering in search of a better one. One of these methods appears novel as an iterative optimization strategy in clustering contexts. Once a clustering has been construct...
Hidden markov models that use predicted local structure for fold recognition: alphabets of backbone geometry
 Proteins
, 2003
"... An important problem in computational biology is predicting the structure of the large number of putative proteins discovered by genome sequencing projects. Foldrecognition methods attempt to solve the problem by relating the target proteins to known structures, searching for template proteins hom ..."
Abstract

Cited by 61 (12 self)
 Add to MetaCart
(Show Context)
An important problem in computational biology is predicting the structure of the large number of putative proteins discovered by genome sequencing projects. Foldrecognition methods attempt to solve the problem by relating the target proteins to known structures, searching for template proteins homologous to the target. Remote homologs which may have significant structural similarity are often not detectable by sequence similarities alone. To address this, we incorporated predicted local structure, a generalization of secondary structure, into twotrack profile HMMs. We did not rely on a simple helixstrandcoil definition of secondary structure,
Unsupervised Learning Using MML
 IN MACHINE LEARNING: PROCEEDINGS OF THE THIRTEENTH INTERNATIONAL CONFERENCE (ICML 96
, 1996
"... This paper discusses the unsupervised learning problem. An important part of the unsupervised learning problem is determining the number of constituent groups (components or classes) which best describes some data. We apply the Minimum Message Length (MML) criterion to the unsupervised learning prob ..."
Abstract

Cited by 50 (6 self)
 Add to MetaCart
This paper discusses the unsupervised learning problem. An important part of the unsupervised learning problem is determining the number of constituent groups (components or classes) which best describes some data. We apply the Minimum Message Length (MML) criterion to the unsupervised learning problem, modifying an earlier such MML application. We give an empirical comparison of criteria prominent in the literature for estimating the number of components in a data set. We conclude that the Minimum Message Length criterion performs better than the alternatives on the data considered here for unsupervised learning tasks.
Similaritybased approaches to natural language processing
, 1997
"... Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and naturallanguagebased user interfaces. However, even huge ..."
Abstract

Cited by 50 (3 self)
 Add to MetaCart
(Show Context)
Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and naturallanguagebased user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similaritybased approaches, where, in general, we measure similarity by the KullbackLeibler divergence, an informationtheoretic quantity. Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing,
A Nonbehavioural, Computational Extension to the Turing Test
 In International Conference on Computational Intelligence & Multimedia Applications (ICCIMA '98
, 1998
"... We also ask the following question: Given two programs H1 and H2 respectively of lengths l1 and l2, l1! l2, if H1 and H2 perform equally well (to date) on a Turing Test, which, if either, should be preferred for the future? We also set a challenge. If humans can presume intelligence in their ability ..."
Abstract

Cited by 35 (18 self)
 Add to MetaCart
(Show Context)
We also ask the following question: Given two programs H1 and H2 respectively of lengths l1 and l2, l1! l2, if H1 and H2 perform equally well (to date) on a Turing Test, which, if either, should be preferred for the future? We also set a challenge. If humans can presume intelligence in their ability to set the Turing test, then we issue the additional challenge to researchers to get machines to administer the Turing Test.
Dynamic clustering using particle swarm optimization with application in unsupervised image segmentation
 2005
"... A new dynamic clustering approach (DCPSO), based on Particle Swarm Optimization, is proposed. This approach is applied to unsupervised image classification. The proposed approach automatically determines the "optimum " number of clusters and simultaneously clusters the data set with minima ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
(Show Context)
A new dynamic clustering approach (DCPSO), based on Particle Swarm Optimization, is proposed. This approach is applied to unsupervised image classification. The proposed approach automatically determines the "optimum " number of clusters and simultaneously clusters the data set with minimal user interference. The algorithm starts by partitioning the data set into a relatively large number of clusters to reduce the effects of initial conditions. Using binary particle swarm optimization the "best" number of clusters is selected. The centers of the chosen clusters is then refined via the Kmeans clustering algorithm. The experiments conducted show that the proposed approach generally found the "optimum" number of clusters on the tested images.
Circular Clustering Of Protein Dihedral Angles By Minimum Message Length
 In Proceedings of the 1st Pacific Symposium on Biocomputing (PSB1
, 1996
"... this paper is given in [DADH95] and is available from ftp://www.cs.monash.edu.au/www/publications/1995/TR237.ps.Z.) Section 2introduces the MML principle and how it can be used for this circular clustering problem. The remaining sections give the results of the secondary structure groups [KaSa83] th ..."
Abstract

Cited by 15 (11 self)
 Add to MetaCart
this paper is given in [DADH95] and is available from ftp://www.cs.monash.edu.au/www/publications/1995/TR237.ps.Z.) Section 2introduces the MML principle and how it can be used for this circular clustering problem. The remaining sections give the results of the secondary structure groups [KaSa83] that resulted from applying Snob to cluster our dihedral angle data.