Results 1  10
of
38
BIRCH: an efficient data clustering method for very large databases
 In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD
, 1996
"... Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multidir nensional clataset. Prior work does not adequately address the problem of ..."
Abstract

Cited by 557 (2 self)
 Add to MetaCart
(Show Context)
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multidir nensional clataset. Prior work does not adequately address the problem of large datasets and minimization of 1/0 costs. This paper presents a data clustering method named Bfll (;”H (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and clynamicall y clusters incoming multidimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints). BIRCH can typically find a goocl clustering with a single scan of the data, and improve the quality further with a few aclditioual scans. BIRCH is also the first clustering algorithm proposerl in the database area to handle “noise) ’ (data points that are not part of the underlying pattern) effectively. We evaluate BIRCH’S time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIR (;’H versus CLARA NS, a clustering method proposed recently for large datasets, and S11OW that BIRCH is consistently 1
KnowledgeBased Artificial Neural Networks
, 1994
"... Hybrid learning methods use theoretical knowledge of a domain and a set of classified examples to develop a method for accurately classifying examples not seen during training. The challenge of hybrid learning systems is to use the information provided by one source of information to offset informat ..."
Abstract

Cited by 183 (13 self)
 Add to MetaCart
Hybrid learning methods use theoretical knowledge of a domain and a set of classified examples to develop a method for accurately classifying examples not seen during training. The challenge of hybrid learning systems is to use the information provided by one source of information to offset information missing from the other source. By so doing, a hybrid learning system should learn more effectively than systems that use only one of the information sources. KBANN(KnowledgeBased Artificial Neural Networks) is a hybrid learning system built on top of connectionist learning techniques. It maps problemspecific "domain theories", represented in propositional logic, into neural networks and then refines this reformulated knowledge using backpropagation. KBANN is evaluated by extensive empirical tests on two problems from molecular biology. Among other results, these tests show that the networks created by KBANN generalize better than a wide variety of learning systems, as well as several t...
Semisupervised Clustering with User Feedback
, 2003
"... We present a new approach to clustering based on the observation that \it is easier to criticize than to construct." Our approach of semisupervised clustering allows a user to iteratively provide feedback to a clustering algorithm. The feedback is incorporated in the form of constraints w ..."
Abstract

Cited by 124 (2 self)
 Add to MetaCart
(Show Context)
We present a new approach to clustering based on the observation that \it is easier to criticize than to construct." Our approach of semisupervised clustering allows a user to iteratively provide feedback to a clustering algorithm. The feedback is incorporated in the form of constraints which the clustering algorithm attempts to satisfy on future iterations. These constraints allow the user to guide the clusterer towards clusterings of the data that the user nds more useful. We demonstrate semisupervised clustering with a system that learns to cluster news stories from a Reuters data set. Introduction Consider the following problem: you are given 100,000 text documents (e.g., papers, newsgroup articles, or web pages) and asked to group them into classes or into a hierarchy such that related documents are grouped together. You are not told what classes or hierarchy to use or what documents are related; you have some criteria in mind, but may not be able to say exactly w...
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract

Cited by 115 (2 self)
 Add to MetaCart
(Show Context)
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called kmodes, to extend the kmeans paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of both the number of clusters and the number of records.
Graphbased hierarchical conceptual clustering
 International Journal on Artificial Intelligence Tools
, 2001
"... Hierarchical conceptual clustering has been proven to be a useful data mining technique. Graphbased representation of structural information has been shown to be successful in knowledge discovery. The Subdue substructure discovery system provides the advantages of both approaches. In this paper we ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
(Show Context)
Hierarchical conceptual clustering has been proven to be a useful data mining technique. Graphbased representation of structural information has been shown to be successful in knowledge discovery. The Subdue substructure discovery system provides the advantages of both approaches. In this paper we present Subdue and focus on its clustering capabilities. We use two examples to illustrate the validity of the approach both in structured and unstructured domains, as well as compare Subdue to an earlier clustering algorithm.
Circular Clustering Of Protein Dihedral Angles By Minimum Message Length
 In Proceedings of the 1st Pacific Symposium on Biocomputing (PSB1
, 1996
"... this paper is given in [DADH95] and is available from ftp://www.cs.monash.edu.au/www/publications/1995/TR237.ps.Z.) Section 2introduces the MML principle and how it can be used for this circular clustering problem. The remaining sections give the results of the secondary structure groups [KaSa83] th ..."
Abstract

Cited by 15 (11 self)
 Add to MetaCart
this paper is given in [DADH95] and is available from ftp://www.cs.monash.edu.au/www/publications/1995/TR237.ps.Z.) Section 2introduces the MML principle and how it can be used for this circular clustering problem. The remaining sections give the results of the secondary structure groups [KaSa83] that resulted from applying Snob to cluster our dihedral angle data.
Bootstrapping knowledge representations: from entailment meshes via semantic nets to learning webs
, 2001
"... ..."
An improved Bayesian Structural EM algorithm for learning Bayesian networks for clustering
 Pattern Recognition Letters
"... The application of the Bayesian Structural EM algorithm to learn Bayesian networks for clustering implies a search over the space of Bayesian network structures alternating between two steps: an optimization of the Bayesian network parameters (usually by means of the EM algorithm) and a structural s ..."
Abstract

Cited by 11 (7 self)
 Add to MetaCart
The application of the Bayesian Structural EM algorithm to learn Bayesian networks for clustering implies a search over the space of Bayesian network structures alternating between two steps: an optimization of the Bayesian network parameters (usually by means of the EM algorithm) and a structural search for model selection. In this paper, we propose to perform the optimization of the Bayesian network parameters using an alternative approach to the EM algorithm: the BC+EM method. We provide experimental results to show that our proposal results in a more effective and efficient version of the Bayesian Structural EM algorithm for learning Bayesian networks for clustering. Key words: clustering, Bayesian networks, EM algorithm, Bayesian Structural EM algorithm, Bound and Collapse method. 1 Introduction One of the basic problems that arises in a great variety of fields, including pattern recognition, machine learning and statistics, is the socalled data clustering problem [1,5,6,10,1...
A General Similarity Framework for Horn Clause Logic
, 2001
"... FirstOrder Logic formulæ are a powerful representation formalism characterized by the use of relations, that cause serious computational problems due to the phenomenon of indeterminacy (various portions of one description are possibly mapped in different ways onto another description). Being able t ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
FirstOrder Logic formulæ are a powerful representation formalism characterized by the use of relations, that cause serious computational problems due to the phenomenon of indeterminacy (various portions of one description are possibly mapped in different ways onto another description). Being able to identify the correct corresponding parts of two descriptions would help to tackle the problem: hence, the need for a framework for the comparison and similarity assessment. This could have many applications in Artificial Intelligence: guiding subsumption procedures and theory revision systems, implementing flexible matching, supporting instancebased learning and conceptual clustering. Unfortunately, few works on this subject are available in the literature. This paper focuses on Horn clauses, which are the basis for the Logic Programming paradigm, and proposes a novel similarity formula and evaluation criteria for identifying the descriptions components that are more similar and hence more likely to correspond to each other, based only on their syntactic structure. Experiments on realworld datasets prove the effectiveness of the proposal, and the efficiency of the corresponding implementation in the above tasks.