Results 1  10
of
84
Data Clustering: A Review
 ACM COMPUTING SURVEYS
, 1999
"... Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exp ..."
Abstract

Cited by 1284 (13 self)
 Add to MetaCart
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify crosscutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
BIRCH: an efficient data clustering method for very large databases
 In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD
, 1996
"... Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multidir nensional clataset. Prior work does not adequately address the problem of ..."
Abstract

Cited by 434 (2 self)
 Add to MetaCart
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multidir nensional clataset. Prior work does not adequately address the problem of large datasets and minimization of 1/0 costs. This paper presents a data clustering method named Bfll (;”H (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and clynamicall y clusters incoming multidimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints). BIRCH can typically find a goocl clustering with a single scan of the data, and improve the quality further with a few aclditioual scans. BIRCH is also the first clustering algorithm proposerl in the database area to handle “noise) ’ (data points that are not part of the underlying pattern) effectively. We evaluate BIRCH’S time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIR (;’H versus CLARA NS, a clustering method proposed recently for large datasets, and S11OW that BIRCH is consistently 1
The adaptive nature of human categorization
 Psychological Review
, 1991
"... A rational model of human categorization behavior is presented that assumes that categorization reflects the derivation of optimal estimates of the probability of unseen features of objects. A Bayesian analysis is performed of what optimal estimations would be if categories formed a disjoint partiti ..."
Abstract

Cited by 211 (2 self)
 Add to MetaCart
A rational model of human categorization behavior is presented that assumes that categorization reflects the derivation of optimal estimates of the probability of unseen features of objects. A Bayesian analysis is performed of what optimal estimations would be if categories formed a disjoint partitioning of the object space and if features were independently displayed within a category. This Bayesian analysis is placed within an incremental categorization algorithm. The resulting rational model accounts for effects of central tendency of categories, effects of specific instances, learning of linearly nonseparable categories, effects of category labels, extraction of basic level categories, baserate effects, probability matching in categorization, and trialbytrial learning functions. Although the rational model considers just I level of categorization, it is shown how predictions can be enhanced by considering higher and lower levels. Considering prediction at the lower, individual level allows integration of this rational analysis of categorization with the earlier rational analysis of memory (Anderson & Milson, 1989). Anderson (1990) presented a rational analysis ot 6 human cognition. The term rational derives from similar "rationalman" analyses in economics. Rational analyses in other fields are sometimes called adaptationist analyses. Basically, they are efforts to explain the behavior in some domain on the assumption that the behavior is optimized with respect to some criteria of adaptive importance. This article begins with a general characterization ofhow one develops a rational theory of a particular cognitive phenomenon. Then I present the basic theory of categorization developed in Anderson (1990) and review the applications from that book. Since the writing of the book, the theory has been greatly extended and applied to many new phenomena. Most of this article describes these new developments and applications. A Rational Analysis Several theorists have promoted the idea that psychologists might understand human behavior by assuming it is adapted to the environment (e.g., Brunswik, 1956; Campbell, 1974; Gib
Learning in the Presence of Concept Drift and Hidden Contexts
 Machine Learning
, 1996
"... . Online learning in domains where the target concept depends on some hidden context poses serious problems. A changing context can induce changes in the target concepts, producing what is known as concept drift. We describe a family of learning algorithms that flexibly react to concept drift and c ..."
Abstract

Cited by 190 (0 self)
 Add to MetaCart
. Online learning in domains where the target concept depends on some hidden context poses serious problems. A changing context can induce changes in the target concepts, producing what is known as concept drift. We describe a family of learning algorithms that flexibly react to concept drift and can take advantage of situations where contexts reappear. The general approach underlying all these algorithms consists of (1) keeping only a window of currently trusted examples and hypotheses; (2) storing concept descriptions and reusing them when a previous context reappears; and (3) controlling both of these functions by a heuristic that constantly monitors the system's behavior. The paper reports on experiments that test the systems' performance under various conditions such as different levels of noise and different extent and rate of concept drift. Keywords: Incremental concept learning, online learning, context dependence, concept drift, forgetting 1. Introduction The work presen...
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 156 (2 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
Iterative Optimization and Simplification of Hierarchical Clusterings
 Journal of Artificial Intelligence Research
, 1995
"... Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high qual ..."
Abstract

Cited by 103 (1 self)
 Add to MetaCart
Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high quality, but be computationally inexpensive as well. In general, we cannot have it both ways, but we can partition the search so that a system inexpensively constructs a `tentative' clustering for initial examination, followed by iterative optimization, which continues to search in background for improved clusterings. Given this motivation, we evaluate an inexpensive strategy for creating initial clusterings, coupled with several control strategies for iterative optimization, each of which repeatedly modifies an initial clustering in search of a better one. One of these methods appears novel as an iterative optimization strategy in clustering contexts. Once a clustering has been construct...
BIRCH: a new data clustering algorithm and its applications
 Data Min. Knowl. Disc
, 1997
"... Abstract. Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of ..."
Abstract

Cited by 65 (0 self)
 Add to MetaCart
Abstract. Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality. In this paper, an efficient and scalable data clustering method is proposed, based on a new inmemory data structure called CFtree, which serves as an inmemory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two reallife problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.
Introspective Multistrategy Learning: Constructing a Learnung Strategy under Reasoning Failure
 Artificial Intelligence
, 1996
"... Officer praised dog for barking at object." Enables Detect Drugs out FK Initiates Retrieval 5 6 Missing Figure 10. Forgetting to fill the tank with gas A=actual intention; E=expectation; Q=question; C=context; I=index; G=goal Tank Out of Gas Tank Full Tank Low Fill Tank Shoul ..."
Abstract

Cited by 56 (21 self)
 Add to MetaCart
Officer praised dog for barking at object." Enables Detect Drugs out FK Initiates Retrieval 5 6 Missing Figure 10. Forgetting to fill the tank with gas A=actual intention; E=expectation; Q=question; C=context; I=index; G=goal Tank Out of Gas Tank Full Tank Low Fill Tank Should have filled up with gas when tank low Expectation What Action to Do? KEY: G = goal; I = index; C = context; Q = question; E = expectation; A = actual intention Results At Store connections with related concepts. Other learning goals take multiple arguments. For instance, a knowledge differentiation goal (Cox & Ram, 1995) is a goal to determine a change in a body of knowledge such that two items are separated conceptually. In contrast, a knowledge reconciliation goal (Cox & Ram, 1995) is one that seeks to merge two items that were mistakenly considered separate entities. Both expansion goals and reconciliation goals may include or spawn a knowledge organization goal (Ram, 1993) that seeks to reorganize the existing knowledge so that it is made available to the reasoner at the appropriate time, as well as modify the structure or content of a concept itself. Such reorganization of knowledge affects the conditions under which a particular piece of knowledge is retrieved or the kinds of indexes associated with an item in memory.
Concept Formation in Structured Domains
, 1991
"... ions are made over the structural information (relations) ..."
Abstract

Cited by 52 (2 self)
 Add to MetaCart
ions are made over the structural information (relations)
Autonomous Learning of Sequential Tasks: Experiments and Analyses
 IEEE Transactions on Neural Networks
, 1998
"... : This paper presents a novel learning model Clarion, which is a hybrid model based on the twolevel approach proposed in Sun (1995). The model integrates neural, reinforcement, and symbolic learning methods to perform online, bottomup learning (i.e., learning that goes from neural to symbolic repr ..."
Abstract

Cited by 45 (29 self)
 Add to MetaCart
: This paper presents a novel learning model Clarion, which is a hybrid model based on the twolevel approach proposed in Sun (1995). The model integrates neural, reinforcement, and symbolic learning methods to perform online, bottomup learning (i.e., learning that goes from neural to symbolic representations). The model utilizes both procedural and declarative knowledge (in neural and symbolic representations respectively), tapping into the synergy of the two types of processes. It was applied to deal with sequential decision tasks. Experiments and analyses in various ways are reported that shed light on the advantages of the model. obstacles agent target Figure 1: Navigating Through A Minefield 1 Introduction This paper presents a model that unifies neural, symbolic, and reinforcement learning. It addresses the following three issues: (1) It deals with autonomous learning: It allows a situated agent to learn autonomously and continuously, from ongoing experience in the world, w...