Results 1  10
of
16
A densitybased algorithm for discovering clusters in large spatial databases with noise
, 1996
"... Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clu ..."
Abstract

Cited by 1092 (59 self)
 Add to MetaCart
Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The wellknown clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a densitybased notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the wellknown algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.
BIRCH: an efficient data clustering method for very large databases
 In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD
, 1996
"... Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multidir nensional clataset. Prior work does not adequately address the problem of ..."
Abstract

Cited by 434 (2 self)
 Add to MetaCart
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multidir nensional clataset. Prior work does not adequately address the problem of large datasets and minimization of 1/0 costs. This paper presents a data clustering method named Bfll (;”H (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and clynamicall y clusters incoming multidimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints). BIRCH can typically find a goocl clustering with a single scan of the data, and improve the quality further with a few aclditioual scans. BIRCH is also the first clustering algorithm proposerl in the database area to handle “noise) ’ (data points that are not part of the underlying pattern) effectively. We evaluate BIRCH’S time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIR (;’H versus CLARA NS, a clustering method proposed recently for large datasets, and S11OW that BIRCH is consistently 1
An Efficient kMeans Clustering Algorithm: Analysis and Implementation
, 2000
"... Kmeans clustering is a very popular clustering technique, which is used in numerous applications. Given a set of n data points in R d and an integer k, the problem is to determine a set of k points R d , called centers, so as to minimize the mean squared distance from each data point to its ..."
Abstract

Cited by 208 (3 self)
 Add to MetaCart
Kmeans clustering is a very popular clustering technique, which is used in numerous applications. Given a set of n data points in R d and an integer k, the problem is to determine a set of k points R d , called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for kmeans clustering is Lloyd's algorithm. In this paper we present a simple and efficient implementation of Lloyd's kmeans clustering algorithm, which we call the filtering algorithm. This algorithm is very easy to implement. It differs from most other approaches in that it precomputes a kdtree data structure for the data points rather than the center points. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a datasensitive analysis of the algorithm's running time. Second, we have implemented the algorithm and performed a number of empirical studies, both on synthetically generated data and on real...
BIRCH: a new data clustering algorithm and its applications
 Data Min. Knowl. Disc
, 1997
"... Abstract. Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of ..."
Abstract

Cited by 65 (0 self)
 Add to MetaCart
Abstract. Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality. In this paper, an efficient and scalable data clustering method is proposed, based on a new inmemory data structure called CFtree, which serves as an inmemory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two reallife problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.
DEMON: Mining and Monitoring Evolving Data
 IEEE Transactions on Knowledge and Data Engineering
, 2000
"... Data mining algorithms have been the focus of much research recently. In practice, the input data to a data mining process resides in a large data warehouse whose data is kept uptodate through periodic or occasional addition and deletion of blocks of data. Most data mining algorithms have either ..."
Abstract

Cited by 55 (1 self)
 Add to MetaCart
Data mining algorithms have been the focus of much research recently. In practice, the input data to a data mining process resides in a large data warehouse whose data is kept uptodate through periodic or occasional addition and deletion of blocks of data. Most data mining algorithms have either assumed that the input data is static, or have been designed for arbitrary insertions and deletions of data records.
Learning simple relations: Theory and applications
 In Second SIAM Data Mining Conference
, 2002
"... Abstract – In addition to classic clustering algorithms, many different approaches to clustering are emerging for objects of special nature. In this article we deal with the grouping of rows and columns of a matrix with nonnegative entries. Two rows (or columns) are considered similar if correspond ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Abstract – In addition to classic clustering algorithms, many different approaches to clustering are emerging for objects of special nature. In this article we deal with the grouping of rows and columns of a matrix with nonnegative entries. Two rows (or columns) are considered similar if corresponding crossdistributions are close. This grouping is a dual clustering of two sets of elements, row and column indices. The introduced approach is based on the minimization of reduction of mutual information contained in a matrix that represents the relationship between two sets of elements. Our clustering approach contains many parallels with KMeans clustering due to certain common algebraic properties. The obtained results have many applications, including grouping of Web visit data.
TreeBased Partitioning Querying: A Methodology for Computing Medoids in Large Spatial Datasets
 VLDB J
"... Besides traditional domains (e.g., resource allocation, data mining applications), algorithms for medoid computation and related problems will play an important role in numerous emerging fields, such as location based services and sensor networks. Since the kmedoid problem is NP hard, all existing ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
Besides traditional domains (e.g., resource allocation, data mining applications), algorithms for medoid computation and related problems will play an important role in numerous emerging fields, such as location based services and sensor networks. Since the kmedoid problem is NP hard, all existing work deals with approximate solutions on relatively small datasets. This paper aims at efficient methods for very large spatial databases, motivated by: (i) the high and ever increasing availability of spatial data, and (ii) the need for novel query types and improved services. The proposed solutions exploit the intrinsic grouping properties of a data partition index in order to read only a small part of the dataset. Compared to
Continuous Medoid Queries over Moving Objects. SSTD
, 2007
"... Abstract. In the kmedoid problem, given a dataset P, we are asked to choose k points in P as the medoids. The optimal medoid set minimizes the average Euclidean distance between the points in P and their closest medoid. Finding the optimal k medoids is NP hard, and existing algorithms aim at approx ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Abstract. In the kmedoid problem, given a dataset P, we are asked to choose k points in P as the medoids. The optimal medoid set minimizes the average Euclidean distance between the points in P and their closest medoid. Finding the optimal k medoids is NP hard, and existing algorithms aim at approximate answers, i.e., they compute medoids that achieve a small, yet not minimal, average distance. Similarly in this paper, we also aim at approximate solutions. We consider, however, the continuous version of the problem, where the points in P move and our task is to maintain the medoid set onthefly (trying to keep the average distance small). To the best of our knowledge, this work constitutes the first attempt on continuous medoid queries. First, we consider centralized monitoring, where the points issue location updates whenever they move. A server processes the stream of generated updates and constantly reports the current medoid set. Next, we address distributed monitoring, where we assume that the data points have some computational capabilities, and they take over part of the monitoring task. In particular, the server installs adaptive filters (i.e., permissible spatial ranges, called safe regions) to the points, which report their location only when they move outside their filters. The distributed techniques reduce the frequency of location updates (and, thus, the network overhead and the server load), at the cost of a slightly higher average distance, compared to the centralized methods. Both our centralized and distributed methods do not make any assumption about the data moving patterns (e.g., velocity vectors, trajectories, etc) and can be applied to an arbitrary number of medoids k. We demonstrate the efficiency and efficacy of our techniques through extensive experiments.
ShapeBased Clustering Of Enterprise CAD Databases
 Computer Aided Design & Applications
, 2005
"... Cluster analysis is a primary data mining method for knowledge discovery in spatial databases, where, the goal is to find ‘natural ’ groups in a dataset based on a similarity or dissimilarity function for pairs of objects. With the number and size of spatial databases in various domains growing rapi ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Cluster analysis is a primary data mining method for knowledge discovery in spatial databases, where, the goal is to find ‘natural ’ groups in a dataset based on a similarity or dissimilarity function for pairs of objects. With the number and size of spatial databases in various domains growing rapidly over the last couple of decades, methods for automated knowledge discovery in these datasets is becoming increasingly important. In the last couple of years, various similarity search methods for 3D CAD databases have been researched with the purpose of promoting engineering data reuse. However, unlike in other spatial domains, not much interest has yet been generated towards the task of automated knowledge discovery and data mining in 3D CAD databases. Moreover, most wellknown clustering algorithms, have a very high computational complexity when used directly, and hence are too inefficient when applied to large spatial databases. Developing an efficient clustering technique usually requires leveraging crucial domain knowledge. This paper proposes a simple and efficient system for automatic single scan clustering of 3D CAD databases based on a shape similarity measure. The resulting system is a visual data mining tool to help the user quickly locate a seed model and search for similar models in the database. The paper also shows how the system can generate various statistical data which give new insight into the contents of the databases. The described system was implemented and evaluated on a large number of industrial CAD datasets and the results obtained were highly encouraging.