Results 1  10
of
102
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 247 (0 self)
 Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data
 in Proceedings of Second SIAM International Conference on Data Mining
, 2003
"... ..."
Scaling EM (ExpectationMaximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the ExpectationMaximization (EM) algorithm. The database community has focused on distancebased clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as KMeans), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discretevalued and continuousvalued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support
 In Proc.ofInt.Conf.on Databases in Office, Engineering and Science
, 1999
"... : Spatial data mining algorithms heavily depend on the efficient processing of neighborhood relations since the neighbors of many objects have to be investigated in a single run of a typical algorithm. Therefore, providing general concepts for neighborhood relations as well as an efficient implement ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
: Spatial data mining algorithms heavily depend on the efficient processing of neighborhood relations since the neighbors of many objects have to be investigated in a single run of a typical algorithm. Therefore, providing general concepts for neighborhood relations as well as an efficient implementation of these concepts will allow a tight integration of spatial data mining algorithms with a spatial database management system. This will speed up both, the development and the execution of spatial data mining algorithms. In this paper, we define neighborhood graphs and paths and a small set of database primitives for their manipulation. We show that typical spatial data mining algorithms are well supported by the proposed basic operations. For finding significant spatial patterns, only certain classes of paths "leading away" from a starting object are relevant. We discuss filters allowing only such neighborhood paths which will significantly reduce the search space for spatial data mini...
Epsilon Grid Order: An Algorithm for the Similarity Join on Massive HighDimensional Data
 In SIGMOD
, 2001
"... The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs w ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter . In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equidistant grid with cell length over the data space and comparing the grid cells lexicographically. A typical problem of gridbased approaches such as MSJ or the kdBtree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strate...
DensityConnected Sets and their Application for Trend Detection in Spatial Databases
 Proc. 3rd Znt. Conf on Knowledge Discovery and Data Mining, 1015, Menlo Park
, 1997
"... Several clustering algorithms have been proposed for class identification in spatial databases such as earth observation databases. The effectivity of the wellknown algorithms such as DBSCAN, however, is somewhat limited because they do not fully exploit the richness of the different types of data ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Several clustering algorithms have been proposed for class identification in spatial databases such as earth observation databases. The effectivity of the wellknown algorithms such as DBSCAN, however, is somewhat limited because they do not fully exploit the richness of the different types of data contained in a spatial database. In this paper, we introduce the concept of densityconnected sets and present a significantly generalized version of DBSCAN. The major properties of this algorithm are as follows: (1) any symmetric predicate can be used to define the neighborhood of an object allowing a natural definition in the case of spatially extended objects such as polygons, and (2) the cardinality function for a set of neighboring objects may take into account the nonspatial attributes of the objects as a means of assigning application specific weights. Densityconnected sets can be used as a basis to discover trends in a spatial database. We define trends in spatial databases and show how to apply the generalized DBSCAN algorithm for the task of discovering such knowledge. To demonstrate the practical impact of our approach, we performed experiments on a geographical information system on Bavaria which is representative for a broad class of spatial databases.
Discovering Personal Gazetteers: An Interactive Clustering Approach
 In Proc. ACMGIS
, 2004
"... Personal gazetteers record individuals' most important places, such as home, work, grocery store, etc. Using personal gazetteers in locationaware applications o#ers additional functionality and improves the user experience. However, systems then need some way to acquire them. ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
Personal gazetteers record individuals' most important places, such as home, work, grocery store, etc. Using personal gazetteers in locationaware applications o#ers additional functionality and improves the user experience. However, systems then need some way to acquire them.
Unsupervised Distributed Clustering
, 2004
"... Clustering can be defined as the process of partitioning a set of patterns into disjoint and homogeneous meaningful groups, called clusters. The growing need for distributed clustering algorithms is attributed to the huge size of databases that is common nowadays. In this paper we propose a modifica ..."
Abstract

Cited by 20 (12 self)
 Add to MetaCart
Clustering can be defined as the process of partitioning a set of patterns into disjoint and homogeneous meaningful groups, called clusters. The growing need for distributed clustering algorithms is attributed to the huge size of databases that is common nowadays. In this paper we propose a modification of a recently proposed algorithm, namely kwindows, that is able to achieve high quality results in distributed computing environments.
DensityBased Clustering for RealTime Stream Data
 Proc. Of KDD' 07
, 2007
"... Existing datastream clustering algorithms such as CluStream are based on kmeans. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and userspecified time window. To address these issues, this paper ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Existing datastream clustering algorithms such as CluStream are based on kmeans. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and userspecified time window. To address these issues, this paper proposes DStream, a framework for clustering stream data using a densitybased approach. The algorithm uses an online component which maps each input data record into a grid and an offline component which computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream. Exploiting the intricate relationships between the decay factor, data density and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped to by outliers in order to dramatically improve the space and time efficiency of the system. The technique makes highspeed data stream clustering feasible without degrading the clustering quality. The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of realtime data streams. 1.
Algorithms and Applications for Spatial Data Mining
, 2001
"... Introduction Due to the computerization and the advances in scientific data collection we are faced with a large and continuously growing amount of data which makes it impossible to interpret all this data manually. Therefore, the development of new techniques and tools that support the human in tr ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Introduction Due to the computerization and the advances in scientific data collection we are faced with a large and continuously growing amount of data which makes it impossible to interpret all this data manually. Therefore, the development of new techniques and tools that support the human in transforming data into useful knowledge has been the focus of the relatively new and interdisciplinary research area "knowledge discovery in databases". Knowledge discovery in databases (KDD) has been defined as the nontrivial process of discovering valid, novel, potentially useful and ultimately understandable patterns from data, a pattern is an expression in some language describing a subset of the data or a model applicable to that subset (Fayyad et al., 1996). The process of KDD is interactive and iterative, involving several steps such as data selection, data reduction, data mining, and the evaluation of the data mining results. The heart of the process, however, is the data mining step