Results 1  10
of
166
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 351 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data
 in Proceedings of Second SIAM International Conference on Data Mining
, 2003
"... ..."
(Show Context)
Scaling EM (ExpectationMaximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract

Cited by 47 (1 self)
 Add to MetaCart
(Show Context)
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the ExpectationMaximization (EM) algorithm. The database community has focused on distancebased clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as KMeans), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discretevalued and continuousvalued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
A Fast Parallel Clustering Algorithm for Large Spatial Databases
 DATA MINING AND KNOWLEDGE DISCOVERY, 3, 263–290
, 1999
"... The clustering algorithm DBSCAN relies on a densitybased notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we present PDBSCAN, a parallel version of this algorithm. We use the ‘sharednothing’ architecture with multiple compu ..."
Abstract

Cited by 39 (1 self)
 Add to MetaCart
The clustering algorithm DBSCAN relies on a densitybased notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we present PDBSCAN, a parallel version of this algorithm. We use the ‘sharednothing’ architecture with multiple computers interconnected through a network. A fundamental component of a sharednothing system is its distributed data structure. We introduce the dR∗tree, a distributed spatial index structure in which the data is spread among multiple computers and the indexes of the data are replicated on every computer. We implemented our method using a number of workstations connected via Ethernet (10 Mbit). A performance evaluation shows that PDBSCAN offers nearly linear speedup and has excellent scaleup and sizeup behavior.
DensityBased Clustering for RealTime Stream Data
 Proc. Of KDD' 07
, 2007
"... Existing datastream clustering algorithms such as CluStream are based on kmeans. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and userspecified time window. To address these issues, this paper ..."
Abstract

Cited by 34 (0 self)
 Add to MetaCart
(Show Context)
Existing datastream clustering algorithms such as CluStream are based on kmeans. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and userspecified time window. To address these issues, this paper proposes DStream, a framework for clustering stream data using a densitybased approach. The algorithm uses an online component which maps each input data record into a grid and an offline component which computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream. Exploiting the intricate relationships between the decay factor, data density and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped to by outliers in order to dramatically improve the space and time efficiency of the system. The technique makes highspeed data stream clustering feasible without degrading the clustering quality. The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of realtime data streams. 1.
Epsilon Grid Order: An Algorithm for the Similarity Join on Massive HighDimensional Data
 In SIGMOD
, 2001
"... The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs w ..."
Abstract

Cited by 32 (2 self)
 Add to MetaCart
The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter . In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equidistant grid with cell length over the data space and comparing the grid cells lexicographically. A typical problem of gridbased approaches such as MSJ or the kdBtree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strate...
Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support
 In Proc.ofInt.Conf.on Databases in Office, Engineering and Science
, 1999
"... : Spatial data mining algorithms heavily depend on the efficient processing of neighborhood relations since the neighbors of many objects have to be investigated in a single run of a typical algorithm. Therefore, providing general concepts for neighborhood relations as well as an efficient implement ..."
Abstract

Cited by 31 (1 self)
 Add to MetaCart
: Spatial data mining algorithms heavily depend on the efficient processing of neighborhood relations since the neighbors of many objects have to be investigated in a single run of a typical algorithm. Therefore, providing general concepts for neighborhood relations as well as an efficient implementation of these concepts will allow a tight integration of spatial data mining algorithms with a spatial database management system. This will speed up both, the development and the execution of spatial data mining algorithms. In this paper, we define neighborhood graphs and paths and a small set of database primitives for their manipulation. We show that typical spatial data mining algorithms are well supported by the proposed basic operations. For finding significant spatial patterns, only certain classes of paths "leading away" from a starting object are relevant. We discuss filters allowing only such neighborhood paths which will significantly reduce the search space for spatial data mini...
A Clusteringbased Approach for Discovering Interesting Places in Trajectories
"... Because of the large amount of trajectory data produced by mobile devices, there is an increasing need for mechanisms to extract knowledge from this data. Most existing works have focused on the geometric properties of trajectories, but recently emerged the concept of semantic trajectories, in which ..."
Abstract

Cited by 30 (6 self)
 Add to MetaCart
(Show Context)
Because of the large amount of trajectory data produced by mobile devices, there is an increasing need for mechanisms to extract knowledge from this data. Most existing works have focused on the geometric properties of trajectories, but recently emerged the concept of semantic trajectories, in which the background geographic information is integrated to trajectory sample points. In this new concept, trajectories are observed as a set of stops and moves, where stops are the most important parts of the trajectory. Stops and moves have been computed by testing the intersections of trajectories with a set of geographic objects given by the user. In this paper we present an alternative solution with the capability of finding interesting places that are not expected by the user. The proposed solution is a spatiotemporal clustering method, based on speed, to work with single trajectories. We compare the two different approaches with experiments on real data and show that the computation of stops using the concept of speed can be interesting for several applications.
Unsupervised and semisupervised clustering: a brief survey
 7th ACM SIGMM international workshop on Multimedia information retrieval
"... Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more “similar ” to each other than they are to items in the other clusters. This notion of similarity can be expressed in very different ways, according to the purpose of ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
(Show Context)
Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more “similar ” to each other than they are to items in the other clusters. This notion of similarity can be expressed in very different ways, according to the purpose of the study, to domainspecific assumptions and to prior knowledge of the problem. Clustering is usually performed when no information is available concerning the membership of data items to predefined classes. For this reason, clustering is traditionally seen as part of unsupervised learning. We nevertheless speak here of unsupervised clustering to distinguish it from a more recent and less common approach that makes use of a small amount of supervision to “guide ” or “adjust ” clustering (see section 2). To support the extensive use of clustering in computer vision, pattern recognition, information retrieval, data mining, etc., very many different methods were developed in several communities. Detailed surveys of this domain can be found in [25], [27] or [26]. In the following, we attempt to briefly review a few core concepts of cluster analysis and describe categories of clustering methods that are best represented in the literature. We also take this opportunity to provide some pointers to more recent work on clustering.
DensityConnected Sets and their Application for Trend Detection in Spatial Databases
 Proc. 3rd Znt. Conf on Knowledge Discovery and Data Mining, 1015, Menlo Park
, 1997
"... Several clustering algorithms have been proposed for class identification in spatial databases such as earth observation databases. The effectivity of the wellknown algorithms such as DBSCAN, however, is somewhat limited because they do not fully exploit the richness of the different types of data ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
(Show Context)
Several clustering algorithms have been proposed for class identification in spatial databases such as earth observation databases. The effectivity of the wellknown algorithms such as DBSCAN, however, is somewhat limited because they do not fully exploit the richness of the different types of data contained in a spatial database. In this paper, we introduce the concept of densityconnected sets and present a significantly generalized version of DBSCAN. The major properties of this algorithm are as follows: (1) any symmetric predicate can be used to define the neighborhood of an object allowing a natural definition in the case of spatially extended objects such as polygons, and (2) the cardinality function for a set of neighboring objects may take into account the nonspatial attributes of the objects as a means of assigning application specific weights. Densityconnected sets can be used as a basis to discover trends in a spatial database. We define trends in spatial databases and show how to apply the generalized DBSCAN algorithm for the task of discovering such knowledge. To demonstrate the practical impact of our approach, we performed experiments on a geographical information system on Bavaria which is representative for a broad class of spatial databases.