Results 1  10
of
191
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 251 (0 self)
 Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Finding Generalized Projected Clusters in High Dimensional Spaces
"... High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projec ..."
Abstract

Cited by 141 (8 self)
 Add to MetaCart
High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projected clustering which are able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to rede ne clustering for high dimensional applications by searching for hidden subspaces with clusters which are created by interattribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable, and are likely to tradeoff with better accuracy.
Clustering by Pattern Similarity in Large Data Sets
 In SIGMOD
"... Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, ..."
Abstract

Cited by 122 (16 self)
 Add to MetaCart
Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, similar objects are required to have close values on at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. Ecommerce applications, such as collaborative filtering, can also benefit from the new model, which captures not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. Our paper introduces an effective algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its effectiveness.
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 106 (2 self)
 Add to MetaCart
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces
, 2000
"... Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current stateoftheart technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then in ..."
Abstract

Cited by 106 (1 self)
 Add to MetaCart
Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current stateoftheart technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then indexing the reduced dimensionality space using a multidimensional index structure. The above technique, referred to as global dimensionality reduction (GDR), works well when the data set is globally correlated, i.e. most of the variation in the data can be captured by a few dimensions. In practice, datasets are often not globally correlated. In such cases, reducing the data dimensionality using GDR causes significant loss of distance information resulting in a large number of false positives and hence a high query cost. Even when a global correlation does not exist, there may exist subsets of data that are locally correlated. In this paper, we propose a technique called Local Dime...
Horting Hatches an Egg: A New GraphTheoretic Approach to Collaborative Filtering
 In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge discovery and data mining
, 1999
"... This paper introduces a new and novel approach to ratingbased collaborative filtering. The new technique is most appropriate for ecommerce merchants offering one or more groups of relatively homogeneous items such as compact disks, videos, books, software and the like. In contrast with other known ..."
Abstract

Cited by 99 (1 self)
 Add to MetaCart
This paper introduces a new and novel approach to ratingbased collaborative filtering. The new technique is most appropriate for ecommerce merchants offering one or more groups of relatively homogeneous items such as compact disks, videos, books, software and the like. In contrast with other known collaborative filtering techniques, the new algorithm is graphtheoretic, based on the twin new concepts of horting and predictability. As is demonstrated in this paper, the technique is fast, scalable, accurate, and requires only a modest learning curve. It makes use of a hierarchical classification scheme in order to introduce context into the rating process, and uses socalled creative links in order to find surprising and atypical items to recommend, perhaps even items which cross the group boundaries. The new technique is one of the key engines of the Intelligent Recommendation Algorithm (IRA) project, now being developed at IBM Research. In addition to several other recommendation engines, IRA contains a situation analyzer to determine the most appropriate mix of engines for a particular ecommerce merchant, as well as an engine for optimizing the placement of advertisements.
A monte carlo algorithm for fast projective clustering
 In Proceedings of the 2002 ACM SIGMOD International conference on Management of data
, 2002
"... We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good ..."
Abstract

Cited by 83 (1 self)
 Add to MetaCart
We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good with high probability. We implemented a modified version of the algorithm, using heuristics to speed up computation. Our extensive experiments show that our method is significantly more accurate than previous approaches. In particular, we use our techniques to build a classifier for detecting rotated human faces in cluttered images. 1. PROJECTIVE CLUSTERING Clustering is a widely used technique for data mining, indexing, and classification. Many practical methods proposed in the last few years, such as CLARANS [11], BIRCH [15], DBSCAN [5, 6], and
Matrix Approximation and Projective Clustering via Volume Sampling
, 2006
"... Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows of a given matrix A spans the rows of a lowrank approximation D that minimizes A−DF within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. In this paper, we genera ..."
Abstract

Cited by 63 (2 self)
 Add to MetaCart
Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows of a given matrix A spans the rows of a lowrank approximation D that minimizes A−DF within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. In this paper, we generalize this result in two ways. First, we prove that the additive error drops exponentially by iterating the sampling in an adaptive manner (adaptive sampling). Using this result, we give a passefficient algorithm for computing a lowrank approximation with reduced additive error. Our second result is that there exist k rows of A whose span contains the rows of a multiplicative (k + 1)approximation to the best rankk matrix; moreover, this subset can be found by sampling ksubsets of rows from a natural distribution (volume sampling). Combining volume sampling with adaptive sampling yields the existence of a set of k + k(k + 1)/ε rows whose span contains the rows of a multiplicative (1 + ε)approximation. This leads to a PTAS for the following NPhard
A Framework for Projected Clustering of High Dimensional Data Streams
 IN PROC. OF VLDB
, 2004
"... The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, singlescan, stream analysis methods have been propo ..."
Abstract

Cited by 58 (10 self)
 Add to MetaCart
The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, singlescan, stream analysis methods have been proposed in this context. However,
Adaptive Dimension Reduction for Clustering High Dimensional Data
, 2002
"... It is wellknown that for high dimensional data clustering, standard algorithms such as EM and the Kmeans are often trapped in local minimum. Many initialization methods were proposed to tackle this problem , but with only limited success. In this paper we propose a new approach to resolve this pro ..."
Abstract

Cited by 57 (2 self)
 Add to MetaCart
It is wellknown that for high dimensional data clustering, standard algorithms such as EM and the Kmeans are often trapped in local minimum. Many initialization methods were proposed to tackle this problem , but with only limited success. In this paper we propose a new approach to resolve this problem by repeated dimension reductions such that Kmeans or EM are performed only in very low dimensions. Cluster membership is utilized as a bridge between the reduced dimensional subspace and the original space, providing flexibility and ease of implementation. Clustering analysis performed on highly overlapped Gaussians, DNA gene expression profiles and internet newsgroups demonstrate the e#ectiveness of the proposed algorithm.