Results 1  10
of
74
Data Clustering: 50 Years Beyond KMeans
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract

Cited by 75 (3 self)
 Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, Kmeans, was first published in 1955. In spite of the fact that Kmeans was proposed over 50 years ago and thousands of clustering algorithms have been published since then, Kmeans is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semisupervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Streaming Pattern Discovery in Multiple TimeSeries
 In VLDB
, 2005
"... In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries) . Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream col ..."
Abstract

Cited by 68 (15 self)
 Add to MetaCart
In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries) . Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream collection.
Ranking a stream of news
 In WWW ’05: Proceedings of the 14th international conference on World Wide Web
, 2005
"... According to a recent survey made by Nielsen NetRatings, searching on news articles is one of the most important activity online. Indeed, Google, Yahoo, MSN and many others have proposed commercial search engines for indexing news feeds. Despite this commercial interest, no academic research has foc ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
According to a recent survey made by Nielsen NetRatings, searching on news articles is one of the most important activity online. Indeed, Google, Yahoo, MSN and many others have proposed commercial search engines for indexing news feeds. Despite this commercial interest, no academic research has focused on ranking a stream of news articles and a set of news sources. In this paper, we introduce this problem by proposing a ranking framework which models: (1) the process of generation of a stream of news articles, (2) the news articles clustering by topics, and (3) the evolution of news story over the time. The ranking algorithm proposed ranks news information, finding the most authoritative news sources and identifying the most interesting events in the different categories to which news article belongs. All these ranking measures take in account the time and can be obtained without a predefined sliding window of observation over the stream. The complexity of our algorithm is linear in the number of pieces of news still under consideration at the time of a new posting. This allow a continuous online process of ranking. Our ranking framework is validated on a collection of more than 300,000 pieces of news, produced in two months by more then 2000 news sources belonging to 13 different categories (World, U.S, Europe, Sports, Business, etc). This collection is extracted from the index of comeToMyHead, an academic news search engine available online. 1.
Compressing Large Boolean Matrices Using Reordering Techniques
, 2004
"... Large boolean matrices are a basic representational unit in a variety of applications, with some notable examples being interactive visualization systems, mining large graph structures, and association rule mining. Designing space and time e#cient scalable storage and query mechanisms for such ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
Large boolean matrices are a basic representational unit in a variety of applications, with some notable examples being interactive visualization systems, mining large graph structures, and association rule mining. Designing space and time e#cient scalable storage and query mechanisms for such large matrices is a challenging problem.
A Novel Efficient
 Group Signature With Forward Security, ICICS 2003, LNCS 2836
, 2003
"... Abstract—Given a data set P,akmeans query returns k points in space (called centers), such that the average squared distance between each point in P and its nearest center is minimized. Since this problem is NPhard, several approximate algorithms have been proposed and used in practice. In this pa ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
Abstract—Given a data set P,akmeans query returns k points in space (called centers), such that the average squared distance between each point in P and its nearest center is minimized. Since this problem is NPhard, several approximate algorithms have been proposed and used in practice. In this paper, we study continuous kmeans computation at a server that monitors a set of moving objects. Reevaluating kmeans every time there is an object update imposes a heavy burden on the server (for computing the centers from scratch) and the clients (for continuously sending location updates). We overcome these problems with a novel approach that significantly reduces the computation and communication costs, while guaranteeing that the quality of the solution, with respect to the reevaluation approach, is bounded by a userdefined tolerance. The proposed method assigns each moving object a threshold (i.e., range) such that the object sends a location update only when it crosses the range boundary. First, we develop an efficient technique for maintaining the kmeans. Then, we present mathematical formulas and algorithms for deriving the individual thresholds. Finally, we justify our performance claims with extensive experiments. Index Terms—kMeans, continuous monitoring, query processing. Ç
DensityBased Clustering for RealTime Stream Data
 Proc. Of KDD' 07
, 2007
"... Existing datastream clustering algorithms such as CluStream are based on kmeans. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and userspecified time window. To address these issues, this paper ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Existing datastream clustering algorithms such as CluStream are based on kmeans. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and userspecified time window. To address these issues, this paper proposes DStream, a framework for clustering stream data using a densitybased approach. The algorithm uses an online component which maps each input data record into a grid and an offline component which computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream. Exploiting the intricate relationships between the decay factor, data density and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped to by outliers in order to dramatically improve the space and time efficiency of the system. The technique makes highspeed data stream clustering feasible without degrading the clustering quality. The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of realtime data streams. 1.
Stream Monitoring under the Time Warping Distance
"... Data stream processing has recently attracted an increasing amount of interest. The goal of this paper is to monitor numerical streams, and to find subsequences that are similar to a given query sequence, under the DTW (Dynamic Time Warping) distance. Applications include word spotting, sensor patte ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Data stream processing has recently attracted an increasing amount of interest. The goal of this paper is to monitor numerical streams, and to find subsequences that are similar to a given query sequence, under the DTW (Dynamic Time Warping) distance. Applications include word spotting, sensor pattern matching, and monitoring of biomedical signals (e.g., EKG, ECG), and monitoring of environmental (seismic and volcanic) signals. DTW is a very popular distance measure, permitting accelerations and decelerations, and it has been studied for finite, stored sequence sets. However, in many applications such as network analysis and sensor monitoring, massive amounts of data arrive continuously and it is infeasible to save all the historical data. We propose SPRING, a novel algorithm that can solve the problem. We provide a theoretical analysis and prove that SPRING does not sacrifice accuracy, while it requires constant space and time per timetick. These are dramatic improvements over the naive method. Our experiments on real and realistic data illustrate that SPRING does indeed detect the qualifying subsequences correctly and that it can offer dramatic improvements in speed (up to 650,000 times) over the naive implementation. 1
A New Conceptual Clustering Framework
 MACHINE LEARNING
, 2004
"... We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may not cover all the points. In addition, a point may be assigned to a cluster description even if it only satisfies most, and not necessarily all, of the attributes in the conjunction. Connections between this conceptual clustering problem and the maximum edge biclique problem are made. Simple, randomized algorithms are given that discover a collection of approximate conjunctive cluster descriptions in sublinear time.
BBM: Bayesian Browsing Model from Petabytescale Data
"... Given a quarter of petabyte click log data, how can we estimate the relevance of each URL for a given query? In this paper, we propose the Bayesian Browsing Model (BBM), a new modeling technique with following advantages: (a) it does exact inference; (b) it is singlepass and parallelizable; (c) it ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
Given a quarter of petabyte click log data, how can we estimate the relevance of each URL for a given query? In this paper, we propose the Bayesian Browsing Model (BBM), a new modeling technique with following advantages: (a) it does exact inference; (b) it is singlepass and parallelizable; (c) it is effective. We present two sets of experiments to test model effectiveness and efficiency. On the first set of over 50 million search instances of 1.1 million distinct queries, BBM outperforms the stateoftheart competitor by 29.2 % in loglikelihood while being 57 times faster. On the second clicklog set, spanning a quarter of petabyte data, we showcase the scalability of BBM: we implemented it on a commercial MapReduce cluster, and it took only 3 hours to compute the relevance for 1.15 billion distinct queryURL pairs.
Optimal Sampling from Sliding Windows
 ACM PODS2009
, 2009
"... A sliding windows model is an important case of the streaming model, where only the most “recent” elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gion ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
A sliding windows model is an important case of the streaming model, where only the most “recent” elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gionis, Indyk and Motwani (SODA 02)). There are two equally important types of the sliding windows model – windows with fixed size, (e.g., where items arrive one at a time, and only the most recent n items remain active for some fixed parameter n), and bursty windows (e.g., where many items can arrive in “bursts ” at a single step and where only items from the last t steps remain active, again for some fixed parameter t). Random sampling is a fundamental tool for data streams, as numerous algorithms operate on the sampled data instead of on the entire stream. Effective sampling from sliding windows is a nontrivial problem, as elements eventually expire. In fact, the deletions are implicit; i.e., it is not possible to identify deleted elements without storing the entire window. The implicit nature of deletions on sliding windows does not allow the existing methods (even those that support explicit deletions, e.g., Cormode, Muthukrishnan and Rozenbaum (VLDB 05); Frahling, Indyk and Sohler (SOCG 05)) to be directly “translated ” to the sliding windows model. One trivial approach to overcoming the problem of implicit deletions is that of oversampling. When k samples are required, the oversampling method maintains k ′> k samples in the hope that at least k samples are not expired. The obvious disadvantages of this method are twofold: (a) It introduces additional costs and thus decreases the performance; and (b) The memory bounds are not deterministic, which is atypical for