Results 11  20
of
157
Scalable KMeans++
"... Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obta ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the kmeans++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing kmeans that have mostly focused on the postinitialization phases of kmeans. We prove that our proposed initialization algorithm kmeans obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on realworld largescale data demonstrates that kmeans  outperforms kmeans++ in both sequential and parallel settings. 1.
Fast clustering using MapReduce
 In KDD
, 2011
"... Clustering problems have numerous applications and are becoming more challenging with the growing size of data available. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
(Show Context)
Clustering problems have numerous applications and are becoming more challenging with the growing size of data available. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems kcenter and kmedian. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis showing several clustering algorithms are in MRC 0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and run a time consuming clustering algorithm such as local search or Lloyd’s algorithm on the reduced data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the kmedian problem. The experiments show that our algorithms ’ solutions are similar or better than the other algorithms, while running faster than any other parallel algorithm that was tested for sufficiently large data sets. 1.
A data placement strategy in scientific cloud workflows
 FUTURE GENERATION COMPUTER SYSTEMS
, 2010
"... ..."
(Show Context)
BBM: Bayesian Browsing Model from Petabytescale Data
"... Given a quarter of petabyte click log data, how can we estimate the relevance of each URL for a given query? In this paper, we propose the Bayesian Browsing Model (BBM), a new modeling technique with following advantages: (a) it does exact inference; (b) it is singlepass and parallelizable; (c) it ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
(Show Context)
Given a quarter of petabyte click log data, how can we estimate the relevance of each URL for a given query? In this paper, we propose the Bayesian Browsing Model (BBM), a new modeling technique with following advantages: (a) it does exact inference; (b) it is singlepass and parallelizable; (c) it is effective. We present two sets of experiments to test model effectiveness and efficiency. On the first set of over 50 million search instances of 1.1 million distinct queries, BBM outperforms the stateoftheart competitor by 29.2 % in loglikelihood while being 57 times faster. On the second clicklog set, spanning a quarter of petabyte data, we showcase the scalability of BBM: we implemented it on a commercial MapReduce cluster, and it took only 3 hours to compute the relevance for 1.15 billion distinct queryURL pairs.
Optimal Sampling from Sliding Windows
 ACM PODS2009
, 2009
"... A sliding windows model is an important case of the streaming model, where only the most “recent” elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gion ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
A sliding windows model is an important case of the streaming model, where only the most “recent” elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gionis, Indyk and Motwani (SODA 02)). There are two equally important types of the sliding windows model – windows with fixed size, (e.g., where items arrive one at a time, and only the most recent n items remain active for some fixed parameter n), and bursty windows (e.g., where many items can arrive in “bursts ” at a single step and where only items from the last t steps remain active, again for some fixed parameter t). Random sampling is a fundamental tool for data streams, as numerous algorithms operate on the sampled data instead of on the entire stream. Effective sampling from sliding windows is a nontrivial problem, as elements eventually expire. In fact, the deletions are implicit; i.e., it is not possible to identify deleted elements without storing the entire window. The implicit nature of deletions on sliding windows does not allow the existing methods (even those that support explicit deletions, e.g., Cormode, Muthukrishnan and Rozenbaum (VLDB 05); Frahling, Indyk and Sohler (SOCG 05)) to be directly “translated ” to the sliding windows model. One trivial approach to overcoming the problem of implicit deletions is that of oversampling. When k samples are required, the oversampling method maintains k ′> k samples in the hope that at least k samples are not expired. The obvious disadvantages of this method are twofold: (a) It introduces additional costs and thus decreases the performance; and (b) The memory bounds are not deterministic, which is atypical for
Supervised clustering of streaming data for email batch detection
 In International Conference on Machine Learning
, 2007
"... We address the problem of detecting batches of emails that have been created according to the same template. This problem is motivated by the desire to filter spam more effectively by exploiting collective information about entire batches of jointly generated messages. The application matches the pr ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
We address the problem of detecting batches of emails that have been created according to the same template. This problem is motivated by the desire to filter spam more effectively by exploiting collective information about entire batches of jointly generated messages. The application matches the problem setting of supervised clustering, because examples of correct clusterings can be collected. Known decoding procedures for supervised clustering are cubic in the number of instances. When decisions cannot be reconsidered once they have been made – owing to the streaming nature of the data – then the decoding problem can be solved in linear time. We devise a sequential decoding procedure and derive the corresponding optimization problem of supervised clustering. We study the impact of collective attributes of email batches on the effectiveness of recognizing spam emails. 1.
Abbadi. Using Association Rules for Fraud Detection in Web Advertising Networks
, 2005
"... Discovering associations between elements occurring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of elements in streams. We develop an algorithm, StreamingRules, to report ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
(Show Context)
Discovering associations between elements occurring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of elements in streams. We develop an algorithm, StreamingRules, to report association rules with tight guarantees on errors, using limited processing per element, and minimal space. The modular design of StreamingRules allows for integration with current stream management systems, since it employs existing techniques for finding frequent elements. The presentation emphasizes the applicability of the algorithm to fraud detection in advertising networks. Such fraud instances have not been successfully detected by current techniques. Our experiments on synthetic data demonstrate scalability and efficiency. On real data, potential fraud was discovered. 1
A New Conceptual Clustering Framework
 MACHINE LEARNING
, 2004
"... We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may not cover all the points. In addition, a point may be assigned to a cluster description even if it only satisfies most, and not necessarily all, of the attributes in the conjunction. Connections between this conceptual clustering problem and the maximum edge biclique problem are made. Simple, randomized algorithms are given that discover a collection of approximate conjunctive cluster descriptions in sublinear time.
Anomaly Detection in a Mobile Communication Network
 PROCEEDINGS OF THE NAACSOS
, 2006
"... Cell phone networks produce a massive volume of service usage data which, when combined with location data, can be used to pinpoint emergency situations that cause changes in network usage. Such a change may be the results of an increased number of people trying to call friends or family to tell the ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
(Show Context)
Cell phone networks produce a massive volume of service usage data which, when combined with location data, can be used to pinpoint emergency situations that cause changes in network usage. Such a change may be the results of an increased number of people trying to call friends or family to tell them what is happening or a decrease in network usage caused by people being unable to use the network. Such events are anomalies and managing emergencies effectively requires identifying anomalies quickly. This problem is difficult due to the rate at which very large volumes of data are produced. In this paper, we discuss the use of data stream clustering algorithms for anomaly detection.
A Survey On: Content Based Image Retrieval Systems Using Clustering Techniques For Large Data sets
"... Contentbased image retrieval (CBIR) is a new but widely adopted method for finding images from vast and unannotated image databases. As the network and development of multimedia technologies are becoming more popular, users are not satisfied with the traditional information retrieval techniques. So ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
Contentbased image retrieval (CBIR) is a new but widely adopted method for finding images from vast and unannotated image databases. As the network and development of multimedia technologies are becoming more popular, users are not satisfied with the traditional information retrieval techniques. So nowadays the content based image retrieval (CBIR) are becoming a source of exact and fast retrieval. In recent years, a variety of techniques have been developed to improve the performance of CBIR. Data clustering is an unsupervised method for extraction hidden pattern from huge data sets. With large data sets, there is possibility of high dimensionality. Having both accuracy and efficiency for high dimensional data sets with enormous number of samples is a challenging arena. In this paper the clustering techniques are discussed and analysed. Also, we propose a method HDK that uses more than one clustering technique to improve the performance of CBIR.This method makes use of hierachical and divide and conquer KMeans clustering technique with equivalency and compatible relation concepts to improve the performance of the KMeans for using in high dimensional datasets. It also introduced the feature like color, texture and shape for accurate and effective retrieval system.