Results 11 - 20
of
101
Data Clustering: 50 Years Beyond K-Means
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Dynamic Non-Parametric Mixture Models and The Recurrent Chinese Restaurant Process: with Applications to Evolutionary Clustering
"... Clustering is an important data mining task for exploration and visualization of different data types like news stories, scientific publications, weblogs, etc. Due to the evolving nature of these data, evolutionary clustering, also known as dynamic clustering, has recently emerged to cope with the c ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Clustering is an important data mining task for exploration and visualization of different data types like news stories, scientific publications, weblogs, etc. Due to the evolving nature of these data, evolutionary clustering, also known as dynamic clustering, has recently emerged to cope with the challenges of mining temporally smooth clusters over time. A good evolutionary clustering algorithm should be able to fit the data well at each time epoch, and at the same time results in a smooth cluster evolution that provides the data analyst with a coherent and easily interpretable model. In this paper we introduce the temporal Dirichlet process mixture model (TDPM) as a framework for evolutionary clustering. TDPM is a generalization of the DPM framework for clustering that automatically grows the number of clusters with the data. In our framework, the data is divided into epochs; all data points inside the same epoch are assumed to be fully exchangeable, whereas the temporal order is maintained across epochs. Moreover, The number of clusters in each epoch is unbounded: the clusters can retain, die out or emerge over time, and the actual parameterization of each cluster can also evolve over time in a Markovian fashion. We give a detailed and intuitive construction of this framework using the recurrent Chinese restaurant process (RCRP) metaphor, as well as a Gibbs sampling algorithm to carry out posterior inference in order to determine the optimal cluster evolution. We demonstrate our model over simulated data by using it to build an infinite dynamic mixture of Gaussian factors, and over real dataset by using it to build a simple non-parametric dynamic clustering-topic model and apply it to analyze the NIPS12 document collection.
Adaptive Mining Techniques for Data Streams Using Algorithm Output Granularity Mohamed
- Workshop (AusDM 2003), Held in conjunction with the 2003 Congress on Evolutionary Computation (CEC 2003
, 2003
"... Mining data streams is an emerging area of research given the potentially large number of business and scientific applications. A significant challenge in analyzing /mining data streams is the high data rate of the stream. In this paper, we propose a novel approach to cope with the high data rate of ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Mining data streams is an emerging area of research given the potentially large number of business and scientific applications. A significant challenge in analyzing /mining data streams is the high data rate of the stream. In this paper, we propose a novel approach to cope with the high data rate of incoming data streams. We termed our approach "algorithm output granularity". It is a resource-aware approach that is adaptable to available memory, time constraints, and data stream rate. The approach is generic and applicable to clustering, classification and counting frequent items mining techniques. We have developed a data stream clustering algorithm based on the algorithm output granularity approach. We present this algorithm and discuss its implementation and empirical evaluation. The experiments show acceptable accuracy accompanied with run-time efficiency. They show that the proposed algorithm outperforms the K-means in terms of running time while preserving the accuracy that our algorithm can achieve.
Cost-Efficient Mining Techniques for Data Streams
- In Proc. Australasian Workshop on Data Mining and Web Intelligence (DMWI2004
, 2004
"... A data stream is a continuous and high-speed flow of data items. High speed refers to the phenomenon that the data rate is high relative to the computational power. The increasing focus of applications that generate and receive data streams stimulates the need for online data stream analysis tools. ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
A data stream is a continuous and high-speed flow of data items. High speed refers to the phenomenon that the data rate is high relative to the computational power. The increasing focus of applications that generate and receive data streams stimulates the need for online data stream analysis tools. Mining data streams is a real time process of extracting interesting patterns from high-speed data streams. Mining data streams raises new problems for the data mining community in terms of how to mine continuous high-speed data items that you can only have one look at. In this paper, we propose algorithm output granularity as a solution for mining data streams. Algorithm output granularity is the amount of mining results that fits in main memory before any incremental integration. We show the application of the proposed strategy to build efficient clustering, frequent items and classification techniques. The empirical results for our clustering algorithm are presented and discussed which demonstrate acceptable accuracy coupled with efficiency in running time.
Anomaly Detection in a Mobile Communication Network
- Proceedings of the NAACSOS
, 2006
"... Cell phone networks produce a massive volume of service usage data which, when combined with location data, can be used to pinpoint emergency situations that cause changes in network usage. Such a change may be the results of an increased number of people trying to call friends or family to tell the ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Cell phone networks produce a massive volume of service usage data which, when combined with location data, can be used to pinpoint emergency situations that cause changes in network usage. Such a change may be the results of an increased number of people trying to call friends or family to tell them what is happening or a decrease in network usage caused by people being unable to use the network. Such events are anomalies and managing emergencies effectively requires identifying anomalies quickly. This problem is difficult due to the rate at which very large volumes of data are produced. In this paper, we discuss the use of data stream clustering algorithms for anomaly detection. Contact:
Mining Data Streams
- A Review, ACM SIGMOD Record
, 2005
"... Abstract: Mining data streams has raised a number of research challenges for the data mining community. These challenges include the limitations of computational resources, especially because mining streams of data most likely be done on a mobile device with limited resources. Also due to the contin ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Abstract: Mining data streams has raised a number of research challenges for the data mining community. These challenges include the limitations of computational resources, especially because mining streams of data most likely be done on a mobile device with limited resources. Also due to the continuality of data streams, the algorithm should have only one pass or less over the incoming data records. In this article, our Algorithm Output Granularity (AOG) approach in mining data streams is discussed. AOG is a novel adaptable approach that can cope with the challenging inherent features of data streams. We also show the results for AOG based clustering in a resource constrained environment.
Local correlation tracking in time series
- In ICDM
, 2006
"... We address the problem of capturing and tracking local correlations among time evolving time series. Our approach is based on comparing the local auto-covariance matrices (via their spectral decompositions) of each series and generalizes the notion of linear cross-correlation. In this way, it is pos ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
We address the problem of capturing and tracking local correlations among time evolving time series. Our approach is based on comparing the local auto-covariance matrices (via their spectral decompositions) of each series and generalizes the notion of linear cross-correlation. In this way, it is possible to concisely capture a wide variety of local patterns or trends. Our method produces a general similarity score, which evolves over time, and accurately reflects the changing relationships. Finally, it can also be estimated incrementally, in a streaming setting. We demonstrate its usefulness, robustness and efficiency on a wide range of real datasets. 1
Density-Based Clustering for Real-Time Stream Data
- Proc. Of KDD' 07
, 2007
"... Existing data-stream clustering algorithms such as CluStream are based on k-means. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and user-specified time window. To address these issues, this paper ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Existing data-stream clustering algorithms such as CluStream are based on k-means. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and user-specified time window. To address these issues, this paper proposes D-Stream, a framework for clustering stream data using a density-based approach. The algorithm uses an online component which maps each input data record into a grid and an offline component which computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream. Exploiting the intricate relationships between the decay factor, data density and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped to by outliers in order to dramatically improve the space and time efficiency of the system. The technique makes high-speed data stream clustering feasible without degrading the clustering quality. The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of real-time data streams. 1.
A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS
"... The large volume of data streams poses unique space and time constraints on the computation process. Many query processing, database operations, and mining algorithms require efficient execution which can be difficult to achieve with a fast data stream. In many cases, it may be acceptable to generat ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
The large volume of data streams poses unique space and time constraints on the computation process. Many query processing, database operations, and mining algorithms require efficient execution which can be difficult to achieve with a fast data stream. In many cases, it may be acceptable to generate approximate solutions for such problems. In recent years a number of synopsis structures have been developed, which can be used in conjunction with a variety of mining and query processing techniques in data stream processing. Some key synopsis methods include those of sampling, wavelets, sketches and histograms. In this chapter, we will provide a survey of the key synopsis techniques, and the mining techniques supported by such methods. We will discuss the challenges and tradeoffs associated with using different kinds of techniques, and the important research directions for synopsis construction.
Mining deviants in time series data streams
- In SSDBM, Santorini Island
, 2004
"... One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of “deviants ” from Jagadish et ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of “deviants ” from Jagadish et al [19] as outliers. Deviants are based on one of the most fundamental statistical concept of standard deviation (or variance). Formally, deviants are defined based on a representation sparsity metric, i.e., deviants are values whose removal from the dataset leads to an improved compressed representation of the remaining items. Thus, deviants are not global maxima/minima, but rather these are appropriate local aberrations. Deviants are known to be of great mining value in time series databases. We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor streams using very small space (polylogarithmic in data size) and are able to quickly find deviants at any instant, as the data stream evolves over time. For all versions of this problem—uni- vs multivariate time series, optimal vs nearoptimal vs heuristic solutions, offline vs streaming—our algorithms have the same framework of maintaining a hierarchical set of candidate deviants that are updated as the time series data gets progressively revealed. We show experimentally using real network traffic data (SNMP aggregate time series) as well as synthetic data that our algorithm is remarkably accurate in determining the deviants. 1.

