Results 1 - 10
of
43
Streaming Pattern Discovery in Multiple Time-Series
- In VLDB
, 2005
"... In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries) . Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream col ..."
Abstract
-
Cited by 50 (14 self)
- Add to MetaCart
In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries) . Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream collection.
Ranking a Stream of News
- in WWW ’05: Proceedings of the 14th international conference on World Wide Web
, 2005
"... According to a recent survey made by Nielsen NetRatings, searching on news articles is one of the most important activity online. Indeed, Google, Yahoo, MSN and many others have proposed commercial search engines for indexing news feeds. Despite this commercial interest, no academic research has foc ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
According to a recent survey made by Nielsen NetRatings, searching on news articles is one of the most important activity online. Indeed, Google, Yahoo, MSN and many others have proposed commercial search engines for indexing news feeds. Despite this commercial interest, no academic research has focused on ranking a stream of news articles and a set of news sources. In this paper, we introduce this problem by proposing a ranking framework which models: (1) the process of generation of a stream of news articles, (2) the news articles clustering by topics, and (3) the evolution of news story over the time. The ranking algorithm proposed ranks news information, finding the most authoritative news sources and identifying the most interesting events in the di#erent categories to which news article belongs. All these ranking measures take in account the time and can be obtained without a predefined sliding window of observation over the stream. The complexity of our algorithm is linear in the number of pieces of news still under consideration at the time of a new posting. This allow a continuous on-line process of ranking. Our ranking framework is validated on a collection of more than 300,000 pieces of news, produced in two months by more then 2000 news sources belonging to 13 di#erent categories (World, U.S, Europe, Sports, Business, etc). This collection is extracted from the index of comeToMyHead, an academic news search engine available online. Categories and Subject Descriptors H.3.1 [Information Storage And Retrieval]: Content Analysis and IndexingRetrieval models; Search process; H.3.3 [Information Storage And Retrieval]: Information Search and Retrieval; H.3.5 [Information Storage And Retrieval]: OnlineInformation Services General Terms Algorithms, Experimentat...
Compressing Large Boolean Matrices Using Reordering Techniques
, 2004
"... Large boolean matrices are a basic representational unit in a variety of applications, with some notable examples being interactive visualization systems, mining large graph structures, and association rule mining. Designing space and time e#cient scalable storage and query mechanisms for such ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Large boolean matrices are a basic representational unit in a variety of applications, with some notable examples being interactive visualization systems, mining large graph structures, and association rule mining. Designing space and time e#cient scalable storage and query mechanisms for such large matrices is a challenging problem.
A Novel Efficient
- Group Signature With Forward Security, ICICS 2003, LNCS 2836
, 2003
"... Abstract—Given a data set P,ak-means query returns k points in space (called centers), such that the average squared distance between each point in P and its nearest center is minimized. Since this problem is NP-hard, several approximate algorithms have been proposed and used in practice. In this pa ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
Abstract—Given a data set P,ak-means query returns k points in space (called centers), such that the average squared distance between each point in P and its nearest center is minimized. Since this problem is NP-hard, several approximate algorithms have been proposed and used in practice. In this paper, we study continuous k-means computation at a server that monitors a set of moving objects. Reevaluating k-means every time there is an object update imposes a heavy burden on the server (for computing the centers from scratch) and the clients (for continuously sending location updates). We overcome these problems with a novel approach that significantly reduces the computation and communication costs, while guaranteeing that the quality of the solution, with respect to the reevaluation approach, is bounded by a user-defined tolerance. The proposed method assigns each moving object a threshold (i.e., range) such that the object sends a location update only when it crosses the range boundary. First, we develop an efficient technique for maintaining the k-means. Then, we present mathematical formulas and algorithms for deriving the individual thresholds. Finally, we justify our performance claims with extensive experiments. Index Terms—k-Means, continuous monitoring, query processing. Ç
Data Clustering: 50 Years Beyond K-Means
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
A New Conceptual Clustering Framework
- MACHINE LEARNING
, 2004
"... We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may not cover all the points. In addition, a point may be assigned to a cluster description even if it only satisfies most, and not necessarily all, of the attributes in the conjunction. Connections between this conceptual clustering problem and the maximum edge biclique problem are made. Simple, randomized algorithms are given that discover a collection of approximate conjunctive cluster descriptions in sublinear time.
Anomaly Detection in a Mobile Communication Network
- Proceedings of the NAACSOS
, 2006
"... Cell phone networks produce a massive volume of service usage data which, when combined with location data, can be used to pinpoint emergency situations that cause changes in network usage. Such a change may be the results of an increased number of people trying to call friends or family to tell the ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Cell phone networks produce a massive volume of service usage data which, when combined with location data, can be used to pinpoint emergency situations that cause changes in network usage. Such a change may be the results of an increased number of people trying to call friends or family to tell them what is happening or a decrease in network usage caused by people being unable to use the network. Such events are anomalies and managing emergencies effectively requires identifying anomalies quickly. This problem is difficult due to the rate at which very large volumes of data are produced. In this paper, we discuss the use of data stream clustering algorithms for anomaly detection. Contact:
Stream Monitoring under the Time Warping Distance
"... Data stream processing has recently attracted an increasing amount of interest. The goal of this paper is to monitor numerical streams, and to find subsequences that are similar to a given query sequence, under the DTW (Dynamic Time Warping) distance. Applications include word spotting, sensor patte ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Data stream processing has recently attracted an increasing amount of interest. The goal of this paper is to monitor numerical streams, and to find subsequences that are similar to a given query sequence, under the DTW (Dynamic Time Warping) distance. Applications include word spotting, sensor pattern matching, and monitoring of biomedical signals (e.g., EKG, ECG), and monitoring of environmental (seismic and volcanic) signals. DTW is a very popular distance measure, permitting accelerations and decelerations, and it has been studied for finite, stored sequence sets. However, in many applications such as network analysis and sensor monitoring, massive amounts of data arrive continuously and it is infeasible to save all the historical data. We propose SPRING, a novel algorithm that can solve the problem. We provide a theoretical analysis and prove that SPRING does not sacrifice accuracy, while it requires constant space and time per time-tick. These are dramatic improvements over the naive method. Our experiments on real and realistic data illustrate that SPRING does indeed detect the qualifying subsequences correctly and that it can offer dramatic improvements in speed (up to 650,000 times) over the naive implementation. 1
Density-Based Clustering for Real-Time Stream Data
- Proc. Of KDD' 07
, 2007
"... Existing data-stream clustering algorithms such as CluStream are based on k-means. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and user-specified time window. To address these issues, this paper ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Existing data-stream clustering algorithms such as CluStream are based on k-means. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and user-specified time window. To address these issues, this paper proposes D-Stream, a framework for clustering stream data using a density-based approach. The algorithm uses an online component which maps each input data record into a grid and an offline component which computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream. Exploiting the intricate relationships between the decay factor, data density and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped to by outliers in order to dramatically improve the space and time efficiency of the system. The technique makes high-speed data stream clustering feasible without degrading the clustering quality. The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of real-time data streams. 1.
Distributed Pattern Discovery in Multiple Streams
- In Proc. of PAKDD
, 2006
"... and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties. ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties.

