Results 1 - 10
of
30
Mining time-changing data streams
- IN PROC. OF THE 2001 ACM SIGKDD INTL. CONF. ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2001
"... Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a station-ary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying pro-cesses genera ..."
Abstract
-
Cited by 196 (4 self)
- Add to MetaCart
Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a station-ary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying pro-cesses generating them changed during this time, sometimes radically. Although a number of algorithms have been pro-posed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes ques-tionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.
MAFIA: A maximal frequent itemset algorithm for transactional databases
- In ICDE
, 2001
"... We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depth-first traversal of the itemset lattice with effective pruning ..."
Abstract
-
Cited by 187 (3 self)
- Add to MetaCart
We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms. Our implementation of the search strategy combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema. In a thorough experimental analysis of our algorithm on real data, we isolate the effect of the individual components of the algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of three to five. 1
Clustering data streams: Theory and practice
- IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract
-
Cited by 75 (2 self)
- Add to MetaCart
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Distributed streams algorithms for sliding windows
- In Proc. ACM Symp. on Parallel Algorithms and Architectures (SPAA
, 2002
"... Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window ” of the N most recent data items ..."
Abstract
-
Cited by 48 (10 self)
- Add to MetaCart
Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window ” of the N most recent data items in one or more streams. Our results include: 1. For a single stream, we present the first ɛ-approximation scheme for the number of 1’s in a sliding window that is optimal in both worst case time and space. We also present the first ɛ-approximation scheme for the sum of integers in [0..R] in a sliding window that is optimal in both worst case time and space (assuming R is at most polynomial in N). Both algorithms are deterministic and use only logarithmic memory words. 2. In contrast, we show that any deterministic algorithm that estimates, to within a small constant relative error, the number of 1’s (or the sum of integers) in a sliding window on the union of distributed streams requires Ω(N) space.
Local and Global Methods in Data Mining: Basic Techniques and Open Problems
- In ICALP 2002, 29th International Colloquium on Automata, Languages, and Programming, Malaga
, 2002
"... Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns oc ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns occurring in the data. We discuss briefly some simple local and global techniques, review two attempts at combining the approaches, and list open problems with an algorithmic flavor.
Incremental, Online, and Merge Mining of Partial Periodic Patterns in Time-Series Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Mining of periodic patterns in time-series databases is an interesting data mining problem. It can be envisioned as a tool for forecasting and prediction of the future behavior of time-series data. Incremental mining refers to the issue of maintaining the discovered patterns over time in the prese ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
Mining of periodic patterns in time-series databases is an interesting data mining problem. It can be envisioned as a tool for forecasting and prediction of the future behavior of time-series data. Incremental mining refers to the issue of maintaining the discovered patterns over time in the presence of more items being added into the database. Because of the mostly append only nature of updating time-series data, incremental mining would be very effective and efficient. Several algorithms for incremental mining of partial periodic patterns in time-series databases are proposed and are analyzed empirically. The new algorithms allow for online adaptation of the thresholds in order to produce interactive mining of partial periodic patterns. The storage overhead of the incremental online mining algorithms is analyzed. Results show that the storage overhead for storing the intermediate data structures pays off as the incremental online mining of partial periodic patterns proves to be significantly more efficient than the nonincremental nononline versions. Moreover, a new problem, termed merge mining, is introduced as a generalization of incremental mining. Merge mining can be defined as merging the discovered patterns of two or more databases that are mined independently of each other. An algorithm for merge mining of partial periodic patterns in time-series databases is proposed and analyzed.
Conquering the divide: Continuous clustering of distributed data streams
- In Intl. Conf. on Data Engineering
, 2007
"... Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this pap ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this paper, we formalize and study the problem of maintaining a clustering of such distributed data that is continuously evolving. In particular, our goal is to minimize the communication and computational cost, still providing guaranteed accuracy of the clustering. We focus on the k-center clustering, and provide a suite of algorithms that vary based on which centralized algorithm they derive from, and whether they maintain a single global clustering or many local clusterings that can be merged together. We show that these algorithms can be designed to give accuracy guarantees that are close to the best possible even in the centralized case. In our experiments, we see clear trends among these algorithms, showing that the choice of algorithm is crucial, and that we can achieve a clustering that is as good as the best centralized clustering, with only a small fraction of the communication required to collect all the data in a single location. 1
Monitoring the Evolution of Web Usage Patterns
- Lecture Notes in Computer Science
, 2004
"... Abstract With the ongoing shift from off-line to on-line business processes, the Web has become an important business platform, and for most companies it is crucial to have an on-line presence which can be used to gather information about their products and/or services. However, in many cases there ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Abstract With the ongoing shift from off-line to on-line business processes, the Web has become an important business platform, and for most companies it is crucial to have an on-line presence which can be used to gather information about their products and/or services. However, in many cases there is a difference between the intended and the effective usage of a web site and, presently, many web site operators analyze the usage of their sites to improve their usability. But especially in the context of the Internet, content and structure change rather quickly, and the way a web site is used may change often, either due to changing information needs of its visitors, or due to an evolving user group. Therefore, the discovered usage patterns need to be updated continuously to always reflect the current state. In this article, we introduce PAM, an automated Pattern Monitor, which can be used to observe changes to the behavior of a web sites visitors. It is based on a temporal representation of rules in which both the content
Using multiple windows to track concept drift
- In Intelligent Data Analysis Journal, Vol
, 2003
"... In this paper we present a multiple window incremental learning algorithm that distinguishes between virtual concept drift and real concept drift. The algorithm is unsupervised and uses a novel approach to tracking concept drift that involves the use of competing windows to interpret the data. Unlik ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
In this paper we present a multiple window incremental learning algorithm that distinguishes between virtual concept drift and real concept drift. The algorithm is unsupervised and uses a novel approach to tracking concept drift that involves the use of competing windows to interpret the data. Unlike previous methods which use a single window to determine the drift in the data, our algorithm uses three windows of different sizes to estimate the change in the data. The advantage of this approach is that it allows the system to progressively adapt and predict the change thus enabling it to deal more effectively with different types of drift. We give a detailed description of the algorithm and present the results obtained from its application to two real world problems: computing the background image and sound recognition. We also compare its performance with FLORA, an existing concept drift tracking algorithm. 1
Efficiently mining approximate models of associations in evolving databases
- In Proc. of the 6 th Int'l Conf. on Principles and Practices of Data Mining and Knowledge Discovery in Databases
, 2002
"... Abstract Much of the existing work in machine learning and data mining has relied on devising efficient techniques to build accurate models from the data. Research on how the accuracy of a model changes as a function of dynamic updates to the databases is very limited. In this work we show that extr ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Abstract Much of the existing work in machine learning and data mining has relied on devising efficient techniques to build accurate models from the data. Research on how the accuracy of a model changes as a function of dynamic updates to the databases is very limited. In this work we show that extracting this information: knowing which aspects of the model are changing; and how they are changing as a function of data updates; can be very effective for interactive data mining purposes (where response time is often more important than model quality as long as model quality is not too far off the best (exact) model. In this paper we consider the problem of generating approximate models within the context of association mining, a key data mining task. We propose a new approach to incrementally generate approximate models of associations in evolving databases. Our approach is able to detect how patterns evolve over time (an interesting result in its own right), and uses this information in generating approximate models with high accuracy at a fraction of the cost (of generating the exact model). Extensive experimental evaluation on real databases demonstrates the effectiveness and advantages of the proposed approach. 1

