Results 11 - 20
of
83
Multi-Dimensional Sequential Pattern Mining
, 2001
"... With our recently developed sequential pattern mining algorithms, such as PrefixSpan, it is possible to mine sequential user-access patterns from Web-logs. While this information is very useful when redesigning web-sites for easier perusal and fewer network traffic bottlenecks, it would be so much r ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
With our recently developed sequential pattern mining algorithms, such as PrefixSpan, it is possible to mine sequential user-access patterns from Web-logs. While this information is very useful when redesigning web-sites for easier perusal and fewer network traffic bottlenecks, it would be so much richer if we could incorporate multiple dimensions of information. For example, if you knew the referral site that users frequently come from, you might be able to determine what information on your own web-site is of interest to them --- and enhance or separate this information as needed. Similarly, if you knew what weekday and time certain access patterns frequently occur at, you could ensure updated information is ready and available for these users. This thesis proposes and explores two different techniques, HYBRID and PSFP, to incorporate additional dimensions of information into the process of mining sequential patterns. It investigates the strengths and limitations of each approach. The HYBRID method first finds frequent dimension value combinations, and then mines sequential patterns from the set of sequences that satisfy each of these combinations. PSFP approaches the problem from the opposite direction. It mines the sequential patterns for the whole dataset only once (using PrefixSpan), and mines the corresponding frequent dimension patterns alongside each sequential pattern (using existing association algorithm FP-growth). Experiments show that HYBRID is most effective at low support in datasets that are sparse with respect to dimension value combinations but dense with respect to the sequential patterns present. PSFP is the better alternative in every other case, including datasets that are dense with respect to both dimension values combinations and sequential ...
Cache-conscious frequent pattern mining on a modern processor
- In Proceedings of the International Conference on Very Large Data Bases (VLDB
, 2005
"... In this paper, we examine the performance of frequent pattern mining algorithms on a modern processor. A detailed performance study reveals that even the best frequent pattern mining implementations, with highly efficient memory managers, still grossly under-utilize a modern processor. The primary p ..."
Abstract
-
Cited by 22 (6 self)
- Add to MetaCart
In this paper, we examine the performance of frequent pattern mining algorithms on a modern processor. A detailed performance study reveals that even the best frequent pattern mining implementations, with highly efficient memory managers, still grossly under-utilize a modern processor. The primary performance bottlenecks are poor data locality and low instruction level parallelism (ILP). We propose a cache-conscious prefix tree to address this problem. The resulting tree improves spatial locality and also enhances the benefits from hardware cache line prefetching. Furthermore, the design of this data structure allows the use of a novel tiling strategy to improve temporal locality. The result is an overall speedup of up to 3.2 when compared with state-of-the-art implementations. We then show how these algorithms can be improved further by realizing a non-naive thread-based decomposition that targets simultaneously multi-threaded processors. A key aspect of this decomposition is to ensure cache re-use between threads that are co-scheduled at a fine granularity. This optimization affords an additional speedup of 50%, resulting in an overall speedup of up to 4.8. To
Mining periodic patterns with gap requirement from sequences
- In SIGMOD
, 2005
"... We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observing P given a randomly picked length-l subseq ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observing P given a randomly picked length-l subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of the mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem, and study their characteristics. We also present a case study in which we apply our algorithms on some DNA sequences. We discuss some interesting patterns obtained from the case study. 1
InfoMiner+: Mining Partial Periodic Patterns with Gap Penalties
- In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM’02
, 2002
"... In this paper, we focus on mining periodic patterns allowing some degree of imperfection in the form of random replacement from a perfect periodic pattern. Information gain was proposed to identify patterns with events of vastly different occurrence frequencies and adjust for the deviation from a pa ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
In this paper, we focus on mining periodic patterns allowing some degree of imperfection in the form of random replacement from a perfect periodic pattern. Information gain was proposed to identify patterns with events of vastly different occurrence frequencies and adjust for the deviation from a pattern. However, it does not take any penalty if there exists some gap between the pattern occurrences. In many applications, e.g., bio-informatics, it is important to identify subsequences that a pattern repeats perfectly (or near perfectly). As a solution, we extend the information gain measure to include a penalty for gaps between pattern occurrences. We call this measure as generalized information gain. Furthermore, we want to find subsequence S such that for a pattern P , the generalized information gain of P in S is high. This is particularly useful in locating repeats in DNA sequences. In this paper, we developed an effective mining algorithm, InfoMiner+, to simultaneously mine significant patterns and the associated subsequences.
Using Convolution to Mine Obscure Periodic Patterns In One Pass
- PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON EXTENDING DATABASE TECHNOLOGY (EDBT’04
, 2004
"... The mining of periodic patterns in time series databases is an interesting data mining problem that can be envisioned as a tool for forecasting and predicting the future behavior of time series data. Existing periodic patterns mining algorithms either assume that the periodic rate (or simply the ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
The mining of periodic patterns in time series databases is an interesting data mining problem that can be envisioned as a tool for forecasting and predicting the future behavior of time series data. Existing periodic patterns mining algorithms either assume that the periodic rate (or simply the period) is user-specified, or try to detect potential values for the period in a separate phase. The former assumption is a considerable disadvantage, especially in time series databases where the period is not known a priori. The latter approach results in a multi-pass algorithm, which on the other hand is to be avoided in online environments (e.g., data streams). In this paper, we develop an algorithm that mines periodic patterns in time series databases with unknown or obscure periods such that discovering the period is part of the mining process. Based on
On-Line Analytical Mining of Association Rules
, 1998
"... With wide applications of computers and automated data collection tools, massive amounts of data have been continuously collected and stored in databases, which creates an imminent need and great opportunities for mining interesting knowledge from data. Association rule mining is one kind of data mi ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
With wide applications of computers and automated data collection tools, massive amounts of data have been continuously collected and stored in databases, which creates an imminent need and great opportunities for mining interesting knowledge from data. Association rule mining is one kind of data mining techniques which discovers strong association or correlation relationships among data. The discovered rules may help market basket or cross-sales analysis, decision making, and business management. In this thesis, we propose and develop an interesting association rule mining approach, called on-line analytical mining of association rules, which integrates the recently developed OLAP (on-line analytical processing) technology with some efficient association mining methods. It leads to flexible, multi-dimensional, multi-level association rule mining with high performance. Several algorithms are developed based on this approach for mining various kinds of associations in multi-dimensional ...
Meta-Patterns: Revealing Hidden Periodic Patterns
- IBM Research Report
, 2001
"... Discovery of periodic patterns in time series data has become an active research area with many applications. These patterns can be hierarchical in nature, where a higher level pattern may consist of repetitions of lower level patterns. Unfortunately, the presence of noise may prevent these higher l ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Discovery of periodic patterns in time series data has become an active research area with many applications. These patterns can be hierarchical in nature, where a higher level pattern may consist of repetitions of lower level patterns. Unfortunately, the presence of noise may prevent these higher level patterns from being recognized in the sense that two portions (of a data sequence) that support the same (high level) pattern may have different layouts of occurrences of basic symbols. There may not exist any common representation in terms of raw symbol combinations; and hence such (high level) pattern may not be expressed by any previous model (defined on raw symbols or symbol combinations) and would not be properly recognized by any existing method. In this paper, we propose a novel model, namely meta-pattern, to capture these high level patterns. As a more flexible model, the number of potential meta-patterns could be very large. A substantial difficulty lies on how to identify the proper pattern candidates. However, the well-known Apriori property is not able to provide sufficient pruning power. A new property, namely component location property, is identified and used to conduct the candidate generation so that an efficient computation-based mining algorithm can be developed. Last but not least, we apply our algorithm to some real and synthetic sequences and some interesting patterns are discovered. 1
On Effective Classification of Strings with Wavelets
, 2002
"... In recent years, the technological advances in mapping genes have made it increasingly easy to store and use a wide variety of biological data. Such data are usually in the form of very long strings for which it is difficult to determine the most relevant features for a classification task. For exam ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
In recent years, the technological advances in mapping genes have made it increasingly easy to store and use a wide variety of biological data. Such data are usually in the form of very long strings for which it is difficult to determine the most relevant features for a classification task. For example, a typical DNA string may be millions of characters long, and there may be thousands of such strings in a database. In many cases, the classification behavior of the data maybe hidden in the compositional behavior of certain segments of the string which cannot be easily determined apriori. Another problem which complicates the classification task is that in some cases the classification behavior is reflected in global behavior of the string, whereas in others it is reflected in local patterns. Given the enormous variation in the behavior of the strings over different data sets, it is useful to develop an approach which is sensitive to both the global and local behavior of the strings for the purpose of classi cation. For this purpose, we will exploit the multi-resolution property of wavelet decomposition in order to create a scheme which can mine classification characteristics at different levels of granularity. The resulting scheme turns out to be very effective in practice on a wide range of problems.

