Results 1 - 10
of
80
Mining Sequential Patterns: Generalizations and Performance Improvements
- Research Report RJ 9994, IBM Almaden Research
, 1995
"... Abstract. The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user- ..."
Abstract
-
Cited by 446 (3 self)
- Add to MetaCart
Abstract. The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user-speci ed minimum support, where the support of a pattern is the number of data-sequences that contain the pattern. An example of a sequential pattern is \5 % of customers bought `Foundation' and `Ringworld ' in one transaction, followed by `Second Foundation ' in a later transaction". We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transaction-times are within a user-speci ed time window. Third, given a user-de ned taxonomy (is-a hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy. We present GSP, a new algorithm that discovers these generalized sequential patterns. Empirical evaluation using synthetic and real-life data indicates that GSP is much faster than the AprioriAll algorithm presented in [3]. GSP scales linearly with the number of data-sequences, and has very good scale-up properties with respect to the average datasequence size. 1
Fast Subsequence Matching in Time-Series Databases
- SIGMOD 94
, 1994
"... We present an efficient indexing method to locate 1-dimensional subsequences witbin a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of multidimensional rectangles in feature space ..."
Abstract
-
Cited by 372 (18 self)
- Add to MetaCart
We present an efficient indexing method to locate 1-dimensional subsequences witbin a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of multidimensional rectangles in feature space. Then, these rectangles can be readily indexed using traditional spatial access methods, like the R*-tree [9]. In more deteil, we use a sliding window over the data sequence and extract its features; the result is a trail in feature space. We propose an efficient and effective algorithm to divide such trails into sub-trails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently. We implemented our method and carried out experiments on synthetic and real data (stock price movements). We compared the method to sequential scanning, which is the only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 times.
Sampling Large Databases for Association Rules
, 1996
"... Discovery of association rules is an important database mining problem. Current algorithms for nding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very signi cant for very large databases. We present new algorithms that reduce the data ..."
Abstract
-
Cited by 329 (4 self)
- Add to MetaCart
Discovery of association rules is an important database mining problem. Current algorithms for nding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very signi cant for very large databases. We present new algorithms that reduce the database activity considerably. Theidea is to pick a random sample, to ndusingthis sample all association rules that probably hold in the whole database, and then to verify the results with the restofthe database. The algorithms thus produce exact association rules, not approximations based on a sample. The approach is, however, probabilistic, and inthose rare cases where our sampling method does not produce all association rules, the missing rules can be found inasecond pass. Our experiments show that the proposed algorithms can nd association rules very e ciently in only onedatabase pass. 1
Mining Quantitative Association Rules in Large Relational Tables
, 1996
"... We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be "10% of married people between age 50 and 60 have at least 2 cars". We deal with quantitative attributes by finepartitio ..."
Abstract
-
Cited by 305 (2 self)
- Add to MetaCart
We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be "10% of married people between age 50 and 60 have at least 2 cars". We deal with quantitative attributes by finepartitioning the values of the attribute and then combining adjacent partitions as necessary. We introduce measures of partial completeness which quantify the information lost due to partitioning. A direct application of this technique can generate too many similar rules. We tackle this problem by using a "greater-than-expected-value" interest measure to identify the interesting rules in the output. We give an algorithm for mining such quantitative association rules. Finally, we describe the results of using this approach on a real-life dataset. 1 Introduction Data mining, also known as knowledge discovery in databases, has been recognized as a new area for database research. The problem of discove...
Parallel Mining of Association Rules
- IEEE Transactions on Knowledge and Data Engineering
, 1996
"... We consider the problem of mining association rules on a shared-nothing multiprocessor. We present three algorithms that explore a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem-specific information. The best algorithm exhibits near p ..."
Abstract
-
Cited by 203 (3 self)
- Add to MetaCart
We consider the problem of mining association rules on a shared-nothing multiprocessor. We present three algorithms that explore a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem-specific information. The best algorithm exhibits near perfect scaleup behavior, yet requires only minimal overhead compared to the current best serial algorithm.
Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension
, 1995
"... We examine the estimation of selectivities for range and spatial join queries in real spatial databases. As we have shown earlier [FK94a], real point sets: (a) violate consistently the "uniformity" and "independence" assumptions, (b) can often be described as "fractals", with non-integer (fractal) d ..."
Abstract
-
Cited by 112 (15 self)
- Add to MetaCart
We examine the estimation of selectivities for range and spatial join queries in real spatial databases. As we have shown earlier [FK94a], real point sets: (a) violate consistently the "uniformity" and "independence" assumptions, (b) can often be described as "fractals", with non-integer (fractal) dimension. In this paper we show that, among the infinite family of fractal dimensions, the so called "Correlation Dimension" D 2 is the one that we need to predict the selectivity of spatial join. The main contribution is that, for all the real and synthetic point-sets we tried, the average number of neighbors for a given point of the point-set follows a power law, with D 2 as the exponent. This immediately solves the selectivity estimation for spatial joins, as well as for "biased" range queries (i.e., queries whose centers prefer areas of high point density). We present the formulas to estimate the selectivity for the biased queries, including an integration constant (K `shape 0 ) for ea...
Pruning and Grouping Discovered Association Rules
, 1995
"... Association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set X, then it has 1 also in the columns in set Y ". Efficient methods exist for discovering association rules from large collections of data. The number of discovered rules c ..."
Abstract
-
Cited by 70 (4 self)
- Add to MetaCart
Association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set X, then it has 1 also in the columns in set Y ". Efficient methods exist for discovering association rules from large collections of data. The number of discovered rules can, however, be so large that the rules cannot be presented to the user. We show how the set of rules can be pruned by forming rule covers. A rule cover is a subset of the original set of rules such that for each row in the relation there is an applicable rule in the cover if and only if there is an applicable rule in the original set. We also discuss grouping of association rules by clustering, and present some experimental results of both pruning and grouping. Keywords: data mining, association rules, covers, clustering. 1 Introduction Association rules are an interesting class of database regularities, introduced by Agrawal, Imielinski, and Swami [AIS93]. An association rule is an expres...
Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison
, 1995
"... The field of knowledge discovery in databases, or "Data Mining", has received increasing attention during recent years as large organizations have begun to realize the potential value of the information that is stored implicitly in their databases. One specific data mining task is the mining of Asso ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
The field of knowledge discovery in databases, or "Data Mining", has received increasing attention during recent years as large organizations have begun to realize the potential value of the information that is stored implicitly in their databases. One specific data mining task is the mining of Association Rules, particularly from retail data. The task is to determine patterns (or rules) that characterize the shopping behavior of customers from a large database of previous consumer transactions. The rules can then be used to focus marketing efforts such as product placement and sales promotions. Because early algorithms required an unpredictably large number of IO operations, reducing IO cost has been the primary target of the algorithms presented in the literature. One of the most recent proposed algorithms, called PARTITION, uses a new TID-list data representation and a new partitioning technique. The partitioning technique reduces IO cost to a constant amount by processing one datab...
Learning Action Strategies for Planning Domains
- ARTIFICIAL INTELLIGENCE
, 1997
"... This paper reports on experiments where techniques of supervised machine learning are applied to the problem of planning. The input to the learning algorithm is composed of a description of a planning domain, planning problems in this domain, and solutions for them. The output is an efficient algori ..."
Abstract
-
Cited by 58 (2 self)
- Add to MetaCart
This paper reports on experiments where techniques of supervised machine learning are applied to the problem of planning. The input to the learning algorithm is composed of a description of a planning domain, planning problems in this domain, and solutions for them. The output is an efficient algorithm --- a strategy --- for solving problems in that domain. We test the strategy on an independent set of planning problems from the same domain, so that success is measured by its ability to solve complete problems. A system, L2Act, has been developed in order to perform these experiments. We have experimented with the blocks world domain, and the logistics domain, using strategies in the form of a generalization of decision lists, where the rules on the list are existentially quantified first order expressions. The learning algorithm is a variant of Rivest`s [39] algorithm, improved with several techniques that reduce its time complexity. As the experiments demonstrate, generalization is a...
Security and Privacy Implications of Data Mining
- In ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery
, 1996
"... Data mining enables us to discover information we do not expect to find in databases. This can be a security/privacy issue: If we make information available, are we perhaps giving out more than we bargained for? This position paper discusses possible problems and solutions, and outlines ideas for fu ..."
Abstract
-
Cited by 56 (1 self)
- Add to MetaCart
Data mining enables us to discover information we do not expect to find in databases. This can be a security/privacy issue: If we make information available, are we perhaps giving out more than we bargained for? This position paper discusses possible problems and solutions, and outlines ideas for further research in this area. 1 Introduction Database technology provides a number of advantages. Data mining is one of these; using automated tools to analyze corporate data can help find ways to increase efficiency of an organization. Another advantage of database technology is information sharing (including sharing with other organizations). For example, publicly accessible corporate telephone books can decrease the need for telephone operators (offloading this task to the caller...) Sharing need not be completely public - making inventory information available to suppliers can help a retail operation to avoid shortages, and can lower the supplier's cost (thus allowing the retailer to n...

