Results 1  10
of
147
Mining Sequential Patterns: Generalizations and Performance Improvements
 Research Report RJ 9994, IBM Almaden Research
, 1995
"... Abstract. The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transactiontime, and each transaction is a set of items. The problem is to discover all sequential patterns with a user ..."
Abstract

Cited by 748 (5 self)
 Add to MetaCart
(Show Context)
Abstract. The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transactiontime, and each transaction is a set of items. The problem is to discover all sequential patterns with a userspeci ed minimum support, where the support of a pattern is the number of datasequences that contain the pattern. An example of a sequential pattern is \5 % of customers bought `Foundation' and `Ringworld ' in one transaction, followed by `Second Foundation ' in a later transaction". We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transactiontimes are within a userspeci ed time window. Third, given a userde ned taxonomy (isa hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy. We present GSP, a new algorithm that discovers these generalized sequential patterns. Empirical evaluation using synthetic and reallife data indicates that GSP is much faster than the AprioriAll algorithm presented in [3]. GSP scales linearly with the number of datasequences, and has very good scaleup properties with respect to the average datasequence size. 1
Fast subsequence matching in timeseries databases
 PROCEEDINGS OF THE 1994 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA
, 1994
"... We present an efficient indexing method to locate 1dimensional subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of multidimensional rectangles in feature space ..."
Abstract

Cited by 528 (24 self)
 Add to MetaCart
(Show Context)
We present an efficient indexing method to locate 1dimensional subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of multidimensional rectangles in feature space. Then, these rectangles can be readily indexed using traditional spatial access methods, like the R*tree [9]. In more detail, we use a sliding window over the data sequence and extract its features; the result is a trail in feature space. We propose an ecient and eective algorithm to divide such trails into subtrails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently. We implemented our method and carried out experiments on synthetic and real data (stock price movements). We compared the method to sequential scanning, which is the only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 times.
FastMap: A Fast Algorithm for Indexing, DataMining and Visualization of Traditional and Multimedia Datasets
, 1995
"... A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in kd space, using k featureextraction functions, provided by a domain expert [25]. Thus, we can subsequently use highly finetuned spatial access methods (SAMs), to answer several types ..."
Abstract

Cited by 497 (23 self)
 Add to MetaCart
(Show Context)
A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in kd space, using k featureextraction functions, provided by a domain expert [25]. Thus, we can subsequently use highly finetuned spatial access methods (SAMs), to answer several types of queries, including the `Query By Example' type (which translates to a range query); the `all pairs' query (which translates to a spatial join [8]); the nearestneighbor or bestmatch query, etc. However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points. This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some kdimensional space (k is userdefined), such that the dissimilarities are preserved. There are two benefits from this mapping: (a) efficient ret...
Sampling Large Databases for Association Rules
, 1996
"... Discovery of association rules is an important database mining problem. Current algorithms for nding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very signi cant for very large databases. We present new algorithms that reduce the data ..."
Abstract

Cited by 465 (4 self)
 Add to MetaCart
Discovery of association rules is an important database mining problem. Current algorithms for nding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very signi cant for very large databases. We present new algorithms that reduce the database activity considerably. Theidea is to pick a random sample, to ndusingthis sample all association rules that probably hold in the whole database, and then to verify the results with the restofthe database. The algorithms thus produce exact association rules, not approximations based on a sample. The approach is, however, probabilistic, and inthose rare cases where our sampling method does not produce all association rules, the missing rules can be found inasecond pass. Our experiments show that the proposed algorithms can nd association rules very e ciently in only onedatabase pass. 1
Mining Quantitative Association Rules in Large Relational Tables
, 1996
"... We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be "10% of married people between age 50 and 60 have at least 2 cars". We deal with quantitative attributes by fi ..."
Abstract

Cited by 438 (3 self)
 Add to MetaCart
(Show Context)
We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be "10% of married people between age 50 and 60 have at least 2 cars". We deal with quantitative attributes by finepartitioning the values of the attribute and then combining adjacent partitions as necessary. We introduce measures of partial completeness which quantify the information lost due to partitioning. A direct application of this technique can generate too many similar rules. We tackle this problem by using a "greaterthanexpectedvalue" interest measure to identify the interesting rules in the output. We give an algorithm for mining such quantitative association rules. Finally, we describe the results of using this approach on a reallife dataset. 1 Introduction Data mining, also known as knowledge discovery in databases, has been recognized as a new area for database research. The problem of discove...
Parallel Mining of Association Rules
 IEEE Transactions on Knowledge and Data Engineering
, 1996
"... We consider the problem of mining association rules on a sharednothing multiprocessor. We present three algorithms that explore a spectrum of tradeoffs between computation, communication, memory usage, synchronization, and the use of problemspecific information. The best algorithm exhibits near p ..."
Abstract

Cited by 319 (3 self)
 Add to MetaCart
(Show Context)
We consider the problem of mining association rules on a sharednothing multiprocessor. We present three algorithms that explore a spectrum of tradeoffs between computation, communication, memory usage, synchronization, and the use of problemspecific information. The best algorithm exhibits near perfect scaleup behavior, yet requires only minimal overhead compared to the current best serial algorithm.
Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension
, 1995
"... We examine the estimation of selectivities for range and spatial join queries in real spatial databases. As we have shown earlier [FK94a], real point sets: (a) violate consistently the "uniformity" and "independence" assumptions, (b) can often be described as "fractals" ..."
Abstract

Cited by 126 (18 self)
 Add to MetaCart
(Show Context)
We examine the estimation of selectivities for range and spatial join queries in real spatial databases. As we have shown earlier [FK94a], real point sets: (a) violate consistently the "uniformity" and "independence" assumptions, (b) can often be described as "fractals", with noninteger (fractal) dimension. In this paper we show that, among the infinite family of fractal dimensions, the so called "Correlation Dimension" D 2 is the one that we need to predict the selectivity of spatial join. The main contribution is that, for all the real and synthetic pointsets we tried, the average number of neighbors for a given point of the pointset follows a power law, with D 2 as the exponent. This immediately solves the selectivity estimation for spatial joins, as well as for "biased" range queries (i.e., queries whose centers prefer areas of high point density). We present the formulas to estimate the selectivity for the biased queries, including an integration constant (K `shape 0 ) for ea...
Beyond streams and graphs: Dynamic tensor analysis
 In KDD
, 2006
"... How do we find patterns in authorkeyword associations, evolving over time? Or in DataCubes, with productbranchcustomer sales information? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule ..."
Abstract

Cited by 111 (16 self)
 Add to MetaCart
(Show Context)
How do we find patterns in authorkeyword associations, evolving over time? Or in DataCubes, with productbranchcustomer sales information? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule identification in numerous settings like streaming data, text, graphs, social networks and many more. However, they have only two orders, like author and keyword, in the above example. We propose to envision such higher order data as tensors, and tap the vast literature on the topic. However, these methods do not necessarily scale up, let alone operate on semiinfinite streams. Thus, we introduce the dynamic tensor analysis (DTA) method, and its variants. DTA provides a compact summary for highorder and highdimensional data, and it also reveals the hidden correlations. Algorithmically, we designed DTA very carefully so that it is (a) scalable, (b) space efficient (it does not need to store the past) and (c) fully automatic with no need for user defined parameters. Moreover, we propose STA, a streaming tensor analysis method, which provides a fast, streaming approximation to DTA. We implemented all our methods, and applied them in two real settings, namely, anomaly detection and multiway latent semantic indexing. We used two real, large datasets, one on network flow data (100GB over 1 month) and one from DBLP (200MB over 25 years). Our experiments show that our methods are fast, accurate and that they find interesting patterns and outliers on the real datasets. 1.
Pruning and Grouping Discovered Association Rules
, 1995
"... Association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set X, then it has 1 also in the columns in set Y ". Efficient methods exist for discovering association rules from large collections of data. The number of discover ..."
Abstract

Cited by 102 (5 self)
 Add to MetaCart
Association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set X, then it has 1 also in the columns in set Y ". Efficient methods exist for discovering association rules from large collections of data. The number of discovered rules can, however, be so large that the rules cannot be presented to the user. We show how the set of rules can be pruned by forming rule covers. A rule cover is a subset of the original set of rules such that for each row in the relation there is an applicable rule in the cover if and only if there is an applicable rule in the original set. We also discuss grouping of association rules by clustering, and present some experimental results of both pruning and grouping. Keywords: data mining, association rules, covers, clustering. 1 Introduction Association rules are an interesting class of database regularities, introduced by Agrawal, Imielinski, and Swami [AIS93]. An association rule is an expres...
Security and Privacy Implications of Data Mining
 In ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery
, 1996
"... Data mining enables us to discover information we do not expect to find in databases. This can be a security/privacy issue: If we make information available, are we perhaps giving out more than we bargained for? This position paper discusses possible problems and solutions, and outlines ideas for fu ..."
Abstract

Cited by 94 (3 self)
 Add to MetaCart
(Show Context)
Data mining enables us to discover information we do not expect to find in databases. This can be a security/privacy issue: If we make information available, are we perhaps giving out more than we bargained for? This position paper discusses possible problems and solutions, and outlines ideas for further research in this area. 1 Introduction Database technology provides a number of advantages. Data mining is one of these; using automated tools to analyze corporate data can help find ways to increase efficiency of an organization. Another advantage of database technology is information sharing (including sharing with other organizations). For example, publicly accessible corporate telephone books can decrease the need for telephone operators (offloading this task to the caller...) Sharing need not be completely public  making inventory information available to suppliers can help a retail operation to avoid shortages, and can lower the supplier's cost (thus allowing the retailer to n...