Results 1  10
of
312
Fast Algorithms for Mining Association Rules
, 1994
"... We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving this problem that are fundamentally different from the known algorithms. Empirical evaluation shows that these algorithms outperform the known a ..."
Abstract

Cited by 2900 (14 self)
 Add to MetaCart
We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving this problem that are fundamentally different from the known algorithms. Empirical evaluation shows that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also show how the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scaleup experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scaleup properties with respect to the transaction size and the number of items in the database.
Beyond Market Baskets: Generalizing Association Rules To Dependence Rules
, 1998
"... One of the more wellstudied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B. Motivated partly by the goal of generalizing beyond market bask ..."
Abstract

Cited by 534 (7 self)
 Add to MetaCart
One of the more wellstudied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B. Motivated partly by the goal of generalizing beyond market basket data and partly by the goal of ironing out some problems in the definition of association rules, we develop the notion of dependence rules that identify statistical dependence in both the presence and absence of items in itemsets. We propose measuring significance of dependence via the chisquared test for independence from classical statistics. This leads to a measure that is upwardclosed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between dependent and independent itemsets in the lattice. We develop pruning strategies based on the closure property and thereby devise an efficient algorithm for discovering dependence rules. We demonstrate our algorithm’s effectiveness by testing it on census data, text data (wherein we seek term dependence), and synthetic data.
Dynamic Itemset Counting and Implication Rules for Market Basket Data
, 1997
"... We consider the problem of analyzing marketbasket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We in ..."
Abstract

Cited by 521 (6 self)
 Add to MetaCart
We consider the problem of analyzing marketbasket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We investigate the idea of item reordering, which can improve the lowlevel efficiency of the algorithm. Second, we present a new way of generating "implication rules," which are normalized based on both the antecedent and the consequent and are truly implications (not simply a measure of cooccurrence), and we show how they produce more intuitive results than other methods. Finally, we show how different characteristics of real data, as opposed to synthetic data, can dramatically affect the performance of the system and the form of the results. 1 Introduction Within the area of data mining, the problem of deriving associations from data has recently received a great deal of attention. The prob...
Fast Subsequence Matching in TimeSeries Databases
 SIGMOD 94
, 1994
"... We present an efficient indexing method to locate 1dimensional subsequences witbin a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of multidimensional rectangles in feature space ..."
Abstract

Cited by 447 (22 self)
 Add to MetaCart
(Show Context)
We present an efficient indexing method to locate 1dimensional subsequences witbin a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of multidimensional rectangles in feature space. Then, these rectangles can be readily indexed using traditional spatial access methods, like the R*tree [9]. In more deteil, we use a sliding window over the data sequence and extract its features; the result is a trail in feature space. We propose an efficient and effective algorithm to divide such trails into subtrails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently. We implemented our method and carried out experiments on synthetic and real data (stock price movements). We compared the method to sequential scanning, which is the only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 times.
Efficient similarity search in sequence databases
, 1994
"... We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the first few frequencies are strong. Anot ..."
Abstract

Cited by 443 (21 self)
 Add to MetaCart
We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the first few frequencies are strong. Another important observation is Parseval's theorem, which specifies that the Fourier transform preserves the Euclidean distance in the time or frequency domain. Having thus mapped sequences to a lowerdimensionality space by using only the first few Fourier coe cients, we use Rtrees to index the sequences and e ciently answer similarity queries. We provide experimental results which show that our method is superior to search based on sequential scanning. Our experiments show that a few coefficients (13) are adequate to provide good performance. The performance gain of our method increases with the number and length of sequences.
Data Mining: An Overview from Database Perspective
 IEEE Transactions on Knowledge and Data Engineering
, 1996
"... Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have sh ..."
Abstract

Cited by 434 (26 self)
 Add to MetaCart
Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information providing services, such as data warehousing and online services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.
Discovery of MultipleLevel Association Rules from Large Databases
 In Proc. 1995 Int. Conf. Very Large Data Bases
, 1995
"... Previous studies on mining association rules find rules at single concept level, however, mining association rules at multiple concept levels may lead to the discovery of more specific and concrete knowledge from data. In this study, a topdown progressive deepening method is developed for mining mu ..."
Abstract

Cited by 394 (32 self)
 Add to MetaCart
Previous studies on mining association rules find rules at single concept level, however, mining association rules at multiple concept levels may lead to the discovery of more specific and concrete knowledge from data. In this study, a topdown progressive deepening method is developed for mining multiplelevel association rules from large transaction databases by extension of some existing association rule mining techniques. A group of variant algorithms are proposed based on the ways of sharing intermediate results, with the relative performance tested on different kinds of data. Relaxation of the rule conditions for finding "levelcrossing" association rules is also discussed in the paper.
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in TimeSeries Databases
 In VLDB
, 1995
"... We introduce a new model of similarity of time sequences that captures the intuitive notion that two sequences should be considered similar if they have enough nonoverlapping timeordered pairs of subsequences thar are similar. The model allows the amplitude of one of the two sequences to be scaled ..."
Abstract

Cited by 209 (6 self)
 Add to MetaCart
(Show Context)
We introduce a new model of similarity of time sequences that captures the intuitive notion that two sequences should be considered similar if they have enough nonoverlapping timeordered pairs of subsequences thar are similar. The model allows the amplitude of one of the two sequences to be scaled by any suitable amount and its offset adjusted appropriately. Two subsequences are considered similar if one can be enclosed within an envelope of a specified width drawn around the other. The model also allows nonmatching gaps in the matching subsequences. The matching subsequences need not be aligned along the time axis. Given this model of similarity,we present fast search techniques for discovering all similar sequences in a set of sequences. These techniques can also be used to find all (sub)sequences similar to a given sequence. We applied this matching system to the U.S. mutual funds data and discovered interesting matches.
SLIQ: A Fast Scalable Classifier for Data Mining
, 1996
"... . Classification is an important problem in the emerging field of data mining. Although classification has been studied extensively in the past, most of the classification algorithms are designed only for memoryresident data, thus limiting their suitability for data mining large data sets. This pap ..."
Abstract

Cited by 206 (9 self)
 Add to MetaCart
(Show Context)
. Classification is an important problem in the emerging field of data mining. Although classification has been studied extensively in the past, most of the classification algorithms are designed only for memoryresident data, thus limiting their suitability for data mining large data sets. This paper discusses issues in building a scalable classifier and presents the design of SLIQ 1 , a new classifier. SLIQ is a decision tree classifier that can handle both numeric and categorical attributes. It uses a novel presorting technique in the treegrowth phase. This sorting procedure is integrated with a breadthfirst tree growing strategy to enable classification of diskresident datasets. SLIQ also uses a new treepruning algorithm that is inexpensive, and results in compact and accurate trees. The combination of these techniques enables SLIQ to scale for large data sets and classify data sets irrespective of the number of classes, attributes, and examples (records), thus making it an ...
Toward integrating feature selection algorithms for classification and clustering
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2005
"... This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals ..."
Abstract

Cited by 156 (16 self)
 Add to MetaCart
(Show Context)
This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some realworld applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.