Results 11 - 20
of
209
A New Framework For Itemset Generation
- In: PODS 98, Symposium on Principles of Database Systems
, 1998
"... The problem of finding association rules in a large database of sales transactions has been widely studied in the literature. We discuss some of the weaknesses of the large itemset method for association rule generation. A different method for evaluating and finding itemsets referred to as strongly ..."
Abstract
-
Cited by 53 (3 self)
- Add to MetaCart
The problem of finding association rules in a large database of sales transactions has been widely studied in the literature. We discuss some of the weaknesses of the large itemset method for association rule generation. A different method for evaluating and finding itemsets referred to as strongly collective itemsets is proposed. The concepts of "support" of an itemset and correlation of the items within an itemset are related, though not quite the same. This criterion stresses the importance of the actual correlation of the items with one another rather than the absolute support. Previously proposed methods to provide correlated itemsets are not necessarily applicable to very large databases. We provide an algorithm which provides very good computational efficiency, while maintaining statistical robustness. The fact that this algorithm relies on relative measures rather than absolute measures such as support also implies that the method can be applied to find association rules in datasets in which items may appear in a sizeable percentage of the transactions (dense datasets), datasets in which the items have varying density, or even negative association rules.
Constructing Knowledge From Multivariate Spatiotemporal Data: Integrating Geographic Visualization (GVis) with Knowledge Discovery in Database (KDD) Methods
- International Journal of Geographical Information Science
, 1999
"... In this paper, we develop an approach to the process of constructing knowledge through structured exploration of large spatiotemporal data sets. We begin by introducing our problem context and defining both Geographic Visualization (GVis) and Knowledge Discovery in Databases (KDD), the source domain ..."
Abstract
-
Cited by 49 (15 self)
- Add to MetaCart
In this paper, we develop an approach to the process of constructing knowledge through structured exploration of large spatiotemporal data sets. We begin by introducing our problem context and defining both Geographic Visualization (GVis) and Knowledge Discovery in Databases (KDD), the source domains for methods being integrated. Next, we review and compare recent GVis and KDD developments and consider the potential for their integration, emphasizing that an iterative process with user interaction is a central focus for uncovering interesting and meaningful patterns through each. We then introduce an approach to design of an integrated GVis-KDD environment directed to exploration and discovery in the context of spatiotemporal environmental data. The approach emphasizes a matching of GVis and KDD meta-operations. Following description of the GVis and KDD methods that are linked in our prototype system, we present a demonstration of the prototype applied to a typical spatiotemporal datas...
Online Generation of Association Rules
- IBM Research Division, T.J. Watson Research
, 1998
"... We have a large database consisting of sales transactions. We investigate the problem of online mining of association rules in this large database. We show how to preprocess the data effectively in order to make it suitable for repeated online queries. The preprocessing algorithm takes into account ..."
Abstract
-
Cited by 47 (4 self)
- Add to MetaCart
We have a large database consisting of sales transactions. We investigate the problem of online mining of association rules in this large database. We show how to preprocess the data effectively in order to make it suitable for repeated online queries. The preprocessing algorithm takes into account the storage space available. We store the preprocessed data in such a way that online processing may be done by applying a graph theoretic search algorithm whose complexity is proportional to the size of the output. This results in an online algorithm which is practically instantaneous in terms of response time. The algorithm also supports techniques for quickly discovering association rules from large itemsets. The algorithm is capable of finding rules with specific items in the antecedent or consequent. These association rules are presented in a compact form, eliminating redundancy. We believe that the elimination of redundancy in online generation of association rules from large itemsets is interesting in its own right.
Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results
- IEEE Bulletin of the Technical Committee on Data Engineering
, 1998
"... Clustering of data in a large dimension space is of a great interest in many data mining applications. In this paper, we propose a method for clustering of data in a high dimensional space based on a hypergraph model. In this method, the relationship present in the original data in high dimensional ..."
Abstract
-
Cited by 45 (14 self)
- Add to MetaCart
Clustering of data in a large dimension space is of a great interest in many data mining applications. In this paper, we propose a method for clustering of data in a high dimensional space based on a hypergraph model. In this method, the relationship present in the original data in high dimensional space are mapped into a hypergraph. A hyperedge represents a relationship (affinity) among subsets of data and the weight of the hyperedge reflects the strength of this affinity. A hypergraph partitioning algorithm is used to find a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. We present results of experiments on two different data sets: S&P500 stock data for the period of 1994-1996 and protein coding data. These experiments demonstrate that our approach is applicable and effective in high dimensional datasets. 1 Introduction Clustering in data mining is a disco...
Mining Association Rules: Anti-Skew Algorithms
- IN 14TH INTL. CONF. ON DATA ENGINEERING
, 1998
"... Mining association rules among items in a large database has been recognized as one of the most important data mining problems. All proposed approaches for this problem require scanning the entire database at least or almost twice in the worst case. In this paper we propose several techniques which ..."
Abstract
-
Cited by 40 (2 self)
- Add to MetaCart
Mining association rules among items in a large database has been recognized as one of the most important data mining problems. All proposed approaches for this problem require scanning the entire database at least or almost twice in the worst case. In this paper we propose several techniques which overcome the problem of data skew in the basket data. These techniques reduce the maximum number of scans to less than 2, and in most cases find all association rules in about 1 scan. Our algorithms employ prior knowledge collected during the mining process and/or via sampling, to further reduce the number of candidate itemsets and identify false candidate itemsets at an earlier stage.
Data mining for measuring and improving the success of web sites
- Data Mining and Knowledge Discovery
, 2001
"... Abstract. For many companies, competitiveness in e-commerce requires a successful presence on the web. Web sites are used to establish the company’s image, to promote and sell goods and to provide customer support. The success of a web site affects and reflects directly the success of the company in ..."
Abstract
-
Cited by 39 (2 self)
- Add to MetaCart
Abstract. For many companies, competitiveness in e-commerce requires a successful presence on the web. Web sites are used to establish the company’s image, to promote and sell goods and to provide customer support. The success of a web site affects and reflects directly the success of the company in the electronic market. In this study, we propose a methodology to improve the “success ” of web sites, based on the exploitation of navigation pattern discovery. In particular, we present a theory, in which success is modelled on the basis of the navigation behaviour of the site’s users. We then exploit WUM, a navigation pattern discovery miner, to study how the success of a site is reflected in the users ’ behaviour. With WUM we measure the success of a site’s components and obtain concrete indications of how the site should be improved. We report on our first experiments with an online catalog, the success of which we have studied. Our mining analysis has shown very promising results, on the basis of which the site is currently undergoing concrete improvements.
Data Mining Library Reuse Patterns using Generalized Association Rules
, 2000
"... In this paper, we show how data mining can be used to discover library reuse patterns in existing applications. Specifically, we consider the problem of discovering library classes and member functions that are typically reused in combination by application classes. This paper improves upon our earl ..."
Abstract
-
Cited by 37 (2 self)
- Add to MetaCart
In this paper, we show how data mining can be used to discover library reuse patterns in existing applications. Specifically, we consider the problem of discovering library classes and member functions that are typically reused in combination by application classes. This paper improves upon our earlier research using "association rules" [8] by taking into account the inheritance hierarchy using "generalized association rules". This turns out to be a non-trivial but worthwhile endeavor.
A Microeconomic View of Data Mining
, 1998
"... We present a rigorous framework, based on optimization, for evaluating data mining operations such as associations and clustering, in terms of their utility in decisionmaking. This framework leads quickly to some interesting computational problems related to sensitivity analysis, segmentation and th ..."
Abstract
-
Cited by 36 (2 self)
- Add to MetaCart
We present a rigorous framework, based on optimization, for evaluating data mining operations such as associations and clustering, in terms of their utility in decisionmaking. This framework leads quickly to some interesting computational problems related to sensitivity analysis, segmentation and the theory of games. Department of Computer Science, Cornell University, Ithaca NY 14853. Email: kleinber@cs.cornell.edu. Supported in part by an Alfred P. Sloan Research Fellowship and by NSF Faculty Early Career Development Award CCR-9701399. y Computer Science Division, Soda Hall, UC Berkeley, CA 94720. christos@cs.berkeley.edu z IBM Almaden Research Center, 650 Harry Road, San Jose CA 95120. pragh@almaden.ibm.com 1 Introduction Data mining is about extracting interesting patterns from raw data. There is some agreement in the literature on what qualifies as a "pattern" (association rules and correlations [1, 2, 3, 5, 6, 12, 20, 21] as well as clustering of the data points [9], are ...
Feature Subset Selection by Bayesian networks: a comparison with genetic and sequential algorithms
"... In this paper we perform a comparison among FSS-EBNA, a randomized, populationbased and evolutionary algorithm, and two genetic and other two sequential search approaches in the well known Feature Subset Selection (FSS) problem. In FSS-EBNA, the FSS problem, stated as a search problem, uses the E ..."
Abstract
-
Cited by 35 (13 self)
- Add to MetaCart
In this paper we perform a comparison among FSS-EBNA, a randomized, populationbased and evolutionary algorithm, and two genetic and other two sequential search approaches in the well known Feature Subset Selection (FSS) problem. In FSS-EBNA, the FSS problem, stated as a search problem, uses the EBNA (Estimation of Bayesian Network Algorithm) search engine, an algorithm within the EDA (Estimation of Distribution Algorithm) approach. The EDA paradigm is born from the roots of the GA community in order to explicitly discover the relationships among the features of the problem and not disrupt them by genetic recombination operators. The EDA paradigm avoids the use of recombination operators and it guarantees the evolution of the population of solutions and the discovery of these relationships by the factorization of the probability distribution of best individuals in each generation of the search. In EBNA, this factorization is carried out by a Bayesian network induced by a chea...
An Index-Based Approach for Similarity Search Supporting Time Warping in Large Sequence Databases
- In ICDE
, 2001
"... This paper discusses an effective processing of similarity search that supports time warping in large sequence databases. Time warping enables finding sequences with similar patterns even when they are of different lengths. Previous methods for processing similarity search that supports time warp ..."
Abstract
-
Cited by 35 (2 self)
- Add to MetaCart
This paper discusses an effective processing of similarity search that supports time warping in large sequence databases. Time warping enables finding sequences with similar patterns even when they are of different lengths. Previous methods for processing similarity search that supports time warping fail to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan all the database, thus suffer from serious performance degradation in large databases. Another method that hires the suffix tree, which does not assume any distance function, also shows poor performance due to the large tree size. In this paper, we propose a new novel method for similarity search that supports time warping. Our primary goal is to innovate on search performance in large databases without permitting any false dismissal. To attain this goal, we devise a new distance function D tw\Gammalb that consistently unde...

