Results 1 - 10
of
55
Efficient similarity search in sequence databases
, 1994
"... We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the first few frequencies are strong. Anot ..."
Abstract
-
Cited by 359 (19 self)
- Add to MetaCart
We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the first few frequencies are strong. Another important observation is Parseval's theorem, which specifies that the Fourier transform preserves the Euclidean distance in the time or frequency domain. Having thus mapped sequences to a lower-dimensionality space by using only the first few Fourier coe cients, we use R-trees to index the sequences and e ciently answer similarity queries. We provide experimental results which show that our method is superior to search based on sequential scanning. Our experiments show that a few coefficients (1-3) are adequate to provide good performance. The performance gain of our method increases with the number and length of sequences.
Scalable Algorithms for Association Mining
- IEEE Transactions on Knowledge and Data Engineering
, 2000
"... Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. In this paper we present efficient algorithms for the discovery ..."
Abstract
-
Cited by 138 (21 self)
- Add to MetaCart
Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. In this paper we present efficient algorithms for the discovery of frequent itemsets, which forms the compute intensive phase of the task. The algorithms utilize the structural properties of frequent itemsets to facilitate fast discovery. The items are organized into a subset lattice search space, which is decomposed into small independent chunks or sub-lattices, which can be solved in memory. Ecient lattice traversal techniques are presented, which quickly identify all the long frequent itemsets, and their subsets if required. We also present the effect of using different database layout schemes combined with the proposed decomposition and traversal techniques. We experimentally compare the new algorithms against the previous approaches, obtaining ...
Efficient data mining for path traversal patterns
- IEEE Transactions on Knowledge and Data Engineering
, 1998
"... Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed information-providing environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we der ..."
Abstract
-
Cited by 128 (10 self)
- Add to MetaCart
Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed information-providing environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we derive an algorithm to convert the original sequence of log data into a set of maximal forward references. By doing so, we can filter out the effect of some backward references, which are mainly made for ease of traveling and concentrate on mining meaningful user access sequences. Second, we derive algorithms to determine the frequent traversal patterns¦i.e., large reference sequences¦from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences; one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed. It is shown that the option of selective scan is very advantageous and can lead to prominent performance improvement. Sensitivity analysis on various parameters is conducted. Index Terms—Data mining, traversal patterns, distributed information system, World Wide Web, performance analysis.
Outlier detection for high dimensional data
, 2001
"... The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximit ..."
Abstract
-
Cited by 128 (0 self)
- Add to MetaCart
The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximity in order to nd outliers based on their relationship to the rest of the data. Ho w ever, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective ofproximity-based de nitions. Consequently, for high dimensional data, the notion of nding meaningful outliers becomes substantially more complex and non-obvious. In this paper, w e discuss new techniques for outlier detection whic h nd the outliers by studying the behavior of projections from the data set. 1.
E-Commerce Recommendation Applications
, 2001
"... Recommender systems are being used by an ever-increasing number of E-commerce sites to help consumers find products to purchase. What started as a novelty has turned into a serious business tool. Recommender systems use product knowledge -- either hand-coded knowledge provided by experts or "mined" ..."
Abstract
-
Cited by 110 (0 self)
- Add to MetaCart
Recommender systems are being used by an ever-increasing number of E-commerce sites to help consumers find products to purchase. What started as a novelty has turned into a serious business tool. Recommender systems use product knowledge -- either hand-coded knowledge provided by experts or "mined" knowledge learned from the behavior of consumers -- to guide consumers through the often-overwhelming task of locating products they will like. In this article we present an explanation of how recommender systems are related to some traditional database analysis techniques. We examine how recommender systems help E-commerce sites increase sales and analyze the recommender systems at six market-leading sites. Based on these examples, we create a taxonomy of recommender systems, including the inputs required from the consumers, the additional knowledge required from the database, the ways the recommendations are presented to consumers, the technologies used to create the recommendations, and t...
Finding Intensional Knowledge of Distance-Based Outliers
- In VLDB
, 1999
"... Existing studies on outliers focus only on the identification aspect; none provides any intensional knowledge of the outliers---by which we mean a description or an explanation of why an identified outlier is exceptional. For many applications, a description or explanation is at least as vital to t ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
Existing studies on outliers focus only on the identification aspect; none provides any intensional knowledge of the outliers---by which we mean a description or an explanation of why an identified outlier is exceptional. For many applications, a description or explanation is at least as vital to the user as the identification aspect. Specifically, intensional knowledge helps the user to: (i) evaluate the validity of the identified outliers, and (ii) improve one's understanding of the data. The two main issues addressed in this paper are: what kinds of intensional knowledge to provide, and how to optimize the computation of such knowledge. With respect to the first issue, we propose finding strongest and weak outliers and their corresponding structural intensional knowledge. With respect to the second issue, we first present a naive and a semi-naive algorithm. Then, by means of what we call path and semi-lattice sharing of I/O processing, we develop two optimized approaches. We provi...
QBISM: A Prototype 3-D Medical Image Database System
- IEEE Data Engineering Bulletin
, 1993
"... this paper. However, these automatic or semi-automatic warping algorithms are extremely important for this application. It is precisely this technology that permits anatomic structure-based access to acquired medical images as well as comparisons among studies, even of different patients, as long as ..."
Abstract
-
Cited by 22 (5 self)
- Add to MetaCart
this paper. However, these automatic or semi-automatic warping algorithms are extremely important for this application. It is precisely this technology that permits anatomic structure-based access to acquired medical images as well as comparisons among studies, even of different patients, as long as they have been warped to the same atlas. Furthermore, it enables the database to grow, and be queryable, without time-consuming manual segmentation of the data.
Hierarchical classification of real life documents
- In Proceedings of the 1st SIAM International Conference on Data Mining
, 2001
"... Two features have successfully made on-line information comprehensible and accessible to people: hierarchically structured classes where topics are organized into a hierarchy of increasing specificity, and multi-classed documents where a document is classified into all relevant classes. One such inf ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Two features have successfully made on-line information comprehensible and accessible to people: hierarchically structured classes where topics are organized into a hierarchy of increasing specificity, and multi-classed documents where a document is classified into all relevant classes. One such information source is Yahoo!
Independence Diagrams: A Technique for Visual Data Mining
, 1998
"... An important issue in data mining is the recognition of complex dependencies between attributes. Past techniques for identifying attribute dependence include correlation coefficients, scatterplots, and equiwidth histograms. These techniques are sensitive to outliers, and often are not sufficiently i ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
An important issue in data mining is the recognition of complex dependencies between attributes. Past techniques for identifying attribute dependence include correlation coefficients, scatterplots, and equiwidth histograms. These techniques are sensitive to outliers, and often are not sufficiently informative to identify the kind of attribute dependence present. We propose a new approach, which we call independence diagrams. We divide each attribute into ranges; for each pair of attributes, the combination of these ranges defines a two-dimensional grid. For each cell of this grid, we store the number of data items in it. We display the grid, scaling each attribute axis so that the displayed width of a range is proportional to the total number of data items within that range. The brightness of a cell is proportional to the density of data items in it. As a result, both attributes are independently normalized by frequency, ensuring insensitivity to outliers and skew, and ...
HARMONY: Efficiently mining the best rules for classification
- In Proc. of SDM
, 2005
"... Many studies have shown that rule-based classification algorithms perform well in classifying categorical and sparse high-dimensional databases. However, a fundamental limitation with many rule-based classifiers is that they find the classification rules in a coarsegrained manner. They usually use h ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Many studies have shown that rule-based classification algorithms perform well in classifying categorical and sparse high-dimensional databases. However, a fundamental limitation with many rule-based classifiers is that they find the classification rules in a coarsegrained manner. They usually use heuristic methods to prune the search space, and select the rules based on the sequential database covering paradigm. Thus, the so-mined rules may not be the globally best rules for some instances in the training database. To make worse, these algorithms fail to fully exploit some more effective search space pruning methods in order to scale to large databases. In this paper we propose a new classifier, HAR-MONY, which directly mines the final set of classification rules. HARMONY uses an instance-centric rulegeneration approach in the sense that it can assure for each training instance, one of the highest-confidence rules covering this instance is included in the result set, which helps a lot in achieving high classification accuracy. By introducing several novel search strategies and pruning methods into the traditional frequent itemset mining framework, HARMONY also has high efficiency and good scalability. Our thorough performance study with some large text and categorical databases has shown that HARMONY outperforms many well-known classifiers in terms of both accuracy and efficiency, and scales well w.r.t. the database size.

