Results 1  10
of
88
Data Mining: An Overview from Database Perspective
 IEEE Transactions on Knowledge and Data Engineering
, 1996
"... Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have sh ..."
Abstract

Cited by 386 (25 self)
 Add to MetaCart
Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information providing services, such as data warehousing and online services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.
Optimal MultiStep kNearest Neighbor Search
, 1998
"... For an increasing number of modern database applications, efficient support of similarity search becomes an important task. Along with the complexity of the objects such as images, molecules and mechanical parts, also the complexity of the similarity models increases more and more. Whereas algorithm ..."
Abstract

Cited by 166 (19 self)
 Add to MetaCart
For an increasing number of modern database applications, efficient support of similarity search becomes an important task. Along with the complexity of the objects such as images, molecules and mechanical parts, also the complexity of the similarity models increases more and more. Whereas algorithms that are directly based on indexes work well for simple mediumdimensional similarity distance functions, they do not meet the efficiency requirements of complex highdimensional and adaptable distance functions. The use of a multistep query processing strategy is recommended in these cases, and our investigations substantiate that the number of candidates which are produced in the filter step and exactly evaluated in the refinement step is a fundamental efficiency parameter. After revealing the strong performance shortcomings of the stateoftheart algorithm for knearest neighbor search [Korn et al. 1996], we present a novel multistep algorithm which is guaranteed to produce the minim...
Efficient data mining for path traversal patterns
 IEEE Transactions on Knowledge and Data Engineering
, 1998
"... Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed informationproviding environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we der ..."
Abstract

Cited by 154 (12 self)
 Add to MetaCart
Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed informationproviding environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we derive an algorithm to convert the original sequence of log data into a set of maximal forward references. By doing so, we can filter out the effect of some backward references, which are mainly made for ease of traveling and concentrate on mining meaningful user access sequences. Second, we derive algorithms to determine the frequent traversal patterns¦i.e., large reference sequences¦from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences; one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed. It is shown that the option of selective scan is very advantageous and can lead to prominent performance improvement. Sensitivity analysis on various parameters is conducted. Index Terms—Data mining, traversal patterns, distributed information system, World Wide Web, performance analysis.
Distancebased indexing for highdimensional metric spaces
 In Proc. ACM SIGMOD International Conference on Management of Data
, 1997
"... In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query image. Distance based index structures are propos ..."
Abstract

Cited by 114 (3 self)
 Add to MetaCart
In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query image. Distance based index structures are proposed for applications where the data domain is high dimensional, or the distance function used to compute distances between data objects is nonEuclidean. In this paper, we introduce a distance based index structure called multivantage point (mvp) tree for similarity queries on highdimensional metric spaces. The mvptree uses more than one vantage point to partition the space into spherical cuts at each level. It also utilizes the precomputed (at construction time) distances between the data points and the vantage points. We have done experiments to compare mvptrees with vptrees which have a similar partitioning strategy, but use only one vantage point at each level, and do not make use of the precomputed distances. Empirical studies show that mvptree outperforms the vptree 20 % to 80 % for varying query ranges and different distance distributions. 1.
A Probabilistic Approach to Fast Pattern Matching in Time Series Databases
 Proceedings of the 3 rd International Conference of Knowledge Discovery and Data Mining
, 1997
"... The problem of efficiently and accurately locating patterns of interest in massive time series data sets is an important and nontrivial problem in a wide variety of applications, including diagnosis and monitoring of complex systems, biomedical data analysis, and exploratory data analysis in scient ..."
Abstract

Cited by 100 (15 self)
 Add to MetaCart
The problem of efficiently and accurately locating patterns of interest in massive time series data sets is an important and nontrivial problem in a wide variety of applications, including diagnosis and monitoring of complex systems, biomedical data analysis, and exploratory data analysis in scientific and business time series. In this paper a probabilistic approach is taken to this problem. Using piecewise linear segmentations as the underlying representation, local features (such as peaks, troughs, and plateaus) are defined using a prior distribution on expected deformations from a basic template. Global shape information is represented using another prior on the relative locations of the individual features. An appropriately defined probabilistic model integrates the local and global information and directly leads to an overall distance measure between sequence patterns based on prior knowledge. A search algorithm using this distance measure is shown to efficiently and accurately f...
On Similarity Queries for TimeSeries Data: Constraint Specification and Implementation
, 1995
"... Constraints are a natural mechanism for the specification of similarity queries on timeseries data. However, to realize the expressive power of constraint programming in this context, one must provide the matching implementation technology for efficient indexing of very large data sets. In this pap ..."
Abstract

Cited by 98 (4 self)
 Add to MetaCart
Constraints are a natural mechanism for the specification of similarity queries on timeseries data. However, to realize the expressive power of constraint programming in this context, one must provide the matching implementation technology for efficient indexing of very large data sets. In this paper, we formalize the intuitive notions of exact and approximate similarity between timeseries patterns and data. Our definition of similarity extends the distance metric used in [2, 7] with invariance under a group of transformations. Our main observation is that the resulting, more expressive, set of constraint queries can be supported by a new indexing technique, which preserves all the desirable properties of the indexing scheme proposed in [2, 7].
Properties of embedding methods for similarity searching in metric spaces
 PAMI
, 2003
"... Complex data types—such as images, documents, DNA sequences, etc.—are becoming increasingly important in modern database applications. A typical query in many of these applications seeks to find objects that are similar to some target object, where (dis)similarity is defined by some distance functi ..."
Abstract

Cited by 79 (4 self)
 Add to MetaCart
Complex data types—such as images, documents, DNA sequences, etc.—are becoming increasingly important in modern database applications. A typical query in many of these applications seeks to find objects that are similar to some target object, where (dis)similarity is defined by some distance function. Often, the cost of evaluating the distance between two objects is very high. Thus, the number of distance evaluations should be kept at a minimum, while (ideally) maintaining the quality of the result. One way to approach this goal is to embed the data objects in a vector space so that the distances of the embedded objects approximates the actual distances. Thus, queries can be performed (for the most part) on the embedded objects. In this paper, we are especially interested in examining the issue of whether or not the embedding methods will ensure that no relevant objects are left out (i.e., there are no false dismissals and, hence, the correct result is reported). Particular attention is paid to the SparseMap, FastMap, and MetricMap embedding methods. SparseMap is a variant of Lipschitz embeddings, while FastMap and MetricMap are inspired by dimension reduction methods for Euclidean spaces (using KLT or the related PCA and SVD). We show that, in general, none of these embedding methods guarantee that queries on the embedded objects have no false dismissals, while also demonstrating the limited cases in which the guarantee does hold. Moreover, we describe a variant of SparseMap that allows queries with no false dismissals. In addition, we show that with FastMap and MetricMap, the distances of the embedded objects can be much greater than the actual distances. This makes it impossible (or at least impractical) to modify FastMap and MetricMap to guarantee no false dismissals.
A Survey of Temporal Knowledge Discovery Paradigms and Methods
 IEEE Transactions on Knowledge and Data Engineering
, 2002
"... AbstractÐWith the increase in the size of data sets, data mining has recently become an important research topic and is receiving substantial interest from both academia and industry. At the same time, interest in temporal databases has been increasing and a growing number of both prototype and impl ..."
Abstract

Cited by 78 (7 self)
 Add to MetaCart
AbstractÐWith the increase in the size of data sets, data mining has recently become an important research topic and is receiving substantial interest from both academia and industry. At the same time, interest in temporal databases has been increasing and a growing number of both prototype and implemented systems are using an enhanced temporal understanding to explain aspects of behavior associated with the implicit timevarying nature of the universe. This paper investigates the confluence of these two areas, surveys the work to date, and explores the issues involved and the outstanding problems in temporal data mining. Index TermsÐTemporal data mining, time sequence mining, trend analysis, temporal rules, semantics of mined rules. 1
Indexing Large Metric Spaces for Similarity Search Queries
, 1999
"... In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query image. Distance based index structures are propos ..."
Abstract

Cited by 66 (0 self)
 Add to MetaCart
In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query image. Distance based index structures are proposed for applications where the distance computations between objects of the data domain are expensive (such as high dimensional data), and the distance function used is metric. In this paper, we consider using distancebased index structures for similarity queries on large metric spaces. We elaborate on the approach of using reference points (vantage points) to partition the data space into spherical shelllike regions in a hierarchical manner. We introduce the multivantage point tree structure (mvptree) that uses more than one vantage points to partition the space into spherical cuts at each level. In answering similarity based queries, the mvptree also utilizes the precomputed (at con...
Efficient UserAdaptable Similarity Search in Large Multimedia Databases
 IN PROCEEDINGS OF THE INT. CONF. ON VERY LARGE DATA BASES
, 1997
"... Efficient useradaptable similarity search more and more increases in its importance for multimedia and spatial database systems. As a general similarity model for multidimensional vectors that is adaptable to application requirements and user preferences, we use quadratic form distance functions w ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
Efficient useradaptable similarity search more and more increases in its importance for multimedia and spatial database systems. As a general similarity model for multidimensional vectors that is adaptable to application requirements and user preferences, we use quadratic form distance functions which have been successfully applied to color histograms in image databases [Fal+ 94]. The components a ij of the matrix A denote similarity of the components i and j of the vectors. Beyond the Euclidean distance which produces spherical query ranges, the similarity distance defines a new query type, the ellipsoid query. We present new algorithms to efficiently support ellipsoid query processing for various userdefined similarity matrices on existing precomputed indexes. By adapting techniques for reducing the dimensionality and employing a multistep query processing architecture, the method is extended to highdimensional data spaces. In particular, from our algorithm to reduce the simila...