Results 1  10
of
10
OPTICS: Ordering Points To Identify the Clustering Structure
, 1999
"... Cluster analysis is a primary method for database mining. It is either used as a standalone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of ..."
Abstract

Cited by 511 (49 self)
 Add to MetaCart
Cluster analysis is a primary method for database mining. It is either used as a standalone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of the wellknown clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many realdata sets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately. We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its densitybased clustering structure. This clusterordering contains information which is equivalent to the densitybased clusterings corresponding to a broad range of parameter settings. It is a versatile basis for both automatic and interactive cluster analysis. We show how to automatically and efficiently extract not only ‘traditional ’ clustering information (e.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the clusterordering can be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for interactive exploration of the intrinsic clustering structure offering additional insights into the distribution and correlation of the data.
Spatial Data Mining: A Database Approach
, 1997
"... Knowledge discovery in databases (KDD) is an important task in spatial databases since both, the number and the size of such databases are rapidly growing. This paper introduces a set of basic operations which should be supported by a spatial database system (SDBS) to express algorithms for KDD i ..."
Abstract

Cited by 112 (4 self)
 Add to MetaCart
Knowledge discovery in databases (KDD) is an important task in spatial databases since both, the number and the size of such databases are rapidly growing. This paper introduces a set of basic operations which should be supported by a spatial database system (SDBS) to express algorithms for KDD in SDBS. For this purpose, we introduce the concepts of neighborhood graphs and paths and a small set of operations for their manipulation. We argue that these operations are sufficient for KDD algorithms considering spatial neighborhood relations by presenting the implementation of four typical spatial KDD algorithms based on the proposed operations. Furthermore, the efficient support of operations on large neighborhood graphs and on large sets of neighborhood paths by the SDBS is discussed. Neighborhood indices are introduced to materialize selected neighborhood graphs in order to speed up the processing of the proposed operations.
Epsilon Grid Order: An Algorithm for the Similarity Join on Massive HighDimensional Data
 In SIGMOD
, 2001
"... The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs w ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter . In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equidistant grid with cell length over the data space and comparing the grid cells lexicographically. A typical problem of gridbased approaches such as MSJ or the kdBtree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strate...
A cost model and index architecture for the similarity join
 In ICDE
, 2001
"... The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a par ..."
Abstract

Cited by 32 (8 self)
 Add to MetaCart
(Show Context)
The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter ε. Due to its high practical relevance, many similarity join algorithms have been devised. In this paper, we propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: Finegrained index structures are beneficial for the CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort. In the experimental evaluation, a substantial improvement over competitive techniques is shown. 1.
Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases
"... Metric databases are databases where a metric distance function is defined for pairs of database objects. In such databases, similarity queries in the form of range queries or knearest neighbor queries are the most important queries. In traditional query processing, single queries are issued indepe ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
(Show Context)
Metric databases are databases where a metric distance function is defined for pairs of database objects. In such databases, similarity queries in the form of range queries or knearest neighbor queries are the most important queries. In traditional query processing, single queries are issued independently by different users. In many data mining applications, however, the database is typically explored by iteratively asking similarity queries for answers of previous similarity queries. In this paper, we introduce a generic scheme for such data mining algorithms and we develop a method to transform such algorithms in a way that they can use multiple similarity queries, i.e. sets of queries issued simultaneously. We investigate two orthogonal approaches, reducing I/O cost as well as CPU cost, to speedup the processing of multiple similarity queries. The proposed techniques apply to any type of similarity query and to an implementation based on an index or using a sequential scan. Parall...
High Performance Clustering Based on the Similarity Join
, 2000
"... A broad class of algorithms for knowledge discovery in databases (KDD) relies heavily on similarity queries, i.e. range queries or nearest neighbor queries, in multidimensional feature spaces. Many KDD algorithms perform a similarity query for each point stored in the database. This approach causes ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
A broad class of algorithms for knowledge discovery in databases (KDD) relies heavily on similarity queries, i.e. range queries or nearest neighbor queries, in multidimensional feature spaces. Many KDD algorithms perform a similarity query for each point stored in the database. This approach causes serious performance degenerations if the considered data set does not fit into main memory. Usual cache strategies such as LRU fail because the locality of KDD algorithms is typically not high enough. In this paper, we propose to replace repeated similarity queries by the similarity join, a database primitive prevalent in multimedia database systems. We present a schema to transform query intensive KDD algorithms into a representation using the similarity join as a basic operation without affecting the correctness of the result of the considered algorithm. In order to perform a comprehensive experimental evaluation of our approach, we apply the proposed transformation to the clustering algor...
Automatic Extraction of Clusters from Hierarchical
"... Hierarchical clustering algorithms are typically more effective in detecting the true clustering structure of a data set than partitioning algorithms. ..."
Abstract
 Add to MetaCart
Hierarchical clustering algorithms are typically more effective in detecting the true clustering structure of a data set than partitioning algorithms.
PERFORMANCE OPTIMIZATION OF DATABASE OPERATIONS IN SPATIAL DATABASE SYSTEMS
"... Knowledge discovery in databases (KDD) is an important task in spatial databases since both, the number and the size of such databases are rapidly growing. The automated discovery of knowledge in databases is becoming increasingly important as the world’s wealth of data continues to grow exponential ..."
Abstract
 Add to MetaCart
Knowledge discovery in databases (KDD) is an important task in spatial databases since both, the number and the size of such databases are rapidly growing. The automated discovery of knowledge in databases is becoming increasingly important as the world’s wealth of data continues to grow exponentially. The main contribution of this paper is to introduce a set of basic operations, which should be supported by a spatial database system (SDBS) to express algorithms for KDD in SDBS. The definition of such a set of basic operations and their efficient support by an SDBS will speed up the development of new spatial KDD algorithms and their performance. For this purpose, we introduce the concept of neighborhood graphs and paths and a small set of operations for their manipulation. These operations are sufficient for KDD algorithms considering spatial neighborhood relations by presenting the implementation of typical spatial KDD algorithms based on the proposed operations. A wide variety of algorithms have been proposed for KDD. This involves evaluation of algorithms for optimizing the performance of the KDD operations. These algorithms are classified and identified certain generic tasks like cluster, classification, dependency analysis and deviation
1 Automatic Extraction of Clusters from Hierarchical Clustering Representations
"... Hierarchical clustering algorithms are typically more effective in detecting the true clustering structure of a data set than partitioning algorithms. However, hierarchical clustering algorithms do not actually create clusters, but compute only a hierarchical representation of the data set. This mak ..."
Abstract
 Add to MetaCart
Hierarchical clustering algorithms are typically more effective in detecting the true clustering structure of a data set than partitioning algorithms. However, hierarchical clustering algorithms do not actually create clusters, but compute only a hierarchical representation of the data set. This makes them unsuitable as an automatic preprocessing step for other algorithms that operate on detected clusters. This is true for both dendrograms and reachability plots, which have been proposed as hierarchical clustering representations, and which have different advantages and disadvantages. In this paper we first investigate the relation between dendrograms and reachability plots and introduce methods to convert them into each other showing that they essentially contain the same information. Based on reachability plots, we then introduce a technique that automatically determines the significant clusters in a hierarchical cluster representation. This makes it for the first time possible to use hierarchical clustering as an automatic preprocessing step that requires no user interaction to select clusters from a hierarchical cluster representation.
ARC DP120104168, and NSFC61021004.
"... Efficient topk similarity join processing ..."
(Show Context)