Results 1 - 10
of
18
An Efficient Approach to Clustering in Large Multimedia Databases with Noise
, 1998
"... Several clustering algorithms can be applied to clustering in large multimedia databases. The effectiveness and efficiency of the existing algorithms, however, is somewhat limited, since clustering in multimedia databases requires clustering high-dimensional feature vectors and since multimedia data ..."
Abstract
-
Cited by 165 (7 self)
- Add to MetaCart
Several clustering algorithms can be applied to clustering in large multimedia databases. The effectiveness and efficiency of the existing algorithms, however, is somewhat limited, since clustering in multimedia databases requires clustering high-dimensional feature vectors and since multimedia databases often contain large amounts of noise. In this paper, we therefore introduce a new algorithm to clustering in large multimedia databases called DENCLUE (DENsitybased CLUstEring). The basic idea of our new approachis to model the overall point density analytically as the sum of influence functions of the data points. Clusters can then be identified by determining density-attractors and clusters of arbitrary shape can be easily described by a simple equation based on the overall density function. The advantages of our new approach are (1) it has a firm mathematical basis, (2) it has good clustering properties in data sets with large amounts of noise, (3) it allows a compact mathematical ...
Query By Image Example: The Candid Approach
, 1995
"... CANDID (Comparison Algorithm for Navigating Digital Image Databases) was developed to enable contentbased retrieval of digital imagery from large databases using a query-by-example methodology. A user provides an example image to the system, and images in the database that are similar to that exampl ..."
Abstract
-
Cited by 81 (1 self)
- Add to MetaCart
CANDID (Comparison Algorithm for Navigating Digital Image Databases) was developed to enable contentbased retrieval of digital imagery from large databases using a query-by-example methodology. A user provides an example image to the system, and images in the database that are similar to that example are retrieved. The development of CANDID was inspired by the N-gram approach to document fingerprinting, where a "global signature" is computed for every document in a database and these signatures are compared to one another to determine the similarity between any two documents. CANDID computes a global signature for every image in a database, where the signature is derived from various image features such as localized texture, shape, or color information. A distance between probability density functions of feature vectors is then used to compare signatures. In this paper, we present CANDID and highlight two results from our current research: subtracting a "background" signature from ever...
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering
, 1999
"... Many applications require the clustering of large amounts of high-dimensional data. Most clustering algorithms, however, do not work effectively and efficiently in high-dimensional space, which is due to the so-called "curse of dimensionality". In addition, the high-dimensional data often contains a ..."
Abstract
-
Cited by 74 (3 self)
- Add to MetaCart
Many applications require the clustering of large amounts of high-dimensional data. Most clustering algorithms, however, do not work effectively and efficiently in high-dimensional space, which is due to the so-called "curse of dimensionality". In addition, the high-dimensional data often contains a significant amount of noise which causes additional effectiveness problems. In this paper, we review and compare the existing algorithms for clustering high-dimensional data and show the impact of the curse of dimensionality on their effectiveness and efficiency. The comparison reveals that condensation-based approaches (such as BIRCH or STING) are the most promising candidates for achieving the necessary efficiency, but it also shows that basically all condensation-based approaches have severe weaknesses with respect to their effectiveness in high-dimensional space. To overcome these problems, we develop a new clustering technique called OptiGrid which is based on constructing an optimal grid-...
A Cost Model for Query Processing in High-Dimensional Data Spaces
, 2000
"... During the last decade, multimedia databases have become increasingly important in many application areas such as medicine, CAD, geography or molecular biology. An important research issue in the field of multimedia databases is similarity search in large data sets. Most current approaches addressin ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
During the last decade, multimedia databases have become increasingly important in many application areas such as medicine, CAD, geography or molecular biology. An important research issue in the field of multimedia databases is similarity search in large data sets. Most current approaches addressing similarity search use the so-called feature approach which transforms important properties of the stored objects into points of a high-dimensional space (feature vectors). Thus, the similarity search is transformed into a neighborhood search in the feature space. For the management of the feature vectors, multidimensional index structures are usually applied. The performance of query processing can be substantially improved by opti...
Fast Nearest Neighbor Search in High-dimensional Space
- In Proceedings of the 14th International Conference on Data Engineering
, 1998
"... Similarity search in multimedia databases requires an efficient support of nearest-neighbor search on a large set of high-dimensional points as a basic operation for query processing. As recent theoretical results show, state of the art approaches to nearest-neighbor search are not efficient in high ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
Similarity search in multimedia databases requires an efficient support of nearest-neighbor search on a large set of high-dimensional points as a basic operation for query processing. As recent theoretical results show, state of the art approaches to nearest-neighbor search are not efficient in higher dimensions. In our new approach, we therefore precompute the result of any nearest-neighbor search which corresponds to a computation of the voronoi cell of each data point. In a second step, we store the voronoi cells in an index structure efficient for high-dimensional data spaces. As a result, nearest neighbor search corresponds to a simple point query on the index structure. Although our technique is based on a precomputation of the solution space, it is dynamic, i.e. it supports insertions of new data points. An extensive experimental evaluation of our technique demonstrates the high efficiency for uniformly distributed as well as real data. We obtained a significant reduction of the...
Toward improved ranking metrics
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2000
"... AbstractÐIn many computer vision algorithms, a metric or similarity measure is used to determine the distance between two features. The Euclidean or SSD (sum of the squared differences) metric is prevalent and justified from a maximum likelihood perspective when the additive noise distribution is Ga ..."
Abstract
-
Cited by 36 (11 self)
- Add to MetaCart
AbstractÐIn many computer vision algorithms, a metric or similarity measure is used to determine the distance between two features. The Euclidean or SSD (sum of the squared differences) metric is prevalent and justified from a maximum likelihood perspective when the additive noise distribution is Gaussian. Based on real noise distributions measured from international test sets, we have found that the Gaussian noise distribution assumption is often invalid. This implies that other metrics, which have distributions closer to the real noise distribution, should be used. In this paper, we consider three different applications: content-based retrieval in image databases, stereo matching, and motion tracking. In each of them, we experiment with different modeling functions for the noise distribution and compute the accuracy of the methods using the corresponding distance measures. In our experiments, we compared the SSD metric, the SAD (sum of the absolute differences) metric, the Cauchy metric, and the Kullback relative information. For several algorithms from the research literature which used the SSD or SAD, we showed that greater accuracy could be obtained by using the Cauchy metric instead. Index TermsÐMaximum likelihood, ranking metrics, content-based retrieval, color indexing, stereo matching, motion tracking. 1
Dynamically optimizing high-dimensional index structures
- In Proc. Int. Conf. on Extending Database Technology (EDBT
, 2000
"... Abstract. In high-dimensional query processing, the optimization of the logical page-size of index structures is an important research issue. Even very simple query processing techniques such as the sequential scan are able to outperform indexes which are not suitably optimized. Page-size optimizati ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Abstract. In high-dimensional query processing, the optimization of the logical page-size of index structures is an important research issue. Even very simple query processing techniques such as the sequential scan are able to outperform indexes which are not suitably optimized. Page-size optimization based on a cost model faces the problem, that the optimum not only depends on static schema information such as the dimension of the data space but also on dynamically changing parameters such as the number of objects stored in the database and the degree of clustering and correlation in the current data set. Therefore, we propose a method for adapting the page size of an index dynamically during insert processing. Our solution, called DABS-tree, uses a flat directory whose entries consist of an MBR, a pointer to the data page and the size of the data page. Before splitting pages in insert operations, a cost model is consulted to estimate whether the split operation is beneficial. Otherwise, the split is avoided and the logical page-size is adapted instead. A similar rule applies for merging when performing delete operations. We present an algorithm for the management of data pages with varying page-sizes in an index and show that all restructuring operations are locally restricted. We show in our experimental evaluation that the DABS tree outperforms the X-tree by a factor up to 4.6 and the sequential scan by a factor up to 6.6. 1.
On optimizing nearest neighbor queries in high-dimensional data spaces
- In Proceedings of 8th International Conference on Database Theory (ICDT
, 2001
"... Abstract. Nearest-neighbor queries in high-dimensional space are of high importance in various applications, especially in content-based indexing of multimedia data. For an optimization of the query processing, accurate models for estimating the query processing costs are needed. In this paper, we p ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract. Nearest-neighbor queries in high-dimensional space are of high importance in various applications, especially in content-based indexing of multimedia data. For an optimization of the query processing, accurate models for estimating the query processing costs are needed. In this paper, we propose a new cost model for nearest neighbor queries in high-dimensional space, which we apply to enhance the performance of high-dimensional index structures. The model is based on new insights into effects occurring in high-dimensional space and provides a closed formula for the processing costs of nearest neighbor queries depending on the dimensionality, the block size and the database size. From the wide range of possible applications of our model, we select two interesting samples: First, we use the model to prove the known linear complexity of the nearest neighbor search problem in high-dimensional space, and second, we provide a technique for optimizing the block size. For data of medium dimensionality, the optimized block size allows significant speed-ups of the query processing time when compared to traditional block sizes and to the linear scan. 1.
Efficiency Issues Related to Probability Density Function Comparison
- SPIE - Storage and Retrieval for Image and Video Databases
, 1996
"... The CANDID project (Comparison Algorithm for Navigating Digital Image Databases) employs probability density functions (PDFs) of localized feature information to represent the content of an image for search and retrieval purposes. A similarity measure between PDFs is used to identify database images ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The CANDID project (Comparison Algorithm for Navigating Digital Image Databases) employs probability density functions (PDFs) of localized feature information to represent the content of an image for search and retrieval purposes. A similarity measure between PDFs is used to identify database images that are similar to a user-provided query image. Unfortunately, signature comparison involving PDFs is a very time-consuming operation. In this paper, we look into some efficiency considerations when working with PDFs. Since PDFs can take on many forms, we look into tradeoffs between accurate representation and efficiency of manipulation for several data sets. In particular, we typically represent each PDF as a Gaussian mixture (e.g. as a weighted sum of Gaussian kernels) in the feature space. We find that by constraining all Gaussian kernels to have principal axes that are aligned to the natural axes of the feature space, computations involving these PDFs are simplified. We can also constr...
Multilevel Color Histogram Representation of Color Images by Peaks
, 1999
"... This paper proposes the use of a vector of color histogram peaks as an efficient and effective way for many image indexing problems. It shows that histogram peaks are more stable than general histogram bins when there are variation of scale and/or scale. We also introduce the structure of a room rec ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper proposes the use of a vector of color histogram peaks as an efficient and effective way for many image indexing problems. It shows that histogram peaks are more stable than general histogram bins when there are variation of scale and/or scale. We also introduce the structure of a room recognition system which applies this indexing technique to omni-directional images of rooms. Experimental results shows that using only peaks leads to significantly less time and storage demands an still provides recognition rates across a database of hundreds of rooms.

