Results 1  10
of
11
Optimizing Queries over Multimedia Repositories
, 1996
"... Multimedia repositories and applications that retrieve multimedia information are becoming increasingly popular. In this paper, we study the problem of selecting objects from multimedia repositories, and show how this problem relates to the processing and optimization of selection queries in other c ..."
Abstract

Cited by 88 (9 self)
 Add to MetaCart
Multimedia repositories and applications that retrieve multimedia information are becoming increasingly popular. In this paper, we study the problem of selecting objects from multimedia repositories, and show how this problem relates to the processing and optimization of selection queries in other contexts, e.g., when some of the selection conditions are expensive userdefined predicates. We find that the problem has unique characteristics that lead to interesting new research questions and results. This article presents an overview of the results in [1]. An expanded version of that paper is in preparation [2]. 1 Query Model In this section we first describe the model that we use for querying multimedia repositories. Then, we briefly review related models for querying text and image repositories. 1.1 Our Query Model In our model, a multimedia repository consists of a set of multimedia objects, each with a distinct object identity. Each multimedia object has a set of attributes, like...
Optimizing topk selection queries over multimedia repositories
, 2003
"... Repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. A query on these attributes will typically request not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated wi ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
Repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. A query on these attributes will typically request not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, which indicates how well the object matches the selection condition (ranking). Further more, unlike in the relational model, users may just want the k topranked objects for their selection queries, for a relatively small k. In addition to the differences in the query model, another peculiarity of multimedia repositories is that they may allow access to the attributes of each object only through indexes. In this paper, we investigate how to optimize the processing of topk selection queries over multimedia repositories. The access characteristics of the repositories and the above query model lead to novel issues in query optimization. In particular, the choice of the indexes used to search the repos itory strongly influences the cost of processing the filtering condition. We define an execution space that is searchminimal, i.e., the set of indexes searched is minimal. Although the general problem of picking an optimal plan in the searchminimal execution space is NPhard, we present an efficient algorithm that solves the problem optimally with respect to our cost model and execution space when the predicates in the query are independent. We also show that the problem of optimizing topk selection queries can be viewed, in many cases, as that of evaluating more traditional selection conditions. Thus,
On a model of indexability and its bounds for range queries
 Journal of the ACM
, 2002
"... Abstract. We develop a theoretical framework to characterize the hardness of indexing data sets on blockaccess memory devices like hard disks. We define an indexing workload by a data set and a set of potential queries. For a workload, we can construct an indexing scheme, which is a collection of f ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Abstract. We develop a theoretical framework to characterize the hardness of indexing data sets on blockaccess memory devices like hard disks. We define an indexing workload by a data set and a set of potential queries. For a workload, we can construct an indexing scheme, which is a collection of fixedsized subsets of the data. We identify two measures of efficiency for an indexing scheme on a workload: storage redundancy, r (how many times each item in the data set is stored), and access overhead, A (how many times more blocks than necessary does a query retrieve).
A Lower Bound Theorem for Indexing Schemes and its Application to Multidimensional Range Queries
 In Proceedings of the ACM Symposium on Principles of Database Systems
, 1998
"... Indexing schemes were proposed by Hellerstein, Koutsoupias and Papadimitriou [7] to model data indexing on external memory. Using indexing schemes, the complexity of indexing is quantified by two parameters: storage redundancy and access overhead. There is a tradeoff between these two parameters, in ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
Indexing schemes were proposed by Hellerstein, Koutsoupias and Papadimitriou [7] to model data indexing on external memory. Using indexing schemes, the complexity of indexing is quantified by two parameters: storage redundancy and access overhead. There is a tradeoff between these two parameters, in the sense that for some problems it is not possible for both of these to be low. In this paper we derive a lowerbounds theorem for arbitrary indexing schemes. We apply our theorem to the particular problem of ddimensional range queries. We first resolve the open problem of [7] for a tight lower bound for 2dimensional range queries and extend our lower bound to ddimensional range queries. We then show, how, the construction in our lowerbounds proof may be exploited to derive indexing schemes for ddimensional range queries, whose asymptotic complexity matches our lower bounds. 1
Nearest Neighbor Search in Multidimensional Spaces
, 1999
"... The Nearest Neighbor Search problem is defined as follows: given a set P of n points, preprocess the points so as to efficiently answer queries that require finding the closest point in P to a query point q. If we are willing to settle for a point that is almost as close as the nearest neighbor, t ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
The Nearest Neighbor Search problem is defined as follows: given a set P of n points, preprocess the points so as to efficiently answer queries that require finding the closest point in P to a query point q. If we are willing to settle for a point that is almost as close as the nearest neighbor, then we can relax the problem to the approximate Nearest Neighbor Search. Nearest Neighbor Search (exact or approximate) is an integral component in a wide range of applications that include multimedia databases, computational biology, data mining, and information retrieval. The common thread in all these applications is similarity search: given a database of objects, we want to return the object in the database that is most similar to a query object. The objects are mapped onto points in a high dimensional metric space , and similarity search reduces to a nearest neighbor search. The dimension of the underlying space may be in the order of a few hundreds, or thousands; therefore, we r...
Fractal Dimension for Data Mining
"... In this project, we introduce the concept of intrinsic "fractal" dimension of a data set and show how this can be used to aid in several data mining tasks. We are interested in answering questions about the performance of a method and also in comparing between the methods quickly. In pa ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this project, we introduce the concept of intrinsic "fractal" dimension of a data set and show how this can be used to aid in several data mining tasks. We are interested in answering questions about the performance of a method and also in comparing between the methods quickly. In particular, we discuss two specific problems  dimensionality reduction and vector quantization. In each of these problems, we show how the performance of a method is related to the fractal dimension of the data set. Using real and synthetic data sets, we validate these relationships and show how we can use this for faster evaluation and comparison of the methods.
Fractal Dimension and Vector Quantization [Extended Abstract]
, 2003
"... Krishna Kumaraswamy Center for Automated Learning and Discovery, Carnegie Mellon University skkumar@cs.cmu.edu Vasileios Megalooikonomou Department of Computer and Information Sciences, Temple University vasilis@cis.temple.edu ABSTRACT Is there a way to determine the performance of a Vector ..."
Abstract
 Add to MetaCart
Krishna Kumaraswamy Center for Automated Learning and Discovery, Carnegie Mellon University skkumar@cs.cmu.edu Vasileios Megalooikonomou Department of Computer and Information Sciences, Temple University vasilis@cis.temple.edu ABSTRACT Is there a way to determine the performance of a Vector Quantization method? Is there a fast method for determining the performance of a Vector Quantization method? In this paper, we show the relationship between the concepts of Vector Quantization and intrinsic "Fractal" Dimension. We derive a formula predicting the error rate of a Vector Quantization method, given the fractal dimension of the data set. We show that our result is true on synthetic as well as on real data sets. Also, we discuss how we can use our result for better, faster use in several data mining tasks.
ABSTRACT Flip Korn
"... Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for ..."
Abstract
 Add to MetaCart
(Show Context)
Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for mining the data. The challenges to do model fitting at streaming speeds are both technical –howto continually find fast and reliable parameter estimates on high speed streams of skewed data using small space – and conceptual –how to validate the goodnessoffit and stability of the model online. In this paper, we show how to fit hierarchical (binomial multifractal) and nonhierarchical (Pareto) powerlaw models on a data stream. We address the technical challenges using an approach that maintains a sketch of the data stream and fits leastsquares straight lines; it yields algorithms that are fast, spaceefficient, and provide approximations of parameter value estimates with a priori quality guarantees relative to those obtained offline. We address the conceptual challenge by designing fast methods for online goodnessoffit measurements on a data stream; we adapt the statistical testing technique of examining the quantilequantile (qq) plot, to perform online model validation at streaming speeds. As a concrete application of our techniques, we focus on network traffic data which has been shown to exhibit skewed distributions. We complement our analytic and algorithmic results with experiments on IP traffic streams in AT&T’s Gigascope R ○ data stream management system, to demonstrate practicality of our methods at line speeds. We measured the stability and robustness of these models over weeks of operational packet data in an IP network. In addition, we study an intrusion detection application, and demonstrate the potential of online parametric modeling. 1.
ABSTRACT Flip Korn
"... Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for ..."
Abstract
 Add to MetaCart
(Show Context)
Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for mining the data. The challenges to do model fitting at streaming speeds are both technical –howto continually find fast and reliable parameter estimates on high speed streams of skewed data using small space – and conceptual –how to validate the goodnessoffit and stability of the model online. In this paper, we show how to fit hierarchical (binomial multifractal) and nonhierarchical (Pareto) powerlaw models on a data stream. We address the technical challenges using an approach that maintains a sketch of the data stream and fits leastsquares straight lines; it yields algorithms that are fast, spaceefficient, and provide approximations of parameter value estimates with a priori quality guarantees relative to those obtained offline. We address the conceptual challenge by designing fast methods for online goodnessoffit measurements on a data stream; we adapt the statistical testing technique of examining the quantilequantile (qq) plot, to perform online model validation at streaming speeds. As a concrete application of our techniques, we focus on network traffic data which has been shown to exhibit skewed distributions. We complement our analytic and algorithmic results with experiments on IP traffic streams in AT&T’s Gigascope R ○ data stream management system, to demonstrate practicality of our methods at line speeds. We measured the stability and robustness of these models over weeks of operational packet data in an IP network. In addition, we study an intrusion detection application, and demonstrate the potential of online parametric modeling. 1.
Largescale Automated Forecasting using Fractals
, 2002
"... Forecasting has attracted a lot of research interest, with very successful methods for periodic time sequences. Here, we propose a fast, automated method to do nonlinear forecasting, for both periodic as well as chaotic time sequences. We use the technique of delay coordinate embedding, which needs ..."
Abstract
 Add to MetaCart
Forecasting has attracted a lot of research interest, with very successful methods for periodic time sequences. Here, we propose a fast, automated method to do nonlinear forecasting, for both periodic as well as chaotic time sequences. We use the technique of delay coordinate embedding, which needs several parameters; our contribution is the automated way of setting these parameters, using the concept of ‘intrinsic dimensionality’. Our operational system has fast and scalable algorithms for preprocessing and, using Rtrees, also has fast methods for forecasting. The result of this work is a blackbox which, given a time series as input, finds the best parameter settings, and generates a prediction system. Tests on real and synthetic