Results 1 -
7 of
7
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-h ..."
Abstract
-
Cited by 91 (11 self)
- Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs
- Bioinformatics
, 2004
"... Biological and engineered networks have recently been shown to display network motifs: a small set of characteristic patterns which occur much more frequently than in randomized networks with the same degree sequence. Network motifs were demonstrated to play key information processing roles in biolo ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
Biological and engineered networks have recently been shown to display network motifs: a small set of characteristic patterns which occur much more frequently than in randomized networks with the same degree sequence. Network motifs were demonstrated to play key information processing roles in biological regulation networks. Existing algorithms for detecting network motifs act by exhaustively enumerating all subgraphs with a given number of nodes in the network. The runtime of such full enumeration algorithms increases strongly with network size. Here we present a novel algorithm that allows estimation of subgraph concentrations and detection of network motifs at a run time that is asymptotically independent of the network size. This algorithm is based on random sampling of subgraphs. Network motifs are detected with a surprisingly small number of samples in a wide variety of networks. Our method can be applied to estimate the concentrations of larger subgraphs in larger networks than was previously possible with full enumeration algorithms. We present results for high-order motifs in several biological networks and discuss their possible functions. Availability: A software tool for estimating subgraph concentrations and detecting network motifs (mfinder 2.0) and further information is available at:
Online Dynamic Reordering
, 2000
"... We present a mechanism for providing dynamic user control during long running, data-intensive operations. Users can see partial results and dynamically direct the processing to suit their interests, thereby enhancing the interactivity in several contexts such as online aggregation and large-scale ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We present a mechanism for providing dynamic user control during long running, data-intensive operations. Users can see partial results and dynamically direct the processing to suit their interests, thereby enhancing the interactivity in several contexts such as online aggregation and large-scale spreadsheets. In this paper, we develop a pipelining, dynamically tunable reorder operator for providing user control. Users specify preferences for different data items based on prior results, so that data of interest is prioritized for early processing. The reordering mechanism is efficient and non-blocking, and can be used over arbitrary data streams from files and indexes, as well as continuous data feeds. We also investigate several policies for the reordering based on the performance goals of various typical applications. We present performance results for reordering in the context of an Online Aggregation implementation in Informix, and in the context of sorting and scrolling in...
On Sampling and Relational Operators
- BULLETIN OF THE TECHNICAL COMMITTEE ON DATA ENGINEERING
, 1999
"... A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. We highlight the primary difficulties, summarize the results of some recent work in this area, and indicate directions for future work. ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. We highlight the primary difficulties, summarize the results of some recent work in this area, and indicate directions for future work.
Data Engineering
"... XML is fast becoming the intergalactic data speak alphabet for data and information exchange that hides the heterogeneity among the components of Loosely-coupled, distributed systems and provides the glue that allows the individual components to take part in the loosely integrated system. Since much ..."
Abstract
- Add to MetaCart
XML is fast becoming the intergalactic data speak alphabet for data and information exchange that hides the heterogeneity among the components of Loosely-coupled, distributed systems and provides the glue that allows the individual components to take part in the loosely integrated system. Since much of this data is currently stored in relational database systems, simplifying the transformation of this data from and to XML in general and from and to the agreed upon exchange schema specifically is an important feature that should improve the productivity of the programmer and the efficiency of this process. This article provides an overview over the features that are needed to provide access via HTTP and XML and presents the approach taken in Microsoft SQL Server. Keywords: Loosely-coupled, distributed system architectures, XML, relational database systems 1
Bulletin of the Technical Committee on
"... XML is fast becoming the intergalactic data speak alphabet for data and information exchange that hides the heterogeneity among the components of Loosely-coupled, distributed systems and provides the glue that allows the individual components to take part in the loosely integrated system. Since muc ..."
Abstract
- Add to MetaCart
XML is fast becoming the intergalactic data speak alphabet for data and information exchange that hides the heterogeneity among the components of Loosely-coupled, distributed systems and provides the glue that allows the individual components to take part in the loosely integrated system. Since much of this data is currently stored in relational database systems, simplifying the transformation of this data from and to XML in general and from and to the agreed upon exchange schema specifically is an important feature that should improve the productivity of the programmer and the efficiency of this process. This article provides an overview over the features that are needed to provide access via HTTP and XML and presents the approach taken in Microsoft SQL Server.
Concise Papers __________________________________________________________________________________________ A Selectivity Model for Fragmented Relations: Applied in Information Retrieval
"... Abstract—New application domains cause today’s database sizes to grow rapidly, posing great demands on technology. Data fragmentation facilitates techniques (like distribution, parallelization, and main-memory computing) meeting these demands. Also, fragmentation might help to improve efficient proc ..."
Abstract
- Add to MetaCart
Abstract—New application domains cause today’s database sizes to grow rapidly, posing great demands on technology. Data fragmentation facilitates techniques (like distribution, parallelization, and main-memory computing) meeting these demands. Also, fragmentation might help to improve efficient processing of query types such as top N. Database design and query optimization require a good notion of the costs resulting from a certain fragmentation. Our mathematically derived selectivity model facilitates this. Once its two parameters have been computed based on the fragmentation, after each (though usually infrequent) update, our model can forget the data distribution, resulting in fast and quite good selectivity estimation. We show experimental verification for Zipfian distributed IR databases. Index Terms—Selectivity, fragmentation, Zipf, information retrieval, databases.

