Results 1  10
of
245
Improved Histograms for Selectivity Estimation of Range Predicates
, 1996
"... Many commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never been a systematic study of all histogram aspec ..."
Abstract

Cited by 239 (20 self)
 Add to MetaCart
Many commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never been a systematic study of all histogram aspects, the available choices for each aspect, and the impact of such choices on histogram effectiveness. In this paper, we provide a taxonomy of histograms that captures all previously proposed histogram types and indicates many new possibilities. We introduce novel choices for several of the taxonomy dimensions, and derive new histogram types by combining choices in effective ways. We also show how sampling techniques can be used to reduce the cost of histogram construction. Finally, we present results from an empirical study of the proposed histogram types used in selectivity estimation of range predicates and identify the histogram types that have the best overall performance. 1 Introduction...
Selectivity Estimation Without the Attribute Value Independence Assumption
, 1997
"... The result size of a query that involves multiple attributes from the same relation depends on these attributes’joinr data distribution, i.e., the frequencies of all combinations of attribute values. To simplify the estimation of that size, most commercial systems make the artribute value independen ..."
Abstract

Cited by 198 (12 self)
 Add to MetaCart
The result size of a query that involves multiple attributes from the same relation depends on these attributes’joinr data distribution, i.e., the frequencies of all combinations of attribute values. To simplify the estimation of that size, most commercial systems make the artribute value independenceassumption and maintain statistics (typically histograms) on individual attributes only. In reality, this assumption is almost always wrong and the resulting estimations tend to be highly inaccurate. In this paper, we propose two main alternatives to effectively approximate (multidimensional) joint data distributions. (a) Using a multidimensional histogram, (b) Using the Singular Value Decomposition (SVD) technique from linear algebra. An extensive set of experiments demonstrates the advantages and disadvantages of the two approaches and the benefits of both compared to the independence assumption. 1
The Average Distance in a Random Graph with Given Expected Degrees
"... Random graph theory is used to examine the “smallworld phenomenon”– any two strangers are connected through a short chain of mutual acquaintances. We will show that for certain families of random graphs with given expected degrees, the average distance is almost surely of order log n / log ˜ d wher ..."
Abstract

Cited by 191 (13 self)
 Add to MetaCart
Random graph theory is used to examine the “smallworld phenomenon”– any two strangers are connected through a short chain of mutual acquaintances. We will show that for certain families of random graphs with given expected degrees, the average distance is almost surely of order log n / log ˜ d where ˜ d is the weighted average of the sum of squares of the expected degrees. Of particular interest are power law random graphs in which the number of vertices of degree k is proportional to 1/k β for some fixed exponent β. For the case of β> 3, we prove that the average distance of the power law graphs is almost surely of order log n / log ˜ d. However, many Internet, social, and citation networks are power law graphs with exponents in the range 2 < β < 3 for which the power law random graphs have average distance almost surely of order log log n, but have diameter of order log n (provided having some mild constraints for the average distance and maximum degree). In particular, these graphs contain a dense subgraph, that we call the core, having n c / log log n vertices. Almost all vertices are within distance log log n of the core although there are vertices at distance log n from the core.
Efficient filtering of XML documents with XPath expressions
, 2002
"... cychan,pascal,minos,rastogi¡ We propose a novel index structure, termed XTrie, that supports the efficient filtering of XML documents based on XPath expressions. Our XTrie index structure offers several novel features that make it especially attractive for largescale publish/subscribe systems. First ..."
Abstract

Cited by 172 (12 self)
 Add to MetaCart
cychan,pascal,minos,rastogi¡ We propose a novel index structure, termed XTrie, that supports the efficient filtering of XML documents based on XPath expressions. Our XTrie index structure offers several novel features that make it especially attractive for largescale publish/subscribe systems. First, XTrie is designed to support effective filtering based on complex XPath expressions (as opposed to simple, singlepath specifications). Second, our XTrie structure and algorithms are designed to support both ordered and unordered matching of XML data. Third, by indexing on sequences of element names organized in a trie structure and using a sophisticated matching algorithm, XTrie is able to both reduce the number of unnecessary index probes as well as avoid redundant matchings, thereby providing extremely efficient filtering. Our experimental results over a wide range of XML document and XPath expression workloads demonstrate that our XTrie index structure outperforms earlier approaches by wide margins. 1.
Power laws, Pareto distributions and Zipf’s law
 Contemporary Physics
, 2005
"... When the probability of measuring a particular value of some quantity varies inversely as a power of that value, the quantity is said to follow a power law, also known variously as Zipf’s law or the Pareto distribution. Power laws appear widely in physics, biology, earth and planetary sciences, econ ..."
Abstract

Cited by 170 (0 self)
 Add to MetaCart
When the probability of measuring a particular value of some quantity varies inversely as a power of that value, the quantity is said to follow a power law, also known variously as Zipf’s law or the Pareto distribution. Power laws appear widely in physics, biology, earth and planetary sciences, economics and finance, computer science, demography and the social sciences. For instance, the distributions of the sizes of cities, earthquakes, solar flares, moon craters, wars and people’s personal fortunes all appear to follow power laws. The origin of powerlaw behaviour has been a topic of debate in the scientific community for more than a century. Here we review some of the empirical evidence for the existence of powerlaw forms and the theories proposed to explain them. I.
Evaluating Topk Selection Queries
 In VLDB
, 1999
"... In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the "top k" tuples that best match the given attribute values. In this paper, we study the advantages and li ..."
Abstract

Cited by 139 (4 self)
 Add to MetaCart
In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the "top k" tuples that best match the given attribute values. In this paper, we study the advantages and limitations of processing a topk query by translating it into a single range query that traditional relational DBMSs can process e#ciently. In particular, we study how to determine a range query to evaluate a topk query by exploiting the statistics available to a relational DBMS, and the impact of the quality of these statistics on the retrieval e#ciency of the resulting scheme. 1 Introduction Internet Search engines rank the objects in the results of selection queries according to how well these objects match the original selection condition. For such engines, query results are not flat sets of objects that match a given condition. Instead, query results are ranked starting ...
Eliminating Fuzzy Duplicates in Data Warehouses
 In VLDB
, 2002
"... The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between m ..."
Abstract

Cited by 112 (3 self)
 Add to MetaCart
The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multiattribute tuples. However, such approaches result in large numbers of false positives if we want to identify domainspecific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.
Extracting LargeScale Knowledge Bases From the Web
 Proceedings of the 25th VLDB Conference
, 1999
"... The subject of this paper is the creation of knowledge bases by enumerating and organizing all web occurrences of certain subgraphs. We focus on subgraphs that are signatures of web phenomena such as tightlyfocused topic communities, webrings, taxonomy trees, keiretsus, etc. For instance, the ..."
Abstract

Cited by 103 (2 self)
 Add to MetaCart
The subject of this paper is the creation of knowledge bases by enumerating and organizing all web occurrences of certain subgraphs. We focus on subgraphs that are signatures of web phenomena such as tightlyfocused topic communities, webrings, taxonomy trees, keiretsus, etc. For instance, the signature of a webring is a central page with bidirectional links to a number of other pages. We develop novel algorithms for such enumeration problems. A key technical contribution is the development of a model for the evolution of the web graph, based on experimental observations derived from a snapshot of the web. We argue that our algorithms run efficiently in this model, and use the model to explain some statistical phenomena on the web that emerged during our experiments. Finally, we describe the design and implementation of Campfire, a knowledge base of over one hundred thousand web communities. 1 Overview The subject of this paper is the creation of knowledge bases by ...
The complex dynamics of collaborative tagging
 In Proceedings of international conference
, 2007
"... The debate within the Web community over the optimal means by which to organize information often pits formalized classifications against distributed collaborative tagging systems. A number of questions remain unanswered, however, regarding the nature of collaborative tagging systems including wheth ..."
Abstract

Cited by 95 (3 self)
 Add to MetaCart
The debate within the Web community over the optimal means by which to organize information often pits formalized classifications against distributed collaborative tagging systems. A number of questions remain unanswered, however, regarding the nature of collaborative tagging systems including whether coherent categorization schemes can emerge from unsupervised tagging by users. This paper uses data from tagged sites on the social bookmarking site del.icio.us to examine the dynamics of collaborative tagging systems. In particular, we examine whether the distribution of the frequency of use of tags for “popular ” sites with a long history (many tags and many users) can be described by a power law distribution, often characteristic of what are considered complex systems. We produce a generative model of collaborative tagging in order to understand the basic dynamics behind tagging, including how a power law distribution of tags could arise. We empirically examine the tagging history of sites in order to determine how this distribution arises over time and patterns prior to a stable distribution. Lastly, by focusing on the highfrequency tags of a site where the distribution of tags is a stabilized power law, we show how tag cooccurrence networks for a sample domain of tags can be used analyze the meaning of particular tags given their relationship to other tags. 1.