Results 11  20
of
154
Counting Twig Matches in a Tree
, 2001
"... We describe efficient algorithms for accurately estimating the number of matches of a small nodelabeled tree, i.e., a twig, in a large nodelabeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for cos ..."
Abstract

Cited by 64 (2 self)
 Add to MetaCart
We describe efficient algorithms for accurately estimating the number of matches of a small nodelabeled tree, i.e., a twig, in a large nodelabeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for costbased query optimization. Our summary data structure scalably representsapproximate frequencyinformation about twiglets (i.e., small twigs) in the data tree. Given a twig query, the number of matches is estimated by creating a set of query twiglets, and combining two complementary approaches: Set Hashing, used to estimate the number of matches of each query twiglet, and Maximal Overlap, used to combine the query twiglet estimates into an estimate for the twig query. We propose several estimation algorithms that apply these approaches on query twiglets formed using variations on different twiglet decomposition techniques. We present an extensive experimental evaluation using several real XML...
Bypassing the embedding: Algorithms for lowdimensional metrics
 In Proceedings of the 36th ACM Symposium on the Theory of Computing (STOC
, 2004
"... The doubling dimension of a metric is the smallest k such that any ball of radius 2r can be covered using 2 k balls of radius r. This concept for abstract metrics has been proposed as a natural analog to the dimension of a Euclidean space. If we could embed metrics with low doubling dimension into l ..."
Abstract

Cited by 64 (4 self)
 Add to MetaCart
The doubling dimension of a metric is the smallest k such that any ball of radius 2r can be covered using 2 k balls of radius r. This concept for abstract metrics has been proposed as a natural analog to the dimension of a Euclidean space. If we could embed metrics with low doubling dimension into low dimensional Euclidean spaces, they would inherit several algorithmic and structural properties of the Euclidean spaces. Unfortunately however, such a restriction on dimension does not suffice to guarantee embeddibility in a normed space. In this paper we explore the option of bypassing the embedding. In particular we show the following for low dimensional metrics: • Quasipolynomial time (1+ɛ)approximation algorithm for various optimization problems such as TSP, kmedian and facility location. • (1 + ɛ)approximate distance labeling scheme with optimal label length. • (1+ɛ)stretch polylogarithmic storage routing scheme.
A Small Approximately MinWise Independent Family of Hash Functions
 Journal of Algorithms
, 1999
"... In this paper we give a construction of a small approximately minwise independent family of hash functions. The number of bits needed to represent each function is O(logn \Delta log 1=ffl). This construction gives a solution to the main open problem of [2]. 1 Introduction A family of functions H ..."
Abstract

Cited by 56 (2 self)
 Add to MetaCart
In this paper we give a construction of a small approximately minwise independent family of hash functions. The number of bits needed to represent each function is O(logn \Delta log 1=ffl). This construction gives a solution to the main open problem of [2]. 1 Introduction A family of functions H ae [n] ! [n] (where [n] = f0 : : : n \Gamma 1g) is called fflminwise independent if for any X ae [n] and x 2 [n] \Gamma X we have Pr h2H [h(x) ! minh(X)] = 1 jXj + 1 (1 \Sigma ffl) 1 This definition can be generalized to the case when jXj is restricted to be smaller than a prespecified bound s. Such families (restricted to the case when all functions from H are permutations) were introduced and investigated in [2] and independently earlier in [6] (cf. [7]). The motivation for studying such families is to reduce amount of randomness used by algorithms [6, 2, 3]. In particular (as pointed out in [2]) they have immediate application to efficient detection of similar documents in large...
On the Evolution of Clusters of NearDuplicate Web Pages
 IN 1ST LATIN AMERICAN WEB CONGRESS
, 2003
"... This paper expands on a 1997 study of the amount and distribution of nearduplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are nearduplicates of one another, and tracked how clust ..."
Abstract

Cited by 54 (3 self)
 Add to MetaCart
This paper expands on a 1997 study of the amount and distribution of nearduplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are nearduplicates of one another, and tracked how clusters of nearduplicate documents evolved over time. We found that 29.2% of all web pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of nearduplicate documents are fairly stable: Two documents that are nearduplicates of one another are very likely to still be nearduplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be nearduplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions, thereby freeing up crawling resources that can be brought to bear more productively somewhere else.
An Axiomatic Approach for Result Diversification
 WWW 2009 MADRID!
, 2009
"... Understanding user intent is key to designing an effective ranking system in a search engine. In the absence of any explicit knowledge of user intent, search engines want to diversify results to improve user satisfaction. In such a setting, the probability ranking principlebased approach of present ..."
Abstract

Cited by 51 (1 self)
 Add to MetaCart
Understanding user intent is key to designing an effective ranking system in a search engine. In the absence of any explicit knowledge of user intent, search engines want to diversify results to improve user satisfaction. In such a setting, the probability ranking principlebased approach of presenting the most relevant results on top can be suboptimal, and hence the search engine would like to tradeoff relevance for diversity in the results. In analogy to prior work on ranking and clustering systems, we use the axiomatic approach to characterize and design diversification systems. We develop a set of natural axioms that a diversification system is expected to satisfy, and show that no diversification function can satisfy all the axioms simultaneously. We illustrate the use of the axiomatic framework by providing three example diversification objectives that satisfy different subsets of the axioms. We also uncover a rich link to the facility dispersion problem that results in algorithms for a number of diversification objectives. Finally, we propose an evaluation methodology to characterize the objectives and the underlying axioms. We conduct a large scale evaluation of our objectives based on two data sets: a data set derived from the Wikipedia disambiguation pages and a product database.
Sampling Algorithms: Lower Bounds and Applications (Extended Abstract)
, 2001
"... ] Ziv BarYossef y Computer Science Division U. C. Berkeley Berkeley, CA 94720 zivi@cs.berkeley.edu Ravi Kumar IBM Almaden 650 Harry Road San Jose, CA 95120 ravi@almaden.ibm.com D. Sivakumar IBM Almaden 650 Harry Road San Jose, CA 95120 siva@almaden.ibm.com ABSTRACT We develop a fr ..."
Abstract

Cited by 49 (2 self)
 Add to MetaCart
] Ziv BarYossef y Computer Science Division U. C. Berkeley Berkeley, CA 94720 zivi@cs.berkeley.edu Ravi Kumar IBM Almaden 650 Harry Road San Jose, CA 95120 ravi@almaden.ibm.com D. Sivakumar IBM Almaden 650 Harry Road San Jose, CA 95120 siva@almaden.ibm.com ABSTRACT We develop a framework to study probabilistic sampling algorithms that approximate general functions of the form f : A n ! B, where A and B are arbitrary sets. Our goal is to obtain lower bounds on the query complexity of functions, namely the number of input variables x i that any sampling algorithm needs to query to approximate f(x1 ; : : : ; xn ). We define two quantitative properties of functions  the block sensitivity and the minimum Hellinger distance  that give us techniques to prove lower bounds on the query complexity. These techniques are quite general, easy to use, yet powerful enough to yield tight results. Our applications include the mean and higher statistical moments, the median and other selection functions, and the frequency moments, where we obtain lower bounds that are close to the corresponding upper bounds. We also point out some connections between sampling and streaming algorithms and lossy compression schemes. 1.
QuickSAND: Quick Summary and Analysis of Network Data
, 2001
"... Monitoring and analyzing traffic data generated from large ISP networks imposes challenges both at the data gathering phase as well as the data analysis itself. Still both tasks are crucial for responding to day to day challenges of engineering large networks with thousands of customers. In this pap ..."
Abstract

Cited by 46 (8 self)
 Add to MetaCart
Monitoring and analyzing traffic data generated from large ISP networks imposes challenges both at the data gathering phase as well as the data analysis itself. Still both tasks are crucial for responding to day to day challenges of engineering large networks with thousands of customers. In this paper we build on the premise that approximation is a necessary evil of handling massive datasets such as network data. We propose building compact summaries of the traffic data called sketches at distributed network elements and centers. These sketches are able to respond well to queries that seek features that stand out of the data. We call such features "heavy hitters." In this paper, we describe sketches and show how to use sketches to answer aggregate and trendrelated queries and identify heavy hitters. This may be used for exploratory data analysis of network operations interest. We support our proposal by experimentally studying AT&T WorldNet data and performing a feasibility study on the Cisco NetFlow data collected at several routers. 1
Scalable Techniques for Clustering the Web
 In Proc. of the WebDB Workshop
, 2000
"... Clustering is one of the most crucial techniques for dealing with the massive amount of information present on the web. Clustering can either be performed once offline, independent of search queries, or performed online on the results of search queries. Our offline approach aims to efficiently clust ..."
Abstract

Cited by 41 (5 self)
 Add to MetaCart
Clustering is one of the most crucial techniques for dealing with the massive amount of information present on the web. Clustering can either be performed once offline, independent of search queries, or performed online on the results of search queries. Our offline approach aims to efficiently cluster similar pages on the web, using the technique of LocalitySensitive Hashing (LSH), in which web pages are hashed in such a way that similar pages have a much higher probability of collision than dissimilar pages. Our preliminary experiments on the Stanford WebBase have shown that the hashbased scheme can be scaled to millions of urls. 1.
Efficient semistreaming algorithms for local triangle counting in massive graphs
 in KDD’08, 2008
"... In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph ha ..."
Abstract

Cited by 41 (4 self)
 Add to MetaCart
In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first paper that addresses the problem of local triangle counting with a focus on the efficiency issues arising in massive graphs. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help to detect the presence of spamming activity in largescale Web graphs, as well as to provide useful features to assess content quality in social networks. For computing the local number of triangles we propose two approximation algorithms, which are based on the idea of minwise independent permutations (Broder et al. 1998). Our algorithms operate in a semistreaming fashion, using O(V ) space in main memory and performing O(log V ) sequential scans over the edges of the graph. The first algorithm we describe in this paper also uses O(E) space in external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results in massive graphs demonstrating the practical efficiency of our approach. Luca Becchetti was partially supported by EU Integrated