Results 1  10
of
25
Computing on Data Streams
, 1998
"... In this paper we study the space requirement of algorithms that make only one (or a small number of) pass(es) over the input data. We study such algorithms under a model of data streams that we introduce here. We give a number of upper and lower bounds for problems stemming from queryprocessing, ..."
Abstract

Cited by 180 (3 self)
 Add to MetaCart
(Show Context)
In this paper we study the space requirement of algorithms that make only one (or a small number of) pass(es) over the input data. We study such algorithms under a model of data streams that we introduce here. We give a number of upper and lower bounds for problems stemming from queryprocessing, invoking in the process tools from the area of communication complexity.
PracticaJ selectivity estimation through adaptive sampling
 In Proc. ,4CM SIGMOD International Conf on Management of Data
, 1990
"... Recently we have proposed an adaptive, random sampling algorithm for general query size estlmatlon In earlier work we analyzed the asymptotic ef’l?clency and accuracy of the algorithm, m this paper we mvestlgate Its practlcahty as applied to selects and Jams First, we extend our previous analysis to ..."
Abstract

Cited by 164 (7 self)
 Add to MetaCart
Recently we have proposed an adaptive, random sampling algorithm for general query size estlmatlon In earlier work we analyzed the asymptotic ef’l?clency and accuracy of the algorithm, m this paper we mvestlgate Its practlcahty as applied to selects and Jams First, we extend our previous analysis to provide agmficantly improved bounds on the amount of samplmg necessary for a given level of accuracy Next, we provide “sanity bounds ” to deal with queries for which the underlying data 1s extremely skewed or the query result 1s very small Finally, we report on the performance of the estlmatlon algorithm as amplemented m a host language on a commercial relational system The results are encouraging, even with this loose couplmg between the estlmatlon algorithm and the DBMS 1
ANF: A Fast and Scalable Tool for Data Mining in Massive Graphs
 NTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2002
"... Graphs are an increasingly important data source, with such important graphs as the Internet and the Web. Other familiar graphs include CAD circuits, phone records, gene sequences, city streets, social networks and academic citations. Any kind of relationship, such as actors appearing in movies, can ..."
Abstract

Cited by 124 (19 self)
 Add to MetaCart
(Show Context)
Graphs are an increasingly important data source, with such important graphs as the Internet and the Web. Other familiar graphs include CAD circuits, phone records, gene sequences, city streets, social networks and academic citations. Any kind of relationship, such as actors appearing in movies, can be represented as a graph. This work presents a data mining tool, called ANF, that can quickly answer a number of interesting questions on graphrepresented data, such as the following. How robust is the Internet to failures? What are the most influential database papers? Are there gender differences in movie appearance patterns? At its core, ANF is based on a fast and memoryefficient approach for approximating the complete "neighbourhood function" for a graph. For the Internet graph (268K nodes), ANF's highlyaccurate approximation is more than 700 times faster than the exact computation. This reduces the running time from nearly a day to a matter of a minute or two, allowing users to perform ad hoc drilldown tasks and to repeatedly answer questions about changing data sources. To enable this drilldown, ANF employs new techniques for approximating neighbourhoodtype functions for graphs with distinguished nodes and/or edges. When compared to the best existing approximation, ANF's approach is both faster and more accurate, given the same resources. Additionally, unlike previous approaches, ANF scales gracefully to handle disk resident graphs. Finally, we present some of our results from mining large graphs using ANF.
LinkBased Characterization and Detection of Web Spam
 In AIRWeb
, 2006
"... We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a stu ..."
Abstract

Cited by 54 (8 self)
 Add to MetaCart
We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Using this approach we are able to detect 80.4% of the Web spam in our sample, with only 1.1% of false positives.
Bifocal Sampling for SkewResistant Join Size Estimation
 In Proceedings of the 1996 ACM SIGMOD Intl. Conf. on Management of Data
, 1996
"... This paper introduces bifocal sampling, a new technique for estimating the size of an equijoin of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value, Distinct estimation procedures are employed ..."
Abstract

Cited by 39 (6 self)
 Add to MetaCart
This paper introduces bifocal sampling, a new technique for estimating the size of an equijoin of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value, Distinct estimation procedures are employed that focus on various combinations for joining tuples (e.g., for estimating the number of joining tuples that are dense in both relations). This combination of estimation procedures overcomes some wellknown problems in previous schemes, enabling good estimates with no a priori knowledge about the data distribution. The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is f2(n lg n), for relations consisting of n tuples. The algorithm requires a sample of size at most O(W Ig n). By contrast, previous algorithms using a sample of similar size may require the join size to be f2(n/ @ to guarantee an accurate estimate. Experimental results support the theoretical claims and show that bifocal sampling is practical and effective. 1
Approximating Betweenness Centrality
, 2007
"... Betweenness is a centrality measure based on shortest paths, widely used in complex network analysis. It is computationallyexpensive to exactly determine betweenness; currently the fastestknown algorithm by Brandes requires O(nm) time for unweighted graphs and O(nm + n 2 log n) time for weighted ..."
Abstract

Cited by 39 (5 self)
 Add to MetaCart
Betweenness is a centrality measure based on shortest paths, widely used in complex network analysis. It is computationallyexpensive to exactly determine betweenness; currently the fastestknown algorithm by Brandes requires O(nm) time for unweighted graphs and O(nm + n 2 log n) time for weighted graphs, where n is the number of vertices and m is the number of edges in the network. These are also the worstcase time bounds for computing the betweenness score of a single vertex. In this paper, we present a novel approximation algorithm for computing betweenness centrality of a given vertex, for both weighted and unweighted graphs. Our approximation algorithm is based on an adaptive sampling technique that significantly reduces the number of singlesource shortest path computations for vertices with high centrality. We conduct an extensive experimental study on realworld graph instances, and observe that our random sampling algorithm gives very good betweenness approximations for biological networks, road networks and web crawls.
Using Rank Propagation and Probabilistic Counting for LinkBased Spam Detection
 In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD
, 2006
"... This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and di#cult to solve, mostly due to the large size ..."
Abstract

Cited by 38 (13 self)
 Add to MetaCart
This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and di#cult to solve, mostly due to the large size of web collections that makes many algorithms unfeasible in practice.
Random Sampling from Databases  A Survey
 Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including Rtrees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
AQUA: System and techniques for approximate query answering
, 1998
"... In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to ..."
Abstract

Cited by 24 (5 self)
 Add to MetaCart
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This paper presents the Approximate QUery Answering (AQUA) System, for fast, highly accurate approximate answers to queries. Aqua provides approximate answers using small, precomputed synopses (samples, counts, etc.) of the underlying base data. An important feature of Aqua is that it provides accuracy guarantees without any a priori assumptions on either the data distribution, the order in which the base data is loaded, or the layout of the data on the disks. Currently, the system provides fast approximate answers for queries with selects, aggregates, group bys and/or joins (especially, the multiway foreign key joins that are popular in OLAP). We present several new techniques for improving the accuracy of approximate query answers for this class of queries. We show how join sampling can significantly improve the approximation quality. We describe how biased sampling can be used to overcome the problem of group size disparities