Results 1  10
of
21
LinkBased Characterization and Detection of Web Spam
 In AIRWeb
, 2006
"... We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a stu ..."
Abstract

Cited by 48 (8 self)
 Add to MetaCart
We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Using this approach we are able to detect 80.4% of the Web spam in our sample, with only 1.1% of false positives.
Efficient semistreaming algorithms for local triangle counting in massive graphs
 in KDD’08, 2008
"... In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph ha ..."
Abstract

Cited by 41 (4 self)
 Add to MetaCart
In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first paper that addresses the problem of local triangle counting with a focus on the efficiency issues arising in massive graphs. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help to detect the presence of spamming activity in largescale Web graphs, as well as to provide useful features to assess content quality in social networks. For computing the local number of triangles we propose two approximation algorithms, which are based on the idea of minwise independent permutations (Broder et al. 1998). Our algorithms operate in a semistreaming fashion, using O(V ) space in main memory and performing O(log V ) sequential scans over the edges of the graph. The first algorithm we describe in this paper also uses O(E) space in external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results in massive graphs demonstrating the practical efficiency of our approach. Luca Becchetti was partially supported by EU Integrated
Link analysis for web spam detection
 ACM Transactions on the Web
, 2007
"... We propose linkbased techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The issue of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means th ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
We propose linkbased techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The issue of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible. We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. We build several automatic web spam classifiers using different techniques. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Based on these results we propose spam detection techniques which only consider the link structure of Web, regardless of page contents. These statistical features are used to build a classifier that is tested over a large collection of Web link spam. After tenfold crossvalidation, our best classifiers have a performance comparable to that of stateoftheart spam classifiers that use content attributes, and orthogonal to their methods.
GRAPH DISTANCES IN THE DATASTREAM MODEL
, 2008
"... We explore problems related to computing graph distances in the datastream model. The goal is to design algorithms that can process the edges of a graph in an arbitrary order given only a limited amount of working memory. We are motivated by both the practical challenge of processing massive graph ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
We explore problems related to computing graph distances in the datastream model. The goal is to design algorithms that can process the edges of a graph in an arbitrary order given only a limited amount of working memory. We are motivated by both the practical challenge of processing massive graphs such as the web graph and the desire for a better theoretical understanding of the datastream model. In particular, we are interested in the tradeoffs between model parameters such as perdataitem processing time, total space, and the number of passes that may be taken over the stream. These tradeoffs are more apparent when considering graph problems than they were in previous streaming work that solved problems of a statistical nature. Our results include the following: (1) Spanner construction: There exists a singlepass, Õ(tn1+1/t)space, Õ(t2n1/t)timeperedge algorithm that constructs a (2t + 1)spanner. For t =Ω(logn/log log n), the algorithm satisfies the semistreaming space restriction of O(n polylog n) and has peredge processing time O(polylog n). This resolves an open question from [J. Feigenbaum et al., Theoret. Comput. Sci., 348 (2005), pp. 207–216]. (2) Breadthfirstsearch (BFS) trees: For any even constant k, we show that any algorithm that computes the first k layers of a BFS tree from a prescribed node with probability at least 2/3 requires either greater than k/2 passes or ˜Ω(n1+1/k) space. Since constructing BFS trees is
Annotations in Data Streams
, 2009
"... The central goal of data stream algorithms is to process massive streams of data using sublinear storage space. Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms be further reduced by enlisting a more pow ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
The central goal of data stream algorithms is to process massive streams of data using sublinear storage space. Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms be further reduced by enlisting a more powerful “helper ” who can annotate the stream as it is read. We do not wish to blindly trust the helper, so we require that the algorithm be convinced of having computed a correct answer. We show upper bounds that achieve a nontrivial tradeoff between the amount of annotation used and the space required to verify it. We also prove lower bounds on such tradeoffs, often nearly matching the upper bounds, via notions related to MerlinArthur communication complexity. Our results cover the classic data stream problems of selection, frequency moments, and fundamental graph problems such as trianglefreeness and connectivity. Our work is also part of a growing trend — including recent studies of multipass streaming, read/write streams and randomly ordered streams — of asking more complexitytheoretic questions about data stream processing. It is a recognition that, in addition to practical relevance, the data stream model raises many interesting theoretical questions in its own right. 1
On estimating path aggregates over streaming graphs
 In International Symposium on Algorithms and Computation
, 2006
"... Abstract. We consider the updatable streaming graph model, where edges of a graph arrive or depart in arbitrary sequence and are processed in an online fashion using sublinear space and time. We study the problem of estimating aggregate path metrics Pk defined as the number of pairs of vertices tha ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Abstract. We consider the updatable streaming graph model, where edges of a graph arrive or depart in arbitrary sequence and are processed in an online fashion using sublinear space and time. We study the problem of estimating aggregate path metrics Pk defined as the number of pairs of vertices that have a simple path between them of length k. For a streaming undirected graph with n vertices, m edges and r components, we present an Õ(m(m − r)−1/4) space 1 algorithm for estimating P2 and an Ω ( √ m) space lower bound. We show that estimating P2 over directed streaming graphs, and estimating Pk over streaming graphs (whether directed or undirected), for any k ≥ 3 requires Ω(n 2) space. We also present a space lower bound of Ω(n 2) for the problems of (a) deterministically testing the connectivity, and, (b) estimating the size of transitive closure, of undirected streaming graphs that allow both edgeinsertions and deletions. 1
Streaming graph computations with a helpful advisor
"... Motivated by the trend to outsource work to commercial cloud computing services, we consider a variation of the streaming paradigm where a streaming algorithm can be assisted by a powerful helper that can provide annotations to the data stream. We extend previous work on such annotation models by c ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Motivated by the trend to outsource work to commercial cloud computing services, we consider a variation of the streaming paradigm where a streaming algorithm can be assisted by a powerful helper that can provide annotations to the data stream. We extend previous work on such annotation models by considering a number of graph streaming problems. Without annotations, streaming algorithms for graph problems generally require significant memory; we show that for many standard problems, including all graph problems that can be expressed with totally unimodular integer programming formulations, only constant memory is needed for singlepass algorithms given linearsized annotations. We also obtain a protocol achieving optimal tradeoffs between annotation length and memory usage for matrixvector multiplication; this result contributes to a trend of recent research on numerical linear algebra in streaming models.
Validating XML documents in the streaming model with external memory
 In ICDT
, 2012
"... We study the problem of validating XML documents of size N against general DTDs in the context of streaming algorithms. The starting point of this work is a wellknown space lower bound. There are XML documents and DTDs for which ppass streaming algorithms require Ω(N/p) space. We show that when al ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
We study the problem of validating XML documents of size N against general DTDs in the context of streaming algorithms. The starting point of this work is a wellknown space lower bound. There are XML documents and DTDs for which ppass streaming algorithms require Ω(N/p) space. We show that when allowing access to external memory, there is a deterministic streaming algorithm that solves this problem with memory space O(log 2 N), a constant number of auxiliary read/write streams, and O(log N) total number of passes on the XML document and auxiliary streams. An important intermediate step of this algorithm is the computation of the FirstChildNextSibling (FCNS) encoding of the initial XML document in a streaming fashion. We study this problem independently, and we also provide memory efficient streaming algorithms for decoding an XML document given in its FCNS encoding. Furthermore, validating XML documents encoding binary trees in the usual streaming model without external memory can be done with sublinear memory. There is a onepass algorithm using O ( √ N log N) space, and a bidirectional twopass algorithm using O(log 2 N) space performing this task.
Tight Lower Bounds for MultiPass Stream Computation via Pass Elimination
, 2008
"... There is a natural relationship between lower bounds in the multipass stream model and lower bounds in multiround communication. However, this connection is less understood than the connection between singlepass stream computation and oneway communication. In this paper, we consider datastream ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
There is a natural relationship between lower bounds in the multipass stream model and lower bounds in multiround communication. However, this connection is less understood than the connection between singlepass stream computation and oneway communication. In this paper, we consider datastream problems for which reductions from natural multiround communication problems do not yield tight bounds or do not apply. While lower bounds are known for some of these datastream problems, many of these only apply to deterministic or comparisonbased algorithms, whereas the lower bounds we present apply to any (possibly randomized) algorithms. Our results are particularly relevant to evaluating functions that are dependent on the ordering of the stream, such as the longest increasing subsequence and a variant of tree pointer jumping in which pointers are revealed according to a postorder traversal. Our approach is based on establishing “passelimination” type results that are analogous to the roundelimination results of Miltersen et al. [23] and Sen [29]. We demonstrate our approach by proving tight bounds for a range of datastream problems including finding the longest increasing sequences (a problem that has recently become very
On Maximum Coverage in the Streaming Model & Application to Multitopic BlogWatch
"... We generalize the graph streaming model to hypergraphs. In this streaming model, hyperedges are arriving online and any computation has to be done onthefly using a small amount of space. Each hyperedge can be viewed as a set of elements (nodes), so we refer to our proposed model as the “setstream ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We generalize the graph streaming model to hypergraphs. In this streaming model, hyperedges are arriving online and any computation has to be done onthefly using a small amount of space. Each hyperedge can be viewed as a set of elements (nodes), so we refer to our proposed model as the “setstreaming ” model of computation. We consider the problem of “maximum coverage”, in which k sets have to be selected that maximize the total weight of the covered elements. In the setstreaming model of computation, we show that our algorithm for maximumcoverage achieves an approximation factor of 1 4. When multiple passes are allowed, we also provide a Θ(log n) approximation algorithm for the setcover. We next consider a multitopic blogwatch application, an extension of blogalert like applications for handling simultaneous multipletopic requests. We show how the problems of maximumcoverage and setcover in the setstreaming model can be utilized to give efficient online solutions to this problem. We verify the effectiveness of our methods both on synthetic and real weblog data. 1