Results 1  10
of
113
Combating web spam with trustrank
 In VLDB
, 2004
"... Web spam pages use various techniques to achieve higherthandeserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semiautomatically separate reputable, good pages fr ..."
Abstract

Cited by 291 (2 self)
 Add to MetaCart
Web spam pages use various techniques to achieve higherthandeserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semiautomatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites. 1
Web Spam Taxonomy
, 2005
"... Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, wh ..."
Abstract

Cited by 168 (2 self)
 Add to MetaCart
Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures.
SpamRank  Fully Automatic Link Spam Detection
 IN PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON ADVERSARIAL INFORMATION RETRIEVAL ON THE WEB (AIRWEB
, 2005
"... Spammers intend to increase the PageRank of certain spam pages by creating a large number of links pointing to them. We propose a novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists ..."
Abstract

Cited by 67 (5 self)
 Add to MetaCart
Spammers intend to increase the PageRank of certain spam pages by creating a large number of links pointing to them. We propose a novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists or other means of human intervention. We assume that spammed pages have a biased distribution of pages that contribute to the undeserved high PageRank value. We define SpamRank by penalizing pages that originate a suspicious PageRank share and personalizing PageRank on the penalties. Our method is tested on a 31 M page crawl of the .de domain with a manually classified 1000page stratified random sample with bias towards large PageRank values.
A survey of eigenvector methods of web information retrieval
 SIAM Rev
"... Abstract. Web information retrieval is significantly more challenging than traditional wellcontrolled, small document collection information retrieval. One main difference between traditional information retrieval and Web information retrieval is the Web’s hyperlink structure. This structure has bee ..."
Abstract

Cited by 66 (6 self)
 Add to MetaCart
Abstract. Web information retrieval is significantly more challenging than traditional wellcontrolled, small document collection information retrieval. One main difference between traditional information retrieval and Web information retrieval is the Web’s hyperlink structure. This structure has been exploited by several of today’s leading Web search engines, particularly Google and Teoma. In this survey paper, we focus on Web information retrieval methods that use eigenvector computations, presenting the three popular methods of HITS, PageRank, and SALSA.
A survey on pagerank computing
 Internet Mathematics
, 2005
"... Abstract. This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. T ..."
Abstract

Cited by 64 (0 self)
 Add to MetaCart
Abstract. This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. This defines the importance of the model and the data structures that underly PageRank processing. Computing even a single PageRank is a difficult computational task. Computing many PageRanks is a much more complex challenge. Recently, significant effort has been invested in building sets of personalized PageRank vectors. PageRank is also used in many diverse applications other than ranking. We are interested in the theoretical foundations of the PageRank formulation, in the acceleration of PageRank computing, in the effects of particular aspects of web graph structure on the optimal organization of computations, and in PageRank stability. We also review alternative models that lead to authority indices similar to PageRank and the role of such indices in applications other than web search. We also discuss linkbased search personalization and outline some aspects of PageRank infrastructure from associated measures of convergence to link preprocessing. 1.
Dynamic Personalized Pagerank in EntityRelation Graphs
, 2007
"... Extractors and taggers turn unstructured text into entityrelation (ER) graphs where nodes are entities (email, paper, person, conference, company) and edges are relations (wrote, cited, worksfor). Typed proximity search of the form type=person NEAR company∼"IBM", paper∼"XML " is ..."
Abstract

Cited by 52 (2 self)
 Add to MetaCart
Extractors and taggers turn unstructured text into entityrelation (ER) graphs where nodes are entities (email, paper, person, conference, company) and edges are relations (wrote, cited, worksfor). Typed proximity search of the form type=person NEAR company∼"IBM", paper∼"XML " is an increasingly useful search paradigm in ER graphs. Proximity search implementations either perform a Pageranklike computation at query time, which is slow, or precompute, store and combine perword Pageranks, which can be very expensive in terms of preprocessing time and space. We present HubRank, a new system for fast, dynamic, spaceefficient proximity searches in ER graphs. During preprocessing, HubRank computes and indexes certain “sketchy” random walk fingerprints for a small fraction of nodes, carefully chosen using query log statistics. At query time, a small “active ” subgraph is identified, bordered by nodes with indexed fingerprints. These fingerprints are adaptively loaded to various resolutions to form approximate personalized Pagerank vectors (PPVs). PPVs at remaining active nodes are now computed iteratively. We report on experiments with CiteSeer’s ER graph and millions of real CiteSeer queries. Some representative numbers follow. On our testbed, HubRank preprocesses and indexes 52 times faster than wholevocabulary PPV computation. A text index occupies 56 MB. Wholevocabulary PPVs would consume 102 GB. If PPVs are truncated to 56 MB, precision compared to true Pagerank drops to 0.55; in contrast, HubRank has precision 0.91 at 63 MB. HubRank’s average query time is 200–300 milliseconds; querytime Pagerank computation takes 11 seconds on average.
Link spam alliances
 In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB
, 2005
"... Link spam is used to increase the ranking of certain target web pages by misleading the connectivitybased ranking algorithms in search engines. In this paper we study how web pages can be interconnected in a spam farm in order to optimize rankings. We also study alliances, that is, interconnections ..."
Abstract

Cited by 47 (1 self)
 Add to MetaCart
Link spam is used to increase the ranking of certain target web pages by misleading the connectivitybased ranking algorithms in search engines. In this paper we study how web pages can be interconnected in a spam farm in order to optimize rankings. We also study alliances, that is, interconnections of spam farms. Our results identify the optimal structures and quantify the potential gains. In particular, we show that alliances can be synergistic and improve the rankings of all participants. We believe that the insights we gain will be useful in identifying and combating link spam. 1
HigherOrder Web Link Analysis Using Multilinear Algebra
 IEEE INTERNATIONAL CONFERENCE ON DATA MINING
, 2005
"... Linear algebra is a powerful and proven tool in web search. Techniques, such as the PageRank algorithm of Brin and Page and the HITS algorithm of Kleinberg, score web pages based on the principal eigenvector (or singular vector) of a particular nonnegative matrix that captures the hyperlink structu ..."
Abstract

Cited by 45 (16 self)
 Add to MetaCart
Linear algebra is a powerful and proven tool in web search. Techniques, such as the PageRank algorithm of Brin and Page and the HITS algorithm of Kleinberg, score web pages based on the principal eigenvector (or singular vector) of a particular nonnegative matrix that captures the hyperlink structure of the web graph. We propose and test a new methodology that uses multilinear algebra to elicit more information from a higherorder representation of the hyperlink graph. We start by labeling the edges in our graph with the anchor text of the hyperlinks so that the associated linear algebra representation is a sparse, threeway tensor. The first two dimensions of the tensor represent the web pages while the third dimension adds the anchor text. We then use the rank1 factors of a multilinear PARAFAC tensor decomposition, which are akin to singular vectors of the SVD, to automatically identify topics in the collection along with the associated authoritative web pages.
Beyond pagerank: Machine learning for static ranking
 In WWW ’06: Proceedings of the 15th international conference on World Wide Web
, 2006
"... Since the publication of Brin and Page’s paper on PageRank, many in the Web community have depended on PageRank for the static (queryindependent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gai ..."
Abstract

Cited by 44 (2 self)
 Add to MetaCart
Since the publication of Brin and Page’s paper on PageRank, many in the Web community have depended on PageRank for the static (queryindependent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gain a further boost in accuracy by using data on the frequency at which users visit Web pages. We use RankNet, a ranking machine learning algorithm, to combine these and other static features based on anchor text and domain characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3 % (vs. 56.7% for PageRank or 50 % for random).
PageRank as a Function of the Damping Factor
, 2005
"... PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor # that spreads uniformly part of the rank. The choice of # is eminently empirical, and in most cases the original suggestion # = 0.85 ..."
Abstract

Cited by 36 (9 self)
 Add to MetaCart
PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor # that spreads uniformly part of the rank. The choice of # is eminently empirical, and in most cases the original suggestion # = 0.85 by Brin and Page is still used. Recently, however, the behaviour of PageRank with respect to changes in # was discovered to be useful in linkspam detection [21]. Moreover, an analytical justification of the value chosen for # is still missing. In this paper, we give the first mathematical analysis of PageRank when # changes. In particular, we show that, contrarily to popular belief, for realworld graphs values of # close to 1 do not give a more meaningful ranking. Then, we give closedform formulae for PageRank derivatives of any order, and an extension of the Power Method that approximates them with convergence O for the kth derivative. Finally, we show a tight connection between iterated computation and analytical behaviour by proving that the kth iteration of the Power Method gives exactly the PageRank value obtained using a Maclaurin polynomial of degree k. The latter result paves the way towards the application of analytical methods to the study of PageRank.