• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Topical TrustRank: Using Topicality to Combat Web Spam (2006)

by B Wu, G V, D B D
Venue:In WWW
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 22
Next 10 →

Opinion spam and analysis

by Nitin Jindal, Bing Liu - In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM , 2008
"... Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarizat ..."
Abstract - Cited by 33 (8 self) - Add to MetaCart
Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. An important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. In this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. In the past two years, several startup companies also appeared which aggregate opinions from product reviews. It is thus high time to study spam in reviews. To the best of our knowledge, there is still no published study on this topic, although Web spam and email spam have been investigated extensively. We will see that opinion spam is quite different from Web spam and email spam, and thus requires different detection techniques. Based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. This paper analyzes such spam activities and presents some novel techniques to detect them.

Propagating Trust and Distrust to Demote Web Spam

by Baoning Wu , Vinay Goel, Brian D. Davison , 2006
"... Web spamming describes behavior that attempts to deceive search engine's ranking algorithms. TrustRank is a recent algorithm that can combat web spam by propagating trust among web pages. However, TrustRank propagates trust among web pages based on the number of outgoing links, which is also how Pag ..."
Abstract - Cited by 22 (2 self) - Add to MetaCart
Web spamming describes behavior that attempts to deceive search engine's ranking algorithms. TrustRank is a recent algorithm that can combat web spam by propagating trust among web pages. However, TrustRank propagates trust among web pages based on the number of outgoing links, which is also how PageRank propagates authority scores among Web pages. This type of propagation may be suited for propagating authority, but it is not optimal for calculating trust scores for demoting spam sites. In this paper,

Link-based similarity search to fight web spam

by András A. Benczúr, Károly Csalogány, Tamás Sarlós - In AIRWEB , 2006
"... www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe ..."
Abstract - Cited by 15 (2 self) - Add to MetaCart
www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1.

Web projections: Learning from contextual subgraphs of the web

by Jure Leskovec - In WWW ’07: Proceedings of the 16th international conference on World Wide Web , 2007
"... Graphical relationships among web pages have been leveraged as sources of information in methods for ranking search results. To date, specific graphical properties have been used in these analyses. We introduce web projections, a methodology that generalizes prior efforts on exploiting graphical rel ..."
Abstract - Cited by 10 (2 self) - Add to MetaCart
Graphical relationships among web pages have been leveraged as sources of information in methods for ranking search results. To date, specific graphical properties have been used in these analyses. We introduce web projections, a methodology that generalizes prior efforts on exploiting graphical relationships of the web in several ways. With the approach, we create subgraphs by projecting sets of pages and domains onto the larger web graph, and then use machine learning to construct predictive models that consider graphical properties as evidence. We describe the method and present experiments that illustrate the construction of predictive models of search result quality and user query reformulation.

SocialTrust: Tamper-Resilient Trust Establishment in Online Communities

by James Caverlee, Ling Liu, Steve Webb
"... Web 2.0 promises rich opportunities for information sharing, electronic commerce, and new modes of social interaction, all centered around the“social Web”of user-contributed content, social annotations, and person-to-person social connections. But the increasing reliance on this “social Web ” also p ..."
Abstract - Cited by 8 (2 self) - Add to MetaCart
Web 2.0 promises rich opportunities for information sharing, electronic commerce, and new modes of social interaction, all centered around the“social Web”of user-contributed content, social annotations, and person-to-person social connections. But the increasing reliance on this “social Web ” also places individuals and their computer systems at risk, creating opportunities for malicious participants to exploit the tight social fabric of these networks. With these problems in mind, we propose the SocialTrust framework for tamperresilient trust establishment in online communities. Social-Trust provides community users with dynamic trust values by (i) distinguishing relationship quality from trust; (ii) incorporating a personalized feedback mechanism for adapting as the community evolves; and (iii) tracking user behavior. We experimentally evaluate the SocialTrust framework using real online social networking data consisting of millions of MySpace profiles and relationships. We find that SocialTrust supports robust trust establishment even in the presence of large-scale collusion by malicious participants.

Dynamics of Large Networks

by Jurij Leskovec , 2008
"... A basic premise behind the study of large networks is that interaction leads to complex collective behavior. In our work we found very interesting and counterintuitive patterns for time evolving networks, which change some of the basic assumptions that were made in the past. We then develop models ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
A basic premise behind the study of large networks is that interaction leads to complex collective behavior. In our work we found very interesting and counterintuitive patterns for time evolving networks, which change some of the basic assumptions that were made in the past. We then develop models that explain processes which govern the network evolution, fit such models to real networks, and use them to generate realistic graphs or give formal explanations about their properties. In addition, our work has a wide range of applications: it can help us spot anomalous graphs and outliers, forecast future graph structure and run simulations of network evolution. Another important aspect of our research is the study of “local ” patterns and structures of propagation in networks. We aim to identify building blocks of the networks and find the patterns of influence that these blocks have on information or virus propagation over the network. Our recent work included the study of the spread of influence in a large person-to-person

Knowing a Web Page by the Company It Keeps

by Xiaoguang Qi, Brian D. Davison - CIKM , 2006
"... Web page classification is important to many tasks in information retrieval and web mining. However, applying traditional textual classifiers on web data often produces unsatisfying results. Fortunately, hyperlink information provides important clues to the categorization of a web page. In this pape ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
Web page classification is important to many tasks in information retrieval and web mining. However, applying traditional textual classifiers on web data often produces unsatisfying results. Fortunately, hyperlink information provides important clues to the categorization of a web page. In this paper, an improved method is proposed to enhance web page classification by utilizing the class information from neighboring pages in the link graph. The categories represented by four kinds of neighbors (parents, children, siblings and spouses) are combined to help with the page in question. In experiments to study the effect of these factors on our algorithm, we find that the method proposed is able to boost the classification accuracy of common textual classifiers from around 70 % to more than 90 % on a large dataset of pages from the Open Directory Project, and outperforms existing algorithms. Unlike prior techniques, our approach utilizes same-host links and can improve classification accuracy even when neighboring pages are unlabeled. Finally, while all neighbor types can contribute, sibling pages are found to be the most important.

RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee

by Junghoo Cho, et al. , 2007
"... Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downl ..."
Abstract - Cited by 5 (1 self) - Add to MetaCart
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover “most ” of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing, in the context of a system that is given a set of trusted pages, a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the “important ” part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high “coverage of the Web with a relatively small number of pages.

Extracting Link Spam using Biased Random Walks From Spam Seed Sets

by Baoning Wu
"... Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such as PageRank, HITS, and other derivatives are especially vulnerable to link spam. Link farms and link exchanges are two co ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such as PageRank, HITS, and other derivatives are especially vulnerable to link spam. Link farms and link exchanges are two common instances of link spam that produce spam communities – i.e., clusters in the web graph. In this paper, we present a directed approach to extracting link spam communities when given one or more members of the community. In contrast to previous completely automated approaches to finding link spam, our method is specifically designed to be used interactively. Our approach starts with a small spam seed set provided by the user and simulates a random walk on the web graph. The random walk is biased to explore the local neighborhood around the seed set through the use of decay probabilities. Truncation is used to retain only the most frequently visited nodes. After termination, the nodes are sorted in decreasing order of their final probabilities and presented to the user. Experiments using manually labeled link spam data sets and random walks from a single seed domain show that the approach achieves over 95.12 % precision in extracting large link farms and 80.46 % precision in extracting link exchange centroids.

Semi-Supervised Learning: A Comparative Study for Web Spam and Telephone User Churn

by András A. Benczúr, Károly Csalogány, László Lukács, Dávid Siklósi - In Graph Labeling Workshop in conjunction with ECML/PKDD , 2007
"... Abstract. We compare a wide range of semi-supervised learning techniques both for Web spam filtering and for telephone user churn classification. Semisupervised learning has the assumption that the label of a node in a graph is similar to those of its neighbors. In this paper we measure this phenome ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
Abstract. We compare a wide range of semi-supervised learning techniques both for Web spam filtering and for telephone user churn classification. Semisupervised learning has the assumption that the label of a node in a graph is similar to those of its neighbors. In this paper we measure this phenomenon both for Web spam and telco churn. We conclude that spam is often linked to spam while honest pages are linked to honest ones; similarly churn occurs in bursts in groups of a social network. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University