Results 1 - 10
of
29
Detecting spam web pages through content analysis
- In Proceedings of the World Wide Web conference
, 2006
"... In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatica ..."
Abstract
-
Cited by 110 (3 self)
- Add to MetaCart
In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
Know your neighbors: Web spam detection using the web topology
- In Proceedings of SIGIR
, 2007
"... Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link ..."
Abstract
-
Cited by 43 (8 self)
- Add to MetaCart
Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link dependencies among the Web pages, and the content of the pages themselves. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam that can be applied in practice to large-scale Web data.
Link spam detection based on mass estimation
- In Proceedings of the 32nd International Conference on Very Large Databases. ACM
, 2006
"... Link spamming intends to mislead search engines and trigger an artificially high link-based ranking of specific target web pages. This paper introduces the concept of spam mass, a measure of the impact of link spamming on a page’s ranking. We discuss how to estimate spam mass and how the estimates c ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Link spamming intends to mislead search engines and trigger an artificially high link-based ranking of specific target web pages. This paper introduces the concept of spam mass, a measure of the impact of link spamming on a page’s ranking. We discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming. In our experiments on the host-level Yahoo! web graph we use spam mass estimates to successfully identify tens of thousands of instances of heavy-weight link spamming. 1.
Web Page Classification: Features and Algorithms
, 2007
"... Classification of web page content is essential to many tasks in web information retrieval such as maintaining web directories and focused crawling. The uncontrolled nature of web content presents additional challenges to web page classification as compared to traditional text classification, but th ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Classification of web page content is essential to many tasks in web information retrieval such as maintaining web directories and focused crawling. The uncontrolled nature of web content presents additional challenges to web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in web page classification, we note the importance of these web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages. 1
Link-based similarity search to fight web spam
- In AIRWEB
, 2006
"... www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1.
IRLbot: Scaling to 6 Billion Pages and Beyond
"... Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimi ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes. I.
Social Honeypots: Making Friends With A Spammer Near You
"... Social networking communities have become an important communications platform, but the popularity of these communities has also made them targets for a new breed of social spammers. Unfortunately, little is known about these social spammers, their level of sophistication, or their strategies and ta ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
Social networking communities have become an important communications platform, but the popularity of these communities has also made them targets for a new breed of social spammers. Unfortunately, little is known about these social spammers, their level of sophistication, or their strategies and tactics. Thus, in this paper, we provide the first characterization of social spammers and their behaviors. Concretely, we make two contributions: (1) we introduce social honeypots for tracking and monitoring social spam, and (2) we report the results of an analysis performed on spam data that was harvested by our social honeypots. Based on our analysis, we find that the behaviors of social spammers exhibit recognizable temporal and geographic patterns and that social spam content contains various distinguishing characteristics. These results are quite promising and suggest that our analysis techniques may be used to automatically identify social spam. 1
Site Level Noise Removal for Search Engines
- IN PROC. OF INTERNATIONAL WORLD WIDE WEB CONFERENCE (WWW
, 2006
"... The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. In the same time, the growth of the web has also inherently generated several navigational hyperlink structur ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. In the same time, the growth of the web has also inherently generated several navigational hyperlink structures which have a negative impact on the importance measures employed by current search engines. In this paper we propose and evaluate algorithms for identifying all these noisy links over the web graph, may them be spam or simple relationships between real world entities represented by sites, replication of content, etc. Unlike prior work, we target a different type of noisy link structures, residing at the site level, instead of the page level. We thus investigate and annihilate site level mutual reinforcement relationships, abnormal support coming from one site towards another, as well as complex link alliances between web sites. Our experiments with the link database of the TodoBR search engine show a very strong increase in the quality of the output rankings after having applied our techniques.
Improving Web Spam Classification using Rank-time Features
, 2007
"... In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of dataset construction is crucial for ac ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of dataset construction is crucial for accurate spam classification and we note that this problem occurs generally in learning problems and can be hard to detect. In particular, we find that ensuring no overlapping domains between test and training sets is necessary to accurately test a web spam classifier. In our case, classification performance can differ by as much as 40 % in precision when using non-domain-separated data. Second, we show ranktime features can improve the performance of a web spam classifier. Our paper is the first to investigate the use of rank-time features, and in particular query-dependent ranktime features, for web spam detection. We show that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only.
Collaborative Blog Spam Filtering Using Adaptive Percolation Search
, 2006
"... We propose a novel collaborative filtering method for link spams on blogs. The key idea is to rely on manual identification of spams and share this information about spams through a network of trust. The blogger who has identified a spam tells a small number of fellow bloggers (content implantation) ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
We propose a novel collaborative filtering method for link spams on blogs. The key idea is to rely on manual identification of spams and share this information about spams through a network of trust. The blogger who has identified a spam tells a small number of fellow bloggers (content implantation) , and those who have not heard about it start a search using an adaptive percolation search, combined with content implantation, they contract the information about identified spam in only a fraction of the query period time without producing large volume of tra#c.

