Results 1 - 10
of
11
Detecting spam web pages through content analysis
- In Proceedings of the World Wide Web conference
, 2006
"... In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatica ..."
Abstract
-
Cited by 110 (3 self)
- Add to MetaCart
In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
Spam double-funnel: connecting web spammers with advertisers
- In WWW
, 2007
"... Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a fivelayer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords – one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.
A Quantitative Study of Forum Spamming Using Contextbased Analysis
- In Proc. Network and Distributed System Security (NDSS) Symposium
, 2007
"... Forum spamming has become a major means of search engine spamming. To evaluate the impact of forum spamming on search quality, we have conducted a comprehensive study from three perspectives: that of the search user, the spammer, and the forum hosting site. We examine spam blogs and spam comments in ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Forum spamming has become a major means of search engine spamming. To evaluate the impact of forum spamming on search quality, we have conducted a comprehensive study from three perspectives: that of the search user, the spammer, and the forum hosting site. We examine spam blogs and spam comments in both legitimate and honey forums. Our study shows that forum spamming is a widespread problem. Spammed forums, powered by the most popular software, show up in the top 20 search results for all the 189 popular keywords. On two blog sites, more than half (75 % and 54 % respectively) of the blogs are spam, and even on a major and reputably well maintained blog site, 8.1 % of the blogs are spam 1. The observation on our honey forums confirms that spammers target abandoned pages and that most comment spam is meant to increase page rank rather than generate immediate traffic. We propose contextbased analyses, consisting of redirection and cloaking analysis, to detect spam automatically and to overcome shortcomings of content-based analyses. Our study shows that these analyses are very effective in identifying spam pages. 1
Link-based similarity search to fight web spam
- In AIRWEB
, 2006
"... www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1.
Adversarial Information Retrieval Aspects of Sponsored Search
, 2006
"... Search engines are commercial entities that require revenue to survive. The most prevalent revenue stream for search engines is sponsored search, where content providers have search engines service their links to users in response to queries or in a contextual manner on relevant Web sites. In exchan ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Search engines are commercial entities that require revenue to survive. The most prevalent revenue stream for search engines is sponsored search, where content providers have search engines service their links to users in response to queries or in a contextual manner on relevant Web sites. In exchange for providing this service, content providers pay search engines based on the number of clicks (i.e., a click being a visit by a user to the content providers Web page). This business model has proven to be very effective for the search engines, content providers, and searchers. However, click fraud, a unique form of adversarial information retrieval, threatens this business model and, therefore, the "free search" that has rapidly become indispensable to the daily lives of many people. In this paper, we outline how sponsored search is a unique form of information retrieval -- not just a mode of advertising, what is click fraud, how click fraud happens, and what are some possible countermeasures.
A Large-Scale Study of Link Spam Detection by Graph Algorithms
"... Link spam refers to attempts to promote the ranking of spammers ’ web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called “link farm”. In this paper, we study the overall structure and distribution of link farm ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Link spam refers to attempts to promote the ranking of spammers ’ web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called “link farm”. In this paper, we study the overall structure and distribution of link farms in a large-scale graph of the Japanese Web with 5.8 million sites and 283 million links. To examine the spam structure, we apply three graph algorithms to the web graph. First, the web graph is decomposed into strongly connected components (SCC). Beside the largest SCC (core) in the center of the web, we have observed that most of large components consist of link farms. Next, to extract spam sites in the core, we enumerate maximal cliques as seeds of link farms. Finally, we expand these link farms as a reliable spam seed set by a minimum cut technique that separates links among spam and non-spam sites. We found about 0.6 million spam sites in SCCs around the core, and extracted additional 8 thousand and 49 thousand sites as spams with high precision in the core by the maximal clique enumeration and by the minimum cut technique, respectively. 1.
On the evolution of search engine rankings
- In In the Proceedings of the 2009 WEBIST Conference
"... Since the early days of the web, users have been relying on them to get informed and make decisions. When the web was relatively small, web directories were built and maintained using human experts to screen and categorize pages according to their characteristics. By the mid 1990’s, however, it was ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Since the early days of the web, users have been relying on them to get informed and make decisions. When the web was relatively small, web directories were built and maintained using human experts to screen and categorize pages according to their characteristics. By the mid 1990’s, however, it was apparent that the human expert model of categorizing web pages does not scale. The first search engines appeared and they have been evolving ever since, taking over the role that web directories used to play. But what need makes a search engine evolve? Beyond the financial objectives, there is a need for quality in search results. Users interact with search engines through search query results. Search engines know that the quality of their ranking will determine how successful they are. If users perceive the results as valuable and reliable, they will use it again. Otherwise, it is easy for them to switch to another search engine. Search results, however, are not simply based on well-designed scientific principles, but they are influenced by web spammers. Web spamming, the practice of introducing artificial text and links into web pages to affect the results of web searches, has been recognized as a major search engine problem. It is also a serious users problem because they are not aware of it and they tend to confuse trusting the search engine with trusting the results of a search. In this paper, we analyze the influence that web spam has on the evolution of the search engines and we identify the strong relationship of spamming methods on the web to propagandistic techniques in society. Our analysis provides a foundation for understanding why spamming works and offers new insight on how to address it. In particular, it suggests that one could use social anti-propagandistic techniques to recognize web spam. 1
Semi-Supervised Learning: A Comparative Study for Web Spam and Telephone User Churn
- In Graph Labeling Workshop in conjunction with ECML/PKDD
, 2007
"... Abstract. We compare a wide range of semi-supervised learning techniques both for Web spam filtering and for telephone user churn classification. Semisupervised learning has the assumption that the label of a node in a graph is similar to those of its neighbors. In this paper we measure this phenome ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. We compare a wide range of semi-supervised learning techniques both for Web spam filtering and for telephone user churn classification. Semisupervised learning has the assumption that the label of a node in a graph is similar to those of its neighbors. In this paper we measure this phenomenon both for Web spam and telco churn. We conclude that spam is often linked to spam while honest pages are linked to honest ones; similarly churn occurs in bursts in groups of a social network. 1
Incorporating Trust into Web Search
, 2007
"... The Web today includes many pages intended to deceive search engines, in which content or links are created to attain an unwarranted result ranking. Since the links among web pages are used to calculate authority, ranking systems should take into consideration which pages contain content to be trust ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The Web today includes many pages intended to deceive search engines, in which content or links are created to attain an unwarranted result ranking. Since the links among web pages are used to calculate authority, ranking systems should take into consideration which pages contain content to be trusted and which do not. In this paper, we assume the existence of a mechanism, such as, but not limited to Gyöngyi et al.’s TrustRank, to estimate the trustworthiness of a given page. However, unlike existing work that uses trust to identify or demote spam pages, we propose how to incorporate a given trust estimate into the process of calculating authority for a cautious surfer. We apply a total of forty-five queries over two large, real-world datasets to demonstrate that incorporating trust into an authority calculation using our cautious surfer can improve PageRank’s precision at 10 by 11-26 % and average top-10 result quality by 53-81%. 1
Both Sides of the Digital Battle for a High Rank from a Search Engine
, 2005
"... Because of the financial gain in achieving a high search engine rank, modifying a web page to unfairly alter its ranking has become a common practice. Techniques to achieve this are generally known as web spam because of the adverse affect on the relevance of the results returned by the search engin ..."
Abstract
- Add to MetaCart
Because of the financial gain in achieving a high search engine rank, modifying a web page to unfairly alter its ranking has become a common practice. Techniques to achieve this are generally known as web spam because of the adverse affect on the relevance of the results returned by the search engine. This paper contains an overview of the techniques used to create web spam and the defenses available to search engines. In addition, we comment on why further research into combating web spam is required. This paper provides insights into the state-of-the-art of both sides of the battle to achieve an unfairly high search engine placement for a page.

