Results 1 - 10
of
24
Detecting spam web pages through content analysis
- In Proceedings of the World Wide Web conference
, 2006
"... In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatica ..."
Abstract
-
Cited by 110 (3 self)
- Add to MetaCart
In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
Topical TrustRank: using topicality to combat web spam
, 2006
"... Web spam is behavior that attempts to deceive search engine ranking algorithms. TrustRank is a recent algorithm that can combat web spam. However, TrustRank is vulnerable in the sense that the seed set used by TrustRank may not be sufficiently representative to cover well the different topics on the ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Web spam is behavior that attempts to deceive search engine ranking algorithms. TrustRank is a recent algorithm that can combat web spam. However, TrustRank is vulnerable in the sense that the seed set used by TrustRank may not be sufficiently representative to cover well the different topics on the Web. Also, for a given seed set, TrustRank has a bias towards larger communities. We propose the use of topical information to partition the seed set and calculate trust scores for each topic separately to address the above issues. A combination of these trust scores for a page is used to determine its ranking. Experimental results on two large datasets show that our Topical TrustRank has a better performance than TrustRank in demoting spam sites or pages. Compared to TrustRank, our best technique can decrease spam from the top ranked sites by as much as 43.1%.
Propagating Trust and Distrust to Demote Web Spam
, 2006
"... Web spamming describes behavior that attempts to deceive search engine's ranking algorithms. TrustRank is a recent algorithm that can combat web spam by propagating trust among web pages. However, TrustRank propagates trust among web pages based on the number of outgoing links, which is also how Pag ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Web spamming describes behavior that attempts to deceive search engine's ranking algorithms. TrustRank is a recent algorithm that can combat web spam by propagating trust among web pages. However, TrustRank propagates trust among web pages based on the number of outgoing links, which is also how PageRank propagates authority scores among Web pages. This type of propagation may be suited for propagating authority, but it is not optimal for calculating trust scores for demoting spam sites. In this paper,
BLEWS: Using Blogs to Provide Context for News Articles
"... An overwhelming number of news articles are available every day via the internet. Unfortunately, it is impossible for us to peruse more than a handful; furthermore it is difficult to ascertain an article’s social context, i.e., is it popular, what sorts of people are reading it, etc. In this paper, ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
An overwhelming number of news articles are available every day via the internet. Unfortunately, it is impossible for us to peruse more than a handful; furthermore it is difficult to ascertain an article’s social context, i.e., is it popular, what sorts of people are reading it, etc. In this paper, we develop a system to address this problem in the restricted domain of political news by harnessing implicit and explicit contextual information from the blogosphere. Specifically, we track thousands of blogs and the news articles they cite, collapsing news articles that have highly overlapping content. We then tag each article with the number of blogs citing it, the political orientation of those blogs, and the level of emotional charge expressed in the blog posts that link to the news article. We summarize and present the results to the user via a novel visualization which displays this contextual information; the user can then find the most popular articles, the articles most cited by liberals, the articles most emotionally discussed in the political blogosphere, etc.
Disorder inequality: A combinatorial approach to nearest neighbor search
- In WSDM’08
"... We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a’th most similar object to z and y is the b’th most similar object to z, then x is among the D(S) · (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zero-error algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses near-linear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q. Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.
Detecting Link Spam using Temporal Information
- Proc. of ICDM-2006
, 2006
"... How to effectively protect against spam on search ranking results is an important issue for contemporary web search engines. This paper addresses the problem of combating one major type of web spam: ‘link spam. ’ Most of the previous work on anti link spam managed to make use of one snapshot of web ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
How to effectively protect against spam on search ranking results is an important issue for contemporary web search engines. This paper addresses the problem of combating one major type of web spam: ‘link spam. ’ Most of the previous work on anti link spam managed to make use of one snapshot of web data to detect spam, and thus it did not take advantage of the fact that link spam tends to result in drastic changes of links in a short time period. To overcome the shortcoming, this paper proposes using temporal information on links in detection of link spam, as well as other information. Specifically, it defines temporal features such as In-link Growth Rate (IGR) and In-link Death Rate (IDR) in a spam classification model (i.e., SVM). Experimental results on web domain graph data show that link spam can be successfully detected with the proposed method. 1.
Finding text reuse on the web
- In WSDM ’09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
, 2009
"... With the overwhelming number of reports on similar events originating from different sources on the web, it is often hard, using existing web search paradigms, to find the original source of “facts”, statements, rumors, and opinions, and to track their development. Several techniques have been previ ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
With the overwhelming number of reports on similar events originating from different sources on the web, it is often hard, using existing web search paradigms, to find the original source of “facts”, statements, rumors, and opinions, and to track their development. Several techniques have been previously proposed for detecting such text reuse between different sources, however these techniques have been tested against relatively small and homogeneous TREC collections. In this work, we test the feasibility of text reuse detection techniques in the setting of web search. In addition to text reuse detection, we develop a novel technique that addresses the unique challenges of finding original sources on the web, such as defining a timeline. We also explore the use of link analysis for identifying reliable and relevant reports. Our experimental results show that the proposed techniques can operate on the scale of the web, are significantly more accurate than standard web search for finding text reuse, and provide a richer representation for tracking the information flow.
Mining page farms and its application in link spam detection
, 2007
"... Understanding the general relations of Web pages and their environments is important with a few interesting applications such as Web spam detection. In this thesis, we study the novel problem of page farm mining and its application in link spam detection. A page farm is the set of Web pages contribu ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Understanding the general relations of Web pages and their environments is important with a few interesting applications such as Web spam detection. In this thesis, we study the novel problem of page farm mining and its application in link spam detection. A page farm is the set of Web pages contributing to (a major portion of) the PageRank score of a target page. We show that extracting page farms is computationally expensive, and propose heuristic methods. We propose the concept of link spamicity based on page farms to evaluate the degree of a Web page being link spam. Using a real sample of more than 3 million Web pages, we analyze the statistics of page farms. We examine the effectiveness of our spamicity-based link spam detection methods using a newly available real data set of spam pages. The empirical study results strongly indicate that our methods are effective. Keywords: page farm; link spam; PageRank; link spamicity Subject Terms: Web search engines; Text processing (Computer science); World Wide Web iii iv To my parents, and my sister. “I like the dreams of the future better than the history of the past.” v
Characterizing Web Spam Using Content and HTTP Session Analysis ABSTRACT
"... Web spam research has been hampered by a lack of statistically significant collections. In this paper, we perform the first large-scale characterization of web spam using content and HTTP session analysis techniques on the Webb Spam Corpus – a collection of about 350,000 web spam pages. Our content ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Web spam research has been hampered by a lack of statistically significant collections. In this paper, we perform the first large-scale characterization of web spam using content and HTTP session analysis techniques on the Webb Spam Corpus – a collection of about 350,000 web spam pages. Our content analysis results are consistent with the hypothesis that web spam pages are different from normal web pages, showing far more duplication of physical content and URL redirections. An analysis of session information collected during the crawling of the Webb Spam Corpus shows significant concentration of hosting IP addresses in two narrow ranges as well as significant overlaps among session header values. These findings suggest that content and HTTP session analysis may contribute a great deal towards future efforts to automatically distinguish web spam pages from normal web pages. 1.
Detecting Splogs via Temporal Dynamics Using Self-Similarity Analysis
"... This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs not only degrade the quality of search engine results, but also waste network resources. The splog detection problem is made difficult due to the lack of stable content descriptors. We have developed a new technique for detecting splogs, based on the observation that a blog is a dynamic, growing sequence of entries (or posts) rather than a collection of individual pages. In our approach, splogs are recognized by their temporal characteristics and content. There are three key ideas in our splog detection framework. (a) We represent the blog temporal dynamics using selfsimilarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts, to investigate the temporal changes of the post sequence. (b) We study the blog temporal characteristics using a visual representation derived from the self-similarity measures. The visual signature reveals correlation between attributes and posts, depending on the type of blogs (normal blogs and splogs). (c) We propose two types of novel temporal features to capture the splog temporal characteristics. In our splog detector, these novel features are combined

