Results 1 -
4 of
4
A reference collection for Web spam
- SIGIR Forum
, 2006
"... We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by ..."
Abstract
-
Cited by 36 (12 self)
- Add to MetaCart
We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by a large and diverse set of judges. 1
Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection
- In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD
, 2006
"... This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and di#cult to solve, mostly due to the large size ..."
Abstract
-
Cited by 26 (12 self)
- Add to MetaCart
This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and di#cult to solve, mostly due to the large size of web collections that makes many algorithms unfeasible in practice.
Web Spam Detection: link-based and content-based techniques
"... Abstract. The Web is both an excellent medium for sharing information as well as an attractive platform for delivering products and services. This platform is, to some extent, mediated by search engines in order to meet the needs of users seeking information. Search engines are the “dragons” that ke ..."
Abstract
- Add to MetaCart
Abstract. The Web is both an excellent medium for sharing information as well as an attractive platform for delivering products and services. This platform is, to some extent, mediated by search engines in order to meet the needs of users seeking information. Search engines are the “dragons” that keep a valuable treasure: information [13]. Given the vast amount of information available on the Web, it is customary to answer queries with only a small set of results (typically 10 or 20 pages at most). Search engines must then rank Web pages, in order to create a short list of high-quality results for users. Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. Here we present the main techniques recently introduced for Web Spam detection e demotion. 1
A Structural, Content-Similarity Measure for Detecting Spam Documents on the Web
"... Purpose- The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search ..."
Abstract
- Add to MetaCart
Purpose- The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. To improve the quality of Web searches, the number of spam documents on the Web must be reduced, if they cannot be eradicated entirely. Design/methodology/approach- In this paper, we present a novel approach for identifying spam Web documents, which have mismatched titles and bodies and/or low percentage of hidden content in markup data structure. Findings- By considering the content and markup of Web documents, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 84.5 % of spam/legitimate Web documents, and (ii) computational inexpensive, sincethewordcorrelation factors used for content analysis are precomputed. Research limitations/implications- Since the bigram-correlation values employed in our spam-detection approach are computed by using the unigram-correlation factors, it imposes additional computational time during the spam-detection process and could generate higher number of misclassified spam Web documents. Originality/value- We have verified that our spam-detection approach outperforms existing anti-spam methods by at least 3 % in terms of F-measure.

