Results 1 -
5 of
5
Know your neighbors: Web spam detection using the web topology
- In Proceedings of SIGIR
, 2007
"... Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link ..."
Abstract
-
Cited by 43 (8 self)
- Add to MetaCart
Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link dependencies among the Web pages, and the content of the pages themselves. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam that can be applied in practice to large-scale Web data.
A reference collection for Web spam
- SIGIR Forum
, 2006
"... We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by ..."
Abstract
-
Cited by 36 (12 self)
- Add to MetaCart
We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by a large and diverse set of judges. 1
Measuring Similarity to Detect Qualified Links ∗
, 2006
"... The success of link-based ranking algorithms is achieved based on the assumption that links imply merit of the target pages. However, on the real web, there exist links for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval. In order ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The success of link-based ranking algorithms is achieved based on the assumption that links imply merit of the target pages. However, on the real web, there exist links for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval. In order to provide high quality search results, it is important to detect them and reduce their influence. In this paper, a method is proposed to detect such links by considering multiple similarity measures over the source pages and target pages. With the help of a classifier, these noisy links are detected and dropped. After that, link analysis algorithms are performed on the reduced link graph. The usefulness of a number of features are also tested. Experiments across 53 query-specific datasets show that the result of our approach is able to boost Bharat and Henzinger’s imp algorithm by around 9 % in terms of precision. It also outperforms a previous approach focusing on link spam detection. 1
Application of DHT Protocol in IP Cloaking
"... Abstract-The paper aims at examining malicious spyware that are causing a significant threat to desktop security and are playing with the integrity of the system. The misuse of websites to serve exploit code to compromise hosts on the Internet has increased drastically in the recent years. Many appr ..."
Abstract
- Add to MetaCart
Abstract-The paper aims at examining malicious spyware that are causing a significant threat to desktop security and are playing with the integrity of the system. The misuse of websites to serve exploit code to compromise hosts on the Internet has increased drastically in the recent years. Many approaches to tackle the problem of spam have been proposed. Spamming is any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value. Web Spam is the Web pages that are the result of spamming. Web spam is the deliberate manipulation of search engine indexes. It is one of the search engine optimization methods. The paper provides an efficient way that prevents users from browsing malicious Web sites by providing a service to check a Web site for malignity before the user opens it. Hence if a Web site has been reported to be malicious, the browser can warn the user and suggest not visiting it.
An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models
"... With the massive use of the internet and the search engines, a major problem that comes to light is the Web Spam. Web spam can be detected by analyzing the various features of web pages and categorizing them as belonging to the spam or nonspam category. The proposed work considers unsupervised learn ..."
Abstract
- Add to MetaCart
With the massive use of the internet and the search engines, a major problem that comes to light is the Web Spam. Web spam can be detected by analyzing the various features of web pages and categorizing them as belonging to the spam or nonspam category. The proposed work considers unsupervised learning algorithms to characterize the web pages based on the link based features and content based features to compare the difference between the various sources of information in the source and target page. An unsupervised learning technique that is initially considered is the Hidden Markov Model which captures the different browsing patterns of users. Users may not only access the web through direct hyperlinks but may also jump from one page to another by typing URL’s or even by opening multiple windows. The unsupervised techniques have no previous class definitions to map outcomes to. As a result, they find out all possible probabilities of relation between the source and target page. This helps to attain higher efficiency in the detection of web spam even if the dataset used is small. Other unsupervised methods used to implement the same are the Self Organizing Map (SOM) and the Adaptive Resonance Theory (ART). Finally a performance evaluation of all the techniques used is made and represented in the increasing order of their performance metric.

