Results 1 -
5 of
5
Know your neighbors: Web spam detection using the web topology
- In Proceedings of SIGIR
, 2007
"... Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link ..."
Abstract
-
Cited by 43 (8 self)
- Add to MetaCart
Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link dependencies among the Web pages, and the content of the pages themselves. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam that can be applied in practice to large-scale Web data.
Site Level Noise Removal for Search Engines
- IN PROC. OF INTERNATIONAL WORLD WIDE WEB CONFERENCE (WWW
, 2006
"... The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. In the same time, the growth of the web has also inherently generated several navigational hyperlink structur ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. In the same time, the growth of the web has also inherently generated several navigational hyperlink structures which have a negative impact on the importance measures employed by current search engines. In this paper we propose and evaluate algorithms for identifying all these noisy links over the web graph, may them be spam or simple relationships between real world entities represented by sites, replication of content, etc. Unlike prior work, we target a different type of noisy link structures, residing at the site level, instead of the page level. We thus investigate and annihilate site level mutual reinforcement relationships, abnormal support coming from one site towards another, as well as complex link alliances between web sites. Our experiments with the link database of the TodoBR search engine show a very strong increase in the quality of the output rankings after having applied our techniques.
Incorporating Trust into Web Search
, 2007
"... The Web today includes many pages intended to deceive search engines, in which content or links are created to attain an unwarranted result ranking. Since the links among web pages are used to calculate authority, ranking systems should take into consideration which pages contain content to be trust ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The Web today includes many pages intended to deceive search engines, in which content or links are created to attain an unwarranted result ranking. Since the links among web pages are used to calculate authority, ranking systems should take into consideration which pages contain content to be trusted and which do not. In this paper, we assume the existence of a mechanism, such as, but not limited to Gyöngyi et al.’s TrustRank, to estimate the trustworthiness of a given page. However, unlike existing work that uses trust to identify or demote spam pages, we propose how to incorporate a given trust estimate into the process of calculating authority for a cautious surfer. We apply a total of forty-five queries over two large, real-world datasets to demonstrate that incorporating trust into an authority calculation using our cautious surfer can improve PageRank’s precision at 10 by 11-26 % and average top-10 result quality by 53-81%. 1
Combining Anchor Text Categorization and Graph Analysis for Paid Link Detection
"... In order to artificially boost the rank of commercial pages in search engine results, search engine optimizers pay for links to these pages on other websites. Identifying paid links is important for a web search engine to produce highly relevant results. In this paper we introduce a novel method of ..."
Abstract
- Add to MetaCart
In order to artificially boost the rank of commercial pages in search engine results, search engine optimizers pay for links to these pages on other websites. Identifying paid links is important for a web search engine to produce highly relevant results. In this paper we introduce a novel method of identifying such links. We start with training a classifier of anchor text topics and analyzing web pages for diversity of their outgoing commercial links. Then we use this information and analyze link graph of the Russian Web to find pages that sell links and sites that buy links and to identify the paid links. Testing on manually marked samples showed high efficiency of the algorithm.
Emerging Applications of Link Analysis for Ranking
, 2007
"... The booming growth of digitally available information has thoroughly increased the popularity of search engine technology over the past years. At the same time, upon interacting with this overwhelming quantity of data, people usually inspect only the very few most relevant items for their task. It ..."
Abstract
- Add to MetaCart
The booming growth of digitally available information has thoroughly increased the popularity of search engine technology over the past years. At the same time, upon interacting with this overwhelming quantity of data, people usually inspect only the very few most relevant items for their task. It is thus very important to utilize high quality ranking measures which efficiently identify these items under the various information retrieval activities we pursue. In this thesis we provide a twofold contribution to the Information Retrieval field. First, we identify those application areas in which a user oriented ranking is missing, though extremely necessary in order to facilitate a qualitative access to relevant resources. Second, for each of these areas we propose appropriate ranking algorithms which exploit their underlying social characteristics, either at the macroscopic, or at the microscopic level. We achieve this by utilizing link analysis techniques, which build on top of the graph based representation of relations between resources in order to rank them or simply to identify social patterns relative to the investigated data set. We

