Results 1 - 10
of
27
Detecting phrase-level duplication on the world wide web
- In Proceedings of the 28th Annual International ACM SIGIR Conference on Research & Development in Information Retrieval
, 2005
"... Two years ago, we conducted a study on the evolution of web pages over time. In the course of that study, we discovered a large number of machine-generated “spam ” web pages emanating from a handful of web servers in Germany. These spam web pages were dynamically assembled by stitching together gram ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
Two years ago, we conducted a study on the evolution of web pages over time. In the course of that study, we discovered a large number of machine-generated “spam ” web pages emanating from a handful of web servers in Germany. These spam web pages were dynamically assembled by stitching together grammatically wellformed German sentences drawn from a large collection of sentences. This discovery motivated us to develop techniques for finding other instances of such “slice and dice ” generation of web pages, where pages are automatically generated by stitching together phrases drawn from a limited corpus. We applied these techniques to two data sets, a set of 151 million web pages collected in December 2002 and a set of 96 million web pages collected in June 2004. We found a number of other instances of large-scale phrase-level replication within the two data sets. This paper describes the algorithms we used to discover this type of replication, and highlights the results of our data mining.
Efficient similarity joins for near duplicate detection
- In WWW
, 2008
"... With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMSbased settings. Experimental results show our proposed algorithms can outperforms previous algorithms on several real datasets.
Thwarting the nigritude ultramarine: learning to identify link spam
- In Proceedings of the 16th European Conference on Machine Learning (ECML
, 2005
"... Abstract. The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming – inflating the page rank of a target page by artificially creating many referri ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
Abstract. The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming – inflating the page rank of a target page by artificially creating many referring pages – has therefore become a common practice. In order to maintain the quality of their search results, search engine providers try to oppose efforts that decorrelate page rank and relevance and maintain blacklists of spamming pages while spammers, at the same time, try to camouflage their spam pages. We formulate the problem of identifying link spam and discuss a methodology for generating training data. Experiments reveal the effectiveness of classes of intrinsic and relational attributes and shed light on the robustness of classifiers against obfuscation of attributes by an adversarial spammer. We identify open research problems related to web spam. 1
Taper: Tiered approach for eliminating redundancy in replica synchronization
- In USENIX Conference on File and Storage Technologies
, 2005
"... We present TAPER, a scalable data replication protocol that synchronizes a large collection of data across multiple geographically distributed replica locations. TAPER can be applied to a broad range of systems, such as software distribution mirrors, content distribution networks, backup and recover ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
We present TAPER, a scalable data replication protocol that synchronizes a large collection of data across multiple geographically distributed replica locations. TAPER can be applied to a broad range of systems, such as software distribution mirrors, content distribution networks, backup and recovery, and federated file systems. TA-PER is designed to be bandwidth efficient, scalable and content-based, and it does not require prior knowledge of the replica state. To achieve these properties, TA-PER provides: i) four pluggable redundancy elimination phases that balance the trade-off between bandwidth savings and computation overheads, ii) a hierarchical hash tree based directory pruning phase that quickly matches identical data from the granularity of directory trees to individual files, iii) a content-based similarity detection technique using Bloom filters to identify similar files, and iv) a combination of coarse-grained chunk matching with finer-grained block matches to achieve bandwidth efficiency. Through extensive experiments on various datasets, we observe that in comparison with rsync, a widely-used directory synchronization tool, TAPER reduces bandwidth by 15 % to 71%, performs faster matching, and scales to a larger number of replicas. 1
Using Bloom filters to refine web search results
- In Proc. 7th WebDB
, 2005
"... Search engines have primarily focused on presenting the most relevant pages to the user quickly. A less well explored aspect of improving the search experience is to remove or group all near-duplicate documents in the results presented to the user. In this paper, we apply a Bloom filter based simila ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Search engines have primarily focused on presenting the most relevant pages to the user quickly. A less well explored aspect of improving the search experience is to remove or group all near-duplicate documents in the results presented to the user. In this paper, we apply a Bloom filter based similarity detection technique to address this issue by refining the search results presented to the user. First, we present and analyze our technique for finding similar documents using contentdefined chunking and Bloom filters, and demonstrate its effectiveness in compactly representing and quickly matching pages for similarity testing. Later, we demonstrate how a number of results of popular and random search queries retrieved from different search engines, Google, Yahoo, MSN, are similar and can be eliminated or re-organized. 1.
Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages
- In Proceedings of WebDB
, 2004
"... The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a ha ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.
The Discoverability of the Web
- In Proc. WWW, 2007. accrued 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Dataset 1 0.0005 0.001 0.0015 0.002 query sketches c=100 c=1000 c=10000 0.25 0.2 0.15 0.1 0.05 Dataset 2 0.0005 0.001 0.0015 0.002 query sketches c=100 c=1000 c=10000
, 2007
"... Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, un ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, under an assumption of perfect estimates of likely sources of links to new content. Second, we relax this assumption and study a more realistic setting in which algorithms must use historical statistics to estimate which pages are most likely to yield links to new content. We recommend a simple algorithm that performs comparably to all approaches we consider. We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90 % of all new content with under 3 % overhead, and 100 % of new content with 9 % overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: one quarter of new content is simply not amenable to efficient discovery. Of the remaining three quarters, 80 % of new content during a given week may be discovered with 160 % overhead if content is recrawled fully on a monthly basis.
Undue influence: Eliminating the impact of link plagiarism on web search rankings
- In Proceedings of the 21st Annual ACM Symposium on Applied Computing
, 2006
"... Link farm spam and replicated pages can greatly deteriorate link-based ranking algorithms like HITS. In order to identify and neutralize link farm spam and replicated pages, we look for sufficient material copied from one page to another. In particular, we focus on the use of “complete hyperlinks” t ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Link farm spam and replicated pages can greatly deteriorate link-based ranking algorithms like HITS. In order to identify and neutralize link farm spam and replicated pages, we look for sufficient material copied from one page to another. In particular, we focus on the use of “complete hyperlinks” to distinguish link targets by the anchor text used. We build and analyze the bipartite graph of documents and their complete hyperlinks to find pages that share anchor text and link targets. Link farms and replicated pages are identified in this process, permitting the influence of problematic links to be reduced in a weighted adjacency matrix. Experiments and user evaluation show significant improvement in the quality of results produced using HITS-like methods. 1
The Co-Evolution of Systems and
- Communities in Free and Open Source Software Development, in S. Koch (ed.), Free/Open Source Software Development, 59-82, Idea Group Publishing
, 2004
"... Under consideration for publication in Knowledge and Information ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Under consideration for publication in Knowledge and Information
Introducing the Portuguese web archive initiative
"... This paper introduces the Portuguese Web Archive initiative, presenting its main objectives and work in progress. Term search over web archives collections is a desirable feature that raises new challenges. It is discussed how the terms index size could be reduced without significantly decreasing th ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
This paper introduces the Portuguese Web Archive initiative, presenting its main objectives and work in progress. Term search over web archives collections is a desirable feature that raises new challenges. It is discussed how the terms index size could be reduced without significantly decreasing the quality of search results. The results obtained from the first performed crawl show that the Portuguese web is composed approximately at least by 54 million contents that correspond to 2.8 TB of data. The crawl of the Portuguese web was stored in 2 TB of disk space using the ARC compressed format.

