Results 1 - 10
of
24
Spotsigs: robust and efficient near duplicate detection in large web collections
- In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
, 2008
"... Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage porti ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage portions of Web pages over advertisements and navigational bars. The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling; 2) we provide an exact and efficient, selftuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative “Gold Set ” of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.
Building blocks for hierarchical latent variable models
- In In Proceedings of the 3rd International Conference on Independent Component Analysis and Blind Signal Separation
, 2006
"... Abstract—We propose a new method for rapid 3D object indexing that combines feature-based methods with coarse alignment-based matching techniques. Our approach achieves a sublinear complexity on the number of models, maintaining at the same time a high degree of performance for real 3D sensed data t ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Abstract—We propose a new method for rapid 3D object indexing that combines feature-based methods with coarse alignment-based matching techniques. Our approach achieves a sublinear complexity on the number of models, maintaining at the same time a high degree of performance for real 3D sensed data that is acquired in largely uncontrolled settings. The key component of our method is to first index surface descriptors computed at salient locations from the scene into the whole model database using the Locality Sensitive Hashing (LSH), a probabilistic approximate nearest neighbor method. Progressively complex geometric constraints are subsequently enforced to further prune the initial candidates and eliminate false correspondences due to inaccuracies in the surface descriptors and the errors of the LSH algorithm. The indexed models are selected based on the MAP rule using posterior probability of the models estimated in the joint 3D-signature space. Experiments with real 3D data employing a large database of vehicles, most of them very similar in shape, containing 1,000,000 features from more than 365 models demonstrate a high degree of performance in the presence of occlusion and obscuration, unmodeled vehicle interiors and part articulations, with an average processing time between 50 and 100 seconds per query. Index Terms—Three-dimensional object recognition, hashing, indexing, pose estimation, approximate nearest neighbor. Ç
Tracking Web Spam with HTML Style Similarities
"... Automatically generated content is ubiquitous in the web: dynamic sites built using the threetier paradigm are good examples (e.g. commercial sites, blogs and other sites powered by a web authoring software), as well as less legitimate spamdexing attempts (e.g. link farms, faked directories...). Tho ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Automatically generated content is ubiquitous in the web: dynamic sites built using the threetier paradigm are good examples (e.g. commercial sites, blogs and other sites powered by a web authoring software), as well as less legitimate spamdexing attempts (e.g. link farms, faked directories...). Those pages built using the same generating method (template or script) share a common “look and feel ” that is not easily detected by common text classification methods, but is more related to stylometry. In this work we study and compare several html style similarity measures based on both textual and extra-textual features in html source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. The algorithm we propose being based on locality sensitive hashing (lsh), we give some recalls about this technique. We describe how to use the html style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers, and give an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.
DPTree: A balanced tree based indexing framework for peer-to-peer systems
, 2006
"... Abstract — Peer-to-peer (P2P) systems have been widely used for exchange of voluminous information and resources among thousands or even millions of users. Since shared data are normally identified by multiple attributes, a fundamental issue in P2P systems is to efficiently support complex queries o ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract — Peer-to-peer (P2P) systems have been widely used for exchange of voluminous information and resources among thousands or even millions of users. Since shared data are normally identified by multiple attributes, a fundamental issue in P2P systems is to efficiently support complex queries on multidimensional data. Prior works suffer from some fundamental limitations, such as being constrained to support certain types of queries, excessive maintenance overheads, and etc. In this study, we propose a framework, called distributed peer tree (DPTree), which efficiently supports various types of queries on multidimensional data in P2P systems based on balanced tree indexes. DPTree achieves the efficiency through the following designs: 1) distributing the tree structure among peers in a way preserving the nice properties of balanced tree structures yet avoiding single points of failure and performance bottlenecks; 2) organizing peers into an overlay structure that enables efficient navigation yet is easy to maintain; 3) an efficient navigation algorithm; 4) an innovative wavelet-based load balancing mechanism. Through extensive performance evaluation, we verify the superiority of DPTree over existing works. I.
Peer-to-peer similarity search in metric spaces
- IN PROCEEDINGS OF VLDB’07
, 2007
"... This paper addresses the efficient processing of similarity queries in metric spaces, where data is horizontally distributed across a P2P network. The proposed approach does not rely on arbitrary data movement, hence each peer joining the network autonomously stores its own data. We present SIMPEER, ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
This paper addresses the efficient processing of similarity queries in metric spaces, where data is horizontally distributed across a P2P network. The proposed approach does not rely on arbitrary data movement, hence each peer joining the network autonomously stores its own data. We present SIMPEER, a novel framework that dynamically clusters peer data, in order to build distributed routing information at super-peer level. SIMPEER allows the evaluation of range and nearest neighbor queries in a distributed manner that reduces communication cost, network latency, bandwidth consumption and computational overhead at each individual peer. SIMPEER utilizes a set of distributed statistics and guarantees that all similar objects to the query are retrieved, without necessarily flooding the network during query processing. The statistics are employed for estimating an adequate query radius for k-nearest neighbor queries, and transform the query to a range query. Our experimental evaluation employs both real-world and synthetic data collections, and our results show that SIMPEER performs efficiently, even in the case of high degree of distribution.
Modeling LSH for Performance Tuning
"... Although Locality-Sensitive Hashing (LSH) is a promising approach to similarity search in high-dimensional spaces, it has not been considered practical partly because its search quality is sensitive to several parameters that are quite data dependent. Previous research on LSH, though obtained intere ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Although Locality-Sensitive Hashing (LSH) is a promising approach to similarity search in high-dimensional spaces, it has not been considered practical partly because its search quality is sensitive to several parameters that are quite data dependent. Previous research on LSH, though obtained interesting asymptotic results, provides little guidance on how these parameters should be chosen, and tuning parameters for a given dataset remains a tedious process. To address this problem, we present a statistical performance model of Multi-probe LSH, a state-of-the-art variance of LSH. Our model can accurately predict the average search quality and latency given a small sample dataset. Apart from automatic parameter tuning with the performance model, we also use the model to devise an adaptive LSH search algorithm to determine the probing parameter dynamically for each query. The adaptive probing method addresses the problem that even though the average performance is tuned for optimal, the variance of the performance is extremely high. We experimented with three different datasets including audio, images and 3D shapes to evaluate our methods. The results show the accuracy of the proposed model: the recall errors predicted are within 5 % from the real values for most cases; the adaptive search method reduces the standard deviation of recall by about 50 % over the existing method.
Cost-Aware Processing of Similarity Queries in Structured Overlays
- In IEEE P2P2006
, 2006
"... Large-scale distributed data management with P2P systems requires the existence of similarity operators for queries as we cannot assume that all users will agree on exactly the same schema and value representations and data quality problems due to spelling errors and typos. In this paper, we present ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Large-scale distributed data management with P2P systems requires the existence of similarity operators for queries as we cannot assume that all users will agree on exactly the same schema and value representations and data quality problems due to spelling errors and typos. In this paper, we present an approach for efficient processing of similarity selections and joins in a structured overlay. We show that there are several possible strategies exploiting DHT features to a different extent (i.e., key organization, routing, multicasting) and thus the choice of the best operator implementation in a given situation (selectivity, data distribution, load) should be based on cost information allowing the system to estimate the computation and communication costs of query execution plans. Hence, we present a cost model for similarity operations on structured data in a DHT and demonstrate the efficiency of our proposal by experimental results from a large-scale PlanetLab deployment. 1
Clustering Near-Duplicate Images in Large Collections
, 2007
"... Near-duplicate images introduce problems of redundancy and copyright infringement in large image collections. The problem is acute on the web, where appropriation of images without acknowledgment of source is prevalent. In this paper, we present an effective clustering approach for nearduplicate ima ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Near-duplicate images introduce problems of redundancy and copyright infringement in large image collections. The problem is acute on the web, where appropriation of images without acknowledgment of source is prevalent. In this paper, we present an effective clustering approach for nearduplicate images, using a combination of techniques from invariant image local descriptors and an adaptation of nearduplicate text-document clustering techniques; we extend our earlier approach of near-duplicate image pairwise identification for this clustering approach. We demonstrate that our clustering approach is highly effective for collections of up to a few hundred thousand images. We also show — via experimentation with real examples — that our approach presents a viable solution for clustering near-duplicate images on the Web.
Similarity Queries on Structured Data in Structured Overlays
- In Int. Workshop on Networking Meets Databases (NetDB’06) icw ICDE
, 2006
"... Structured P2P systems based on distributed hash tables are a popular choice for building large-scaled data management systems. Generally, they only support exact match queries, but data heterogeneities often demand for more complex query types, particularly similarity queries. In this work, we sugg ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Structured P2P systems based on distributed hash tables are a popular choice for building large-scaled data management systems. Generally, they only support exact match queries, but data heterogeneities often demand for more complex query types, particularly similarity queries. In this work, we suggest a vertical data organization, which allows for efficient processing of similarity queries on instance as well as on schema level, and we introduce corresponding physical similarity operators. Our novel approach is shown to be suitable in conjunction with P-Grid, as an example of robust, large-scaled and self-organizing P2P systems. 1
Cuckoo Ring: Balancing Workload for Locality Sensitive Hash
- Proc. IEEE Int’l Conf. Peerto-Peer Computing (P2P
, 2006
"... Locality Sensitive Hash (LSH) is widely used in peerto-peer (P2P) systems. Although it can support range or similarity queries, it breaks the load balance mechanism of traditional Distributed Hash Table (DHT) based system by replacing consistent hash with LSH. To solve the imbalance problem, current ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Locality Sensitive Hash (LSH) is widely used in peerto-peer (P2P) systems. Although it can support range or similarity queries, it breaks the load balance mechanism of traditional Distributed Hash Table (DHT) based system by replacing consistent hash with LSH. To solve the imbalance problem, current systems either weaken the locality preserve ability from similarity preserved to order preserved or adopt load aware peer join mechanism. The first method does not support similarity query as it loses the similarity information and the second method is greatly affected by the dynamic nature of P2P networks. In this paper, we propose a novel system, cuckoo ring, which can preserve similarity information while load balanced. It does not guide the newly joining peer to the hot areas but move the items in the hot areas to cold areas so that the short life time peers are distributed uniformly across the network instead of being guided to the hot areas. Compared to traditional DHT systems, cuckoo ring only maintains a little more information about the global light load peers and the moved indexed items.

