Results 1  10
of
429
From frequency to meaning : Vector space models of semantics
 Journal of Artificial Intelligence Research
, 2010
"... Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are begi ..."
Abstract

Cited by 322 (3 self)
 Add to MetaCart
(Show Context)
Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term–document, word–context, and pair–pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field. 1.
Google news personalization: scalable online collaborative filtering
 in WWW, 2007
"... Several approaches to collaborative filtering have been studied but seldom have studies been reported for large (several million users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating p ..."
Abstract

Cited by 266 (0 self)
 Add to MetaCart
(Show Context)
Several approaches to collaborative filtering have been studied but seldom have studies been reported for large (several million users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptable for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.
Kernelized localitysensitive hashing for scalable image search
 IEEE International Conference on Computer Vision (ICCV
, 2009
"... Fast retrieval methods are critical for largescale and datadriven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a lowdimensional Hamming space where items can be efficiently searched. However, existing methods do not apply ..."
Abstract

Cited by 165 (5 self)
 Add to MetaCart
(Show Context)
Fast retrieval methods are critical for largescale and datadriven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a lowdimensional Hamming space where items can be efficiently searched. However, existing methods do not apply for highdimensional kernelized data when the underlying feature embedding for the kernel is unknown. We show how to generalize localitysensitive hashing to accommodate arbitrary kernel functions, making it possible to preserve the algorithm’s sublinear time similarity search guarantees for a wide class of useful similarity functions. Since a number of successful imagebased kernels have unknown or incomputable embeddings, this is especially valuable for image retrieval tasks. We validate our technique on several largescale datasets, and show that it enables accurate and fast performance for examplebased object classification, feature matching, and contentbased retrieval. 1.
Efficient Similarity Search and Classification Via Rank Aggregation
 In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data
, 2003
"... We propose a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, the database elements are vectors in a Euclidean space. Given a query vector in the same space, the goal is to find elements of the database that are similar to the ..."
Abstract

Cited by 152 (3 self)
 Add to MetaCart
(Show Context)
We propose a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, the database elements are vectors in a Euclidean space. Given a query vector in the same space, the goal is to find elements of the database that are similar to the query. In our approach, a small number of independent "voters" rank the database elements based on similarity to the query. These rankings are then combined by a highly efficient aggregation algorithm. Our methodology leads both to techniques for computing approximate nearest neighbors and to a conceptually rich alternative to nearest neighbors.
Finding nearduplicate web pages: a largescale evaluation of algorithms
 In Proc. of Int. Conf. on Information Retrieval (SIGIR
, 2006
"... Broder et al.’s [3] shingling algorithm and Charikar’s [4] random projection based approach are considered “stateoftheart ” algorithms for finding nearduplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very lar ..."
Abstract

Cited by 113 (1 self)
 Add to MetaCart
(Show Context)
Broder et al.’s [3] shingling algorithm and Charikar’s [4] random projection based approach are considered “stateoftheart ” algorithms for finding nearduplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding nearduplicate pairs on the same site, while both achieve high precision for nearduplicate pairs on different sites. Since Charikar’s algorithm finds more nearduplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al. ’s algorithm. We present a combined algorithm which achieves precision 0.79 with 79 % of the recall of the other algorithms.
Learning to hash with binary reconstructive embeddings
 in Proc. NIPS, 2009
"... Fast retrieval methods are increasingly critical for many largescale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitl ..."
Abstract

Cited by 110 (1 self)
 Add to MetaCart
(Show Context)
Fast retrieval methods are increasingly critical for many largescale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinatedescent algorithm for our proposed hashing objective that is able to efficiently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing stateoftheart techniques. 1
Nearestneighbor searching and metric space dimensions
 In NearestNeighbor Methods for Learning and Vision: Theory and Practice
, 2006
"... Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distan ..."
Abstract

Cited by 106 (0 self)
 Add to MetaCart
(Show Context)
Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distance function as a “black box”. The structure is able to speed up nearest neighbor searching in a variety of settings, for example: points in lowdimensional or structured Euclidean space, strings under Hamming and edit distance, and bit vector data from an OCR application. The data structures are observed to need linear space, with a modest constant factor. The preprocessing time needed per site is observed to match the query time. The data structure can be viewed as an application of a “kdtree ” approach in the metric space setting, using Voronoi regions of a subset in place of axisaligned boxes. 1
H (2010a) Largescale image retrieval with compressed Fisher vectors
 In: CVPR Perronnin F, Sánchez J, Liu Y (2010b) Largescale
"... The problem of largescale image search has been traditionally addressed with the bagofvisualwords (BOV). In this article, we propose to use as an alternative the Fisher kernel framework. We first show why the Fisher representation is wellsuited to the retrieval problem: it describes an image by ..."
Abstract

Cited by 104 (8 self)
 Add to MetaCart
The problem of largescale image search has been traditionally addressed with the bagofvisualwords (BOV). In this article, we propose to use as an alternative the Fisher kernel framework. We first show why the Fisher representation is wellsuited to the retrieval problem: it describes an image by what makes it different from other images. One drawback of the Fisher vector is that it is highdimensional and, as opposed to the BOV, it is dense. The resulting memory and computational costs do not make Fisher vectors directly amenable to largescale retrieval. Therefore, we compress Fisher vectors to reduce their memory footprint and speedup the retrieval. We compare three binarization approaches: a simple approach devised for this representation and two standard compression techniques. We show on two publicly available datasets that compressed Fisher vectors perform very well using as little as a few hundreds of bits per image, and significantly better than a very recent compressed BOV approach. 1.
Fast Image Search for Learned Metrics
"... We introduce a method that enables scalable image search for learned metrics. Given pairwise similarity and dissimilarity constraints between some images, we learn a Mahalanobis distance function that captures the images’ underlying relationships well. To allow sublinear time similarity search unde ..."
Abstract

Cited by 104 (11 self)
 Add to MetaCart
(Show Context)
We introduce a method that enables scalable image search for learned metrics. Given pairwise similarity and dissimilarity constraints between some images, we learn a Mahalanobis distance function that captures the images’ underlying relationships well. To allow sublinear time similarity search under the learned metric, we show how to encode the learned metric parameterization into randomized localitysensitive hash functions. We further formulate an indirect solution that enables metric learning and hashing for vector spaces whose high dimensionality make it infeasible to learn an explicit weighting over the feature dimensions. We demonstrate the approach applied to a variety of image datasets. Our learned metrics improve accuracy relative to commonlyused metric baselines, while our hashing construction enables efficient indexing with learned distances and very large databases.
Efficient similarity joins for near duplicate detection
 In WWW
, 2008
"... With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given ..."
Abstract

Cited by 102 (8 self)
 Add to MetaCart
(Show Context)
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in standalone and RDBMSbased settings. Experimental results show our proposed algorithms can outperforms previous algorithms on several real datasets.