## Author manuscript, published in "N/P" SEARCHING IN ONE BILLION VECTORS: RE-RANK WITH SOURCE CODING (2011)

### BibTeX

@MISC{Jégou11authormanuscript,,

author = {Hervé Jégou and Inria Rennes and Romain Tavenard and Matthijs Douze and Laurent Amsaleg},

title = {Author manuscript, published in "N/P" SEARCHING IN ONE BILLION VECTORS: RE-RANK WITH SOURCE CODING},

year = {2011}

}

### OpenURL

### Abstract

Recent indexing techniques inspired by source coding have been shown successful to index billions of high-dimensional vectors in memory. In this paper, we propose an approach that re-ranks the neighbor hypotheses obtained by these compressed-domain indexing methods. In contrast to the usual post-verification scheme, which performs exact distance calculation on the short-list of hypotheses, the estimated distances are refined based on short quantization codes, to avoid reading the full vectors from disk. We have released a new public dataset of one billion 128dimensional vectors and proposed an experimental setup to evaluate high dimensional indexing algorithms on a realistic scale. Experiments show that our method accurately and efficiently re-ranks the neighbor hypotheses using little memory compared to the full vectors representation. Index Terms — nearest neighbor search, quantization, source coding, high dimensional indexing, large databases 1.

### Citations

5106 | Distinctive image features from scale-invariant keypoints", International journal of computer vision
- Lowe
- 2004
(Show Context)
Citation Context ...remental manner by successive description layers. In order to evaluate our approach, we introduce a dataset of one billion vectors extracted from millions of images using the standard SIFT descriptor =-=[8]-=-. Testing on a large scale is important, as most ANN methods are usually evaluated on sets of unrealistic size, thereby ignoring memory issues that arise in real applications, where billions of vector... |

426 | High performance scalable image compression with EBCOT
- Taubman
(Show Context)
Citation Context ...approximation resulting from the first ranking, and refines it using codes stored in RAM. There is an analogy between this approach and the scalable compression techniques proposed in the last decade =-=[13]-=-, where the term “scalable” means that a reconstruction of the compressed signal is refined in an incremental manner by successive description layers. In order to evaluate our approach, we introduce a... |

289 | Localitysensitive hashing scheme based on p-stable distributions
- Datar, Immorlica, et al.
- 2004
(Show Context)
Citation Context ...ectors representation. Index Terms— nearest neighbor search, quantization, source coding, high dimensional indexing, large databases 1. INTRODUCTION Approximate nearest neighbors (ANN) search methods =-=[3, 10, 14, 15]-=- are required to handle large databases, especially for computer vision [12] and music retrieval [2] applications. One of the most popular techniques is Euclidean Locality-Sensitive Hashing [3]. Howev... |

187 | Fast approximate nearest neighbors with automatic algorithm configuration
- Muja, Lowe
- 2009
(Show Context)
Citation Context ...ectors representation. Index Terms— nearest neighbor search, quantization, source coding, high dimensional indexing, large databases 1. INTRODUCTION Approximate nearest neighbors (ANN) search methods =-=[3, 10, 14, 15]-=- are required to handle large databases, especially for computer vision [12] and music retrieval [2] applications. One of the most popular techniques is Euclidean Locality-Sensitive Hashing [3]. Howev... |

107 | Spectral hashing
- Weiss, Torralba, et al.
- 2008
(Show Context)
Citation Context ...ectors representation. Index Terms— nearest neighbor search, quantization, source coding, high dimensional indexing, large databases 1. INTRODUCTION Approximate nearest neighbors (ANN) search methods =-=[3, 10, 14, 15]-=- are required to handle large databases, especially for computer vision [12] and music retrieval [2] applications. One of the most popular techniques is Euclidean Locality-Sensitive Hashing [3]. Howev... |

70 | Product quantization for nearest neighbor search
- Jégou, Douze, et al.
(Show Context)
Citation Context ... constraint. They are, however, significantly outperformed in terms of the trade-off between memory usage and accuracy by recent methods that cast high dimensional indexing to a source coding problem =-=[11, 5, 6]-=-, in particular the product quantization-based method of [5] exhibits impressive results for large scale image search [6]. State-of-the-art approaches usually perform a re-ranking stage to produce a r... |

64 |
Nearest-Neighbors Methods in Learning and Vision. Theory and Practice
- Shakhnarovich, Darrell, et al.
- 2008
(Show Context)
Citation Context ...g, high dimensional indexing, large databases 1. INTRODUCTION Approximate nearest neighbors (ANN) search methods [3, 10, 14, 15] are required to handle large databases, especially for computer vision =-=[12]-=- and music retrieval [2] applications. One of the most popular techniques is Euclidean Locality-Sensitive Hashing [3]. However, most of these approaches are memory consuming, since several hash tables... |

63 | Aggregating local descriptors into a compact image representation - Jégou, Douze, et al. - 2010 |

55 | Improving bag-of-features for large scale image search
- Jégou, Douze, et al.
- 2010
(Show Context)
Citation Context ...One of the most popular techniques is Euclidean Locality-Sensitive Hashing [3]. However, most of these approaches are memory consuming, since several hash tables or trees are required. The methods of =-=[4, 15]-=-, which embeds the vector into a binary space, better satisfies the memory constraint. They are, however, significantly outperformed in terms of the trade-off between memory usage and accuracy by rece... |

49 |
Small Codes and Large Databases for Recognition
- Torralba, Fergus, et al.
- 2008
(Show Context)
Citation Context |

48 | Content-based music information retrieval: current directions and future challenges
- Casey, Veltkamp, et al.
(Show Context)
Citation Context ...ing, large databases 1. INTRODUCTION Approximate nearest neighbors (ANN) search methods [3, 10, 14, 15] are required to handle large databases, especially for computer vision [12] and music retrieval =-=[2]-=- applications. One of the most popular techniques is Euclidean Locality-Sensitive Hashing [3]. However, most of these approaches are memory consuming, since several hash tables or trees are required. ... |

15 | Learning a fine vocabulary
- Mikulik, Perdoch, et al.
(Show Context)
Citation Context ...scale is important, as most ANN methods are usually evaluated on sets of unrealistic size, thereby ignoring memory issues that arise in real applications, where billions of vectors have to be handled =-=[4, 9]-=-. The groundtruth nearest-neighbors have been computed for 10000 queries using exact distance computations. To our knowledge, this set is the largest ever released to evaluate ANN algorithms against a... |

11 |
NV-tree: An Efficient Disk-Based Index for Approximate Search in Very Large High-Dimensional Collections
- Lejsek, Ásmundsson, et al.
- 2009
(Show Context)
Citation Context ...ions. To our knowledge, this set is the largest ever released to evaluate ANN algorithms against an exact linear scan on real data: the largest other experiment we are aware of is the one reported in =-=[7]-=-, where a private set of 179 million vectors was considered. [1] reports an experiment on 1 billion vectors, but on synthetic data with a known model exploited by the algorithm. Experiments performed ... |

7 | isax 2.0: Indexing and mining one billion time series
- Camerra, Palpanas, et al.
- 2010
(Show Context)
Citation Context ... evaluate ANN algorithms against an exact linear scan on real data: the largest other experiment we are aware of is the one reported in [7], where a private set of 179 million vectors was considered. =-=[1]-=- reports an experiment on 1 billion vectors, but on synthetic data with a known model exploited by the algorithm. Experiments performed on this dataset show that the proposed approach offers an altern... |

5 | Searching with expectations - SANDHAWALIA, JÉGOU |