Results 1  10
of
15
Scalable similarity search with optimized kernel hashing
 In SIGKDD
, 2010
"... Scalable similarity search is the core of many large scale learning or data mining applications. Recently, many research results demonstrate that one promising approach is creating compact and efficient hash codes that preserve data similarity. By efficient, we refer to the low correlation (and thus ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
(Show Context)
Scalable similarity search is the core of many large scale learning or data mining applications. Recently, many research results demonstrate that one promising approach is creating compact and efficient hash codes that preserve data similarity. By efficient, we refer to the low correlation (and thus low redundancy) among generated codes. However, most existing hash methods are designed only for vector data. In this paper, we develop a new hashing algorithm to create efficient codes for large scale data of general formats with any kernel function, including kernels on vectors, graphs, sequences, sets and so on. Starting with the idea analogous to spectral hashing, novel formulations and solutions are proposed such that a kernel based hash function can be explicitly represented and optimized, and directly applied to compute compact hash codes for new samples of general formats. Moreover, we incorporate efficient techniques, such as Nyström approximation, to further reduce time and space complexity for indexing and search, making our algorithm scalable to huge data sets. Another important advantage of our method is the ability to handle diverse types of similarities according to actual task requirements, including both feature similarities and semantic similarities like label consistency. We evaluate our method using both vector and nonvector data sets at a large scale up to 1 million samples. Our comprehensive results show the proposed method outperforms several stateoftheart approaches for all the tasks, with a significant gain for most tasks.
Secure image retrieval through feature protection
 in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP
, 2009
"... This paper addresses the problem of image retrieval from an encrypted database, where data confidentiality is preserved both in the storage and retrieval process. The paper focuses on image feature protection techniques which enable similarity comparison among protected features. By utilizing both s ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
This paper addresses the problem of image retrieval from an encrypted database, where data confidentiality is preserved both in the storage and retrieval process. The paper focuses on image feature protection techniques which enable similarity comparison among protected features. By utilizing both signal processing and cryptographic techniques, three schemes are investigated and compared, including bitplane randomization, random projection, and randomized unary encoding. Experimental results show that secure image retrieval can achieve comparable retrieval performance to conventional image retrieval techniques without revealing information about image content. This work enriches the area of secure information retrieval and can find applications in secure online services for images and videos. Index Terms — Secure image retrieval, feature protection, relevance based search, content based image retrieval 1.
Similarity search and locality sensitive hashing using TCAMs
 In SIGMOD
, 2010
"... Similarity search methods are widely used as kernels in various data mining and machine learning applications including those in computational biology, web search/clustering. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficie ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Similarity search methods are widely used as kernels in various data mining and machine learning applications including those in computational biology, web search/clustering. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficient techniques for exact query lookup using hashing, similarity search using exact nearest neighbors suffers from a ”curse of dimensionality”, i.e. for high dimensional spaces, best known solutions offer little improvement over brute force search and thus are unsuitable for large scale streaming applications. Fast solutions to the approximate NNS problem include Locality Sensitive Hashing (LSH) based techniques, which need storage polynomial in n with exponent greater than 1, and query time sublinear, but still polynomial in n, wherenisthe size of the database. In this work we present a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(1) query time. In fact, this method also works around the best known lower bounds in the cell probe model for the query time using a data structure near linear in the size of the data base. TCAMs are high performance associative memories widely used in networking applications such as address lookups and access control lists. A TCAM can query for a bit vector within a database of ternary vectors, where every bit posi
Efficient distributed locality sensitive hashing
 In CIKM
, 2012
"... (Key, Value) based distributed frameworks, such as MapReduce, Memcached, and Twitter Storm are gaining increasingly widespread use in applications that process large amounts of data. One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
(Key, Value) based distributed frameworks, such as MapReduce, Memcached, and Twitter Storm are gaining increasingly widespread use in applications that process large amounts of data. One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has emerged as the method of choice, specially when the data is highdimensional. At its core, LSH is based on hashing the data points to a number of buckets such that similar points are more likely to map to the same buckets. To guarantee high search quality, the LSH scheme needs a rather large number of hash tables. This entails a large space requirement, and in the distributed setting, with each query requiring a network call per hash bucket look up, this also entails a big network load. The Entropy LSH scheme proposed by Panigrahy significantly reduces the number of required hash tables by looking up a number of query offsets in addition to the query itself. While this improves the LSH space requirement, it does not help with (and in fact worsens) the search network efficiency, as now each query offset requires a network call. In this paper, focusing on the Euclidian space under l2 norm and building up on Entropy LSH, we propose the distributed Layered LSH scheme, and prove that it exponentially decreases the network cost, while maintaining a good load balance between different machines. Our experiments also verify that our scheme results in significant network traffic reductions which also bring about runtime improvements. 1.
S.: Sequential spectral learning to hash with multiple representations
, 2012
"... Abstract. Learning to hash involves learning hash functions from a set of images for embedding highdimensional visual descriptors into a similaritypreserving lowdimensional Hamming space. Most of existing methods resort to a single representation of images, that is, only one type of visual descri ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Learning to hash involves learning hash functions from a set of images for embedding highdimensional visual descriptors into a similaritypreserving lowdimensional Hamming space. Most of existing methods resort to a single representation of images, that is, only one type of visual descriptors is used to learn a hash function to assign binary codes to images. However, images are often described by multiple different visual descriptors (such as SIFT, GIST, HOG), so it is desirable to incorporate these multiple representations into learning a hash function, leading to multiview hashing. In this paper we present a sequential spectral learning approach to multiview hashing where a hash function is sequentially determined by solving the successive maximization of local variances subject to decorrelation constraints. We compute multiview local variances by αaveraging viewspecific distance matrices such that the best averaged distance matrix is determined by minimizing its αdivergence from viewspecific distance matrices. We also present a scalable implementation, exploiting a fast approximate kNN graph construction method, in which αaveraged distances computed in small partitions determined by recursive spectral bisection are gradually merged in conquer steps until whole examples are used. Numerical experiments on Caltech256, CIFAR20, and NUSWIDE datasets confirm the high performance of our method, in comparison to singleview spectral hashing as well as existing multiview hashing methods. 1
Single versus multiple sorting in all pairs similarity search
 Journal of Machine Learning Research  Proceedings Track, 13:145–160
, 2010
"... To save memory and improve speed, vectorial data such as images and signals are often represented as strings of discrete symbols (i.e., sketches). Charikar (2002) proposed a fast approximate method for finding neighbor pairs of strings by sorting and scanning with a small window. This method, which ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
To save memory and improve speed, vectorial data such as images and signals are often represented as strings of discrete symbols (i.e., sketches). Charikar (2002) proposed a fast approximate method for finding neighbor pairs of strings by sorting and scanning with a small window. This method, which we shall call “single sorting”, is applied to locality sensitive codes and prevalently used in speeddemanding webrelated applications. To improve on single sorting, we propose a novel method that employs blockwise masked sorting. Our method can dramatically reduce the number of candidate pairs which have to be verified by distance calculation in exchange with an increased amount of sorting operations. So it is especially attractive for high dimensional dense data, where distance calculation is expensive. Empirical results show the efficiency of our method in comparison to single sorting and recent fast nearest neighbor methods.
Semisupervised discriminant hashing
 in Proceedings of the IEEE International Conference on Data Mining (ICDM
, 2011
"... Abstract—Hashing refers to methods for embedding highdimensional data into a similaritypreserving lowdimensional Hamming space such that similar objects are indexed by binary codes whose Hamming distances are small. Learning hash functions from data has recently been recognized as a promising appr ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Abstract—Hashing refers to methods for embedding highdimensional data into a similaritypreserving lowdimensional Hamming space such that similar objects are indexed by binary codes whose Hamming distances are small. Learning hash functions from data has recently been recognized as a promising approach to approximate nearest neighbor search for highdimensional data. Most of ‘learning to hash ’ methods resort to either unsupervised or supervised learning to determine hash functions. Recently semisupervised learning approach was introduced in hashing where pairwise constraints (mustlink and cannotlink) using labeled data are leveraged while unlabeled data are used for regularization to avoid overfitting. In this paper we base our semisupervised hashing on linear discriminant analysis, where hash functions are learned such that labeled data are used to maximize the separability between binary codes associated with different classes while unlabeled data are used for regularization as well as for balancing condition and pairwise decorrelation of bits. The resulting method is referred to as semisupervised discriminant hashing (SSDH). Numerical experiments on MNIST and CIFAR10 datasets demonstrate that our method outperforms existing methods, especially in the case of short binary codes. KeywordsHashing, regularized discriminant analysis, semisupervised learning. I.
Deep Learning to Hash with Multiple Representations
"... Abstract—Hashing seeks an embedding of highdimensional objects into a similaritypreserving lowdimensional Hamming space such that similar objects are indexed by binary codes with small Hamming distances. A variety of hashing methods have been developed, but most of them resort to a single view (r ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Hashing seeks an embedding of highdimensional objects into a similaritypreserving lowdimensional Hamming space such that similar objects are indexed by binary codes with small Hamming distances. A variety of hashing methods have been developed, but most of them resort to a single view (representation) of data. However, objects are often described by multiple representations. For instance, images are described by a few different visual descriptors (such as SIFT, GIST, and HOG), so it is desirable to incorporate multiple representations into hashing, leading to multiview hashing. In this paper we present a deep network for multiview hashing, referred to as deep multiview hashing, where each layer of hidden nodes is composed of viewspecific and shared hidden nodes, in order to learn individual and shared hidden spaces from multiple views of data. Numerical experiments on image datasets demonstrate the useful behavior of our deep multiview hashing (DMVH), compared to recentlyproposed multimodal deep network as well as existing shallow models of hashing. Keywordsdeep learning; harmonium; hashing; multiview learning; restricted Boltzmann machines; I.
GAD: General Activity Detection for Fast Clustering on Large Data ∗
"... In this paper, we propose GAD (General Activity Detection) for fast clustering on large scale data. Within this framework we design a set of algorithms for different scenarios: (1) Exact GAD algorithm EGAD, which is much faster than KMeans and gets the same clustering result. (2) Approximate GAD a ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
In this paper, we propose GAD (General Activity Detection) for fast clustering on large scale data. Within this framework we design a set of algorithms for different scenarios: (1) Exact GAD algorithm EGAD, which is much faster than KMeans and gets the same clustering result. (2) Approximate GAD algorithms with different assumptions, which are faster than EGAD while achieving different degrees of approximation. (3) GAD based algorithms to handle the ”large clusters ” problem which appears in many large scale clustering applications. Two existing activity detection algorithms GT and CGAUTC are special cases under the framework. The most important contribution of our work is that the framework is the general solution to exploit activity detection for fast clustering in both exact and approximate senarios, and our proposed algorithms within the framework can achieve very high speed. Extensive experiments have been conducted on several large datasets from various real world applications; the results show that our proposed algorithms are effective and efficient. 1
Efficient Construction of Neighborhood Graphs by the Multiple Sorting Method
, 904
"... Neighborhood graphs are gaining popularity as a concise data representation in machine learning. However, naive graph construction by pairwise distance calculation takes O(n 2) runtime for n data points and this is prohibitively slow for millions of data points. For strings of equal length, the mult ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Neighborhood graphs are gaining popularity as a concise data representation in machine learning. However, naive graph construction by pairwise distance calculation takes O(n 2) runtime for n data points and this is prohibitively slow for millions of data points. For strings of equal length, the multiple sorting method (Uno, 2008) can construct an ǫneighbor graph in O(n+m) time, where m is the number of ǫneighbor pairs in the data. To introduce this remarkably efficient algorithm to continuous domains such as images, signals and texts, we employ a random projection method to convert vectors to strings. Theoretical results are presented to elucidate the tradeoff between approximation quality and computation time. Empirical results show the efficiency of our method in comparison to fast nearest neighbor alternatives. 1.