Results 1 
5 of
5
Bloom filter and lossy dictionary based language models
, 2007
"... Language models are probability distributions over a set of unilingual natural language text used in many natural language processing tasks such as statistical machine translation, information retrieval, and speech processing. Since more wellformed training data means a better model and the increa ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Language models are probability distributions over a set of unilingual natural language text used in many natural language processing tasks such as statistical machine translation, information retrieval, and speech processing. Since more wellformed training data means a better model and the increased availability of text via the Internet, the size of language modelling ngram data sets have grown exponentially the past few years. The latest data sets available can no longer fit on a single computer. A recent investigation reported first known use of a probabilistic data structure to create a randomised language model capable of storing probability information for massive ngram sets in a fraction of the space normally needed. We report and compare the properties of lossy language models using two probabilistic data structures: the Bloom filter and lossy dictionary. The Bloom filter has exceptional space requirements and only onesided, false positive error returns but it is computationally slow in scale which is a potential drawback for a structure being queried millions of times per sentence. Lossy dictionaries have low space requirements and are very fast but with twosided error that returns
R S
"... Discriminative models are a class of learning methods where the focus is on learning class memberships, as opposed to Generative models, where the interest is in full class densities. While several approaches to discriminative modelling exist, we concentrate on the Maximum Entropy Framework, based o ..."
Abstract
 Add to MetaCart
(Show Context)
Discriminative models are a class of learning methods where the focus is on learning class memberships, as opposed to Generative models, where the interest is in full class densities. While several approaches to discriminative modelling exist, we concentrate on the Maximum Entropy Framework, based on a theoretical argument developed by Jaynes [1957]. Maximum Entropy methods are featurebased: in order to infer an empirical distribution from the data they encode relevant statistics using features. In general, the quality of the model grows with the number and scope of features: unfortunately, the computational and memory resources needed to manipulate them also grow accordingly, often to an unmanageable extent. We investigate the possibility of representing features using randomised techniques. Exploring one class of important onesided error randomised data structures derived from the Bloom Filter, our study concentrates on the logarithmicfrequency Bloom Filter [Talbot and Osborne, 2007a,b] and the Bloom Map [Talbot and Talbot, 2008]. Both are introduced and tested in a discriminative learning
Abstract An Optimal Bloom Filter Replacement ∗
"... This paper considers spaceefficient data structures for storing an approximation S ′ to a set S such that S ⊆ S ′ and any element not in S belongs to S ′ with probability at most ɛ. The Bloom filter data structure, solving this problem, has found widespread use. Our main result is a new RAM data st ..."
Abstract
 Add to MetaCart
(Show Context)
This paper considers spaceefficient data structures for storing an approximation S ′ to a set S such that S ⊆ S ′ and any element not in S belongs to S ′ with probability at most ɛ. The Bloom filter data structure, solving this problem, has found widespread use. Our main result is a new RAM data structure that improves Bloom filters in several ways: • The time for looking up an element in S ′ is O(1), independent of ɛ. • The space usage is within a lower order term of the lower bound. • The data structure uses explicit hash function families. • The data structure supports insertions and deletions on S in amortized expected constant time. The main technical ingredient is a succinct representation of dynamic multisets. We also consider three recent generalizations of Bloom filters. 1
Randomised Features in Discriminative Machine Learning
, 2008
"... Discriminative models are a class of learning methods where the focus is on learning class memberships, as opposed to Generative models, where the interest is in full class densities. While several approaches to discriminative modelling exist, we concentrate on the Maximum Entropy Framework, based o ..."
Abstract
 Add to MetaCart
Discriminative models are a class of learning methods where the focus is on learning class memberships, as opposed to Generative models, where the interest is in full class densities. While several approaches to discriminative modelling exist, we concentrate on the Maximum Entropy Framework, based on a theoretical argument developed by Jaynes [1957]. Maximum Entropy methods are featurebased: in order to infer an empirical distribution from the data they encode relevant statistics using features. In general, the quality of the model grows with the number and scope of features: unfortunately, the computational and memory resources needed to manipulate them also grow accordingly, often to an unmanageable extent. We investigate the possibility of representing features using randomised techniques. Exploring one class of important onesided error randomised data structures derived from the Bloom Filter, our study concentrates on the logarithmicfrequency Bloom Filter [Talbot and Osborne, 2007a,b] and the Bloom Map [Talbot
An Optimal Bloom Filter Replacementh∗
"... This paper considers spaceefficient data structures for storing an approximation S ′ to a set S such that S ⊆ S′ and any element not in S belongs to S ′ with probability at most . The Bloom filter data structure, solving this problem, has found widespread use. Our main result is a new RAM data stru ..."
Abstract
 Add to MetaCart
(Show Context)
This paper considers spaceefficient data structures for storing an approximation S ′ to a set S such that S ⊆ S′ and any element not in S belongs to S ′ with probability at most . The Bloom filter data structure, solving this problem, has found widespread use. Our main result is a new RAM data structure that improves Bloom filters in several ways: • The time for looking up an element in S ′ is O(1), independent of . • The space usage is within a lower order term of the lower bound. • The data structure uses explicit hash function families. • The data structure supports insertions and deletions on S in amortized expected constant time. The main technical ingredient is a succinct representation of dynamic multisets. We also consider three recent generalizations of Bloom filters. 1