## Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Citations: | 5 - 0 self |

### BibTeX

@MISC{Guthrie_storingthe,

author = {David Guthrie},

title = {Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval},

year = {}

}

### OpenURL

### Abstract

We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30 % of the cost when using the current stateof-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1 byte per n-gram. 1

### Citations

1473 | Space/time trade-offs in hash coding with allowable errors
- Bloom
- 1970
(Show Context)
Citation Context ...emely attractive. Two major approaches have been used for storing language models: Bloom Filters and Bloomier Filters. We give an overview of both in what follows. 2.2.1 Bloom Filters A Bloom filter (=-=Bloom, 1970-=-) is a compact data structure for membership queries, i.e. queries of the form “Is this key in the Set?”. This is a weaker structure than a dictionary or hash table which also associates a value with ... |

889 | E.: Moses: Open source toolkit for statistical machine translation
- Koehn, Hoang, et al.
- 2007
(Show Context)
Citation Context ...at uses Bloom filter based structures to store large language models and has been integrated so that it can be used as the language model storage for the Moses statistical machine translation system (=-=Koehn et al., 2007-=-). We use randLM with the BloomMap (Talbot and Talbot, 2008) storage structure option with 8 bit quantized values and an error rate equivalent to using 8 bit fingerprints (as recommended in the Moses ... |

340 | Self-organized language modeling for speech recognition
- Jelinek
- 1990
(Show Context)
Citation Context ... A range of lossy methods have been proposed, to reduce the storage requirements of LMs by discarding information. Methods include the use of entropy pruning techniques (Stolcke, 1998) or clustering (=-=Jelinek et al., 1990-=-; Goodman and Gao, 2000) to reduce the number of n-grams that must be stored. A key method is quantization (Whittaker and Raj, 2001), which reduces the value information stored with n-grams to a limit... |

317 | Statistical Language Modeling Using the CMU-Cambridge Toolkit
- Clarkson, Rosenfeld
- 1997
(Show Context)
Citation Context ...n store other information, e.g. a probability or count value. Most modern language modeling toolkits employ some version of a trie structure for storage, including SRILM (Stolcke, 2002), CMU toolkit (=-=Clarkson and Rosenfeld, 1997-=-), MITLM (Hsu and Glass, 2008), and IRSTLM (Federico and Cettolo, 2007) and implementations exist which are very compact (Germann et al., 2009). An advantage of this structure is that it allows the st... |

261 |
Trie memory
- Fredkin
- 1960
(Show Context)
Citation Context ...r lossy methods are first applied to reduce the size of the model. 2.1 Language model storage using Trie structures A widely used approach for storing language models employs the trie data structure (=-=Fredkin, 1960-=-), which compactly represents sequences in the form of a prefix tree, where each step down from the root of the tree adds a new element to the sequence represented by the nodes seen so far. Where two ... |

123 | Entropy-based Pruning of Backoff Language Models
- Stolcke
- 1998
(Show Context)
Citation Context ...n-gram or less. 2 Related Work A range of lossy methods have been proposed, to reduce the storage requirements of LMs by discarding information. Methods include the use of entropy pruning techniques (=-=Stolcke, 1998-=-) or clustering (Jelinek et al., 1990; Goodman and Gao, 2000) to reduce the number of n-grams that must be stored. A key method is quantization (Whittaker and Raj, 2001), which reduces the value infor... |

28 | Randomized language models via perfect hash functions
- Talbot, Brants
- 2008
(Show Context)
Citation Context ...e model storage is the use of compact trie structures, but these structures do not scale well and require space proportional to both to the number of n-grams and the vocabulary size. Recent advances (=-=Talbot and Brants, 2008-=-; Talbot and Osborne, 2007b) involve the development of Bloom filter based models, which allow a considerable reduction in the space required to store a model, at the cost of allowing some limited ext... |

26 | Language model size reduction by pruning and clustering
- Goodman, Gao
- 2000
(Show Context)
Citation Context ...ods have been proposed, to reduce the storage requirements of LMs by discarding information. Methods include the use of entropy pruning techniques (Stolcke, 1998) or clustering (Jelinek et al., 1990; =-=Goodman and Gao, 2000-=-) to reduce the number of n-grams that must be stored. A key method is quantization (Whittaker and Raj, 2001), which reduces the value information stored with n-grams to a limited set of discrete alte... |

20 |
English Gigaword. Linguistic Data Consortium
- Graff
- 2003
(Show Context)
Citation Context ...ms may be accepted as rare n-grams), we can reduce the latter cost to below 1 byte per n-gram. 1 Introduction The availability of very large text collections, such as the Gigaword corpus of newswire (=-=Graff, 2003-=-), and the Google Web1T 1-5gram corpus (Brants and Franz, 2006), have made it possible to build models incorporating counts of billions of n-grams. The storage of these language models, however, prese... |

18 |
Quantization-based language model compression
- Whittaker, Raj
(Show Context)
Citation Context ...de the use of entropy pruning techniques (Stolcke, 1998) or clustering (Jelinek et al., 1990; Goodman and Gao, 2000) to reduce the number of n-grams that must be stored. A key method is quantization (=-=Whittaker and Raj, 2001-=-), which reduces the value information stored with n-grams to a limited set of discrete alternatives. It works by grouping together the values (probabilities or counts) associated with n-grams into cl... |

10 | Back-off language model compression - HARB, CHELBA, et al. |

6 |
Randomised language modelling for statistical machine translation
- 2007a
(Show Context)
Citation Context ...he ranks array will be compressed, as shown in Figure 2. Much like the final stage of the CHD minimal perfect hash algorithm we employ a random access compression algorithm of Fredriksson and Nikitin =-=(2007)-=- to reduce the size required by the array of ranks. This method allows compression while retaining O(1) access to query the model. The first step in the compression is to encode the ranks array using ... |

6 |
Smoothed Bloom filter language models: Tera-scale LMs on the cheap
- 2007b
(Show Context)
Citation Context ...he ranks array will be compressed, as shown in Figure 2. Much like the final stage of the CHD minimal perfect hash algorithm we employ a random access compression algorithm of Fredriksson and Nikitin =-=(2007)-=- to reduce the size required by the array of ranks. This method allows compression while retaining O(1) access to query the model. The first step in the compression is to encode the ranks array using ... |

5 | Ronitt Rubinfeld, and Ayellet Tal. 2004. The Bloomier filter: an efficient data structure for static support lookup tables - Chazelle, Kilian |

2 |
Iterative language model estimation:efficient data structure & algorithms
- Hsu, Glass
- 2008
(Show Context)
Citation Context ...ability or count value. Most modern language modeling toolkits employ some version of a trie structure for storage, including SRILM (Stolcke, 2002), CMU toolkit (Clarkson and Rosenfeld, 1997), MITLM (=-=Hsu and Glass, 2008-=-), and IRSTLM (Federico and Cettolo, 2007) and implementations exist which are very compact (Germann et al., 2009). An advantage of this structure is that it allows the stored n-grams to be enumerated... |