## Gatsby Computational

### Cached

### Download Links

### BibTeX

@MISC{Wood_gatsbycomputational,

author = {Frank Wood and Lancelot James and Jan Gasthaus and Cédric Archambeau and Yee Whye Teh},

title = {Gatsby Computational},

year = {}

}

### OpenURL

### Abstract

Probabilistic models of sequences play a central role in most machine translation, automated speech recognition, lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, real-world sequence data often exhibit long range dependencies which can only be captured by computationally challenging, complex models. Sequence data arising from natural processes also often exhibit power-law properties, yet common sequence models do not capture such properties. The sequence memoizer is a new hierarchical Bayesian model for discrete sequence data that captures long range dependencies and power-law characteristics while remaining computationally attractive. Its utility as a language model and general purpose lossless compressor is demonstrated. 1.

### Citations

6955 |
A Mathematical Theory of Communication
- Shannon
- 1948
(Show Context)
Citation Context ...stant time extensions to the SM model [1] have been developed which show great promise for language modeling and other applications. 6.2 Compression Shannon’s celebrated results in information theory =-=[17]-=- has led to lossless compression technology that, given a coding distribution, nearly optimally achieves the theoretical lower limit (given by the log-loss) on the number of bits needed to encode a se... |

1007 | Monte Carlo Statistical Methods
- Robert, Casella
- 1999
(Show Context)
Citation Context ...n analytic (2)solution but often does not. When it does not, like in this situation, it is often necessary to turn to numerical integration approaches, including sampling and Monte Carlo integration =-=[16]-=-. In the case of the Pitman-Yor process, E[G(s)] can be computed as described at a high level in the following way. In addition to the counts {N(s ′ )}s′ ∈Σ, assume there is another set of random “cou... |

970 |
Sequential Monte Carlo Methods in Practice
- Doucet, Frietas, et al.
- 2001
(Show Context)
Citation Context ... using samples from the posterior distribution. The samples are obtained using Gibbs sampling [16] as in [18, 21], which repeatedly makes local changes to the counts, and using sequential Monte Carlo =-=[5]-=- as in [7], which iterSource Perplexity Bengio et al. [2] 109.0 Mnih et al. [13] 83.9 4-gram Interpolated Kneser-Ney [3, 18] 106.1 4-gram Modified Kneser-Ney [3, 18] 102.4 4-gram Hierarchical PYP [18]... |

927 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1998
(Show Context)
Citation Context ...e. This has led to the development of creative approaches to its avoidance. The language modeling and text compression communities have generally called these smoothing or back-off methodologies (see =-=[3]-=- and references therein). In the following we will propose a Bayesian approach that retains uncertainty in parameter estimation and thus avoids over-confident estimates. 4. BAYESIAN MODELING As oppose... |

229 | The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator
- Pitman, Yor
- 1997
(Show Context)
Citation Context ... quite rarely under a power-law, our estimates of G(s) will often be inaccurate. To encode our prior knowledge about power-law scaling, we use a prior distribution called the Pitman-Yor process (PYP) =-=[15]-=-, which is a distribution over the discrete probability distribution G = {G(s)}s∈Σ. It has three parameters: a base distribution G0 = {G0(s)}s∈Σ which is the mean of the PYP and reflects our prior bel... |

165 | A neural probabilistic language model
- Bengio, Ducharme, et al.
- 2003
(Show Context)
Citation Context ...es are obtained using Gibbs sampling [16] as in [18, 21], which repeatedly makes local changes to the counts, and using sequential Monte Carlo [5] as in [7], which iterSource Perplexity Bengio et al. =-=[2]-=- 109.0 Mnih et al. [13] 83.9 4-gram Interpolated Kneser-Ney [3, 18] 106.1 4-gram Modified Kneser-Ney [3, 18] 102.4 4-gram Hierarchical PYP [18] 101.9 Sequence Memoizer [21] 96.9 Table 1: Language mode... |

123 |
Selective Studies and the Principle of Relative Frequency in Language
- Zipf
- 1932
(Show Context)
Citation Context ...ese in turn in the rest of this section. 4.1 Power-Law Scaling As with many other natural phenomena like social networks and earthquakes, occurrences of words in a language follow a power-law scaling =-=[23]-=-. This means that there are a small number of words that occur disproportionately frequently (e.g. the, to, of), and a very large number of rare words that, although each occurs rarely, when taken tog... |

120 | Coalescents with multiple collisions
- Pitman
- 1999
(Show Context)
Citation Context ...nore G11 and marginalize it out from the model. Fortunately, a remarkable property related to an operation on Pitman-Yor processes called coagulation allows us to perform this marginalization exactly =-=[14]-=-. Specifically in the case of G11|G1 ∼ PY(α2, G1) and G011|G11 ∼ PY(α3, G11), the property states simply that G011|G1 ∼ PY(α2α3, G1) where G11 has been marginalized out. In other words, the prior for ... |

115 | Unbounded length contexts for PPM
- Cleary, Teahan
- 1997
(Show Context)
Citation Context ... sequence xT , xT −1, . . . , x1. Similar extensions from fixed-length to unbounded-length contexts, followed by reductions in the context trees have also been developed in the compression literature =-=[4, 19]-=-. 5.3 Inference and Prediction As a consequence of the two marginalization steps described in the previous subsection, inference in the full sequence memoizer model with an infinite number of paramete... |

89 | A hierarchical bayesian language model based on pitman-yor processes
- Teh
- 2006
(Show Context)
Citation Context ...over each symbol xi in x, given that its context consisting of the previous n symbols xi−n:i−1 is u, is simply Gu. The hierarchical Bayesian model in (4) is called the hierarchical Pitman-Yor process =-=[18]-=-. It formally encodes our context tree similarity assumption about the conditional distributions using dependence among them induced by the hierarchy, with more similar distributions being more depend... |

80 | Interpolating between types and tokens by estimating power-law generators
- Goldwater, Griffiths, et al.
- 2006
(Show Context)
Citation Context ... all conditional distributions. It is worth noting that there is a well known connection between the hierarchical PYP and a type of smoothing for m-gram language models called interpolated Kneser-Ney =-=[10, 18]-=-.(a) Gε (b) Gε (c) Gε G0 0 1 G1 G0 0 0 1 G1 G0 0 0 1 G1 G00 0 1 G10 G01 0 1 G11 1 G01 0 1 G11 1 G01 0 1 0 1 0 1 0 1 0 1 1 0 1 0 G000 G100 G010 G110 G001 G101 G011 G111 0 G011 0 G011 Figure 2: (a) Ful... |

77 |
Bayesian data analysis. Chapman & Hall/CRC
- Gelman, Carlin, et al.
- 2004
(Show Context)
Citation Context ...United States of. These contexts all share the same length three suffix. In this section and the following one, we will discuss how this assumption can be codified using a hierarchical Bayesian model =-=[11, 8]-=-. To start we will only consider fixed, finite length contexts. When we do this we say that we are making an n th order Markov assumption. This means that each symbol only depends on the last n observ... |

73 | From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction
- Giegerich, Kurtz
- 1997
(Show Context)
Citation Context ...quence x (independent of |Σ|). At this point some readers may notice that the compact context tree has a structure reminiscent of a data structure for efficient string operations called a suffix tree =-=[9]-=-. In fact the structure of the compact context tree is given by the suffix tree for the reverse sequence xT , xT −1, . . . , x1. Similar extensions from fixed-length to unbounded-length contexts, foll... |

39 | The Context-tree weighting method: extensions
- Willems
(Show Context)
Citation Context ... sequence xT , xT −1, . . . , x1. Similar extensions from fixed-length to unbounded-length contexts, followed by reductions in the context trees have also been developed in the compression literature =-=[4, 19]-=-. 5.3 Inference and Prediction As a consequence of the two marginalization steps described in the previous subsection, inference in the full sequence memoizer model with an infinite number of paramete... |

26 |
A hierarchical Dirichlet language model. Natural language engineering
- MacKay, Peto
- 1995
(Show Context)
Citation Context ...United States of. These contexts all share the same length three suffix. In this section and the following one, we will discuss how this assumption can be codified using a hierarchical Bayesian model =-=[11, 8]-=-. To start we will only consider fixed, finite length contexts. When we do this we say that we are making an n th order Markov assumption. This means that each symbol only depends on the last n observ... |

11 | Lossless compression based on the Sequence Memoizer
- Gasthaus, Wood, et al.
- 2010
(Show Context)
Citation Context ...ples from the posterior distribution. The samples are obtained using Gibbs sampling [16] as in [18, 21], which repeatedly makes local changes to the counts, and using sequential Monte Carlo [5] as in =-=[7]-=-, which iterSource Perplexity Bengio et al. [2] 109.0 Mnih et al. [13] 83.9 4-gram Interpolated Kneser-Ney [3, 18] 106.1 4-gram Modified Kneser-Ney [3, 18] 102.4 4-gram Hierarchical PYP [18] 101.9 Seq... |

11 | A stochastic memoizer for sequence data
- Wood, Archambeau, et al.
- 2009
(Show Context)
Citation Context ...ation (5), we use stochastic (Monte Carlo) approximations where the expectation is approximated using samples from the posterior distribution. The samples are obtained using Gibbs sampling [16] as in =-=[18, 21]-=-, which repeatedly makes local changes to the counts, and using sequential Monte Carlo [5] as in [7], which iterSource Perplexity Bengio et al. [2] 109.0 Mnih et al. [13] 83.9 4-gram Interpolated Knes... |

5 | Forgetting counts : Constant Memory inference for a dependent hierarchical Pitman-Yor
- Bartlett, Pfau, et al.
- 2010
(Show Context)
Citation Context ... extra arguments. extensions to higher orders impractical. The SM model directly fixes this problem while remaining computationally tractable. Constant space, constant time extensions to the SM model =-=[1]-=- have been developed which show great promise for language modeling and other applications. 6.2 Compression Shannon’s celebrated results in information theory [17] has led to lossless compression tech... |

4 |
Improving a statistical language model through non-linear prediction
- Mnih, Zhang, et al.
- 2009
(Show Context)
Citation Context ...Gibbs sampling [16] as in [18, 21], which repeatedly makes local changes to the counts, and using sequential Monte Carlo [5] as in [7], which iterSource Perplexity Bengio et al. [2] 109.0 Mnih et al. =-=[13]-=- 83.9 4-gram Interpolated Kneser-Ney [3, 18] 106.1 4-gram Modified Kneser-Ney [3, 18] 102.4 4-gram Hierarchical PYP [18] 101.9 Sequence Memoizer [21] 96.9 Table 1: Language modeling performance for a ... |

3 |
CTW website
- Willems
- 2009
(Show Context)
Citation Context ...l entropy encoding, lower is better) for the Calgary corpus, a standard benchmark collection of diverse filetypes. The results for unboundedlength context PPM is from [4]. The results for CTW is from =-=[20]-=-. The bzip2 and gzip results come from running the corresponding standard unix command line tools with no extra arguments. extensions to higher orders impractical. The SM model directly fixes this pro... |

2 | Improvements to the sequence memoizer
- Gasthaus, Teh
- 2010
(Show Context)
Citation Context ...ty around the mean G0. When α = 0 the Pitman-Yor process loses its power-law properties and reduces to the more well-known Dirichlet process. In this paper we assume c = 0 instead for simplicity; see =-=[6]-=- for the more general case when c is allowed to be positive. When we write G ∼ PY(α, G0) it means that G has a prior given by a Pitman-Yor process with the given parameters. Figure 1 illustrates the p... |

2 |
Large text compression benchmark
- Mahoney
- 2014
(Show Context)
Citation Context ... types and varying lengths. In addition to the experiments on the Calgary corpus, SM compression performance was also evaluated on a 100 MB excerpt of the English version of Wikipedia (XML text dump) =-=[12]-=-. On this excerpt, the SM model achieved a logloss of 1.66 bits/symbol amounting to a compressed file size of 20.80 MB. While this is worse than 16.23 MB achieved by the best demonstrated Wikipedia co... |

2 |
A new PPM variant for Chinese text compression
- Wu, Teahan
(Show Context)
Citation Context ...entative text file, the Chinese Union version of the bible, we achieved a logloss of 4.91 bits per Chinese character, which is significantly better than the best results in the literature (5.44 bits) =-=[22]-=-. 7. CONCLUSIONS The sequence memoizer achieves improved compression and language modeling performance. These application specific performance improvements are arguably by themselves worthwhile scient... |