## Separating Precision and Mean in Dirichlet-enhanced High-order Markov Models

### BibTeX

@MISC{Takahashi_separatingprecision,

author = {Rikiya Takahashi},

title = {Separating Precision and Mean in Dirichlet-enhanced High-order Markov Models},

year = {}

}

### OpenURL

### Abstract

Abstract. Robustly estimating the state-transition probabilities of highorder Markov processes is an essential task in many applications such as natural language modeling or protein sequence modeling. We propose a novel estimation algorithm called Hierarchical Separated Dirichlet Smoothing (HSDS), where Dirichlet distributions are hierarchically assumed to be the prior distributions of the state-transition probabilities. The key idea in HSDS is to separate the parameters of a Dirichlet distribution into the precision and mean, so that the precision depends on the context while the mean is given by the lower-order distribution. HSDS is designed to outperform Kneser-Ney smoothing especially when the number of states is small, where Kneser-Ney smoothing is currently known as the state-of-the-art technique for N-gram natural language models. Our experiments in protein sequence modeling showed the superiority of HSDS both in perplexity evaluation and classification tasks. 1

### Citations

940 | An empirical study of smoothing techniques for language modeling. Computer Speech and Language
- SF, Goodman
- 1999
(Show Context)
Citation Context ...es to estimate the state-transition probabilities in high-order Markov processes. Using state-transition probabilities for N-grams, a high-order Markov process is often used to model natural language =-=[1]-=-, protein sequences [2], or the dynamics of consumers [3], where one state is assigned to each word, amino acid, or customer type, respectively. In these applications, the statistical robustness of th... |

305 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ... interpolate the probabilities of the N-grams, (N−1)-grams, and lower order distributions. In natural language modeling, the most advanced smoothing techniques currently used are Kneser-Ney smoothing =-=[4]-=- and its derivative versions [1]. The essence of Kneser-Ney smoothing and its derivatives is a modification of the state-transition frequencies in calculating the lower order distributions, so that an... |

247 |
The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression,” in
- Witten, Bell
- 1991
(Show Context)
Citation Context ...578 dataset cal Separated Dirichlet Smoothing (HSDS), Interpolated Kneser-Ney Smoothing (IKNS), Modified Kneser-Ney Smoothing (MKNS), Absolute Discounting (ABSD) [12], and Witten-Bell smoothing (WBS) =-=[13]-=-. The smoothing methods except for IKNS and MKNS were selected to compare the performances broadly. The formulas used in IKNS and MKNS are described in [1], where we adopted the versions without cross... |

234 | The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator
- Pitman, Yor
- 1997
(Show Context)
Citation Context ...05 1 10 100 1000 10000 rank Fig. 1. Distributions of the states controlled by the Dirichlet precision α transition probability phs on the posterior distribution is given as 〈phs|D〉 = nhs + αhθπ(h)s . =-=(5)-=- nh + αh Following Minka in [10], we call αh the “Dirichlet precision” and θ π(h) the “Dirichlet mean”. The Dirichlet mean is hierarchically given by the expectation of the lowerorder distribution, ma... |

184 |
On structuring probabilistic dependency in stochastic language modeling", Computer Speech & Language 8
- Ney
- 1994
(Show Context)
Citation Context .... 3. Test-set perplexity in Reuters-21578 dataset cal Separated Dirichlet Smoothing (HSDS), Interpolated Kneser-Ney Smoothing (IKNS), Modified Kneser-Ney Smoothing (MKNS), Absolute Discounting (ABSD) =-=[12]-=-, and Witten-Bell smoothing (WBS) [13]. The smoothing methods except for IKNS and MKNS were selected to compare the performances broadly. The formulas used in IKNS and MKNS are described in [1], where... |

163 |
Reuters-21578 text categorization test collection, distribution 1.0
- Lewis
- 1997
(Show Context)
Citation Context ...pted the versions without cross-validation in estimating the discounting factors. 4.1 Natural Language Modeling As natural language data, we used the Reuters-21578 text categorization test collection =-=[14]-=-, which is a popular English corpus mainly used for text categorization research. By extracting all of the text, we prepared 172,900 sentences that consisted of a total of 2,804,960 words, and divided... |

89 | A hierarchical Bayesian language model based on Pitman-Yor processes
- Teh
(Show Context)
Citation Context ...e the prior distribution is given by P (p h) = Dir(p h; αθ π(h)). Yet the original MacKay and Peto hierarchical Dirichlet language model was shown to be noncompetitive with other smoothing techniques =-=[8]-=-. Since they discuss extensions to have the Dirichlet precision be context-dependent with classifying the con-texts, HSDS is the first competitive method to extend the hierarchical Dirichlet language... |

81 | Interpolating between types and tokens by estimating power law generators
- Goldwater, Griffiths, et al.
- 2006
(Show Context)
Citation Context ...troduce the concept of the effective frequency by deriving the infimum of the likelihood of the training data. Since the observed frequencies nh obey a Dirichlet-multinomial (Polya) distribution, Eq. =-=(6)-=- gives the likelihood of D under the set of hyperparameters Φ = { αh, θπ(h); h ∈ SN−1} . We referred to [10] in deriving Inequality (7). P (D|Φ) ∝ ∏ h Γ (αh) Γ (nh + αh) s:nhs>0 ∏ s:nhs>0 Γ ( ) nhs + ... |

26 |
A hierarchical Dirichlet language model. Natural language engineering
- MacKay, Peto
- 1995
(Show Context)
Citation Context ...et distribution is a specific parameter for each context, to incorporate the numbers of unique states depending on that context. Our model is an extension of the hierarchical Dirichlet language model =-=[9]-=-. For a given discrete-state space S whose size is |S|, assume we want to predict a prospective state sN that will follow a state sequence s1, s2, · · · , sN−1 with bounded length N ≥ 1. Since sN is a... |

19 | A Bayesian interpretation of interpolated Kneser-Ney
- Teh
- 2006
(Show Context)
Citation Context ...encies nh obey a Dirichlet-multinomial (Polya) distribution, Eq. (6) gives the likelihood of D under the set of hyperparameters Φ = { αh, θπ(h); h ∈ SN−1} . We referred to [10] in deriving Inequality =-=(7)-=-. P (D|Φ) ∝ ∏ h Γ (αh) Γ (nh + αh) s:nhs>0 ∏ s:nhs>0 Γ ( ) nhs + αhθπ(h)s Γ ( ) (6) αhθπ(h)s ≥ ∏ Γ (¯αh) Γ (nh + ¯αh) h exp [(Ψ (nh + ¯αh) − Ψ (¯αh)) (¯αh − αh)] [ ( ∏ Γ nhs + ¯αh ¯ ) θπ(h)s Γ ( ¯αh ¯... |

18 |
DBsubLoc: database of protein sub cellular localization
- Guo, Hua, et al.
- 2004
(Show Context)
Citation Context ...in calculating the lower-order distributions.4.2 Protein Sequence Modeling For protein sequence data, we performed classification tests as well as a perplexity evaluation, using the DBsubloc dataset =-=[15]-=-. Though DBsubloc is a protein database mainly used for protein subcellular localization, because of the amount of available data, we only classified the unlabeled data into one of 4 types of organism... |

16 | A hidden Markov model of customer relationship dynamics. Working paper
- Netzer, Lattin, et al.
- 2005
(Show Context)
Citation Context ...-order Markov processes. Using state-transition probabilities for N-grams, a high-order Markov process is often used to model natural language [1], protein sequences [2], or the dynamics of consumers =-=[3]-=-, where one state is assigned to each word, amino acid, or customer type, respectively. In these applications, the statistical robustness of the estimated state-transition probabilities often becomes ... |

4 | BLMT: Statistical Sequence Analysis using N-Grams
- Ganapathiraju, Manoharan, et al.
(Show Context)
Citation Context ...e-transition probabilities in high-order Markov processes. Using state-transition probabilities for N-grams, a high-order Markov process is often used to model natural language [1], protein sequences =-=[2]-=-, or the dynamics of consumers [3], where one state is assigned to each word, amino acid, or customer type, respectively. In these applications, the statistical robustness of the estimated state-trans... |

4 |
Beyond Newton’s method
- Minka
- 2002
(Show Context)
Citation Context ... approximated by the gamma distribution that has the same expectation. We can immediately derive the following equation, which we can solve quickly with the modified Newton-Raphson method proposed in =-=[11]-=-. Ψ (nh + α ∗ h) − Ψ (α ∗ h) = 1 α∗ + h ∑ [ ( θπ(h)s Ψ nhs + α s:nhs>0 ∗ ) ( )] ∗ hθπ(h)s − Ψ αhθπ(h)s (12) Note that the estimated α∗ h values tend to be underestimated when nh is small, because the ... |