## Latent-Variable Modeling of String Transductions with Finite-State Methods ∗

### Cached

### Download Links

Citations: | 17 - 3 self |

### BibTeX

@MISC{Dreyer_latent-variablemodeling,

author = {Markus Dreyer and Jason R. Smith and Jason Eisner},

title = {Latent-Variable Modeling of String Transductions with Finite-State Methods ∗},

year = {}

}

### OpenURL

### Abstract

String-to-string transduction is a central problem in computational linguistics and natural language processing. It occurs in tasks as diverse as name transliteration, spelling correction, pronunciation modeling and inflectional morphology. We present a conditional loglinear model for string-to-string transduction, which employs overlapping features over latent alignment sequences, and which learns latent classes and latent string pair regions from incomplete training data. We evaluate our approach on morphological tasks and demonstrate that latent variables can dramatically improve results, even when trained on small data sets. On the task of generating morphological forms, we outperform a baseline method reducing the error rate by up to 48%. On a lemmatization task, we reduce the error rates in Wicentowski (2002) by 38–92%. 1

### Citations

991 | Moses: Open source toolkit for statistical machine translation
- Koehn, Hoang, et al.
- 2007
(Show Context)
Citation Context ...ters wide, and successive window positions overlap. This stands in contrast to a competing approach (Sherif and Kondrak, 2007; Zhao et al., 2007) that is inspired by phrase-based machine translation (=-=Koehn et al., 2007-=-), which segments the input string into substrings that are transduced independently, ignoring context. 2 1 At the other extreme, Freitag and Khadivi (2007) use no alignment; each feature takes its ow... |

519 | On the memory BFGS method for large scale optimization
- Liu, Nocedal
- 1989
(Show Context)
Citation Context ...l log-likelihood 11 ∑ (x,y ∗ )∈C log pθ(y ∗ | x) + ||θ|| 2 /2σ 2 , (2) where C is a supervised training corpus. To maximize (2) during training, we apply the gradientbased optimization method L-BFGS (=-=Liu and Nocedal, 1989-=-). 12 9 E.g., Toutanova et al. (2008) improve MT performance by selecting correct morphological forms from a knowledge source. We instead focus on generalizing from observed forms and generating new f... |

319 | Finite-state transducers in language and speech processing
- Mohri
- 1997
(Show Context)
Citation Context ...e problem is that the probability of each string y is a sum over many paths in T [x] that reflect different alignments of y to x. Although it is straightforward to use a determinization construction (=-=Mohri, 1997-=-) 13 to collapse these down to a single path per y (so that ˆy is easily read off the single best path), determinization can increase the WFSA’s size exponentially. We approximate by pruning T [x] bac... |

200 | Learning string edit distance
- Ristad, Yianilos
- 1998
(Show Context)
Citation Context ...or solving such word transduction problems. Our results in morphology generation show that the presented approach improves upon the state of the art. 2 Model Structure A weighted edit distance model (=-=Ristad and Yianilos, 1998-=-) would consider each character in isolation. To consider more context, we pursue a very natural generalization. Given an input x, we evaluate a candidate output y by moving a sliding window over the ... |

89 | Efficient generation in primitive Optimality Theory
- Eisner
- 1997
(Show Context)
Citation Context ...ne recognition—and phonological features and syllable boundaries. Indeed, our local log-linear features over several aligned latent strings closely resemble the soft constraints used by phonologists (=-=Eisner, 1997-=-). Finally, rather than define a fixed set of feature templates as in Fig. 2, we would like to refine empirically useful features during training, resulting in language-specific backoff patterns and a... |

44 | Conditional and Joint Models for Grapheme-to-Phoneme Conversion - Chen - 2003 |

31 | Wojciech Skut, and Mehryar Mohri. 2007. OpenFST: a general and efficient weighted finite-state transducer library - Allauzen, Riley, et al. |

31 | Applying many-to-many alignments and Hidden Markov Models to letter-to-phoneme conversion
- Jiampojamarn, Kondrak, et al.
- 2007
(Show Context)
Citation Context ... latent variables, and training methods that port well across languages and string-transduction tasks. We would like to use features that look at wide context on the input side, which is inexpensive (=-=Jiampojamarn et al., 2007-=-). Latent variables we wish to consider are an increased number of word classes; more flexible regions—see Petrov et al. (2007) on learning a state transition diagram for acoustic regions in phone rec... |

31 |
Substringbased transliteration
- Sherif, Kondrak
- 2007
(Show Context)
Citation Context ...robability based on the material that appears within the current window. The window is a few characters wide, and successive window positions overlap. This stands in contrast to a competing approach (=-=Sherif and Kondrak, 2007-=-; Zhao et al., 2007) that is inspired by phrase-based machine translation (Koehn et al., 2007), which segments the input string into substrings that are transduced independently, ignoring context. 2 1... |

27 | Investigations on joint-multigram models for grapheme-to-phoneme conversion - Bisani, Ney |

23 | Bi-directional conversion between graphemes and phonemes using a joint n-gram model - Galescu, Allen - 2001 |

20 | Computational complexity of problems on probabilistic grammars and transducers
- Casacuberta, Higuera
- 2000
(Show Context)
Citation Context ...in the WFSA T [x], where T = π −1 x ◦Uθ ◦πy is the trained transducer that maps x nondeterministically to y. Alas, it is NP-hard to find the highest-probability string in a WFSA, even an acyclic one (=-=Casacuberta and Higuera, 2000-=-). The problem is that the probability of each string y is a sum over many paths in T [x] that reflect different alignments of y to x. Although it is straightforward to use a determinization construct... |

17 | feature-based, conditional random field parsing - Efficient |

10 | S.: Efficient training methods for maximum entropy language modelling
- Wu, Khudanpur
- 2000
(Show Context)
Citation Context ...), each of which counts the occurrences in A of a particular n-gram of alignment characters. The log-linear framework lets us include ngram features of different lengths, a form of backoff smoothing (=-=Wu and Khudanpur, 2000-=-). We use additional backoff features on alignment strings to capture phonological, morphological, and orthographic generalizations. Examples are found in features (b)-(h) in Fig. 2. Feature (b) match... |

9 | A sequence alignment model based on the averaged perceptron - Freitag, Khadivi - 2007 |

8 | Learning structured models for phone recognition - Petrov, Pauls, et al. - 2007 |

7 | constraints and morphological preprocessing for grapheme-to-phoneme conversion - Phonological |

1 |
Programming pearls [column
- Bentley
- 1986
(Show Context)
Citation Context ...eatures. How do we specify these features to the above construction? Rather than writing ordinary code to extract features from a window, we find it convenient to harness FSTs as a “little language” (=-=Bentley, 1986-=-) for specifying entire sets of features. A feature template T is an nondeterministic FST that maps the contents of the sliding window, such as abc, to one or more features, which are also described a... |

1 | Smooth bilingual n-gram translation - Fonollosa - 2007 |