Results 1 -
2 of
2
Optimizing Lexical and N-gram Coverage Via Judicious Use of Linguistic Data
- In Proc. European Conf. on Speech Technology
"... I study the effect of various types and amounts of North American Business language data on the quality of the derived vocabulary, and use my findings to derive an improved ranking of the words, using only 19% of the NAB corpus. I then study the conflicting effects of increased vocabulary size on a ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
I study the effect of various types and amounts of North American Business language data on the quality of the derived vocabulary, and use my findings to derive an improved ranking of the words, using only 19% of the NAB corpus. I then study the conflicting effects of increased vocabulary size on a speech recognizer's accuracy, and use the result to pick an optimal vocabulary size. A similar analysis of ngram coverage yields a very different outcome, with the best system being the one based on the most data. 1. Vocabulary Optimization 1.1. OOV curve minimization Since Out-Of-Vocabulary (OOV) rate directly affects Word Error Rate, with every OOV word in the test data resulting in at least one (and often more) recognition errors, I set out to minimize the expected OOV rate of the test data. More generally, my goal was to understand how availability of various types and amounts of training data, from various time periods, affects the quality of the derived vocabulary 1 . Given a colle...
Optimizing Lexical and N-gram Coverage
"... I study the effect of various types and amounts of North American Business language data on the quality of the derived vocabulary, and use my findings to derive an improved ranking of the words, using only 19% of the NAB corpus. I then study the conflicting effects of increased vocabulary size on a ..."
Abstract
- Add to MetaCart
I study the effect of various types and amounts of North American Business language data on the quality of the derived vocabulary, and use my findings to derive an improved ranking of the words, using only 19% of the NAB corpus. I then study the conflicting effects of increased vocabulary size on a speech recognizer's accuracy, and use the result to pick an optimal vocabulary size. A similar analysis of ngram coverage yields a very different outcome, with the best system being the one based on the most data.

