Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p^2 (2000) [34 citations — 0 self]
http://acl.ldc.upenn.edu/C/C00/C00-1027.pdf
http://nlp3.korea.ac.kr/proceeding/coling2000/COLI
CACHED:
Abstract:
Repetition is very common. Repetition is very common. Adaptive language models, which allow probabilities to change or adapt after seeing just a few words of a text, were introduced in speech recognition to account for text cohesion. Suppose a document mentions Noriega once. What is the chance that he will be mentioned again? If the first instance has probability p, then under standard (bag-of-words) independence assumptions, two instances ought to have probability p^2, but we find the probability is actually closer to p/2. The first mention of a word obviously depends on frequency, but surprisingly, the second does not. Adaptation depends more on lexical content than frequency; there is more adaptation for content words (proper nouns, technical terminology and good keywords for information retrieval), and less adaptation for function words, cliches and ordinary first names.
Citations
| 144 | Frequency Analysis of English Usage – Francis - 1982 |
| 53 | Poisson mixtures – Gale - 1995 |
| 38 | Context And Structure In Automated Full-Text Information Access". Doctor of Philosophy Thesis – Hearst - 1994 |
| 20 | Dynamic nonlocal language modeling via hierarchical topic-based adaptation – Florian, Yarowsky - 1999 |

