Results 21 -
25 of
25
Toward the Pricipled Utilization . . .
"... The language modelling approach to Information Retrieval (IR) has generated much interest in the field since its conception in 1998[73]. However, some serious questions have been asked about the integrity of the language modelling approach. Specifically, it does not model relevance explicitly, unl ..."
Abstract
- Add to MetaCart
The language modelling approach to Information Retrieval (IR) has generated much interest in the field since its conception in 1998[73]. However, some serious questions have been asked about the integrity of the language modelling approach. Specifically, it does not model relevance explicitly, unlike traditional probabilistic models of IR such as the Binary Independence Model[93]. Instead, it relies upon several underlying assumptions which are touted as being correlated with relevance. In this document, we provide a review of current state of the art language modelling approaches to IR and discuss the conjecture surrounding the language modelling approach. We then provide a study which analyzes the relationship between perplexity and Average Precision that underpins the language modelling approach. We conclude this document by detailing some potential future directions of the Ph.D, the expected contributions of the work and the proposed timetable.
More Than Words: Using Token Context to Improve Canonicalization Of Historical German
, 2010
"... ..."
Processing Informal, Romanized Pakistani Text Messages
"... Regardless of language, the standard character set for text messages (SMS) and many other social media platforms is the Roman alphabet. There are romanization conventions for some character sets, but they are used inconsistently in informal text, such as SMS. In this work, we convert informal, roman ..."
Abstract
- Add to MetaCart
Regardless of language, the standard character set for text messages (SMS) and many other social media platforms is the Roman alphabet. There are romanization conventions for some character sets, but they are used inconsistently in informal text, such as SMS. In this work, we convert informal, romanized Urdu messages into the native Arabic script and normalize non-standard SMS language. Doing so prepares the messages for existing downstream processing tools, such as machine translation, which are typically trained on well-formed, native script text. Our model combines information at the word and character levels, allowing it to handle out-of-vocabulary items. Compared with a baseline deterministic approach, our system reduces both word and character error rate by over 50%. 1
A Natural Law of Succession 1
, 1995
"... Consider the following problem. You are given an alphabet of k distinct symbols and are told that the i th symbol occurred exactly ni times in the past. On the basis of this information alone, you must now estimate the conditional probability that the next symbol will be i. In this report, we presen ..."
Abstract
- Add to MetaCart
Consider the following problem. You are given an alphabet of k distinct symbols and are told that the i th symbol occurred exactly ni times in the past. On the basis of this information alone, you must now estimate the conditional probability that the next symbol will be i. In this report, we present a new solution to this fundamental problem in statistics and demonstrate that our solution outperforms standard approaches, both in theory and in practice.

