Results 1 -
5 of
5
Treebanks Gone Bad Parser Evaluation and Retraining using a Treebank of Ungrammatical Sentences
"... Abstract This article describes how a treebank of ungrammatical sentences can be created from a treebank of well-formed sentences. The treebank creation procedure involves the automatic introduction of frequently occurring grammatical errors into the sentences in an existing treebank, and the minima ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract This article describes how a treebank of ungrammatical sentences can be created from a treebank of well-formed sentences. The treebank creation procedure involves the automatic introduction of frequently occurring grammatical errors into the sentences in an existing treebank, and the minimal transformation of the original analyses in the treebank so that they describe the newly created ill-formed sentences. Such a treebank can be used to test how well a parser is able to ignore grammatical errors in texts (as people do), and can be used to induce a grammar capable of analysing such sentences. This article demonstrates these two applications using the Penn Treebank. In a robustness evaluation experiment, two state-of-the-art statistical parsers are evaluated on an ungrammatical version of Section 23 of the Wall Street Journal (WSJ) portion of the Penn Treebank. This experiment shows that the performance of both parsers degrades with grammatical noise. A breakdown by error type is provided for both parsers. A second experiment retrains both parsers using an ungrammatical version of WSJ Sections 2-21. This experiment indicates that an ungrammatical treebank is a useful resource in improving parser robustness to grammatical errors, but that the correct combination of grammatical and ungrammatical training data has yet to be determined.
Contextual Bearing on Linguistic Variation in Social Media
"... Microtexts, like SMS messages, Twitter posts, and Facebook status updates, are a popular medium for real-time communication. In this paper, we investigate the writing conventions that different groups of users use to express themselves in microtexts. Our empirical study investigates properties of le ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Microtexts, like SMS messages, Twitter posts, and Facebook status updates, are a popular medium for real-time communication. In this paper, we investigate the writing conventions that different groups of users use to express themselves in microtexts. Our empirical study investigates properties of lexical transformations as observed within Twitter microtexts. The study reveals that different populations of users exhibit different amounts of shortened English terms and different shortening styles. The results reveal valuable insights into how human language technologies can be effectively applied to microtexts. 1
SMS based Interface for FAQ Retrieval
"... Short Messaging Service (SMS) is popularly used to provide information access to people on the move. This has resulted in the growth of SMS based Question Answering (QA) services. However automatically handling SMS questions poses significant challenges due to the inherent noise in SMS questions. In ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Short Messaging Service (SMS) is popularly used to provide information access to people on the move. This has resulted in the growth of SMS based Question Answering (QA) services. However automatically handling SMS questions poses significant challenges due to the inherent noise in SMS questions. In this work we present an automatic FAQ-based question answering system for SMS users. We handle the noise in a SMS query by formulating the query similarity over FAQ questions as a combinatorial search problem. The search space consists of combinations of all possible dictionary variations of tokens in the noisy query. We present an efficient search algorithm that does not require any training data or SMS normalization and can handle semantic variations in question formulation. We demonstrate the effectiveness of our approach on two reallife datasets. 1
Towards Computational Guessing of Unknown Word Meanings: The Ontological Semantic Approach
"... The paper describes a computational approach for guessing the meanings of previously unaccounted words in an implemented system for natural language processing. Interested in comparing the results to what is known about human guessing, it reviews a largely educational approach, partially based on co ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The paper describes a computational approach for guessing the meanings of previously unaccounted words in an implemented system for natural language processing. Interested in comparing the results to what is known about human guessing, it reviews a largely educational approach, partially based on cognitive psychology, to teaching humans, mostly children, to acquire new vocabulary from contextual clues, as well as the lexicographic efforts to account for neologisms. It then goes over the previous NLP efforts in processing new words and establishes the difference—mostly, much richer semantic resources—of the proposed approach. Finally, the results of a computer experiment that guesses the meaning of a non-existent word, placed as the direct object of 100 randomly selected verbs, from the known meanings of these verbs, with methods of the ontological semantics technology, are presented and discussed. While the results are promising percentage-wise, ways to improve them within the approach are briefly outlined.
Unsupervised Mining of Lexical Variants from Noisy Text
"... The amount of data produced in usergenerated content continues to grow at a staggering rate. However, the text found in these media can deviate wildly from the standard rules of orthography, syntax and even semantics and present significant problems to downstream applications which make use of this ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The amount of data produced in usergenerated content continues to grow at a staggering rate. However, the text found in these media can deviate wildly from the standard rules of orthography, syntax and even semantics and present significant problems to downstream applications which make use of this noisy data. In this paper we present a novel unsupervised method for extracting domainspecific lexical variants given a large volume of text. We demonstrate the utility of this method by applying it to normalize text messages found in the online social media service, Twitter, into their most likely standard English versions. Our method yields a 20 % reduction in word error rate over an existing state-of-theart approach. 1

