Results 1 - 10
of
16
New Tools for Web-Scale N-grams
"... While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic ..."
Abstract
-
Cited by 18 (10 self)
- Add to MetaCart
While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally-similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. The tools will allow novel sources of information to be applied to long-standing natural language challenges. 1.
Exploring web scale language models for search query processing
- In Proceedings of WWW 2010
"... It has been widely observed that search queries are composed in a very different style from that of the body or the title of a document. Many techniques explicitly accounting for this language style discrepancy have shown promising results for information retrieval, yet a large scale analysis on the ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
It has been widely observed that search queries are composed in a very different style from that of the body or the title of a document. Many techniques explicitly accounting for this language style discrepancy have shown promising results for information retrieval, yet a large scale analysis on the extent of the language differences has been lacking. In this paper, we present an extensive study on this issue by examining the language model properties of search queries and the three text streams associated with each web document: the body, the title, and the anchor text. Our information theoretical analysis shows that queries seem to be composed in a way most similar to how authors summarize documents in anchor texts or titles, offering a quantitative explanation to the observations in past work. We apply these web scale n-gram language models to three search query processing (SQP) tasks: query spelling correction, query bracketing and long query segmentation. By controlling the size and the order of different language models, we find that the perplexity metric to be a good accuracy indicator for these query processing tasks. We show that using smoothed language models yields significant accuracy gains for query bracketing for instance, compared to using web counts as in the literature. We also demonstrate that applying web-scale language models can have marked accuracy advantage over smaller ones.
Grammatical Error Correction with Alternating Structure Optimization
"... We present a novel approach to grammatical error correction based on Alternating Structure Optimization. As part of our work, we introduce the NUS Corpus of Learner English (NUCLE), a fully annotated one million words corpus of learner English available for research purposes. We conduct an extensive ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We present a novel approach to grammatical error correction based on Alternating Structure Optimization. As part of our work, we introduce the NUS Corpus of Learner English (NUCLE), a fully annotated one million words corpus of learner English available for research purposes. We conduct an extensive evaluation for article and preposition errors using various feature sets. Our experiments show that our approach outperforms two baselines trained on non-learner text and learner text, respectively. Our approach also outperforms two commercial grammar checking software packages. 1
Search right and thou shalt find... Using Web Queries for Learner Error Detection
"... We investigate the use of web search queries for detecting errors in non-native writing. Distinguishing a correct sequence of words from a sequence with a learner error is a baseline task that any error detection and correction system needs to address. Using a large corpus of error-annotated learner ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We investigate the use of web search queries for detecting errors in non-native writing. Distinguishing a correct sequence of words from a sequence with a learner error is a baseline task that any error detection and correction system needs to address. Using a large corpus of error-annotated learner data, we investigate whether web search result counts can be used to distinguish correct from incorrect usage. In this investigation, we compare a variety of query formulation strategies and a number of web resources, including two major search engine APIs and a large web-based n-gram corpus. 1
Algorithm Selection and Model Adaptation for ESL Correction Tasks
"... We consider the problem of correcting errors made by English as a Second Language (ESL) writers and address two issues that are essential to making progress in ESL error correction- algorithm selection and model adaptation to the first language of the ESL learner. A variety of learning algorithms ha ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We consider the problem of correcting errors made by English as a Second Language (ESL) writers and address two issues that are essential to making progress in ESL error correction- algorithm selection and model adaptation to the first language of the ESL learner. A variety of learning algorithms have been applied to correct ESL mistakes, but often comparisons were made between incomparable data sets. We conduct an extensive, fair comparison of four popular learning methods for the task, reversing conclusions from earlier evaluations. Our results hold for different training sets, genres, and feature sets. A second key issue in ESL error correction is the adaptation of a model to the first language of the writer. Errors made by non-native speakers exhibit certain regularities and, as we show, models perform much better when they use knowledge about error patterns of the nonnative writers. We propose a novel way to adapt a learned algorithm to the first language of the writer that is both cheaper to implement and performs better than other adaptation methods. 1
Unsupervised Acquisition of Lexical Knowledge From N-grams: Final Report of the 2009 JHU CLSP Workshop
"... This report describes a variety of work that uses web-scale N-gram data. This ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This report describes a variety of work that uses web-scale N-gram data. This
MALWARE DETECTION BASED ON STRUCTURAL AND BEHAVIOURAL FEATURES OF API CALLS
"... In this paper, we propose a five-step approach to detect obfuscated malware by investigating the structural and behavioural features of API calls. We have developed a fully automated system to disassemble and extract API call features effectively from executables. Using n-gram statistical analysis o ..."
Abstract
- Add to MetaCart
In this paper, we propose a five-step approach to detect obfuscated malware by investigating the structural and behavioural features of API calls. We have developed a fully automated system to disassemble and extract API call features effectively from executables. Using n-gram statistical analysis of binary content, we are able to classify if an executable file is malicious or benign. Our experimental results with a dataset of 242 malwares and 72 benign files have shown a promising accuracy of 96.5 % for the unigram model. We also provide a preliminary analysis by our approach using support vector machine (SVM) and by varying n-values from 1 to 5, we have analysed the performance that include accuracy, false positives and false negatives. By applying SVM, we propose to train the classifier and derive an optimum n-gram model for detecting both known and unknown malware efficiently. Keywords: Code obfuscation, Feature extraction, Malware, n-gram, SVM.
1 Word Sense Disambiguation with Automatically Acquired Knowledge
"... Abstract—Word sense disambiguation is the process of determining which sense of a word is used in a given context. Due to its importance in understanding semantics and many real-world applications, word sense disambiguation has been extensively studied in Natural Language Processing and Computationa ..."
Abstract
- Add to MetaCart
Abstract—Word sense disambiguation is the process of determining which sense of a word is used in a given context. Due to its importance in understanding semantics and many real-world applications, word sense disambiguation has been extensively studied in Natural Language Processing and Computational Linguistics. However, existing methods either narrowly focus on a few specific words due to their reliance on expensive manually annotated training text, or give only mediocre performance in real-world settings. Broad coverage and disambiguation quality are critical for real-world natural language processing applications. In this paper we present a fully automatic disambiguation method that utilizes two readily available knowledge sources: a dictionary and knowledge extracted from unannotated text. Such an automatic approach overcomes the knowledge acquisition bottleneck and makes broad-coverage word sense disambiguation feasible in practice. Evaluated with two large scale WSD evaluation corpora, our system significantly outperforms the best unsupervised system and achieves the similar performance as the top-performing supervised systems.
Data-Driven Correction of Function Words in Non-Native English
"... We extend the n-gram-based data-driven prediction approach (Elghafari, Meurers and Wunsch, 2010) to identify function word errors in non-native academic texts as part of the Helping Our Own (HOO) Shared Task. We focus on substitution errors for four categories: prepositions, determiners, conjunction ..."
Abstract
- Add to MetaCart
We extend the n-gram-based data-driven prediction approach (Elghafari, Meurers and Wunsch, 2010) to identify function word errors in non-native academic texts as part of the Helping Our Own (HOO) Shared Task. We focus on substitution errors for four categories: prepositions, determiners, conjunctions, and quantifiers. These error types make up 12 % of the errors annotated in the HOO training data. In our best submission in terms of the error detection score, we detected 67 % of preposition and determiner substitution errors, 40% of conjunction substitution errors, and 33% of quantifier substitution errors. For approximately half of the errors detected, we were also able to provide an appropriate correction. 1

