Results 1 - 10
of
42
New Tools for Web-Scale N-grams
"... While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic ..."
Abstract
-
Cited by 18 (10 self)
- Add to MetaCart
While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally-similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. The tools will allow novel sources of information to be applied to long-standing natural language challenges. 1.
Using the web as an implicit training set: application to structural ambiguity resolution
- In: Proceedings of HLT-EMNLP, Vancouver, British
, 2005
"... Recent work has shown that very large corpora can act as training data for NLP algorithms even without explicit labels. In this paper we show how the use of surface features and paraphrases in queries against search engines can be used to infer labels for structural ambiguity resolution tasks. Using ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Recent work has shown that very large corpora can act as training data for NLP algorithms even without explicit labels. In this paper we show how the use of surface features and paraphrases in queries against search engines can be used to infer labels for structural ambiguity resolution tasks. Using unsupervised algorithms, we achieve 84 % precision on PP-attachment and 80 % on noun compound coordination. 1
Web-Scale N-gram Models for Lexical Disambiguation
"... Web-scale data has been used in a diverse range of language research. Most of this research has used web counts for only short, fixed spans of context. We present a unified view of using web counts for lexical disambiguation. Unlike previous approaches, our supervised and unsupervised systems combin ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Web-scale data has been used in a diverse range of language research. Most of this research has used web counts for only short, fixed spans of context. We present a unified view of using web counts for lexical disambiguation. Unlike previous approaches, our supervised and unsupervised systems combine information from multiple and overlapping segments of context. On the tasks of preposition selection and context-sensitive spelling correction, the supervised system reduces disambiguation error by 20-24 % over the current state-of-the-art. 1
Using verbs to characterize noun-noun relations
- In Proc. of the 12th International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA), Bularia
, 2006
"... Abstract. We present a novel, simple, unsupervised method for characterizing the semantic relations that hold between nouns in noun-noun compounds. The main idea is to discover predicates that make explicit the hidden relations between the nouns. This is accomplished by writing Web search engine que ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
Abstract. We present a novel, simple, unsupervised method for characterizing the semantic relations that hold between nouns in noun-noun compounds. The main idea is to discover predicates that make explicit the hidden relations between the nouns. This is accomplished by writing Web search engine queries that restate the noun compound as a relative clause containing a wildcard character to be filled in with a verb. A comparison to results from the literature suggest this is a promising approach.
Solving Relational Similarity Problems Using the Web as a Corpus
"... We present a simple linguistically-motivated method for characterizing the semantic relations that hold between two nouns. The approach leverages the vast size of the Web in order to build lexically-specific features. The main idea is to look for verbs, prepositions, and coordinating conjunctions th ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
We present a simple linguistically-motivated method for characterizing the semantic relations that hold between two nouns. The approach leverages the vast size of the Web in order to build lexically-specific features. The main idea is to look for verbs, prepositions, and coordinating conjunctions that can help make explicit the hidden relations between the target nouns. Using these features in instance-based classifiers, we demonstrate state-of-the-art results on various relational similarity problems, including mapping noun-modifier pairs to abstract relations like TIME, LOCATION and CONTAINER, characterizing linguistic predicates like CAUSE, USE, and FROM, classifying the relations between nominals in context, and solving SAT verbal analogy problems. In essence, the approach puts together some existing ideas, showing that they apply generally to various semantic tasks, finding that verbs are especially useful features. 1
Web Text Corpus for Natural Language Processing
, 2006
"... Web text has been successfully used as training data for many NLP applications. ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Web text has been successfully used as training data for many NLP applications.
2009. Unsupervised Recognition of Literal and Non-Literal Use of Idiomatic Expressions
- In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009
"... We propose an unsupervised method for distinguishing literal and non-literal usages of idiomatic expressions. Our method determines how well a literal interpretation is linked to the overall cohesive structure of the discourse. If strong links can be found, the expression is classified as literal, o ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
We propose an unsupervised method for distinguishing literal and non-literal usages of idiomatic expressions. Our method determines how well a literal interpretation is linked to the overall cohesive structure of the discourse. If strong links can be found, the expression is classified as literal, otherwise as idiomatic. We show that this method can help to tell apart literal and non-literal usages, even for idioms which occur in canonical form. 1
Efficient Handling of N-gram Language Models for Statistical Machine Translation
"... Statistical machine translation, as well as other areas of human language processing, have recently pushed toward the use of large scale n-gram language models. This paper presents efficient algorithmic and architectural solutions which have been tested within the Moses decoder, an open source toolk ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Statistical machine translation, as well as other areas of human language processing, have recently pushed toward the use of large scale n-gram language models. This paper presents efficient algorithmic and architectural solutions which have been tested within the Moses decoder, an open source toolkit for statistical machine translation. Experiments are reported with a high performing baseline, trained on the Chinese-English NIST 2006 Evaluation task and running on a standard Linux 64-bit PC architecture. Comparative tests show that our representation halves the memory required by SRI LM Toolkit, at the cost of 44 % slower translation speed. However, as it can take advantage of memory mapping on disk, the proposed implementation seems to scale-up much better to very large language models: decoding with a 289-million 5-gram language model runs in 2.1Gb of RAM. 1
A Feedback-Augmented Method for Detecting Errors in the Writing of Learners of English
"... This paper proposes a method for detecting errors in article usage and singular plural usage based on the mass count distinction. First, it learns decision lists from training data generated automatically to distinguish mass and count nouns. Then, in order to improve its performance, it is augmented ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper proposes a method for detecting errors in article usage and singular plural usage based on the mass count distinction. First, it learns decision lists from training data generated automatically to distinguish mass and count nouns. Then, in order to improve its performance, it is augmented by feedback that is obtained from the writing of learners. Finally, it detects errors by applying rules to the mass count distinction. Experiments show that it achieves a recall of 0.71 and a precision of 0.72 and outperforms other methods used for comparison when augmented by feedback.
A study of using search engine page hits as a proxy for n-gram frequencies
- In Proceedings of the RANLP’05
, 2005
"... The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the i ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined. 1

