Results 1 -
7 of
7
Web-Scale Information Extraction in KnowItAll
, 2004
"... Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from ..."
Abstract
-
Cited by 61 (6 self)
- Add to MetaCart
Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from multiple documents. This paper introduces KNOWITALL, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner.
Web-Scale N-gram Models for Lexical Disambiguation
"... Web-scale data has been used in a diverse range of language research. Most of this research has used web counts for only short, fixed spans of context. We present a unified view of using web counts for lexical disambiguation. Unlike previous approaches, our supervised and unsupervised systems combin ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Web-scale data has been used in a diverse range of language research. Most of this research has used web counts for only short, fixed spans of context. We present a unified view of using web counts for lexical disambiguation. Unlike previous approaches, our supervised and unsupervised systems combine information from multiple and overlapping segments of context. On the tasks of preposition selection and context-sensitive spelling correction, the supervised system reduces disambiguation error by 20-24 % over the current state-of-the-art. 1
A study of using search engine page hits as a proxy for n-gram frequencies
- In Proceedings of the RANLP’05
, 2005
"... The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the i ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined. 1
Shallow parsing using noisy and non-stationary training material
- Journal of Machine Learning Research
, 2002
"... Shallow parsers are usually assumed to be trained on noise-free material, drawn from the same distribution as the testing material. However, when either the training set is noisy or else drawn from a different distributions, performance may be degraded. Using the parsed Wall Street Journal, we inves ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Shallow parsers are usually assumed to be trained on noise-free material, drawn from the same distribution as the testing material. However, when either the training set is noisy or else drawn from a different distributions, performance may be degraded. Using the parsed Wall Street Journal, we investigate the performance of four shallow parsers (maximum entropy, memory-based learning, N-grams and ensemble learning) trained using various types of artificially noisy material. Our first set of results show that shallow parsers are surprisingly robust to synthetic noise, with performance gradually decreasing as the rate of noise increases. Further results show that no single shallow parser performs best in all noise situations. Final results show that simple, parser-specific extensions can improve noise-tolerance. Our second set of results addresses the question of whether naturally occurring disfluencies undermines performance more than does a change in distribution. Results using the parsed Switchboard corpus suggest that, although naturally occurring disfluencies might harm performance, differences in distribution between the training set and the testing set are more significant. 1.
Machine Learning Approach for Context-Sensitive Error Detection
- Proc. Int’l Conf. Intelligent Computing and Information Systems (ICICIS ’05
, 2005
"... Context-sensitive spelling errors are those errors resulting from mistyping or mispronouncing a word, and the resulting misspelled word is a valid language/dictionary word. For example, “This building is bigger then our building”: The word ‘then ’ here is a context-sensitive spelling error and the i ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Context-sensitive spelling errors are those errors resulting from mistyping or mispronouncing a word, and the resulting misspelled word is a valid language/dictionary word. For example, “This building is bigger then our building”: The word ‘then ’ here is a context-sensitive spelling error and the intended word is ‘than’. This paper describes an effective approach for detecting context-sensitive spelling errors. Detecting and correcting context-sensitive spelling errors is a very difficult and important problem that needs careful consideration. Working with this problem will involve facing the very difficult problem of natural language semantics. The proposed approach is a machine-learning-based approach. The approach has been fully implemented and evaluated with a large number of experiments. The results reported in this paper are encouraging and show that the method is effective. Overall, the method is capable of detecting context-sensitive errors with an accuracy in the range of ~86 %- ~95%. Keywords: Natural language processing, Computational Linguistics, Context-sensitive Errors, Machine Learning.
Unsupervised Acquisition of Lexical Knowledge From N-grams: Final Report of the 2009 JHU CLSP Workshop
"... This report describes a variety of work that uses web-scale N-gram data. This ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This report describes a variety of work that uses web-scale N-gram data. This

