Results 1 -
9 of
9
Web-based models for natural language processing
- ACM Transactions on Speech and Language Processing
, 2005
"... Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The pr ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-ofthe-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.
Exploring the boundaries: Gene and protein identification in biomedical text
- In Proceedings of the BioCreative Workshop
, 2004
"... Background: Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools. Methods: We present a maximum-entropy based system incorporating a diverse set of features fo ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
Background: Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools. Methods: We present a maximum-entropy based system incorporating a diverse set of features for identifying gene and protein names in biomedical abstracts. Results: This system was entered in the BioCreative comparative evaluation and achieved a precision of 0.83 and recall of 0.84 in the “open ” evaluation and a precision of 0.78 and recall of 0.85 in the “closed ” evaluation. Conclusions: Central contributions are rich use of features derived from the training data at multiple levels of granularity, a focus on correctly identifying entity boundaries, and the innovative use of several external knowledge sources including full MEDLINE abstracts and web searches. Background The explosion of information in the biomedical domain and particularly in genetics has highlighted the need for automated text information extraction techniques. MEDLINE, the primary research database serving the biomedical community, currently contains over 14 million abstracts, with 60,000 new abstracts appearing each month. There is also an impressive number of molecular biological databases covering an
A study of using search engine page hits as a proxy for n-gram frequencies
- In Proceedings of the RANLP’05
, 2005
"... The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the i ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined. 1
Coreference Resolution Using Semantic Relatedness Information from Automatically Discovered Patterns
"... Semantic relatedness is a very important factor for the coreference resolution task. To obtain this semantic information, corpusbased approaches commonly leverage patterns that can express a specific semantic relation. The patterns, however, are designed manually and thus are not necessarily the mos ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Semantic relatedness is a very important factor for the coreference resolution task. To obtain this semantic information, corpusbased approaches commonly leverage patterns that can express a specific semantic relation. The patterns, however, are designed manually and thus are not necessarily the most effective ones in terms of accuracy and breadth. To deal with this problem, in this paper we propose an approach that can automatically find the effective patterns for coreference resolution. We explore how to automatically discover and evaluate patterns, and how to exploit the patterns to obtain the semantic relatedness information. The evaluation on ACE data set shows that the pattern based semantic information is helpful for coreference resolution. 1
Learning Dutch coreference resolution
- In Fifteenth Computational Linguistics in the Netherlands Meeting (CLIN
, 2004
"... This paper presents a machine learning approach to the resolution of coreferential relations between nominal constituents in Dutch. It is the first significant automatic approach to the resolution of coreferential relations between nominal constituents for this language. The corpusbased strategy was ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper presents a machine learning approach to the resolution of coreferential relations between nominal constituents in Dutch. It is the first significant automatic approach to the resolution of coreferential relations between nominal constituents for this language. The corpusbased strategy was enabled by the annotation of a substantial corpus (ca. 12,500 noun phrases) of Dutch news magazine text with coreferential links for pronominal, proper noun and common noun coreferences. Based on the hypothesis that different types of information sources contribute to a correct resolution of different types of coreferential links, we propose a modular approach in which a separate module is trained per NP type. 1 The task of coreference resolution Although largely unexplored for Dutch, automatic coreference 1 resolution is a research area which is becoming increasingly popular in natural language processing (NLP) research. It is a weakness and therefore a key task in applications such as machine translation, automatic summarization and information extraction for which text understanding is of crucial importance.
Unsupervised Acquisition of Lexical Knowledge From N-grams: Final Report of the 2009 JHU CLSP Workshop
"... This report describes a variety of work that uses web-scale N-gram data. This ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This report describes a variety of work that uses web-scale N-gram data. This
But What Do They Mean? Modelling Contrast Between Speakers in Dialogue Signalled by “But”
, 2005
"... Understanding what is being communicated in a dialogue involves determining how it is coherent, that is, how the successive turns in the dialogue are related, what the speakers ’ intentions, goals, beliefs, and expectations are and how they relate to each other’s responses. This thesis aims to addre ..."
Abstract
- Add to MetaCart
Understanding what is being communicated in a dialogue involves determining how it is coherent, that is, how the successive turns in the dialogue are related, what the speakers ’ intentions, goals, beliefs, and expectations are and how they relate to each other’s responses. This thesis aims to address how turns in dialogue are related when one speaker indicates contrast with something in the preceding discourse signalled by “but”. Different relations cued by “but ” will be distinguished and characterised when they relate material spanning speaker turns and an implementation in a work-ing dialogue system is specified with the aim of enabling a better model of dialogue understanding and achieving more precise response generation. A large amount of research in discourse addresses coherence in monologue, and much of it focuses on cases in which the coherence relation is explicitly signalled via a cue-phrase or discourse marker (e.g., “on the other hand”, “but”, et cetera) which provides an explicit cue about the nature of the underlying relation linking the two clauses. However despite research on Speech Acts, planning research into speakers’
An XML-based Tool for Tracking English Inclusions in German Text
"... Abstract. The use of lexicons and corpora advances both linguistic research and performances of current natural language processing (NLP) systems. We present a tool that exploits such resources, specifically English and German lexical databases and the World Wide Web to recognise English inclusions ..."
Abstract
- Add to MetaCart
Abstract. The use of lexicons and corpora advances both linguistic research and performances of current natural language processing (NLP) systems. We present a tool that exploits such resources, specifically English and German lexical databases and the World Wide Web to recognise English inclusions in German newspaper articles. The output of the tool can assist lexical resource developers in monitoring changing patterns of English inclusion usage. The corpus used for the classification covers three different domains. We report the classification results and illustrate their value to linguistic and NLP research. 1

