Results 1 - 10
of
18
BART: A modular toolkit for coreference resolution
- In Association for Computational Linguistics (ACL) Demo Session
, 2008
"... Developing a full coreference system able to run all the way from raw text to semantic interpretation is a considerable engineering effort. Accordingly, there is very limited availability of off-the shelf tools for researchers whose interests are not primarily in coreference or others who want to co ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Developing a full coreference system able to run all the way from raw text to semantic interpretation is a considerable engineering effort. Accordingly, there is very limited availability of off-the shelf tools for researchers whose interests are not primarily in coreference or others who want to concentrate on a specific aspect of the problem. We present BART, a highly modular toolkit for developing coreference applications. In the Johns Hopkins workshop on using lexical and encyclopedic knowledge for entity disambiguation, the toolkit was used to extend a reimplementation of the Soon et al. (2001) proposal with a variety of additional syntactic and knowledge-based features, and experiment with alternative resolution processes, preprocessing tools, and classifiers. 1.
to appear)). The NXT-format switchboard corpus: A rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language Resources and Evaluation Journal
, 2009
"... and prosody of dialogue ..."
Named Entity Recognition in Tweets: An Experimental Study
, 2011
"... People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-bu ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25 % over ten common entity types. Our NLP tools are available at:
CCASH: A Web Application Framework for Efficient Distributed Language Resource Development
- Proceedings of LREC 2010, (p. this proceedings). Valetta
, 2010
"... We introduce CCASH (Cost-Conscious Annotation Supervised by Humans), an extensible web application framework for cost-efficient annotation. CCASH provides a framework in which cost-efficient annotation methods such as Active Learning can be explored via user studies and afterwards applied to large a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We introduce CCASH (Cost-Conscious Annotation Supervised by Humans), an extensible web application framework for cost-efficient annotation. CCASH provides a framework in which cost-efficient annotation methods such as Active Learning can be explored via user studies and afterwards applied to large annotation projects. CCASH’s architecture is described as well as the technologies that it is built on. CCASH allows custom annotation tasks to be built from a growing set of useful annotation widgets. It also allows annotation methods (such as AL) to be implemented in any language. Being a web application framework, CCASH offers secure centralized data and annotation storage and facilitates collaboration among multiple annotations. By default it records timing information about each annotation and provides facilities for recording custom statistics. The CCASH framework has been used to evaluate a novel annotation strategy presented in a concurrently published paper, and will be used in the future to annotate a large Syriac corpus. 1.
Cascaded Filtering for Topic-Driven Multi-Document Summarization
"... This paper presents EMLR’s NLP group’s first participation in the DUC summarization competitions. Our system combines document filtering, ranking sentences using lexical chains and graph matching algorithms with the topic, on top of several annotation layers in the MMAX2 annotation tool. The system ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper presents EMLR’s NLP group’s first participation in the DUC summarization competitions. Our system combines document filtering, ranking sentences using lexical chains and graph matching algorithms with the topic, on top of several annotation layers in the MMAX2 annotation tool. The system ranked 14 out of 30 participating teams in manual annotation, and had particularly good ranking from the linguistic quality point of view. 1
By all these lovely tokens... ∗ Merging Conflicting Tokenizations
"... Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday’s NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday’s NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences for the handling of queries on annotated corpora.
Boemie ontology-based text annotation tool
"... The huge amount of the available information in the Web creates the need of effective information extraction systems that are able to produce metadata that satisfy user’s information needs. The development of such systems, in the majority of cases, depends on the availability of an appropriately ann ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The huge amount of the available information in the Web creates the need of effective information extraction systems that are able to produce metadata that satisfy user’s information needs. The development of such systems, in the majority of cases, depends on the availability of an appropriately annotated corpus in order to learn extraction models. The production of such corpora can be significantly facilitated by annotation tools that are able to annotate, according to a defined ontology, not only named entities but most importantly relations between them. This paper describes the BOEMIE ontology-based annotation tool which is able to locate blocks of text that correspond to specific types of named entities, fill tables corresponding to ontology concepts with those named entities and link the filled tables based on relations defined in the domain ontology. Additionally, it can perform annotation of blocks of text that refer to the same topic. The tool has a user-friendly interface, supports automatic pre-annotation, annotation comparison as well as customization to other annotation schemata. The annotation tool has been used in a large scale annotation task involving 3000 web pages regarding athletics. It has also been used in another annotation task involving 503 web pages with medical information, in different languages. 1.
Annotating Question Types in Social Q&A Sites
"... Abstract. In all domains, including eHumanities, it is crucial to understand how people seek information and what kinds of questions they ask. In this paper, we present an annotation study of domain-specific questions collected from the current leading social Question and Answer site, namely Yahoo! ..."
Abstract
- Add to MetaCart
Abstract. In all domains, including eHumanities, it is crucial to understand how people seek information and what kinds of questions they ask. In this paper, we present an annotation study of domain-specific questions collected from the current leading social Question and Answer site, namely Yahoo! Answers. We define
Information Extraction with the Darmstadt Knowledge Processing Software Repository
"... Current Natural Language Processing (NLP) systems feature high-complexity processing pipelines that require the use of components at different levels of linguistic and application specific processing. These components often have to interface with external e.g. machine learning and information retrie ..."
Abstract
- Add to MetaCart
Current Natural Language Processing (NLP) systems feature high-complexity processing pipelines that require the use of components at different levels of linguistic and application specific processing. These components often have to interface with external e.g. machine learning and information retrieval libraries as well as tools for human annotation and visualization. At the UKP Lab, we are working on the Darmstadt Knowledge Processing Software Repository (DKPro) (Gurevych et al., 2007a; Müller et al., 2008) to create a highly flexible, scalable and easy-to-use toolkit that allows rapid creation of complex NLP pipelines for semantic information processing on demand. The DKPro repository consists of several main parts created to serve the purposes of different NLP application areas. • DKPro core components are general purpose analysis components. Core components are readers for generic text and XML files, and annotators for standard preprocessing tasks like tokenization, sentence splitting, POS-tagging and lemmatization, 1 stop word removal, parsing 2, and others. The core also includes annotation consumers, e.g. one that can produce output in the format used by the general-purpose annotation tool MMAX2 (Müller & Strube, 2006). • DKPro information retrieval components supply functionality for all phases of information retrieval, including indexing, retrieval, and (qualitative and quantitative) evaluation. The components

