Results 1 - 10
of
11
Adding nesting structure to words
- In Developments in Language Theory, LNCS 4036
, 2006
"... We propose the model of nested words for representation of data with both a linear ordering and a hierarchically nested matching of items. Examples of data with such dual linear-hierarchical structure include executions of structured programs, annotated linguistic data, and HTML/XML documents. Neste ..."
Abstract
-
Cited by 42 (10 self)
- Add to MetaCart
We propose the model of nested words for representation of data with both a linear ordering and a hierarchically nested matching of items. Examples of data with such dual linear-hierarchical structure include executions of structured programs, annotated linguistic data, and HTML/XML documents. Nested words generalize both words and ordered trees, and allow both word and tree operations. We define nested word automata—finite-state acceptors for nested words, and show that the resulting class of regular languages of nested words has all the appealing theoretical properties that the classical regular word languages enjoys: deterministic nested word automata are as expressive as their nondeterministic counterparts; the class is closed under union, intersection, complementation, concatenation, Kleene-*, prefixes, and language homomorphisms; membership, emptiness, language inclusion, and language equivalence are all decidable; and definability in monadic second order logic corresponds exactly to finite-state recognizability. We also consider regular languages of infinite nested words and show that the closure properties, MSO-characterization, and decidability of decision problems carry over. The linear encodings of nested words give the class of visibly pushdown languages of words, and this class lies between balanced languages and deterministic context-free languages. We argue that for algorithmic verification of structured programs, instead of viewing the program as a context-free language over words, one should view it as a regular language of nested words (or equivalently, a visibly pushdown language), and this would allow model checking of many properties (such as stack inspection, pre-post conditions) that are not expressible in existing specification logics. We also study the relationship between ordered trees and nested words, and the corresponding automata: while the analysis complexity of nested word automata is the same as that of classical tree automata, they combine both bottom-up and top-down traversals, and enjoy expressiveness and succinctness benefits over tree automata. 1
Towards an alternative implementation of NXT’s query language via XQuery
- In EACL Workshop on Multi-dimensional Markup in Natural Language Processing (NLPXML
, 2006
"... The NITE Query Language (NQL) has been used successfully for analysis of a number of heavily cross-annotated data sets, and users especially value its elegance and flexibility. However, when using the current implementation, many of the more complicated queries that users have formulated must be run ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The NITE Query Language (NQL) has been used successfully for analysis of a number of heavily cross-annotated data sets, and users especially value its elegance and flexibility. However, when using the current implementation, many of the more complicated queries that users have formulated must be run in batch mode. For a re-implementation, we require the query processor to be capable of handling large amounts of data at once, and work quickly enough for on-line data analysis even when used on complete corpora. Early results suggest that the most promising implementation strategy is one that involves the use of XQuery on a multiple file data representation that uses the structure of individual XML files to mirror tree structures in the data, with redundancy where a data node has multiple parents in the underlying data object model. 1
Querying Linguistic Trees
"... Abstract. Large databases of linguistic annotations are used for testing linguistic hypotheses and for training language processing models. These linguistic annotations are often syntactic or prosodic in nature, and have a hierarchical structure. Query languages are used to select particular structu ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. Large databases of linguistic annotations are used for testing linguistic hypotheses and for training language processing models. These linguistic annotations are often syntactic or prosodic in nature, and have a hierarchical structure. Query languages are used to select particular structures of interest, or to project out large slices of a corpus for external analysis. Existing languages suffer from a variety of problems in the areas of expressiveness, efficiency, and naturalness for linguistic query. We describe the domain of linguistic trees and discuss the expressive requirements for a query language. Then we present a language that can express a wide range of queries over these trees, and show that the language is first-order complete over trees. 1.
Mining Syntactically Annotated Corpora with XQuery
"... This paper presents a uniform approach to data extraction from syntactically annotated corpora encoded in XML. XQuery, which incorporates XPath, has been designed as a query language for XML. The combination of XPath and XQuery offers flexibility and expressive power, while corpus specific functions ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents a uniform approach to data extraction from syntactically annotated corpora encoded in XML. XQuery, which incorporates XPath, has been designed as a query language for XML. The combination of XPath and XQuery offers flexibility and expressive power, while corpus specific functions can be added to reduce the complexity of individual extraction tasks. We illustrate our approach using examples from dependency treebanks for Dutch. 1
System for Querying Syntactically Annotated Corpora
"... This paper presents a system for querying treebanks. The system consists of a powerful query language with natural support for cross-layer queries, a client interface with a graphical query builder and visualizer of the results, a command-line client interface, and two substitutable query engines: a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper presents a system for querying treebanks. The system consists of a powerful query language with natural support for cross-layer queries, a client interface with a graphical query builder and visualizer of the results, a command-line client interface, and two substitutable query engines: a very efficient engine using a relational database (suitable for large static data), and a slower, but paralel-computing enabled, engine operating on treebank files (suitable for “live ” data). 1
2007, ‘Graphical query for linguistic treebanks
- In: 10th Conference of the Pacific Association for Computational Linguistics
"... Databases of hierarchically annotated text occupy a central place in linguistic research and language technology development. We describe a new approach to tree query which we call “Query by Annotation”. Users express a query by annotating a tree, and the annotation is compiled into an expression in ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Databases of hierarchically annotated text occupy a central place in linguistic research and language technology development. We describe a new approach to tree query which we call “Query by Annotation”. Users express a query by annotating a tree, and the annotation is compiled into an expression in a path language. The result trees are overlaid with the original query, permitting the user to see why they match. Since queries and results are annotated trees, users can easily refine and resubmit their queries. The approach to Query by Annotation is motivated and exemplified using databases of linguistic trees, or treebanks. 1
Fast Query for Large Treebanks
"... A variety of query systems have been developed for interrogating parsed corpora, or treebanks. With the arrival of efficient, widecoverage parsers, it is feasible to create very large databases of trees. However, existing approaches that use in-memory search, or relational or XML database technologi ..."
Abstract
- Add to MetaCart
A variety of query systems have been developed for interrogating parsed corpora, or treebanks. With the arrival of efficient, widecoverage parsers, it is feasible to create very large databases of trees. However, existing approaches that use in-memory search, or relational or XML database technologies, do not scale up. We describe a method for storage, indexing, and query of treebanks that uses an information retrieval engine. Several experiments with a large treebank demonstrate excellent scaling characteristics for a wide range of query types. This work facilitates the curation of much larger treebanks, and enables them to be used effectively in a variety of scientific and engineering tasks. 1
Representing and Querying Standoff XML
"... The paper discusses the representation and exploitation of multi-level annotated linguistic data. We first present a standoff XML representation, which distributes information over separate, standoff layers and allows us to represent annotations of various kinds in a uniform, generic way. This for ..."
Abstract
- Add to MetaCart
The paper discusses the representation and exploitation of multi-level annotated linguistic data. We first present a standoff XML representation, which distributes information over separate, standoff layers and allows us to represent annotations of various kinds in a uniform, generic way. This format serves as our interchange format. We further introduce an XML-inline representation that is designed to provide for a more efficient processing of the data. This format is computed on the basis of the standoff representation and uses fragments to represent overlapping elements. We then compare both representations by testing their performance with regard to a testsuite. Not surprisingly, the inline variant performs much better than the standoff variant, in particular with more complex queries.
GenerIE: Information Extraction Using Database Queries
"... Abstract — Information extraction systems are traditionally implemented as a pipeline of special-purpose processing modules. A major drawback of such an approach is that whenever a new extraction goal emerges or a module is improved, extraction has to be re-applied from scratch to the entire text co ..."
Abstract
- Add to MetaCart
Abstract — Information extraction systems are traditionally implemented as a pipeline of special-purpose processing modules. A major drawback of such an approach is that whenever a new extraction goal emerges or a module is improved, extraction has to be re-applied from scratch to the entire text corpus even though only a small part of the corpus might be affected. In this demonstration proposal, we describe a novel paradigm for information extraction: we store the parse trees output by text processing in a database, and then express extraction needs using queries, which can be evaluated and optimized by databases. Compared with the existing approaches, database queries for information extraction enable generic extraction and minimize reprocessing. However, such an approach also poses a lot of technical challenges, such as language design, optimization and automatic query generation. We will present the opportunities and challenges that we met when building GenerIE, a system that implements this paradigm. I.
IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING 1 Parse Tree Database for Information Extraction
"... Abstract—Information extraction systems are traditionally implemented as a pipeline of special-purpose processing modules targeting the extraction of a particular kind of information. A major drawback of such approach is that whenever a new extraction goal emerges or a module is improved, extraction ..."
Abstract
- Add to MetaCart
Abstract—Information extraction systems are traditionally implemented as a pipeline of special-purpose processing modules targeting the extraction of a particular kind of information. A major drawback of such approach is that whenever a new extraction goal emerges or a module is improved, extraction has to be re-applied from scratch to the entire text corpus even though only a small part of the corpus might be affected. In this paper, we describe a novel approach for information extraction so that extraction needs are expressed in the form of database queries, which are evaluated and optimized by databases. Using database queries for information extraction enables generic extraction and minimizes reprocessing of data. In addition, our approach provides two different query generation components that can automatically form database queries for extraction from training datasets, as well as from unlabeled data through a mechanism inspired by the pseudo-relevance feedback approach found in protein-protein interactions and drug-protein-metabolic relations from two sets of corpus. Experiments show that our approach achieves a precision of 83.6 % and recall of 58.6 % (F-measure of 64.2%) for the extraction of protein-protein interactions from the BioCreative 2 corpus, while achieving a precision of 85.0 % and recall of 26.0 % (F-measure of 39.8%) for drug-protein-metabolic relations.

