Results 11 -
19 of
19
Question answering techniques for the world wide web
- In Tutorial presentation at EACL
, 2003
"... Question answering systems have become increasingly popular because they deliver users short, succinct answers instead of overloading them with a large number of irrelevant documents. The vast amount of information readily available on the World Wide Web presents new opportunities and challenges for ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Question answering systems have become increasingly popular because they deliver users short, succinct answers instead of overloading them with a large number of irrelevant documents. The vast amount of information readily available on the World Wide Web presents new opportunities and challenges for question answering. In order for question answering systems to benefit from this vast store of useful knowledge, they must cope with large volumes of useless data. Many characteristics of the World Wide Web distinguish Web-based question answering from question answering on closed corpora such as newspaper texts. The Web is vastly larger in size and boasts incredible “data redundancy, ” which renders it amenable to statistical techniques for answer extraction. A data-driven approach can yield high levels of performance and nicely complements traditional question answering techniques driven by information extraction. In addition to enormous amounts of unstructured text, the Web also contains pockets of structured and semistructured knowledge that can serve as a valuable resource for question answering. By organizing these resources and annotating them with natural language, we can successfully incorporate Web knowledge into question answering systems. This tutorial surveys recent Web-based question answering technology, focusing on two separate paradigms: knowledge mining using statistical tools and knowledge annotation using database concepts. Both approaches can employ a wide spectrum of techniques ranging in linguistic sophistication from simple “bag-of-words ” treatments to full syntactic parsing.
Suffix arrays on words
- In Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching, volume 4580 of LNCS
, 2007
"... Abstract. Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at word-boundaries of a text T[1,n], taking O(n)timeandO(k) space in addition to T.Wepropose a class-note solution to this problem that achieves such optimal time and space bound ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract. Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at word-boundaries of a text T[1,n], taking O(n)timeandO(k) space in addition to T.Wepropose a class-note solution to this problem that achieves such optimal time and space bounds. Word-based versions of indexes achieving the same time/space bounds were already known for suffix trees [1,2] and (compact) DAWGs [3,4]. Our solution inherits the simplicity and efficiency of suffix arrays, with respect to such other word-indexes, and thus it foresees applications in word-based approaches to data compression [5] and computational linguistics [6]. To support this, we have run a large set of experiments showing that word-based suffix arrays may beconstructed twice as fast as their full-text counterparts, and with a working space as low as 20%. The space reduction of the final word-based suffix array impacts also in their query time (i.e. less random access binary-search steps!), being faster by a factor of up to 3. 1
Full-Subtopic Retrieval with Keyphrase-based Search Results Clustering
"... We consider the problem of retrieving multiple documents relevant to the single subtopics of a given web query, termed “full-subtopic retrieval”. To solve this problem we present a novel search results clustering algorithm that generates clusters labeled by keyphrases. The keyphrases are extracted f ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We consider the problem of retrieving multiple documents relevant to the single subtopics of a given web query, termed “full-subtopic retrieval”. To solve this problem we present a novel search results clustering algorithm that generates clusters labeled by keyphrases. The keyphrases are extracted from the generalized suffix tree built from the search results and merged through an improved hierarchical agglomerative clustering procedure. We also introduce a novel measure for evaluating full-subtopic retrieval performance, namely “Subtopic Search Length under k document sufficiency”. Using a test collection specifically designed for evaluating subtopic retrieval, we found that our algorithm outperformed both other existing search results clustering algorithms and also a search results re-ranking method that emphasized diversity of results (at least for k>1; i.e., when we are interested in retrieving more than one relevant document per subtopic). Our approach has been implemented into KeySRC (Keyphrase-based Search Results Clustering), a full web clustering engine available online at
M.: Sparse compact directed acyclic word graphs
- In: Stringology
, 2006
"... Abstract. The suffix tree of string w represents all suffixes of w, and thus it supports full indexing of w for exact pattern matching. On the other hand, a sparse suffix tree of w represents only a subset of the suffixes of w, and therefore it supports sparse indexing of w. There has been a wide ra ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. The suffix tree of string w represents all suffixes of w, and thus it supports full indexing of w for exact pattern matching. On the other hand, a sparse suffix tree of w represents only a subset of the suffixes of w, and therefore it supports sparse indexing of w. There has been a wide range of applications of sparse suffix trees, e.g., natural language processing and biological sequence analysis. Word suffix trees are a variant of sparse suffix trees that are defined for strings that contain a special word delimiter #. Namely, the word suffix tree of string w = w1w2 · · · wk, consisting of k words each ending with #, represents only the k suffixes of w of the form wi · · · wk. Recently, we presented an algorithm which builds word suffix trees in O(n) time with O(k) space, where n is the length of w. In addition, we proposed sparse directed acyclic word graphs (SDAWGs) and an on-line algorithm for constructing them, working in O(n) time and space. As a further achievement of this research direction, this paper introduces yet a new text indexing structure named sparse compact directed acyclic word graphs (SCDAWGs). We show that the size of SCDAWGs is smaller than that of word suffix trees and SDAWGs, and present an SCDAWG construction algorithm that works in O(n) time with O(k) space and in an on-line manner. 1
Inter-document similarity in web searches
, 2004
"... are stored in PDF, with the report number as filename. Alternatively, reports are available by post from the above address. Orientador: Júri: ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
are stored in PDF, with the report number as filename. Alternatively, reports are available by post from the above address. Orientador: Júri:
Learning Feature-Value Grammars from Plain Text
- PROCEEDINGS OF THE JOINT CONFERENCE ON NEW METHODS IN LANGUAGE PROCESSING AND COMPUTATIONAL NATURAL LANGUAGE LEARNING
, 1998
"... This paper outlines preliminary work aimed at learning Feature-Value Grammars from plain text. Gommon suffixes are gleaned from a word suffix tree and used to form a first approximation of how regular inflection is marked. Words are.generalised according to these suffixes and then subjected to ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper outlines preliminary work aimed at learning Feature-Value Grammars from plain text. Gommon suffixes are gleaned from a word suffix tree and used to form a first approximation of how regular inflection is marked. Words are.generalised according to these suffixes and then subjected to trigram analysis in an attempt to identify agreement dependencies. They are subsequently labeled with a feature whose value is given by the com- mon suffix. A means for converting the feature dependencies into a unification grammar is de- scribed wherein feature structures are projec- ted on to unlabeled words. Irregularly inttec- ted words are subsumed into common categor- ies through the process of unification.
BODHI: A Database Engine for . . .
, 2006
"... Biodiversity research generates and uses a variety of data spanning across diverse do-mains, including taxonomy, geo-spatial and genetic domains, which vary greatly in their structural features and complexities, query processing costs and storage volumes. In this thesis, we present BODHI, a database ..."
Abstract
- Add to MetaCart
Biodiversity research generates and uses a variety of data spanning across diverse do-mains, including taxonomy, geo-spatial and genetic domains, which vary greatly in their structural features and complexities, query processing costs and storage volumes. In this thesis, we present BODHI, a database engine that seamlessly integrates these diverse types of data, spanning the range from molecular to organism-level information. BODHI is a native object-oriented database system built around a publically available micro-kernel and extensible query processor, and offers a functionally comprehensive query interface. The server is partitioned into three service modules: object, spatial and sequence, each handling the associated data domain and providing appropriate storage, modeling inter-faces, and evaluation algorithms for predicates over the corresponding data types. To accelerate query response times, a variety of specialized access structures are included for each domain. Our experiments with complex cross-domain queries over a representative
Synchronisation Compression
, 2002
"... Original Aims of the Project The project aims to investigate and implement a novel method for compression and decompression of data (primarily electronic mail) to be synchronised with a remote device. Work Completed A compressor and decompresser were implemented to run on a standard PC. Data structu ..."
Abstract
- Add to MetaCart
Original Aims of the Project The project aims to investigate and implement a novel method for compression and decompression of data (primarily electronic mail) to be synchronised with a remote device. Work Completed A compressor and decompresser were implemented to run on a standard PC. Data structures and algorithms which offered improved performance were investigated and the system modified accordingly. The system’s compression performance (on e-mail) was analysed and its behaviour modified accordingly. The system was tested with data other than e-mail. A user-friendly implementation was produced to run on commodity hardware.
STRING DATA STRUCTURES FOR COMPUTATIONAL MOLECULAR BIOLOGY
"... The topic of the chapter is string data structures with applications in the field of computational molecular biology. Let � be a finite alphabet consisting of a set of characters (or symbols). The cardinality of the alphabet denoted by |� | expresses the number of distinct characters in the alphabet ..."
Abstract
- Add to MetaCart
The topic of the chapter is string data structures with applications in the field of computational molecular biology. Let � be a finite alphabet consisting of a set of characters (or symbols). The cardinality of the alphabet denoted by |� | expresses the number of distinct characters in the alphabet. A string or word is an ordered list

