Results 11 - 20
of
38
Evolving GATE to Meet New Challenges in . . .
, 1998
"... In this paper we present recent work on GATE, a widely-used framework and graphical development environment for creating and deploying Language Engineering components and resources in a robust fashion. The GATE architecture has facilitated the development of a number of successful applications for v ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
In this paper we present recent work on GATE, a widely-used framework and graphical development environment for creating and deploying Language Engineering components and resources in a robust fashion. The GATE architecture has facilitated the development of a number of successful applications for various language processing tasks (such as Information Extraction, dialogue and summarisation), the building and annotation of corpora and the quantitative evaluations of LE applications. The focus of this paper is on recent developments in response to new challenges in Language Engineering: Semantic Web, integration with Information Retrieval and data mining, and the need for machine learning support.
Almost-Constant-Time Clustering of Arbitrary Corpus Subsets
- In ACM SIGIR 97
, 1997
"... Methods exist for constant-time clustering of corpus subsets selected via Scatter/Gather browsing [3]. In this paper we expand on those techniques, giving an algorithm for almostconstant -time clustering of arbitrary corpus subsets. This algorithm is never slower than clustering the document set fro ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
Methods exist for constant-time clustering of corpus subsets selected via Scatter/Gather browsing [3]. In this paper we expand on those techniques, giving an algorithm for almostconstant -time clustering of arbitrary corpus subsets. This algorithm is never slower than clustering the document set from scratch, and for medium-sized and large sets it is significantly faster. This algorithm is useful for clustering arbitrary subsets of large corpora --- obtained, for instance, by a boolean search --- quickly enough to be useful in an interactive setting. 1 Introduction Document clustering has emerged as an important tool for the presentation and navigation of document collections. For example, the Scatter/Gather browsing paradigm clusters documents into topic-coherent groups and presents descriptive textual summaries to the user [2]. Informed by the summaries, the user may select clusters, thereby forming a subcollection, for iterative examination. The clustering and reclustering is done ...
The use of categories and clusters for organizing retrieval results
- Natural Language Information Retrieval
, 1999
"... Abstract. An important problem for information access systems is that of organizing large sets of documents that have been retrieved in response to a query. Text categorization and text clustering are two natural language processing tasks whose results can be applied to document organization. This c ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
(Show Context)
Abstract. An important problem for information access systems is that of organizing large sets of documents that have been retrieved in response to a query. Text categorization and text clustering are two natural language processing tasks whose results can be applied to document organization. This chapter describes user interfaces that use categories and clusters to organize retrieval results, and examines the relationship between the two. 1 1.
Protofoil: Storing and Finding the Information Worker's Paper Documents in an Electronic File Cabinet
- CHI
, 1994
"... Although the document imaging industry has taken off in the last few years, document image filing for the individual information worker is still not widespread or effective. In this paper, we focus on building an electronic filing system for paper documents that supports the ad hoc, multifarious wor ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Although the document imaging industry has taken off in the last few years, document image filing for the individual information worker is still not widespread or effective. In this paper, we focus on building an electronic filing system for paper documents that supports the ad hoc, multifarious work of information workers. Motivated by interviews with researchers and a survey of descriptive studies of paper document filing, we have focussed on minimizing or delaying costs of document filing and supporting a rich variety of methods for accessing and using stored documents. We have implemented a prototype system called Protofoil for storing, retrieving, and manipulating paper documents as electronic images that integrates many user interface---paper and workstation--andinformation retrieval technologies. Protofoil has been tested through use in our laboratory, and has been deployed in a field study at a lawyer's office.
an overview
- American Family Physician
, 2007
"... CloVR-Metagenomics: Functional and taxonomic microbial community characterization from metagenomic whole-genome shotgun (WGS) sequences – standard operating procedure, version 1.0 ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
CloVR-Metagenomics: Functional and taxonomic microbial community characterization from metagenomic whole-genome shotgun (WGS) sequences – standard operating procedure, version 1.0
Software Infrastructure for Language Resources: a Taxonomy of Previous Work and a Requirements Analysis
- IN PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC-2
, 2000
"... This paper presents a taxonomy of previous work on infrastructures, architectures and development environments for representing and processing Language Resources (LRs), corpora, and annotations. This classification is then used to derive a set of requirements for a Software Architecture for Langua ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
(Show Context)
This paper presents a taxonomy of previous work on infrastructures, architectures and development environments for representing and processing Language Resources (LRs), corpora, and annotations. This classification is then used to derive a set of requirements for a Software Architecture for Language Engineering (SALE). The analysis shows that a SALE should address common problems and support typical activities in the development, deployment, and maintenance of LE software. The results will be used in the next phase of construction of an infrastructure for LR production, distribution, and access.
Snippet Search: a Single Phrase Approach to Text Access
- In Proceedings of the 1991 Joint Statistical Meetings. American Statistical Association
, 1991
"... this paper. In the worst case, the inner loop of this algorithm is executed ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
this paper. In the worst case, the inner loop of this algorithm is executed
Xerox Site Report: Four TREC-4 Tracks
- The Fourth Text REtrieval Conference (TREC-4
, 1996
"... this document sample than one would expect by chance. The terms are selected according to a binomial likelihood ratio test [10], comparing their occurrence in the first 20 documents to their occurrence in the rest of the collection. The selected terms are then weighted in proportion to the signific ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
this document sample than one would expect by chance. The terms are selected according to a binomial likelihood ratio test [10], comparing their occurrence in the first 20 documents to their occurrence in the rest of the collection. The selected terms are then weighted in proportion to the significance of their occurrence in the sampled documents. Since it uses the baseline results, this run may also be affected by the programming error described above. query set base infl infl-np expand all Q1-25 0.454 0.484 0.492 0.467 Q26-50 0.174 0.204 0.212 0.267 20 Q1-25 0.718 0.718 0.722 0.722 Q26-50 0.306 0.354 0.378 0.402 Table 8: Average precision at all relevant docs (all) and average precision at 20 docs (20) for Spanish queries. The corrected Spanish performance figures are presented in Table 8. We include four different runs: (1) base = stop list but no morphological analysis, (2) infl = text lemmatized (stemmed) with inflectional morphology, (3) infl-np = noun phrase weight doubled, and (4) expand = query expansion. The uncorrected performance figures for infl-np and expand on Q26-50 (corresponding to our submitted runs) are 0.190/0.366 and 0.238/0.380 respectively. We present separate results for queries 1-25 (used for TREC-3) and 26-50 (used for TREC-4) since the former are substantially longer, and we note that the results reflect the difference in length. We find that lemmatization using inflectional morphology helps in most cases, making a 3-5% absolute difference in performance. However, when the queries are long and the user is examining fewer than 20 documents, there is no improvement. These conclusions agree with the results obtained for English [13], although the Spanish inflectional morphology is somewhat more effective than its English counterpart. Doubling ...
Xerox TREC-5 Site Report: Routing, Filtering, NLP, and Spanish Tracks
- In Proceedings of the Fifth Text REtrieval Conference (TREC-5
, 1997
"... this report is divided into three sections. The first section describes our work on routing and filtering (Hull, Schutze, and Pedersen), the second section covers the NLP track (Grefenstette, Schulze, and Gaussier), and the final section addresses the Spanish track (Grefenstette, Schulze, and Hull). ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
this report is divided into three sections. The first section describes our work on routing and filtering (Hull, Schutze, and Pedersen), the second section covers the NLP track (Grefenstette, Schulze, and Gaussier), and the final section addresses the Spanish track (Grefenstette, Schulze, and Hull). 2 Routing and Filtering
Document Routing as Statistical Classification
- In AAAI Spring Symposium on Machine Learning in Information Access Technical Papers
, 1996
"... In this paper, we compare learning techniques based on statistical classification to traAitional methods of relevance feedback for the document routing prob-lem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminaa ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
In this paper, we compare learning techniques based on statistical classification to traAitional methods of relevance feedback for the document routing prob-lem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminaat analysis, logistic re-gression, and neural networks. We demonstrate that the classifiers perform 10-15 % better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks. The Routing Problem Of the two classical information retrieval tasks 1 doc-ument routing is most amenable to machine learning. A fixed, standing query, and a training collection of