Results 1 - 10
of
10
RCV1: A new benchmark collection for text categorization research
- Journal of Machine Learning Research
, 2004
"... Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data ..."
Abstract
-
Cited by 312 (5 self)
- Add to MetaCart
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection’s properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well as
Evaluating Message Understanding Systems: An Analysis of . . .
- COMPUTATIONAL LINGUISTICS
, 1993
"... This paper describes and analyzes the results of the Third Message Understanding Conference (MUC-3). It reviews the purpose, history, and methodology of the conference, summarizes the participating systems, discusses issues of measuring system effectiveness, describes the linguistic phenomena tests, ..."
Abstract
-
Cited by 48 (2 self)
- Add to MetaCart
This paper describes and analyzes the results of the Third Message Understanding Conference (MUC-3). It reviews the purpose, history, and methodology of the conference, summarizes the participating systems, discusses issues of measuring system effectiveness, describes the linguistic phenomena tests, and provides a critical look at the evaluation in terms of the lessons learned. One of the common problems with evaluations is that the statistical significance of the results is unknown. In the discussion of system performance, the statistical significance of the evaluation results is reported and the use of approximate randomization to calculate the statistical significance of the results of MUC-3 is described
A Multilevel Approach to Intelligent Information Filtering: Model, System, and Evaluation
- ACM Transactions on Information Systems
, 1997
"... this article, a filtering model is proposed that decomposes the overall task into subsystem functionalities and highlights the need for multiple adaptation techniques to cope with uncertainties. A filtering system, SIFTER, has been implemented based on the model, using established techniques in info ..."
Abstract
-
Cited by 45 (5 self)
- Add to MetaCart
this article, a filtering model is proposed that decomposes the overall task into subsystem functionalities and highlights the need for multiple adaptation techniques to cope with uncertainties. A filtering system, SIFTER, has been implemented based on the model, using established techniques in information retrieval and artificial intelligence. These techniques include document representation by a vector-space model, document classification by unsupervised learning, and user modeling by reinforcement learning. The system can filter information based on content and a user's specific interests. The user's interests are automatically learned with only limited user intervention in the form of optional relevance feedback for documents. We also describe experimental studies conducted with SIFTER to filter computer and information science documents collected from the Internet and commercial database services. The experimental results demonstrate that the system performs very well in filtering documents in a realistic problem setting.
Querying Text Databases for Efficient Information Extraction
- In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE
, 2003
"... A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract ..."
Abstract
-
Cited by 37 (9 self)
- Add to MetaCart
A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database, or use filters to select promising documents for extraction. The exhaustive scanning approach is not practical or even feasible for large databases, and the current filtering techniques require human involvement to maintain and to adopt to new databases and domains. In this paper, we develop an automatic query-based technique to retrieve documents useful for the extraction of user-defined relations from large text databases, which can be adapted to new domains, databases, or target relations with minimal human effort. We report a thorough experimental evaluation over a large newspaper archive that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents.
Coupling Information Retrieval and Information Extraction: A New Text Technology for Gathering Information from the Web
- IN PROCEEDINGS OF THE 5TH COMPUTED-ASSISTED INFORMATION SEARCHING ON INTERNET CONFERENCE (RIAO'97)
, 1997
"... The techniques of information retrieval and information extraction are complementary, but to date there has been little concrete work aimed at integrating the two. We describe how each of these techniques contributes to the process of transferring information from generator to user, summarise the is ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
The techniques of information retrieval and information extraction are complementary, but to date there has been little concrete work aimed at integrating the two. We describe how each of these techniques contributes to the process of transferring information from generator to user, summarise the issues which must be addressed if they are to work together, and report the results of some preliminary experiments on coupling them which indicate that these technologies can be jointly used to construct a structured data resource from free text on the WWW.
Unknown Value Lists and Their Use for Semantic Analysis in IDA - the Integrated Deductive Approach to Natural Language Interface Design
- Proc. of the Australasian Database Conf
, 1996
"... The framework of our research originates from natural language processing and deductive database technology. Deductive databases possess superior functionality relevant to the efficient solution of many problems in practical applications, yet there still exists no broad acquaintance. As main obstacl ..."
Abstract
-
Cited by 9 (9 self)
- Add to MetaCart
The framework of our research originates from natural language processing and deductive database technology. Deductive databases possess superior functionality relevant to the efficient solution of many problems in practical applications, yet there still exists no broad acquaintance. As main obstacle we identified the absence of any user-friendly interface. Natural language interfaces have been proposed as optimal candidate, however, in spite of the vast number of ambitious attempts to build natural language front ends, the achieved results were rather disappointing. In our opinion the main reason for this is missing integration, responsible for insufficient performance and wrong interpretation. In our Integrated Deductive Approach (IDA) the interface constitutes an integral part of the database system itself which guarantees the consistent mapping from the user query to the appropriate semantic application model. This paper focuses on the semantic analysis for which we introduce unknown value list (UVL) analysis, a technique that operates directly on the evaluation of database values and deep forms of functional words, that is, syntactic analysis is only applied if necessary for disambiguation. We prove the feasibility of the IDA approach by use of a case study, the design and implementation of a production planning and control system.
Machine Learning for Information Extraction from Online Documents
, 1996
"... Introduction The experiment described here was designed for two things: to test the feasibility of a learning approach to information extraction in a real-world domain, and to uncover evidence that by using multiple learners it is possible to achieve better performance than by using a single learne ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Introduction The experiment described here was designed for two things: to test the feasibility of a learning approach to information extraction in a real-world domain, and to uncover evidence that by using multiple learners it is possible to achieve better performance than by using a single learner. Because the documents used in this experiment are taken unmodified from a real online environment designed for human-to-human communication, the task is a challenging one. Its difficulty varies considerably from field to field, but in all cases, in order to conclude that this approach is feasible, I require of each learner that its performance is substantially better than that of a random guesser. Of course, in practice the required performance level is defined by the intended application. Consequently, my argument for feasibility is informal. Some applications may be able to exploit a well-behaved precision-recall curve, so I look for this from the learners tested here. We cannot
Constraint Based Event Recognition for Information Extraction
, 1995
"... A common feature of news reports is the reference to events other than the one which is central to the discourse. Previous research has suggested Gricean explanations for this; more generally, the phenomenon has been referred to simply as "journalistic style". Whatever the underlying reasons, recent ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
A common feature of news reports is the reference to events other than the one which is central to the discourse. Previous research has suggested Gricean explanations for this; more generally, the phenomenon has been referred to simply as "journalistic style". Whatever the underlying reasons, recent investigations into information extraction have emphasised the need for a better understanding of the mechanisms that can be used to recognise and distinguish between multiple events in discourse. Existing information extraction systems approach the problem of event recognition in a number of ways. However, although frameworks and techniques for black box evaluations of information extraction systems have been developed in recent years, almost no attention has been given to the evaluation of techniques for event recognition, despite general acknowledgment of the inadequacies of current implementations. Not only is it unclear which mechanisms are useful, but there is also little consensus as...
Towards The Automatic Identification Of Adjectival Scales: Clustering Adjectives According To Meaning
, 1993
"... In this paper we present a method to group adjectives according to their meaning, as a first step towards the automatic identification of adjectival scales. We discuss the properties of adjectival scales and of groups of semantically related adjectives and how they imply sources of linguistic knowle ..."
Abstract
- Add to MetaCart
In this paper we present a method to group adjectives according to their meaning, as a first step towards the automatic identification of adjectival scales. We discuss the properties of adjectival scales and of groups of semantically related adjectives and how they imply sources of linguistic knowledge in text corpora. We describe how our system exploits this linguistic knowledge to compute a measure of similarity between two adjectives, using statistical techniques and without having access to any semantic information about the adjectives. We also show how a clustering algorithm can use these similarities to produce the groups of adjectives, and we present results produced by our system for a sample set of adjectives. We conclude by presenting evaluation methods for the task at hand, and analyzing the significance of the results obtained.
ICDE 2003 Querying Text Databases for Efficient Information Extraction
"... A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract ..."
Abstract
- Add to MetaCart
A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database, or use filters to select promising documents for extraction. The exhaustive scanning approach is not practical or even feasible for large databases, and the current filtering techniques require human involvement to maintain and to adopt to new databases and domains. In this paper, we develop an automatic query-based technique to retrieve documents useful for the extraction of user-defined relations from large text databases, which can be adapted to new domains, databases, or target relations with minimal human effort. We report a thorough experimental evaluation over a large newspaper archive that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents. 1

