Results 1 - 10
of
185
Unsupervised namedentity extraction from the web: An experimental study.
- Artificial Intelligence,
, 2005
"... Abstract The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL's novel architecture and ..."
Abstract
-
Cited by 372 (39 self)
- Add to MetaCart
(Show Context)
Abstract The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOW-ITALL extracted over 50,000 class instances, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.
A Flexible Learning System for Wrapping Tables and Lists in HTML Documents
- In International World Wide Web Conference
, 2002
"... this paper we will discuss some of the more important representational issues for wrapper learners, focusing on the specific problem of extracting text from web pages. We argue that pure DOM- or token-based representations of web pages are inadequate for the purpose of learning wrappers ..."
Abstract
-
Cited by 108 (5 self)
- Add to MetaCart
this paper we will discuss some of the more important representational issues for wrapper learners, focusing on the specific problem of extracting text from web pages. We argue that pure DOM- or token-based representations of web pages are inadequate for the purpose of learning wrappers
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach
, 1999
"... A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new appl ..."
Abstract
-
Cited by 90 (20 self)
- Add to MetaCart
A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.
Wrapper Maintenance: A Machine Learning Approach
- Journal of Artificial Intelligence Research
, 2003
"... The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and e#cient generation of wrappers, the development of tools for wrapper maintenance has received less attention. ..."
Abstract
-
Cited by 86 (17 self)
- Add to MetaCart
(Show Context)
The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and e#cient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an e#cient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year.
Electric elves: Applying agent technology to support human organizations
, 2001
"... The operation of a human organization requires dozens of everyday tasks to ensure coherence in organizational activities, to monitor the status of such activities, to gather information relevant to the organization, to keep everyone in the organization informed, etc. Teams of software agents can aid ..."
Abstract
-
Cited by 79 (27 self)
- Add to MetaCart
The operation of a human organization requires dozens of everyday tasks to ensure coherence in organizational activities, to monitor the status of such activities, to gather information relevant to the organization, to keep everyone in the organization informed, etc. Teams of software agents can aid humans in accomplishing these tasks, facilitating the organization’s coherent functioning and rapid response to crises, while reducing the burden on humans. Based on this vision, this paper reports on Electric Elves, a system that has been operational, 24/7, at our research institute since June 1, 2000. Tied to individual user workstations, fax machines, voice, mobile devices such as cell phones and palm pilots, Electric Elves has assisted us in routine tasks, such as rescheduling meetings, selecting presenters for research meetings, tracking people’s locations, organizing lunch meetings, etc. We discuss the underlying AI technologies that led to the success of Electric Elves, including technologies devoted to agenthuman interactions, agent coordination, accessing multiple heterogeneous information sources, dynamic assignment of organizational tasks, and deriving information about organization members. We also report the results of deploying Electric Elves in our own research organization.
On automatic information extraction from large web sites
, 2004
"... Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of chan ..."
Abstract
-
Cited by 77 (7 self)
- Add to MetaCart
Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature. We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised—that is, fully automatic—wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks. The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes. A system
Active Learning with Multiple Views
, 2002
"... Active learners alleviate the burden of labeling large amounts of data by detecting and asking the user to label only the most informative examples in the domain. We focus here on active learning for multi-view domains, in which there are several disjoint subsets of features (views), each of which i ..."
Abstract
-
Cited by 54 (1 self)
- Add to MetaCart
(Show Context)
Active learners alleviate the burden of labeling large amounts of data by detecting and asking the user to label only the most informative examples in the domain. We focus here on active learning for multi-view domains, in which there are several disjoint subsets of features (views), each of which is sufficient to learn the target concept. In this paper we make several contributions. First, we introduce Co-Testing, which is the first approach to multi-view active learning. Second, we extend the multi-view learning framework by also exploiting weak views, which are adequate only for learning a concept that is more general/specific than the target concept. Finally, we empirically show that Co-Testing outperforms existing active learners on a variety of real world domains such as wrapper induction, Web page classification, advertisement removal, and discourse tree parsing. 1.
Automatic Data Extraction from Lists and Tables in Web Sources
- In Proceedings of the workshop on Advances in Text Extraction and Mining (IJCAI-2001), Menlo Park
, 2001
"... We describe a technique for extracting data from lists and tables and grouping it by rows and columns. This is done completely automatically, using only some very general assumptions about the structure of the list. We have developed a suite of unsupervised learning algorithms that induce the ..."
Abstract
-
Cited by 47 (5 self)
- Add to MetaCart
(Show Context)
We describe a technique for extracting data from lists and tables and grouping it by rows and columns. This is done completely automatically, using only some very general assumptions about the structure of the list. We have developed a suite of unsupervised learning algorithms that induce the structure of lists by exploiting the regularities both in the format of the pages and the data contained in them. Among the tools used are AutoClass for automatic classication of data and grammar induction of regular languages. The approach was tested on 14 Web sources providing diverse data types, and we found that for 10 of these sources we were able to correctly nd lists and partition the data into columns and rows.
Automatic Annotation of Data Extracted from Large Web Sites
- Proc. Sixth International Workshop on the Web and Databases (WebDB 2003
, 2003
"... Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sample pages, they can p ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
(Show Context)
Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sample pages, they can produce a common wrapper to extract relevant data. However, due to the automatic nature of the approach, the data extracted by these wrappers have anonymous names. In the framework of our ongoing project RoadRunner, we have developed a prototype, called Labeller, that automatically annotates data extracted by automatically generated wrappers. Although Labeller has been developed as a companion system to our wrapper generator, its underlying approach has a general validity and therefore it can be applied together with other wrapper generator systems. We have experimented the prototype over several real-life web sites obtaining encouraging results.
Interactive Learning of Node Selecting Tree Transducer
- MACHINE LEARNING
, 2007
"... We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction. We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (NSTTs), a particul ..."
Abstract
-
Cited by 40 (14 self)
- Add to MetaCart
We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction. We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (NSTTs), a particular class of tree automata that we introduce. We prove that deterministic NSTTs capture the class of queries definable in monadic second order logic (MSO) in trees, which Gottlob and Koch (2002) argue to have the right expressiveness for Web information extraction, and prove that monadic queries defined by NSTTs can be answered efficiently. We present a new polynomial time algorithm in RPNI-style that learns monadic queries defined by deterministic NSTTs from completely annotated examples, where all selected nodes are distinguished. In practice, users prefer to provide partial annotations. We propose to