Results 1 - 10
of
13
From dirt to shovels: Fully automatic tool generation from ad hoc data
- In POPL
, 2008
"... An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular bas ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multi-phase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler and automatically generate fully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearly in the size of the training data — completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5 % of the available data. 1.
Corrected co-training for statistical parsers
- In ICML-03 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining
, 2003
"... Corrected co-training (Pierce & Cardie, 2001) and the closely related co-testing (Muslea et al., 2000) are active learning methods which exploit redundant views to reduce the cost of manually creating labeled training data. We extend these methods to statistical parsing algorithms for natural langua ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Corrected co-training (Pierce & Cardie, 2001) and the closely related co-testing (Muslea et al., 2000) are active learning methods which exploit redundant views to reduce the cost of manually creating labeled training data. We extend these methods to statistical parsing algorithms for natural language. Because creating complex parse structures by hand is significantly more timeconsuming than selecting labels from a small set, it may be easier for the human to correct the learner’s partially accurate output rather than generate the complex label from scratch. The goal of our work is to minimize the number of corrections that the annotator must make. To reduce the human effort in correcting machine parsed sentences, we propose a novel approach, which we call one-sided corrected co-training and show that this method requires only a third as many manual annotation decisions as corrected co-training/co-testing to achieve the same improvement in performance. 1.
Extracting Web data using instance-based learning
- In WISE-05
, 2005
"... Abstract. This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Abstract. This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance (or page) to be extracted with labeled instances (or pages). The key advantage of our method is that it does not need an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance (or page). Only when a new page cannot be extracted does the page need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled pages may not be representative of all other pages. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates and pages of the same template usually can be extracted using a single page instance of the template. The key issue is the similarity or distance measure. Traditional measures based on the Euclidean distance or text similarity are not easily applicable in this context because items to be extracted from different pages can be entirely different. This paper proposes a novel similarity measure for the purpose, which is suitable for templated Web pages. Experimental results with product data extraction from 1200 pages in 24 diverse Web sites show that the approach is surprisingly effective. It outperforms the state-of-the-art existing systems significantly. 1
Text Mining through Semi Automatic Semantic Annotation
- Proc. of PAKM’2006
, 2006
"... The Web is the greatest information source in human history. ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
The Web is the greatest information source in human history.
Information Extraction from Multi-Document Threads
, 2003
"... Information extraction (IE) is the task of extracting fragments of important information from natural language documents. Most IE research involves algorithms for learning to exploit regularities inherent in the textual information and language use, and such systems generally assume that each docume ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Information extraction (IE) is the task of extracting fragments of important information from natural language documents. Most IE research involves algorithms for learning to exploit regularities inherent in the textual information and language use, and such systems generally assume that each document can be processed in isolation. We are extending IE techniques to multi-document extraction tasks, in which the information to be extracted is distributed across several documents. For example, many kinds of work-flow transactions are realized as sequences of electronic mail messages comprising a conversation among several participants. We show that IE performance can be improved by harnessing the structural and temporal relationships between documents.
Deploying information agents on the web
- In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-2003
, 2003
"... The information resources on the Web are vast, but much of the Web is based on a browsing paradigm that requires someone to actively seek information. Instead, one would like to have information agents that continuously attend to one's personal information needs. Such agents need to be able to extra ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
The information resources on the Web are vast, but much of the Web is based on a browsing paradigm that requires someone to actively seek information. Instead, one would like to have information agents that continuously attend to one's personal information needs. Such agents need to be able to extract the relevant information from web sources, integrate data across sites, and execute efficiently in a networked environment. In this paper I describe the technologies we have developed to rapidly construct and deploy information agents on the Web. This includes wrapper learning to convert online sources into agent-friendly resources, query planning and record linkage to integrate data across different sites, and streaming dataflow execution to efficiently execute agent plans. I also describe how we applied this work within the Electric Elves project to deploy a set of agents for continuous monitoring of travel itineraries. 1
Semi-Automatic Semantic Annotations for Web Documents
- Proc. SWAP 2005, 2nd Italian Semantic Web Workshop
, 2005
"... Semantic annotation of the web documents is the only way to make the Semantic Web vision a reality. Considering the scale and dynamics of worldwide web, the largest knowledge base ever built, it becomes clear that we cannot a#ord to annotate web documents manually. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Semantic annotation of the web documents is the only way to make the Semantic Web vision a reality. Considering the scale and dynamics of worldwide web, the largest knowledge base ever built, it becomes clear that we cannot a#ord to annotate web documents manually.
An Overview and Classification of Adaptive Approaches to Information Extraction
- JOURNAL ON DATA SEMANTICS, IV:172–212. LNCS 3730
, 2005
"... Most of the information stored in digital form is hidden in natural language texts. Extracting and storing it in a formal representation (e.g. in form of relations in databases) allows efficient querying, easy administration and further automatic processing of the extracted data. The area of informa ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Most of the information stored in digital form is hidden in natural language texts. Extracting and storing it in a formal representation (e.g. in form of relations in databases) allows efficient querying, easy administration and further automatic processing of the extracted data. The area of information extraction (IE) comprises techniques, algorithms and methods performing two important tasks: finding (identifying) the desired, relevant data and storing it in appropriate form for future use. The rapidly increasing number and diversity of IE systems are the evidence of continuous activity and growing attention to this field. At the same time it is becoming more and more difficult to overview the scope of IE, to see advantages of certain approaches and differences to others. In this paper we identify and describe promising approaches to IE. Our focus is adaptive systems that can be customized for new domains through training or the use of external knowledge sources. Based on the observed origins and requirements of the examined IE techniques a classification of different types of adaptive IE systems is established.
Automated Semantic Analysis of Schematic Data
"... Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describ ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describes a learning-based technique for semantic analysis of schematic data which are characterized by being template-generated from backend databases. Starting with a seed set of hand-labeled instances of semantic concepts in a set of Web pages, the technique learns statistical models of these concepts using light-weight content features. These models direct the annotation of diverse Web pages possessing similar content semantics. The principles behind the technique find application in information retrival and extraction problems. Focused Web browsing activities require only selective fragments of particular Web pages but are often performed using bookmarks which fetch the contents of the entire page. This results in information overload for users of constrained interaction modality devices such as small-screen handheld devices. Fine-grained information extraction from Web pages, which are typically performed using page specific and syntactic expressions known as wrappers, suffer from lack of scalability and robustness. We report on the application of our technique in developing semantic bookmarks for retrieving targeted browsing content and semantic wrappers for robust and scalable information extraction from Web pages sharing a semantic domain.

