Results 1 - 10
of
14
Learning Page-Independent Heuristics for Extracting Data from Web Pages
- In AAAI Spring Symposium on Intelligent Agents in Cyberspace
, 1999
"... One bottleneck in implementing a system that intelligently queries the Web is developing "wrappers"---programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
One bottleneck in implementing a system that intelligently queries the Web is developing "wrappers"---programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap. The output is a general procedure for extracting data that works for many formats and many pages. In experiments with a collection of 84 constrained but realistic extraction problems, we demonstrate that 30% of the problems can be handled perfectly by learned extraction heuristics, and around 50% can be handled acceptably. We also demonstrate that learned page-independent extraction heuristics can substantially improve the performance of methods for learning page-specific wrappers. Keywords: information integration, machine learning, extraction. 1 Introduction A number of recent systems operate by taking informatio...
A Survey of Table Recognition: Models, Observations, Transformations, and Inferences
- International Journal of Document Analysis and Recognition
, 2003
"... Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines the physical and logical structure of tables; the model is used to detect tables, and to analyze and decompose the detected tables. Observations perform feature measurements and data lookup, transformations alter or restructure data, and inferences generate and test hypotheses. This presentation clarifies the decisions that are made by a table recognizer, and the assumptions and inferencing techniques that underlie these decisions.
TINTIN: A System for Retrieval in Text Tables
- IN PROCEEDINGS OF THE SECOND ACM INTERNATIONAL CONFERENCE ON DIGITAL LIBRARIES
, 1997
"... Tables form an important kind of data element in text retrieval. Often, the gist of an entire news article or other exposition can be concisely captured in tabular form. In this paper, we examine the utility of exploiting information other than the key words in a digital document to provide the u ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
Tables form an important kind of data element in text retrieval. Often, the gist of an entire news article or other exposition can be concisely captured in tabular form. In this paper, we examine the utility of exploiting information other than the key words in a digital document to provide the users with more flexible and powerful query capabilities. More specifically, we exploit the structural information in a document to identify tables and their component fields and let the users query based on these fields. Our empirical results have demonstrated that heuristic method based table extraction and component tagging can be performed effectively and efficiently. Moreover, our experiments in retrieval using the TINTIN system have strongly indicated that such structural decomposition can facilitate better representation of user's information needs and hence more effective retrieval of tables.
Jumping Connections: A Graph-Theoretic Model for Recommender Systems
- MASTER’S THESIS, VIRGINIA TECH
, 2001
"... Recommender systems have become paramount to customize information access and reduce information overload. They serve multiple uses, ranging from suggesting products and artifacts (to consumers), to bringing people together by the connections induced by (similar) reactions to products and services. ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Recommender systems have become paramount to customize information access and reduce information overload. They serve multiple uses, ranging from suggesting products and artifacts (to consumers), to bringing people together by the connections induced by (similar) reactions to products and services. This thesis presents a graph-theoretic model that casts recommendation as a process of ‘jumping connections’ in a graph. In addition to emphasizing the social network aspect, this viewpoint provides a novel evaluation criterion for recommender systems. Algorithms for recommender systems are distinguished not in terms of predicted ratings of services/artifacts, but in terms of the combinations of people and artifacts that they bring together. We present an algorithmic framework drawn from random graph theory and outline an analysis for one particular form of jump called a ‘hammock.’ Experimental results on two datasets collected over the Internet demonstrate the validity of this approach.
Selection of Passages for Information Reduction
, 1996
"... is a huge manual undertaking, particularly when there are fifty or more texts. Unfortunately, full-text understanding is not yet feasible as an alternative and information extract techniques themselves rely on large numbers of training texts with manually encoded answer keys. By locating and present ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
is a huge manual undertaking, particularly when there are fifty or more texts. Unfortunately, full-text understanding is not yet feasible as an alternative and information extract techniques themselves rely on large numbers of training texts with manually encoded answer keys. By locating and presenting relevant passages to the user, we will have significantly reduced the time and effort expenditure. Alternatively, we could save an automated informationextraction system from processing an entire text by focusing the system on those portions of the text most likely to contain the desired information. This work integrates a case-based reasoner with an IR engine to reduce the information bottleneck. SPIRE [Se- This research was supported byNSF Grant no. EEC-9209623, State/Industry/University Cooperative Research on Intelligent Information Retrieval, Digital Equipment Corporation and the National Center for Automated Information Research. lection of Passages for Inf
The Partial Evaluation Approach to Information Personalization
- ACM Transactions on Information Systems
, 2001
"... Information personalization refers to the automatic adjustment of information content, structure, and presentation tailored to an individual user. By reducing information overload and customizing information access, personalization systems have emerged as an important segment of the Internet economy ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
Information personalization refers to the automatic adjustment of information content, structure, and presentation tailored to an individual user. By reducing information overload and customizing information access, personalization systems have emerged as an important segment of the Internet economy. This paper presents a systematic modeling methodology --- PIPE (`Personalization is Partial Evaluation') --- for personalization. Personalization systems are designed and implemented in PIPE by modeling an information-seeking interaction in a programmatic representation. The representation supports the description of information-seeking activities as partial information and their subsequent realization by partial evaluation, a technique for specializing programs. We describe the modeling methodology at a conceptual level and outline representational choices. We present two application case studies that use PIPE for personalizing web sites and describe how PIPE suggests a novel evaluation criterion for information system designs. Finally, we mention several fundamental implications of adopting the PIPE model for personalization and when it is (and is not) applicable.
Personalizing Interactions with Information Systems
- in Advances in Computers
, 2002
"... Personalization constitutes the mechanisms and technologies necessary to customize information access to the end-user. It can be defined as the automatic adjustment of information content, structure, and presentation tailored to the individual. In this chapter, we study personalization from the view ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
Personalization constitutes the mechanisms and technologies necessary to customize information access to the end-user. It can be defined as the automatic adjustment of information content, structure, and presentation tailored to the individual. In this chapter, we study personalization from the viewpoint of personalizing interaction. The survey covers mechanisms for information-finding on the web, advanced information retrieval systems, dialogbased applications, and mobile access paradigms. Specific emphasis is placed on studying how users interact with an information system and how the system can encourage and foster interaction. This helps bring out the role of the personalization system as a facilitator which reconciles the user's mental model with the underlying information system's organization. Three tiers of personalization systems are presented, paying careful attention to interaction considerations. These tiers show how progressive levels of sophistication in interaction can be achieved. The chapter also surveys systems support technologies and niche application domains.
Program transformations for information personalization
, 2004
"... Personalization constitutes the mechanisms necessary to automatically customize information content, structure, and presentation to the end-user to reduce information overload. Unlike traditional approaches to personalization, the central theme of our approach is to model a website as a program and ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
Personalization constitutes the mechanisms necessary to automatically customize information content, structure, and presentation to the end-user to reduce information overload. Unlike traditional approaches to personalization, the central theme of our approach is to model a website as a program and conduct website transformation for personalization by program transformation (e.g., partial evaluation, program slicing). The goal of this paper is study personalization through a program transformation lens, and develop a formal model, based on program transformations, for personalized interaction with hierarchical hypermedia. The specific research issues addressed involve identifying and developing program representations and transformations suitable for classes of hierarchical hypermedia, and providing supplemental interactions for improving the personalized experience. The primary form of personalization discussed is out-of-turn interaction – a technique which empowers a user navigating a hierarchical website to postpone clicking on any of the hyperlinks presented on the current page and, instead, communicate the
A Language for Specifying and Comparing Table Recognition Strategies
, 2004
"... Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize these algorithms as strategies in an imitation game, where the goal of the game is to match table interpretations from a chosen procedure as closely as possible. The chosen procedure may be a person or persons producing ‘ground truth, ’ or an algorithm. To describe table recognition strategies we have defined the Recognition Strat-egy Language (RSL). RSL is a simple functional language for describing strategies as sequences of abstract decision types whose results are determined by any suit-able decision method. RSL defines and maintains interpretation trees, a simple data structure for describing recognition results. For each interpretation in an interpreta-tion tree, we annotate hypothesis histories which capture the creation, revision, and rejection of individual hypotheses, such as the logical type and structure of regions. We present a proof-of-concept using two strategies from the literature. We demon-strate how RSL allows strategies to be specified at the level of decisions rather than ii algorithms, and we compare results of our strategy implementations using new tech-niques. In particular, we introduce historical recall and precision metrics. Con-ventional recall and precision characterize hypotheses accepted after a strategy has finished. Historical recall and precision provide additional information by describing all generated hypotheses, including any rejected in the final result. iii
Explaining Scenarios for Information Personalization
- ACM Transactions on Computer-Human Interaction
, 2001
"... Personalization customizes information access. The PIPE (`Personalization is Partial Evaluation') modeling methodology represents interaction with an information space as a program. The program is then specialized to a user's known interests or information seeking activity by the technique of partia ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Personalization customizes information access. The PIPE (`Personalization is Partial Evaluation') modeling methodology represents interaction with an information space as a program. The program is then specialized to a user's known interests or information seeking activity by the technique of partial evaluation. In this paper, we elaborate PIPE by considering requirements analysis in the personalization lifecycle. We investigate the use of scenarios as a means of identifying and analyzing personalization requirements. As our first result, we show how designing a PIPE representation can be cast as a search within a space of PIPE models, organized along a partial order. This allows us to view the design of a personalization system, itself, as specialized interpretation of an information space. We then exploit the underlying equivalence of explanation-based generalization (EBG) and partial evaluation to realize high-level goals and needs identified in scenarios

