Results 1 - 10
of
23
Learning Hidden Markov Models for Information Extraction Actively from Partially Labeled Text
, 2002
"... A vast range of information is expressed in unstructured or semi-structured text, in a form that is hard to decipher automatically. Consequently, it is of enormous importance to construct tools that allow users to extract information from textual documents as easily as it can be extracted from struc ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
A vast range of information is expressed in unstructured or semi-structured text, in a form that is hard to decipher automatically. Consequently, it is of enormous importance to construct tools that allow users to extract information from textual documents as easily as it can be extracted from structured databases. Information Extraction (IE)...
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval
- In Proceedings of the 28th Annual international ACM SIGIR Conference on Research and Development in information Retrieval
, 2005
"... This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6 % improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1 %-29.0 % improvements).
Active Learning of Partially Hidden Markov Models
- In Proceedings of the ECML/PKDD Workshop on Instance Selection
, 2001
"... We consider the task of learning hidden Markov models (HMMs) when only partially (sparsely) labeled observation sequences are available for training. This setting is motivated by the information extraction problem, where only few tokens in the training documents are given a semantic tag while most t ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
We consider the task of learning hidden Markov models (HMMs) when only partially (sparsely) labeled observation sequences are available for training. This setting is motivated by the information extraction problem, where only few tokens in the training documents are given a semantic tag while most tokens are unlabeled. We first describe the partially hidden Markov model together with an algorithm for learning HMMs from partially labeled data. We then present an active learning algorithm that selects "difficult" unlabeled tokens and asks the user to label them. We study empirically by how much active learning reduces the required data labeling effort, or increases the quality of the learned model achievable with a given amount of user effort.
Ontology-Based Extraction of RDF Data from the World Wide Web
, 2003
"... The simplicity and proliferation of the World Wide Web (WWW) has taken the availability of information to an unprecedented level. The next generation of the Web, the Semantic Web, seeks to make information more usable by machines by introducing a more rigorous structure based on ontologies. One hin ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
The simplicity and proliferation of the World Wide Web (WWW) has taken the availability of information to an unprecedented level. The next generation of the Web, the Semantic Web, seeks to make information more usable by machines by introducing a more rigorous structure based on ontologies. One hinderance to the Semantic Web is the lack of existing semantically marked-up data. Until there is a critical mass of Semantic Web data, few people will develop and use Semantic Web applications. This project helps promote the Semantic Web by providing content. We apply existing information-extraction techniques, in particular, the BYU ontologybased data-extraction system, to extract information from the WWW based on a Semantic Web ontology to produce Semantic Web data with respect to that ontology. As an example of how the generated Semantic Web data can be used, we provide an application to browse the extracted data and the source documents together. In this sense, the extracted data is superimposed over or is an index over the source documents. Our experiments with ontologies in four application domains show that our approach can indeed extract Semantic Web data from the WWW with precision and recall similar to that achieved by the underlying information extraction system and make that data accessible to Semantic Web applications.
Exploiting ASP for Semantic Information Extraction
- In Proceedings ASP05 - Answer Set Programming: Advances in Theory and Implementation
, 2005
"... WWW home page:http://www.exeura.it Abstract. The paper describesHıLεX, a new ASP-based system for the extraction of information from unstructured documents. Unlike previous systems, which are mainly syntactic,HıLεX combines both semantic and syntactic knowledge for a powerful information extraction. ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
WWW home page:http://www.exeura.it Abstract. The paper describesHıLεX, a new ASP-based system for the extraction of information from unstructured documents. Unlike previous systems, which are mainly syntactic,HıLεX combines both semantic and syntactic knowledge for a powerful information extraction. In particular, the exploitation of background knowledge, stored in a domain ontology, allows to empower significantly the information extraction mechanisms. HıLεX is founded on a new two-dimensional representation of documents, and heavily exploits DLP + – an extension of disjunctive logic programming for ontology representation and reasoning which has been recently implemented on top of DLV. The domain ontology is represented in DLP +, and the extraction patterns are encoded by DLP + reasoning modules, whose execution yields the actual extraction of information from the input document. HıLεX allows to extract information from both HTML and flat text documents. 1
Natural Language Guided Dialogues for Accessing the Web
- In the Proceedings of the Fifth International Conference on Text, Speech and Dialogue. Brno, Czech Republic,2001. Springer-Verlag in Lecture Notes in Artificial Intelligence subseries of LNCS series as Volume 2448
, 2002
"... Abstract. This paper proposes the use of ontologies representing domain and linguistic knowledge for guiding natural language (NL) communication on the Web contents. This proposal deals with the problem of accessing and processing the Web data required to answer user consults. Concepts and communica ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract. This paper proposes the use of ontologies representing domain and linguistic knowledge for guiding natural language (NL) communication on the Web contents. This proposal deals with the problem of accessing and processing the Web data required to answer user consults. Concepts and communication acts are represented in the conceptual ontology (CO). Domain-restricted grammars and lexicons are obtained automatically by adapting the general linguistic knowledge to cover the communication acts for a particular domain. The use of domain-restricted grammars and lexicons has proved to be efficient especially when the user is guided in introducing the NL queries. Once the query has been processed, the system fires the appropriate wrappers to extract the data from the Web. The domain concepts described in the CO provides a unifying framework to represent the knowledge obtained from the various Web sources. Following this proposal, a dialoguesystem for accessing in Spanish to a set of Web sites on the travelling domain has been implemented. 1
Web mining
- In Oded Maimon and Lior Rokach, editors, The Data Mining and Knowledge Discovery Handbook
, 2005
"... The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly difficult to identify the relevant pieces of information. Research in web mining tries to address this problem by applying techniques from data mining and machine learning to ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly difficult to identify the relevant pieces of information. Research in web mining tries to address this problem by applying techniques from data mining and machine learning to Web data and documents. This chapter provides a brief overview of web mining techniques and research areas, most notably hypertext classification, wrapper induction, recommender systems and web usage mining.
An Overview and Classification of Adaptive Approaches to Information Extraction
- JOURNAL ON DATA SEMANTICS, IV:172–212. LNCS 3730
, 2005
"... Most of the information stored in digital form is hidden in natural language texts. Extracting and storing it in a formal representation (e.g. in form of relations in databases) allows efficient querying, easy administration and further automatic processing of the extracted data. The area of informa ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Most of the information stored in digital form is hidden in natural language texts. Extracting and storing it in a formal representation (e.g. in form of relations in databases) allows efficient querying, easy administration and further automatic processing of the extracted data. The area of information extraction (IE) comprises techniques, algorithms and methods performing two important tasks: finding (identifying) the desired, relevant data and storing it in appropriate form for future use. The rapidly increasing number and diversity of IE systems are the evidence of continuous activity and growing attention to this field. At the same time it is becoming more and more difficult to overview the scope of IE, to see advantages of certain approaches and differences to others. In this paper we identify and describe promising approaches to IE. Our focus is adaptive systems that can be customized for new domains through training or the use of external knowledge sources. Based on the observed origins and requirements of the examined IE techniques a classification of different types of adaptive IE systems is established.
Automated Information Extraction from Web Sources: a Survey
"... Abstract. The Web contains an enormous quantity of information which is usually formatted for human users. This makes it difficult to extract relevant content from various sources. In the last few years some authors have addressed the problem to convert Web documents from unstructured or semi-struct ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. The Web contains an enormous quantity of information which is usually formatted for human users. This makes it difficult to extract relevant content from various sources. In the last few years some authors have addressed the problem to convert Web documents from unstructured or semi-structured format into structured and therefore machine-understandable format such as, for example, XML. In this paper we briefly survey some of the most promising and recently developed extraction tools. 1

