Results 1 - 10
of
17
Automatic Annotation of Data Extracted from Large Web Sites
- Proc. Sixth International Workshop on the Web and Databases (WebDB 2003
, 2003
"... Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sample pages, they can p ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sample pages, they can produce a common wrapper to extract relevant data. However, due to the automatic nature of the approach, the data extracted by these wrappers have anonymous names. In the framework of our ongoing project RoadRunner, we have developed a prototype, called Labeller, that automatically annotates data extracted by automatically generated wrappers. Although Labeller has been developed as a companion system to our wrapper generator, its underlying approach has a general validity and therefore it can be applied together with other wrapper generator systems. We have experimented the prototype over several real-life web sites obtaining encouraging results.
2006a. Learning effective surface text patterns for information extraction
- Proceedings of the EACL Workshop on Adaptive Text Extraction and Mining. 1–8
"... We present a novel method to identify effective surface text patterns using an internet search engine. Precision is only one of the criteria to identify the most effective patterns among the candidates found. Another aspect is frequency of occurrence. Also, a pattern has to relate diverse instances ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
We present a novel method to identify effective surface text patterns using an internet search engine. Precision is only one of the criteria to identify the most effective patterns among the candidates found. Another aspect is frequency of occurrence. Also, a pattern has to relate diverse instances if it expresses a non-functional relation. The learned surface text patterns are applied in an ontology population algorithm, which not only learns new instances of classes but also new instancepairs of relations. We present some £rst experiments with these methods. 1
From Information to Knowledge: Harvesting Entities and Relationships from Web Sources
"... There are major trends to advance the functionality of search engines to a more expressive semantic level. This is enabled by the advent of knowledge-sharing communities such as Wikipedia and the progress in automatically extracting entities and relationships from semistructured as well as natural-l ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
There are major trends to advance the functionality of search engines to a more expressive semantic level. This is enabled by the advent of knowledge-sharing communities such as Wikipedia and the progress in automatically extracting entities and relationships from semistructured as well as natural-language Web sources. Recent endeavors of this kind include DBpedia, EntityCube, KnowItAll, ReadTheWeb, and our own YAGO-NAGA project (and others). The goal is to automatically construct and maintain a comprehensive knowledge base of facts about named entities, their semantic classes, and their mutual relations as well as temporal contexts, with high precision and high recall. This tutorial discusses state-ofthe-art methods, research opportunities, and open challenges along this avenue of knowledge harvesting.
Automatic Ontology Population by Googling
- In: Proceedings of the 17th Belgium-Netherlands Conference on Artificial Intelligence (BNAIC
, 2005
"... We discuss a method to populate ontologies with the use of googled text fragments. We populate an ontology by the use of hand-crafted domain-specific relation patterns, which can be seen as a generalization of Hearst patterns. The algorithm described uses instances of some class returned by Googl ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We discuss a method to populate ontologies with the use of googled text fragments. We populate an ontology by the use of hand-crafted domain-specific relation patterns, which can be seen as a generalization of Hearst patterns. The algorithm described uses instances of some class returned by Google to find instances of other classes. A case study on populating an ontology on the movie domain is presented as an illustration of the method. We present the algorithm in detail and discuss the results of our work.
Tagging artists using cooccurrences on the web
- Proceedings Third Philips Symposium on Intelligent Algorithms (SOIA 2006), pages 171 – 182
, 2006
"... We present an efficient unsupervised approach in finding subjective artist meta-data on the world wide web. Since we are interested in the collective knowledge on artists as available on the web, our method is based on the extraction of information from multiple web pages. We use co-occurrences of p ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We present an efficient unsupervised approach in finding subjective artist meta-data on the world wide web. Since we are interested in the collective knowledge on artists as available on the web, our method is based on the extraction of information from multiple web pages. We use co-occurrences of pairs of artists on the web to identify similarity between artists. To determine the applicability of tags to artists we follow the same approach. We use Google to find the co-occurrences on the web, either by analyzing Google excerpts found by querying patterns or by scanning full documents. Since the same tags are often applicable to related artists, we use similarity between artists to improve the tagging. We tested and compared the two co-occurrence extraction methods on two different domains: finding the most appropriate genres for music artists, and finding art-styles for painters. The results are convincing and show that the use of similar artists indeed improves the precision of the tagging. Key words: information extraction, Google, co-occurrence analysis, tagging, artist similarity. 1
Creating a Dead Poets Society: Extracting a social network of historical persons from the web
- In: Proceedings of the Sixth International Semantic Web Conference and the Second Asian Semantic Web Conference (ISWC + ASWC 2007). Volume 4825 of LNCS., Busan, Korea
, 2007
"... Abstract. We present a simple method to extract information from search engine snippets. Although the techniques presented are domain independent, this work focuses on extracting biographical information of historical persons from multiple unstructured sources on the Web. We first similarly find a l ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. We present a simple method to extract information from search engine snippets. Although the techniques presented are domain independent, this work focuses on extracting biographical information of historical persons from multiple unstructured sources on the Web. We first similarly find a list of persons and their periods of life by querying the periods and scanning the retrieved snippets for person names. Subsequently, we find biographical information for the persons extracted. In order to get insight in the mutual relations among the persons identified, we create a social network using co-occurrences on the Web. Although we use uncontrolled and unstructured Web sources, the information extracted is reliable. Moreover we show that Web Information Extraction can be used to create both informative and enjoyable applications. 1
Generating Dynamic and Adaptive Knowledge Models for Web-Based Resources
- International Conference on Design Science
, 2006
"... Abstract: The number of web pages on the World Wide Web continues to grow exponentially as more and more people, organizations, and businesses rely on the Internet to share and search information. These web-based resources are generally organized and presented in a hierarchical manner based on a cer ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract: The number of web pages on the World Wide Web continues to grow exponentially as more and more people, organizations, and businesses rely on the Internet to share and search information. These web-based resources are generally organized and presented in a hierarchical manner based on a certain categorization structure. We call the categorization structures used for organizing web-based resources their respective knowledge models. Such knowledge models are usually static: their creation and maintenance is normally done manually offline or in a humanmediated manner that makes it difficult and time-consuming to adapt them to the dynamically changing web-based resources. More importantly, while different users may have different perspectives and needs with respect to the existing knowledge model, they are restricted to use the view provided by the current model. For example, in a university website, faculty members are usually listed by departments. A web user may instead want to view how the faculties can be grouped by their research interests. In this paper, we propose a user centric approach that will automatically identify the web resources of interest and systematically generate and maintain knowledge models for the identified web resources while adapting to different user viewpoints. We apply such technologies
Documentum ECI Self-Repairing Wrappers: Performance Analysis ABSTRACT
"... Documentum Enterprise Content Integration (ECI) services is a content integration middleware that provides one-query access to the Intranet and Internet content resources. The ECI Adapter technology offers an interface to any application for data and metadata extraction from unstructured Web pages. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Documentum Enterprise Content Integration (ECI) services is a content integration middleware that provides one-query access to the Intranet and Internet content resources. The ECI Adapter technology offers an interface to any application for data and metadata extraction from unstructured Web pages. It offers a unique framework of wrapper production, automatic recovery and maintenance, developed at Xerox Research Centre Europe and based on state-ofart algorithms from machine learning and grammatical inference. In this presentation we analyze the performance of ECI adapters deployed in current commercial installations. We benefit from accessing reports on daily tests for all ECI commercially deployed adapters collected from June 2003 to September 2005. Using the daily reports, we analyze different aspects of the wrapper technology.
Decomposition-based Optimization of Reload Strategies in the World Wide Web
"... Abstract. Web sites, Web pages and the data on pages are available only for specific periods of time and are deleted afterwards from a client’s point of view. An important task in order to retrieve information from the Web is to consider Web information in the course of time. Different strategies li ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Web sites, Web pages and the data on pages are available only for specific periods of time and are deleted afterwards from a client’s point of view. An important task in order to retrieve information from the Web is to consider Web information in the course of time. Different strategies like push and pull strategies may be applied for this task. Since push services are usually not available, pull strategies have to be conceived in order to optimize the retrieved information with respect to the age of retrieved data and its completeness. In this article we present a new procedure to optimize retrieved data from Web pages by page decomposition. By deploying an automatic Wrapper induction technique a page is decomposed into functional segments. Each segment is considered as an independent component for the analysis of the time behavior of the page. Based on this decomposition we present a new component-based download strategy. By applying this method to Web pages it is shown that for a fraction of Web data the freshness of retrieved data may be improved significantly compared to traditional methods. 1
unknown title
"... Abstract — The data on the web is highly unstructured and some times it is present without any HTML tags, so it becomes difficult to query these web-sites and extract data from them. It is also difficult to merge data after colleting from various websites as it is in different formats and data types ..."
Abstract
- Add to MetaCart
Abstract — The data on the web is highly unstructured and some times it is present without any HTML tags, so it becomes difficult to query these web-sites and extract data from them. It is also difficult to merge data after colleting from various websites as it is in different formats and data types. The machine can’t understand unstructured data by its own and more-over machine needs both structure and content so as to extract data from web. We need some algorithm that can generate structure from this unstructured data automatically without any manual intervention. I.

