Results 11 - 20
of
37
Semi-supervised learning of semantic classes for query . . .
, 2009
"... Understanding intents from search queries can improve a user’s search experience and boost a site’s advertising profits. Query tagging via statistical sequential labeling models has been shown to perform well, but annotating the training set for supervised learning requires substantial human effort. ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Understanding intents from search queries can improve a user’s search experience and boost a site’s advertising profits. Query tagging via statistical sequential labeling models has been shown to perform well, but annotating the training set for supervised learning requires substantial human effort. Domain-specific knowledge, such as semantic class lexicons, reduces the amount of needed manual annotations, but much human effort is still required to maintain these as search topics evolve over time. This paper investigates semi-supervised learning algorithms that leverage structured data (HTML lists) from the Web to automatically generate semantic-class lexicons, which are used to improve query tagging performance – even with far less training data. We focus our study on understanding
Exploring schema repositories with Schemr
- In Proceedings of the SIGMOD Conference, Demonstration Program
, 2009
"... Schemr is a schema search engine, and provides users the ability to search for and visualize schemas stored in a metadata repository. Users may search by keywords and by example – using schema fragments as query terms. Schemr uses a novel search algorithm, based on a combination of text search and s ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Schemr is a schema search engine, and provides users the ability to search for and visualize schemas stored in a metadata repository. Users may search by keywords and by example – using schema fragments as query terms. Schemr uses a novel search algorithm, based on a combination of text search and schema matching techniques, as well as a structurally-aware scoring metric. Schemr presents search results in a GUI that allows users to explore which elements match and how well they do. The GUI supports interactions, including panning, zooming, layout and drilling-in. We demonstrate schema search and visualization, introduce Schemr as a new component of the information integration toolbox, and discuss its benefits in several applications.
Mapping web pages to database records via link paths
- In CIKM
, 2010
"... In this paper we propose a new knowledge management task which aims to map Web pages to their corresponding records in a structured database. For example, the DBLP database contains records for many computer scientists, and most of these persons have public Web pages; if we can map the database reco ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In this paper we propose a new knowledge management task which aims to map Web pages to their corresponding records in a structured database. For example, the DBLP database contains records for many computer scientists, and most of these persons have public Web pages; if we can map the database record with the appropriate Web page then the new information could be used to further describe the person’s database record. To accomplish this goal we employ link paths which contain anchor texts from multiple paths through the Web ending at the Web page in question. We hypothesize that the information from these link paths can be used to generate an accurate Web page to database record mapping. Experiments on two large, real world data sets, DBLP and IMDB for the structured data and computer science faculty members ’ Web pages and official movie homepages for the Web page data, show that our method does provide an accurate mapping. Finally, we conclude by issuing a call for further research on this promising new task. Categories and Subject Descriptors
Unexpected Results in Automatic List Extraction on the Web
"... The discovery and extraction of general lists on the Web continues to be an important problem facing the Web mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recen ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The discovery and extraction of general lists on the Web continues to be an important problem facing the Web mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recent experiences have shown that the list-finding methods used as part of these larger frameworks do not generalize well and therefore ought to be reevaluated. This paper briefly describes some of the current approaches, and tests them on various list-pages. Based on our findings, we conclude that analyzing a Web page’s DOM-structure is not sufficient for the general list finding task. A B
Understanding tables on the web
, 2010
"... The Web contains a wealth of information, and a key challenge is to make this information machine processable. Because natural language understanding at web scale remains difficult and costly at present, in this paper, we focus our attention on understanding well-structured html tables on the Web. F ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The Web contains a wealth of information, and a key challenge is to make this information machine processable. Because natural language understanding at web scale remains difficult and costly at present, in this paper, we focus our attention on understanding well-structured html tables on the Web. From 0.3 billion Web documents, we obtain 1.95 billion tables, and 0.5-1 % of these contain meaningful information of various entities and their properties. Our work focuses on detecting these tables, understanding their content, and using the obtained information and knowledge to support important applications such as search. Our starting point is a rich, general purpose taxonomy whose content is harvested automatically from the Web and search log data. We use the taxonomy to help us interpret and understand tables. We then use the content we understand to enrich the taxonomy, which, in turn, enables us to understand more tables. We report large scale experimental results that demonstrate the feasibility of this approach, and we build a semantic search engine over tables to demonstrate how structured data can empower information retrieval on the Web. 1.
A First Tutorial on Dataspaces
, 2008
"... Dataspace systems offer services on data without requiring upfront semantic integration. In sharp contrast with existing information-integration systems, dataspaces systems offer best-effort answers even before semantic mappings are ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Dataspace systems offer services on data without requiring upfront semantic integration. In sharp contrast with existing information-integration systems, dataspaces systems offer best-effort answers even before semantic mappings are
Mesa: A Search Engine for Querying Web Tables
"... Abstract. The volume of structured data on the Web has grown considerably in the recent past. In contrast to unstructured (textual) documents, which can be searched through simple keyword-based interfaces, the presence of structure enables rich queries to be posed against Web data. In this paper we ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. The volume of structured data on the Web has grown considerably in the recent past. In contrast to unstructured (textual) documents, which can be searched through simple keyword-based interfaces, the presence of structure enables rich queries to be posed against Web data. In this paper we present a search engine designed for querying structured information sources on the Web and show how our system can support on-the-fly, complex queries over content published in hundreds HTML tables. 1.
Redundancy-Driven Web Data Extraction and Integration
"... A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain in ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages confirm the feasibility and quality of the approach. 1.
Functional Dependency Generation and Applications in pay-as-you-go data integration systems ∗
"... Recently, the opportunity of extracting structured data from the Web has been identified by a number of research projects. One such example is that millions of relational-style HTML tables can be extracted from the Web. Traditional data integration approaches do not scale over such corpora with hund ..."
Abstract
- Add to MetaCart
Recently, the opportunity of extracting structured data from the Web has been identified by a number of research projects. One such example is that millions of relational-style HTML tables can be extracted from the Web. Traditional data integration approaches do not scale over such corpora with hundreds of small tables in one domain. To solve this problem, previous work has proposed pay-as-you-go data integration systems to provide, with little up-front cost, base services over loosely-integrated information. One key component of such systems, which has received little attention to date, is the need for a framework to gauge and improve the quality of the integration. We propose a framework based on functional dependencies(FDs). Unlike in traditional database design, where FDs are specified as statements of truth about all possible instances of the database; in web environment, FDs are not specified over the data tables. Instead, we generate FDs by counting-based algorithms over many data sources, and extend the FDs with probabilities to capture the inherent uncertainties in them. Given these probabilistic FDs, we show how to solve two problems to improve data and schema quality in a pay-as-you-go system: (1) pinpointing dirty data sources and (2) normalizing large mediated schemas. We describe these techniques and evaluate them over real-world data sets extracted from the Web. 1.

