Results 11 -
14 of
14
Ontology-centric Source Selection for Meta-querier Customization
"... With an increasing number of semi-structured data sources, meta-queriers are introduced to facilitate effective information retrieval from multiple data sources that are accessible through query forms. However, a one-size-fitsall meta-querier cannot cater for various individual needs. In meta-querie ..."
Abstract
- Add to MetaCart
With an increasing number of semi-structured data sources, meta-queriers are introduced to facilitate effective information retrieval from multiple data sources that are accessible through query forms. However, a one-size-fitsall meta-querier cannot cater for various individual needs. In meta-querier customization, source selection is arguably one of the most critical problems. This paper proposes a capability-based source selection to meet user needs in terms of query capabilities. The major challenges include modeling, understanding and matching of the user needs and source capabilities. Our solution is based on a light-weight ontology, M-Ontology, which is generated from a number of verified mappings between heterogeneous query forms of the data sources. With the assistance of the concepts and relations in M-Ontology, user demands and source capabilities are modeled as concept sets, identified through query-form annotation, and matched by an additive utility function. The experiments on real-world data illustrate the potential of this ontology-centric method. 1
CenterforAutomationResearch,InstituteforAdvancedComputerStudies,
"... Spatial applications often require the ability to perform similarity search over a collection of point sets. For example, given a geographical distribution of a disease outbreak, find k historical outbreaks with similar spatial distributions from a data collection D. In this paper, we study the prob ..."
Abstract
- Add to MetaCart
Spatial applications often require the ability to perform similarity search over a collection of point sets. For example, given a geographical distribution of a disease outbreak, find k historical outbreaks with similar spatial distributions from a data collection D. In this paper, we study the problem of similarity search over a collection of point sets using the Hausdorff distance, which is a measure commonly used to determine the maximum discrepancy between two point sets. To avoid computing the Hausdorff distance for all point sets S in D, one may compute an optimistic estimate (i.e., lower bound value) of the actual Hausdorff distance HausDist(Q,S) for each S to rule out sets that are obviously dissimilar to Q. In our investigation, we observed that a commonly used method (called BscLB) to compute an estimate may not produce a result which is indicative of the actual Hausdorff distance. Consequently, we propose a method (called EnhLB) which produces a tighter estimate than the existing one. We then formulate a similarity search algorithm which uses a combination of BscLB and EnhLB to find similar point sets efficiently. In addition, we also extend our method to support an outlier-resistant variant of the Hausdorff distance called the modified Hausdorff distance. We compare our proposed algorithm with an algorithm using only BscLB. The results of our experiments show a reduction in computation time of 72 % for searches using the Hausdorff distance and a reduction of 53 % using the modified Hausdorff distance.
Recovering Semantics of Tables on the Web
"... The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by en ..."
Abstract
- Add to MetaCart
The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as searching for tables and finding related tables. To recover semantics of tables, we leverage a database of class labels and relationships automatically extracted from the Web. The database of classes and relationships has very wide coverage, but is also noisy. We attach a class label to a column if a sufficient number of the values in the column are identified with that label in the database of class labels, and analogously for binary relationships. We describe a formal model for reasoning about when we have seen sufficient evidence for a label, and show that it performs substantially better than a simple majority scheme. We describe a set of experiments that illustrate the utility of the recovered semantics for table search and show that it performs substantially better than previous approaches. In addition, we characterize what fraction of tables on the Web can be annotated using our approach. 1.
Answering Table Queries on the Web using Column Keywords ABSTRACT
"... We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are much richer sources of structured knowledge th ..."
Abstract
- Add to MetaCart
We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are much richer sources of structured knowledge than free-format text. However, a corpus of tables harvested from arbitrary HTML web pages presents huge challenges of diversity and redundancy not seen in centrally edited knowledge bases. We concentrate on one concrete task in this paper. Given a set of Web tables T1,..., Tn, and a query Q with q sets of keywords Q1,..., Qq, decide for each Ti if it is relevant to Q and if so, identify the mapping between the columns of Ti and query columns. We represent this task as a graphical model that jointly maps all tables by incorporating diverse sources of clues spanning matches in different parts of the table, corpus-wide co-occurrence statistics, and content overlap across table columns. We define a novel query segmentation model for matching keywords to table columns, and a robust mechanism of exploiting content overlap across table columns. We design efficient inference algorithms based on bipartite matching and constrained graph cuts to solve the joint labeling task. Experiments on a workload of 59 queries over a 25 million web table corpus shows significant boost in accuracy over baseline IR methods. 1.

