Results 1 - 10
of
54
Semistructured data
, 1997
"... In semistructured data, the information that is normally as-sociated with a schema is contained within the data, which is sometimes called “self-describing”. In some forms of semi-structured data there is no separate schema, in others it exists but only places loose constraints on the data. Semi-str ..."
Abstract
-
Cited by 230 (0 self)
- Add to MetaCart
In semistructured data, the information that is normally as-sociated with a schema is contained within the data, which is sometimes called “self-describing”. In some forms of semi-structured data there is no separate schema, in others it exists but only places loose constraints on the data. Semi-structured data has recently emerged as an important topic of study for a variety of reasons. First, there are data sources such as the Web, which we would like to treat as databases but which cannot be constrained by a schema. Second, it may be desirable to have an extremely flexible format for data exchange between disparate databases. Third, even when dealing with structured data, it may be helpful to view it. as semistructured for the purposes of browsing. This tu-torial will cover a number of issues surrounding such data: finding a concise formulation, building a sufficiently expres-sive language for querying and transformation, and opti-mizat,ion problems. 1 The motivation The topic of semistructured data (also called unstructured data) is relatively recent, and a tutorial on the topic may well be premature. It represents, if anything, the conver-gence of a number of lines of thinking about new ways to represent and query data that do not completely fit with conventional data models. The purpose of this tutorial is to to describe this motivation and to suggest areas in which further research may be fruitful. For a similar exposition, the reader is referred to Serge Abiteboul’s recent survey pa-per PI. The slides for this tutorial will be made available from a section of the Penn database home page
Quality-driven Integration of Heterogeneous Information Systems
- In VLDB Conference
, 1999
"... Integrated access to information that is spread over multiple, distributed, and heterogeneous sources is an important problem in many scientific and commercial domains. While much work has been done on query processing and choosing plans under cost criteria, very little is known about the important ..."
Abstract
-
Cited by 82 (15 self)
- Add to MetaCart
Integrated access to information that is spread over multiple, distributed, and heterogeneous sources is an important problem in many scientific and commercial domains. While much work has been done on query processing and choosing plans under cost criteria, very little is known about the important problem of incorporating the information quality aspect into query planning. In this paper we describe a framework for multidatabase query processing that fully includes the quality of information in many facets, such as completeness, timeliness, accuracy, etc. We seamlessly include information quality into a multidatabase query processor based on a view-rewriting mechanism. We model information quality at different levels to ultimately find a set of high-quality query answering plans.
Building light-weight wrappers for legacy web data-sources using w4f
- In Proc. of VLDB
, 1999
"... sahuguet�saul.cis.upenn.edu ..."
K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources
, 2000
"... The integration of heterogeneous data sources and software systems is a major issue in the biomedical community and several approaches have been explored: linking databases, "on-the-fly" integration through views, and integration through warehousing. In this paper we report on our experiences with t ..."
Abstract
-
Cited by 52 (4 self)
- Add to MetaCart
The integration of heterogeneous data sources and software systems is a major issue in the biomedical community and several approaches have been explored: linking databases, "on-the-fly" integration through views, and integration through warehousing. In this paper we report on our experiences with two systems that were developed at the University of Pennsylvania: an integration system called K2, which has primarily been used to provide views over multiple external data sources and software systems; and a data warehouse called GUS which downloads, cleans, integrates and annotates data from multiple external data sources. Although the view and warehouse approaches each have their advantages, there is no clear "winner". Therefore, users must consider how the data is to be used, what the performance guarantees must be, and how much programmer time and expertise is available to choose the best strategy for a particular application.
Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts
- Genome Informatics
, 1999
"... s See-Kiong Ng 1 Marie Wong 2 skng@krdl.org.sg marie@bic.nus.edu.sg 1 Kent Ridge Digital Labs, 21 Heng Mui Keng Terrace, Singapore 119613 2 NUS Bioinformatics Centre, National University of Singapore, Singapore 119260 Abstract We are entering a new era of research where the latest scienti ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
s See-Kiong Ng 1 Marie Wong 2 skng@krdl.org.sg marie@bic.nus.edu.sg 1 Kent Ridge Digital Labs, 21 Heng Mui Keng Terrace, Singapore 119613 2 NUS Bioinformatics Centre, National University of Singapore, Singapore 119260 Abstract We are entering a new era of research where the latest scientific discoveries are often first reported online and are readily accessible by scientists worldwide. This rapid electronic dissemination of research breakthroughs has greatly accelerated the current pace in genomics and proteomics research. The race to the discovery of a gene or a drug has now become increasingly dependent on how quickly a scientist can scan through voluminous amount of information available online to construct the relevant picture (such as protein-protein interaction pathways) as it takes shape amongst the rapidly expanding pool of globally accessible biological data (e.g. GENBANK) and scientific literature (e.g. MEDLINE). We describe a prototype system for automatic...
Knowledge-Based Integration of Neuroscience Data Sources
, 2000
"... The need for information integration is paramount in many biological disciplines, because of the large heterogeneity in both the types of data involved and in the diversity of approaches (physiological, anatomical, biochemical, etc.) taken by biologists to study the same or correlated phenomena. How ..."
Abstract
-
Cited by 27 (11 self)
- Add to MetaCart
The need for information integration is paramount in many biological disciplines, because of the large heterogeneity in both the types of data involved and in the diversity of approaches (physiological, anatomical, biochemical, etc.) taken by biologists to study the same or correlated phenomena. However, the very heterogeneity makes the task of information integration very difficult since two approaches studying different aspects of the same phenomena may not even share common attributes in their schema description. This paper develops a wrapper-mediator architecture which extends the conventional data- and vieworiented information mediation approach by incorporating additional knowledge-modules that bridge the gap between the heterogeneous data sources. The semantic integration of the disparate local data sources employs F-logic as a data and knowledge representation and reasoning formalism. We show that the rich object-oriented modeling features of F-logic together with its declarative rule language and the uniform treatment of data and metadata (schema information) make it an ideal candidate for complex integration tasks. We substantiate this claim by elaborating on our integration architecture and illustrating the approach using real world examples from the neuroscience domain. The complete integration framework is currently under development; a first prototype establishing the viability of the approach is operational.
Web Ecology: Recycling HTML pages as XML documents using W4F
- In ACM SIGMOD Workshop on the Web and Databases (WebDB
, 1999
"... In this paper we present the World-Wide Web Wrapper Factory (W4F), a Java toolkit to generate wrappers for Web data sources. Some key features of W4F are an expressive language to extract information from HTML pages in a structured way, a mapping to export it as XML documents and some visual tools t ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
In this paper we present the World-Wide Web Wrapper Factory (W4F), a Java toolkit to generate wrappers for Web data sources. Some key features of W4F are an expressive language to extract information from HTML pages in a structured way, a mapping to export it as XML documents and some visual tools to assist the user during wrapper creation. Moreover, the entire description of wrappers is fully declarative. As an illustration, we demonstrate how to use W4F to create XML gateways, that serve transparently and on-the-fly HTML pages as XML documents with their DTDs. 1 Introduction The Web has become a major conduit to information repositories of all kinds. Today, more than 80% of information published on the Web is generated by underlying databases and this proportion keeps increasing. But Web data sources also consist of stand-alone HTML pages hand-coded by individuals, that provide very useful information such as reviews, digests, links, etc. As soon as we want to go beyond the basic m...
Looking at the Web through XML glasses
, 1999
"... The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human and make information accessible to applications, in order to offer automation, inter-operation and Web-awareness among servic ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human and make information accessible to applications, in order to offer automation, inter-operation and Web-awareness among services. To do so, information from Web sources needs to be accessible in a structured way. XML and its various extensions (data-models, query languages) are a step in this direction. Unfortunately, the Web is not yet a well organized repository of nicely structured XML documents but rather a conglomerate of volatile HTML pages, for which structure has to be extracted. To address this problem, we present the World Wide Web Wrapper Factory (W4F), a Java toolkit for the generation of wrappers for Web sources. Our main contributions are: (1) an expressive language to specify the extraction of complex structures from HTML pages; (2) a declarative mapping to XML documents, with the automatic generat...
Query Processing with Description Logic Ontologies Over Object-Wrapped Databases
- In Proc. of the 14th International Conference on Scientific and Statistical Database Management (SSDBM’02
, 2001
"... This paper presents an approach to answering queries over an ontology modelled using a description logic. The ontology acts as a global schema, providing a declarative description of the concepts of the domain, the instances of which are stored in (potentially many) object-wrapped sources. Queries a ..."
Abstract
-
Cited by 17 (8 self)
- Add to MetaCart
This paper presents an approach to answering queries over an ontology modelled using a description logic. The ontology acts as a global schema, providing a declarative description of the concepts of the domain, the instances of which are stored in (potentially many) object-wrapped sources. Queries are expressed using terms from the rich vocabulary of the ontology, and are translated into an equivalent calculus expression, which references only the objects available in the source databases. The query is then optimized on the basis of information from the ontology and the source databases. Distinctive features of the approach include: the use of the expressive ALCQI description logic, which supports both ontology definition and query expression; the adoption of a global-as-view approach to relating the ontology to the sources; and the use of the ontology to direct semantic optimization of queries phrased over specific sources. The approach is being developed in, and is illustrated using examples from, bioinformatics.
Optimized Seamless Integration of Biomolecular Data
- IEEE symposium on Bio-Informatics and Biomedical Engineering (BIBE’2001), Washington DC
, 2001
"... Today, scientific data is inevitably digitized, stored in a variety of heterogeneous formats, and is accessible over the Internet. Scientists need to access an integrated view of multiple remote or local heterogeneous data sources. They then integrate the results of complex queries and apply further ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Today, scientific data is inevitably digitized, stored in a variety of heterogeneous formats, and is accessible over the Internet. Scientists need to access an integrated view of multiple remote or local heterogeneous data sources. They then integrate the results of complex queries and apply further analysis and visualization to support the task of scientific discovery. Building a digital library for scientific discovery requires accessing and manipulating data extracted from flat files or databases, documents retrieved from the Web, as well as data that is locally materialized in warehouses or is generated by software. We consider several tasks to provide optimized and seamless integration of biomolecular data. Challenges to be addressed include capturing and representing source capabilities; developing a methodology to acquire and represent metadata about source contents and access costs; and decision support to select sources and capabilities using cost based and semantic knowledge, and generating low cost query evaluation plans.

