Results 1 - 10
of
37
Service-Based Distributed Querying on the Grid
- IN PROC. OF THE 1ST INT. CONF. ON SERVICE ORIENTED COMPUTING
, 2003
"... Service-based approaches (such as Web Services and the Open Grid Services Architecture) have gained considerable attention recently for supporting distributed application development in e-business and e-science. The emergence of a service-oriented view of hardware and software resources raises t ..."
Abstract
-
Cited by 32 (21 self)
- Add to MetaCart
Service-based approaches (such as Web Services and the Open Grid Services Architecture) have gained considerable attention recently for supporting distributed application development in e-business and e-science. The emergence of a service-oriented view of hardware and software resources raises the question as to how database management systems and technologies can best be deployed or adapted for use in such an environment. This paper explores one aspect of service-based computing and data management, viz., how to integrate query processing technology with a service-based Grid. The paper describes in detail the design and implementation of a service-based distributed query processor for the Grid. The query processor is service-based in two orthogonal senses: firstly, it supports querying over data storage and analysis resources that are made available as services, and, secondly, its internal architecture factors out as services the functionalities related to the construction of distributed query plans on the one hand, and to their execution over the Grid on the other. The resulting system both provides a declarative approach to service orchestration in the Grid, and demonstrates how query processing can benefit from dynamic access to computational resources on the Grid.
Integration of Biological Sources: Current Systems and Challenges Ahead
- Sigmod Record
, 2004
"... This paper surveys the area of biological and genomic sources integration, which has recently become a major focus of the data integration research field. The challenges that an integration system for biological sources must face are due to several factors such as the variety and amount of data avai ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
This paper surveys the area of biological and genomic sources integration, which has recently become a major focus of the data integration research field. The challenges that an integration system for biological sources must face are due to several factors such as the variety and amount of data available, the representational heterogeneity of the data in the different sources, and the autonomy and differing capabilities of the sources.
Distributed Query Processing on the Grid
, 2002
"... Distributed query processing (DQP) has been widely used in data intensive applications where data of relevance to users is stored in multiple locations. This paper argues: (i) that DQP can be important in the Grid, as a means of providing high-level, declarative languages for integrating data access ..."
Abstract
-
Cited by 25 (14 self)
- Add to MetaCart
Distributed query processing (DQP) has been widely used in data intensive applications where data of relevance to users is stored in multiple locations. This paper argues: (i) that DQP can be important in the Grid, as a means of providing high-level, declarative languages for integrating data access and analysis
Learning Classifiers from Semantically Heterogeneous Data
- In: Proceedings of the International Conference on Ontologies, Databases, and Applications of Semantics (ODBASE 2004), Agia
, 2004
"... Abstract. Semantically heterogeneous and distributed data sources are quite common in several application domains such as bioinformatics and security informatics. In such a setting, each data source has an associated ontology. Different users or applications need to be able to query such data source ..."
Abstract
-
Cited by 19 (14 self)
- Add to MetaCart
Abstract. Semantically heterogeneous and distributed data sources are quite common in several application domains such as bioinformatics and security informatics. In such a setting, each data source has an associated ontology. Different users or applications need to be able to query such data sources for statistics of interest (e.g., statistics needed to learn a predictive model from data). Because no single ontology meets the needs of all applications or users in every context, or for that matter, even a single user in different contexts, there is a need for principled approaches to acquiring statistics from semantically heterogeneous data. In this paper, we introduce ontology-extended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. We show how these constraints can be used to derive mappings from source ontologies to the user ontology. We observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output. We show how the ontology mappings can be used to answer statistical queries needed by algorithms for learning classifiers from data viewed from a certain user perspective. The resulting algorithms offer a powerful approach to data-driven knowledge acquisition over the Semantic Web.
Containment of relational queries with annotation propagation
- In Proceedings of the International Workshop on Database and Programming Languages (DBPL
, 2003
"... We study the problem of determining whether a query is contained in another when queries can carry along annotations from source data. We say that a query is annotation-contained in another if the annotated output of the former is contained in the latter on every possible annotated input databases. ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
We study the problem of determining whether a query is contained in another when queries can carry along annotations from source data. We say that a query is annotation-contained in another if the annotated output of the former is contained in the latter on every possible annotated input databases. We study the relationship between query containment and annotation-containment and show that annotation-containment is a more refined notion in general. As a consequence, the usual equivalences used by a typical query optimizer may no longer hold when queries can carry along annotations from the source to the output. Despite this, we show that the same annotated result is obtained whether intermediate constructs of a query are evaluated with set or bag semantics. We also give a necessary and sufficient condition, via homomorphisms, that checks whether a query is annotationcontained in another. Even though our characterization suggests that annotation-containment is more complex than query containment, we show that the annotation-containment problem is NP-complete, thus putting it in the same complexity class as query containment. In addition, we show that the annotation placement problem, which was first shown to be NP-hard in [BKT02], is in fact DP-hard and the exact complexity of this problem still remains open.
Towards a model of provenance and user views in scientific workflows
- In Data Integration in the Life Sciences
, 2006
"... Abstract. Scientific experiments are becoming increasingly large and complex, with a commensurate increase in the amount and complexity of data generated. Data, both intermediate and final results, is derived by chaining and nesting together multiple database searches and analytical tools. In many c ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Abstract. Scientific experiments are becoming increasingly large and complex, with a commensurate increase in the amount and complexity of data generated. Data, both intermediate and final results, is derived by chaining and nesting together multiple database searches and analytical tools. In many cases, the means by which the data are produced is not known, making the data difficult to interpret and the experiment impossible to reproduce. Provenance in scientific workflows is thus of paramount importance. In this paper, we provide a formal model of provenance for scientific workflows which is general (i.e. can be used with existing workflow systems, such as Kepler, myGrid and Chimera) and sufficiently expressive to answer the provenance queries we encountered in a number of case studies. Interestingly, our model not only takes into account the chained and nested structure of scientific workflows, but allows asks for provenance at different levels of abstraction (user views). 1
Biological Data Integration: Wrapping Data and Tools
, 2002
"... Nowadays scientific data is inevitably digital and stored in a wide variety of formats in heterogeneous systems. Scientists need to access an integrated view of remote or local heterogeneous data sources with advanced data accessing, analyzing, and visualization tools. Building a digital library for ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Nowadays scientific data is inevitably digital and stored in a wide variety of formats in heterogeneous systems. Scientists need to access an integrated view of remote or local heterogeneous data sources with advanced data accessing, analyzing, and visualization tools. Building a digital library for scientific data requires accessing and manipulating data extracted from flat files or databases, documents retrieved from the Web as well as data generated by software. We present an approach to wrapping web data sources, databases, flat files, or data generated by tools through a database view mechanism. Generally, a wrapper has two tasks: it first sends a query to the source to retrieve data and, second builds the expected output with respect to the virtual structure. Our wrappers are composed of a retrieval component based on an intermediate object view mechanism called search views mapping the source capabilities to attributes, and an eXtensible Markup Language (XML) engine, respectively, to perform these two tasks. The originality of the approach consists of: 1) a generic view mechanism to access seamlessly data sources with limited capabilities and 2) the ability to wrap data sources as well as the useful specific tools they may provide. Our approach has been developed and demonstrated as part of the multidatabase system supporting queries via uniform object protocol model (OPM) interfaces.
C.: Selecting biomedical data sources according to user preferences
- In: ISMB/ECCB 2004
, 2004
"... Selecting biomedical data sources according to user preferences ..."
Abstract
-
Cited by 13 (9 self)
- Add to MetaCart
Selecting biomedical data sources according to user preferences
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
- DATA INTEGRATION IN THE LIFE SCIENCES
, 2005
"... We present INDUS (Intelligent Data Understanding System) , a federated, query-centric system for knowledge acquisition from autonomous, distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables. INDUS employs ontologies and inter-ontology mappings, to enabl ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
We present INDUS (Intelligent Data Understanding System) , a federated, query-centric system for knowledge acquisition from autonomous, distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables. INDUS employs ontologies and inter-ontology mappings, to enable a user or an application to view a collection of such data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to an ontology supplied by the user. This allows INDUS to answer user queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology. We used INDUS framework to design algorithms for learning probabilistic models (e.g., Naive Bayes models) for predicting GO functional classification of a protein based on training sequences that are distributed among SWISSPROT and MIPS data sources. Mappings such as EC2GO and MIPS2GO were used to resolve the semantic di#erences between these data sources when answering queries posed by the learning algorithms. Our results show that INDUS can be successfully used for integrative analysis of data from multiple sources needed for collaborative discovery in computational biology.
BioLingua: A Programmable Knowledge Environment for Biologists
, 2005
"... BioLingua is an interactive, web-based programming environment that enables biologists to analyze biological systems by combining knowledge and data through direct end-user programming. BioLingua embeds a mature symbolic programming language in a frame-based knowledge environment, integrating genomi ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
BioLingua is an interactive, web-based programming environment that enables biologists to analyze biological systems by combining knowledge and data through direct end-user programming. BioLingua embeds a mature symbolic programming language in a frame-based knowledge environment, integrating genomic and pathway knowledge about a class of similar organisms. The BioLingua language provides interfaces to numerous state-of-the-art bioinformatic tools, making these available as an integrated package through the novel use of web-based programmability and an integrated Wiki-based community code and data store. The pilot instantiation of BioLingua, which has been developed in collaboration with several cyanobacteriologists, integrates knowledge about a subset of cyanobacteria with the Gene Ontology, KEGG, and BioCyc knowledge bases. We introduce the BioLingua concept, architecture, and language, and give several examples of its use in complex analyses. Extensive documentation is available online at http://nostoc.stanford.edu/Docs/index.html.

