Results 1 - 10
of
39
Daw: Duplicate-aware federated query processing over the web of data
- In ISWC
, 2013
"... Abstract. Over the last years the Web of Data has developed into a large compendium of interlinked data sets from multiple domains. Due to the decentralised architecture of this compendium, several of these datasets contain duplicated data. Yet, so far, only little attention has been paid to the eff ..."
Abstract
-
Cited by 17 (10 self)
- Add to MetaCart
(Show Context)
Abstract. Over the last years the Web of Data has developed into a large compendium of interlinked data sets from multiple domains. Due to the decentralised architecture of this compendium, several of these datasets contain duplicated data. Yet, so far, only little attention has been paid to the effect of duplicated data on federated querying. This work presents DAW, a novel duplicate-aware approach to feder-ated querying over the Web of Data. DAW is based on a combination of min-wise independent permutations and compact data summaries. It can be directly combined with existing federated query engines in or-der to achieve the same query recall values while querying fewer data sources. We extend three well-known federated query processing engines – DARQ, SPLENDID, and FedX – with DAW and compare our exten-sions with the original approaches. The comparison shows that DAW can greatly reduce the number of queries sent to the endpoints, while keeping high query recall values. Therefore, it can significantly improve the performance of federated query processing engines. Moreover, DAW provides a source selection mechanism that maximises the query recall, when the query processing is limited to a subset of the sources.
SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data
"... Abstract. The distributed and heterogeneous nature of Linked Open Data requires flexible and federated techniques for query evaluation. In order to evaluate current federation querying approaches a general methodology for conducting benchmarks is mandatory. In this paper, we present a classification ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Abstract. The distributed and heterogeneous nature of Linked Open Data requires flexible and federated techniques for query evaluation. In order to evaluate current federation querying approaches a general methodology for conducting benchmarks is mandatory. In this paper, we present a classification methodology for federated SPARQL queries. This methodology can be used by developers of federated querying approaches to compose a set of test benchmarks that cover diverse characteristics of different queries and allows for comparability. We further develop a heuristic called SPLODGE for automatic generation of benchmark queries that is based on this methodology and takes into account the number of sources to be queried and several complexity parameters. We evaluate the adequacy of our methodology and the query generation strategy by applying them on the 2011 billion triple challenge data set. 1
LHD: Optimising Linked Data Query Processing Using
"... In the past few years as large volume of Linked Data has been published, and processing distributed SPARQL queries over the Linked Data cloud is becoming increasingly challenging. The high data traffic cost and response time significantly affect the performance of distributed SPARQL queries as the n ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
In the past few years as large volume of Linked Data has been published, and processing distributed SPARQL queries over the Linked Data cloud is becoming increasingly challenging. The high data traffic cost and response time significantly affect the performance of distributed SPARQL queries as the number of SPARQL end point and the volume of data at each endpoint increase. In this context, parallelisation is promising to fully exploit the potential of connections to SPARQL endpoints and thus improve the efficiency of querying Linked Data. We propose LHD, a distributed SPARQL engine that is built on a highly parallel infrastructure and able to minimise query response time, and we evaluate its performance using a BSBM based environment.
A Heuristic-Based Approach for Planning Federated SPARQL Queries
"... Abstract. A large number of SPARQL endpoints are available to access the Linked Open Data cloud, but query capabilities still remain very limited. Thus, to support efficient semantic data management of federations of endpoints, existing SPARQL query engines require to be equipped with new functional ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
Abstract. A large number of SPARQL endpoints are available to access the Linked Open Data cloud, but query capabilities still remain very limited. Thus, to support efficient semantic data management of federations of endpoints, existing SPARQL query engines require to be equipped with new functionalities. First, queries need to be decomposed into sub-queries not only answered by the available endpoints, but also executable in a way that the bandwidth usage is minimized. Second, query engines have to be able to gather the answers produced by the endpoints and merge them following a plan that reduces intermediate results. We address these problems and propose techniques that only rely on information about the predicates of the datasets accessible through the endpoints, to identify bushy plans comprise of sub-queries that can be efficiently executed. These techniques have been implemented on top of one existing RDF engine, and their performance has been studied on the FedBench benchmark. Experimental results show that our approach may support successful evaluation of queries, when other federated query engines fail, either because endpoints are unable to execute the sub-queries or federated query plans are too expensive. 1
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
"... Abstract. Efficient federated query processing is of significant importance to tame the large amount of data available on the Web of Data. Previous works have focused on generating optimized query execution plans for fast result retrieval. However, devising source selection approaches beyond triple ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
(Show Context)
Abstract. Efficient federated query processing is of significant importance to tame the large amount of data available on the Web of Data. Previous works have focused on generating optimized query execution plans for fast result retrieval. However, devising source selection approaches beyond triple pattern-wise source selection has not received much attention. This work presents HiBISCuS, a novel hypergraph-based source selection approach to federated SPARQL querying. Our approach can be directly combined with existing SPARQL query federation en-gines to achieve the same recall while querying fewer data sources. We extend three well-known SPARQL query federation engines with HiBISCus and compare our extensions with the original approaches on FedBench. Our evaluation shows that HiBISCuS can efficiently reduce the total number of sources selected without losing recall. Moreover, our approach significantly reduces the execution time of the selected engines on most of the benchmark queries. 1
TopFed: TCGA Tailored Federated Query Processing and Linking to LOD
"... Full list of author information is available at the end of the article Backgroud The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a co ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
(Show Context)
Full list of author information is available at the end of the article Backgroud The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis. Methods We address these issues by transforming the TCGA data into the Semantic Web
A Fine-Grained Evaluation of SPARQL Endpoint Federation Systems
, 2009
"... The Web of Data has grown enormously over the last years. Currently, it comprises a large compendium of interlinked and distributed datasets from multiple domains. Running complex queries on this compendium often requires accessing data from different endpoints within one query. The abundance of da ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
The Web of Data has grown enormously over the last years. Currently, it comprises a large compendium of interlinked and distributed datasets from multiple domains. Running complex queries on this compendium often requires accessing data from different endpoints within one query. The abundance of datasets and the need for running complex query has thus motivated a considerable body of work on SPARQL query federation systems, the dedicated means to access data distributed over the Web of Data. However, the granularity of previous evaluations of such systems has not allowed deriving of insights concerning their behavior in different steps involved during federated query processing. In this work, we perform extensive experiments to compare state-of-the-art SPARQL endpoint federation systems using the comprehensive performance evaluation framework FedBench. In addition to considering the tradition query runtime as an evaluation criterion, we extend the scope of our performance evaluation by considering criteria, which have not been paid much attention to in previous studies. In particular, we consider the number of sources selected, the total number of SPARQL ASK requests used, the completeness of answers as well as the source selection time. Yet, we show that they have a significant impact on the overall query runtime of existing systems. Moreover, we extend FedBench to mirror a highly distributed data environment and assess the behavior of existing systems by using the same performance criteria. As the result we provide a detailed analysis of the experimental outcomes that reveal novel insights for improving current and future SPARQL federation systems.
Fedra: Query Processing for SPARQL Federations with Divergence
"... Abstract. Data replication and deployment of local SPARQL endpoints improve scalability and availability of public SPARQL endpoints, mak-ing the consumption of Linked Data a reality. This solution requires syn-chronization and specific query processing strategies to take advantage of replication. Ho ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
(Show Context)
Abstract. Data replication and deployment of local SPARQL endpoints improve scalability and availability of public SPARQL endpoints, mak-ing the consumption of Linked Data a reality. This solution requires syn-chronization and specific query processing strategies to take advantage of replication. However, existing replication aware techniques in federa-tions of SPARQL endpoints do not consider data dynamicity. We propose Fedra, an approach for querying federations of endpoints that benefits from replication. Participants in Fedra federations can copy fragments of data from several datasets, and describe them using provenance and views. These descriptions enable Fedra to reduce the number of selected endpoints while satisfying user divergence requirements. Experiments on real-world datasets suggest savings of up to three orders of magnitude.
A Hybrid Approach to Linked Data Query Processing with Time Constraints
"... In addition to RDF data within documents published according to the Linked Data principles, SPARQL endpoints are also a potential source of a great deal of Linked Data. The execution of queries using languages such as SPARQL can use utilise both of these types of data sources. In this paper we prese ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
In addition to RDF data within documents published according to the Linked Data principles, SPARQL endpoints are also a potential source of a great deal of Linked Data. The execution of queries using languages such as SPARQL can use utilise both of these types of data sources. In this paper we present a hybrid approach to answering SPARQL queries that makes use of both link traversal-based and distributed query processing-based approaches in order to combine query answering over the Web of Linked Data and SPARQL endpoints respectively. The technique differs from existing work in that link traversal and endpoint queries take place in parallel without a static query plan. It is demonstrated how, using a set of heuristics and optimisation techniques, this can be effective when answering queries with time constraints (incomplete answers are acceptable in order to minimise execution time). An evaluation of the technique is presented using the FedBench Linked Data queries with query execution time limited to 10 seconds, with an analysis of answers that can be provided within this time limit. 1.
PAnG- Finding Patterns in Annotation Graphs
"... Annotation graph datasets are a natural representation of scientific knowledge. They are common in the life sciences and health sciences, where concepts such as genes, proteins or clinical trials are annotated with controlled vocabulary terms from ontologies. We present a tool, PAnG (Patterns in Ann ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Annotation graph datasets are a natural representation of scientific knowledge. They are common in the life sciences and health sciences, where concepts such as genes, proteins or clinical trials are annotated with controlled vocabulary terms from ontologies. We present a tool, PAnG (Patterns in Annotation Graphs), that is based on a complementary methodology of graph summarization and dense subgraphs. The elements of a graph summary correspond to a pattern and its visualization can provide an explanation of the underlying knowledge. Scientists can use PAnG to develop hypotheses and for exploration.