Results 1 -
9 of
9
From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System
"... Big data analytics often requires processing complex queries us-ing massive parallelism, where the main performance metrics is the communication cost incurred during data reshuffling. In this paper, we describe a system that can compute efficiently complex join queries, including queries with cyclic ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Big data analytics often requires processing complex queries us-ing massive parallelism, where the main performance metrics is the communication cost incurred during data reshuffling. In this paper, we describe a system that can compute efficiently complex join queries, including queries with cyclic joins, on a massively par-allel architecture. We build on two independent lines of work for multi-join query evaluation: a communication-optimal algorithm for distributed evaluation, and a worst-case optimal algorithm for sequential evaluation. We evaluate these algorithms together, then describe novel, practical optimizations for both algorithms. 1.
Changing the face of database cloud services with personalized service level agreements.
- In CIDR,
, 2015
"... ABSTRACT We develop and evaluate an approach for generating Personalized Service Level Agreements (PSLAs) that separate cloud users from the details of compute resources behind a cloud database management service. PSLAs retain the possibility to trade-off performance for cost and do so in a manner ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
ABSTRACT We develop and evaluate an approach for generating Personalized Service Level Agreements (PSLAs) that separate cloud users from the details of compute resources behind a cloud database management service. PSLAs retain the possibility to trade-off performance for cost and do so in a manner specific to the user's database.
GYM: A Multiround Join Algorithm In MapReduce And Its Analysis
"... We study the problem of computing the join of n relations in mul-tiple rounds of MapReduce. We introduce a distributed and gen-eralized version of Yannakakis’s algorithm, called GYM. GYM takes as input any generalized hypertree decomposition (GHD) of a query of width w and depth d, and computes the ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
We study the problem of computing the join of n relations in mul-tiple rounds of MapReduce. We introduce a distributed and gen-eralized version of Yannakakis’s algorithm, called GYM. GYM takes as input any generalized hypertree decomposition (GHD) of a query of width w and depth d, and computes the query in O(d+log(n)) rounds andO(n (IN w+OUT)2 M) communication cost, where M is the memory available per machine in the cluster and IN and OUT are the sizes of input and output of the query, respec-tively. M is assumed to be IN 1 , for some constant > 1. Using GYM we achieve two main results: (1) Every width-w query can be computed in O(n) rounds of MapReduce with O(n (IN w+OUT)2 M
A Demonstration of the BigDAWG Polystore System
"... ABSTRACT This paper presents BigDAWG, a reference implementation of a new architecture for "Big Data" applications. Such applications not only call for large-scale analytics, but also for real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-s ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT This paper presents BigDAWG, a reference implementation of a new architecture for "Big Data" applications. Such applications not only call for large-scale analytics, but also for real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-storage-system queries. Guided by the principle that "one size does not fit all", we build on top of a variety of storage engines, each designed for a specialized use case. To illustrate the promise of this approach, we demonstrate its effectiveness on a hospital application using data from an intensive care unit (ICU). This complex application serves the needs of doctors and researchers and provides real-time support for streams of patient data. It showcases novel approaches for querying across multiple storage engines, data visualization, and scalable real-time analytics.
Let's Rethink Join Optimization in Distributed Systems
"... ABSTRACT Distributed shared-nothing systems that process large-scale data has seen unprecedented developments over the last decade. The specific problem we consider is the evaluation a conjunctive join query Q of m relations R1, ..., Rm on a cluster of p distributed machines. Let IN and OUT be th ..."
Abstract
- Add to MetaCart
ABSTRACT Distributed shared-nothing systems that process large-scale data has seen unprecedented developments over the last decade. The specific problem we consider is the evaluation a conjunctive join query Q of m relations R1, ..., Rm on a cluster of p distributed machines. Let IN and OUT be the size of the input tables and output of Q, respectively. Let MAX-OUT be the maximum possible output of Q under all instances of the input tables. We characterize the performance of distributed algorithms with two parameters: (1) the number of rounds of communication required between machines; and (2) the network IO cost of the algorithm, which consists of the total communication incurred between the machines including writing the results to a distributed file system or database.
Distributed Data Deduplication
"... ABSTRACT Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid co ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously non-duplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.
The Myria Big Data Management and Analytics System and Cloud Service
"... ABSTRACT In this paper, we present an overview of the Myria stack for big data management and analytics that we developed in the database group at the University of Washington and that we have been operating as a cloud service aimed at domain scientists around the UW campus. We highlight Myria&apos ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT In this paper, we present an overview of the Myria stack for big data management and analytics that we developed in the database group at the University of Washington and that we have been operating as a cloud service aimed at domain scientists around the UW campus. We highlight Myria's key design choices and innovations and report on our experience with using Myria for various data science use-cases.
Towards an Analytics Query Engine
"... This vision paper presents new challenges and opportuni-ties in the area of distributed data analytics, at the core of which are data mining and machine learning. At rst, we provide an overview of the current state of the art in the area and then analyse two aspects of data analytics systems, se-man ..."
Abstract
- Add to MetaCart
(Show Context)
This vision paper presents new challenges and opportuni-ties in the area of distributed data analytics, at the core of which are data mining and machine learning. At rst, we provide an overview of the current state of the art in the area and then analyse two aspects of data analytics systems, se-mantics and optimization. We argue that these aspects will emerge as important issues for the data management com-munity in the next years and propose promising research directions for solving them.
GYM: A Multiround Join Algorithm In MapReduce
"... We study the problem of computing the join of n relations in mul-tiple rounds of MapReduce. We introduce a distributed and gen-eralized version of Yannakakis’s algorithm, called GYM. GYM takes as input any generalized hypertree decomposition (GHD) of a query of width w and depth d, and computes the ..."
Abstract
- Add to MetaCart
(Show Context)
We study the problem of computing the join of n relations in mul-tiple rounds of MapReduce. We introduce a distributed and gen-eralized version of Yannakakis’s algorithm, called GYM. GYM takes as input any generalized hypertree decomposition (GHD) of a query of width w and depth d, and computes the query in O(d) rounds and O(n(INw + OUT)) communication and computation cost. Using GYM we achieve two main results: (1) Every width-w query can be computed in O(n) rounds of MapReduce with O(n(INw + OUT)) cost; (2) Every width-w query can be com-puted inO(log(n)) rounds of MapReduce withO(n(IN3w+OUT)) cost. We achieve our second result by showing how to construct a O(log(n))-depth and width-3w GHD of a query of width w. We describe another general technique to construct even shorter depth GHDs with longer widths, effectively showing a spectrum of tradeoffs one can make between communication and computation and the number of rounds of MapReduce. By simulating MapRe-duce in the PRAM model, our second main result also implies the result of Gottlob et al. [12] that computing acyclic and constant-width queries are in NC. In fact, for certain queries, our approach yields significantly fewer PRAM steps than does the construction of the latter paper. However, we achieve our results using only Yan-nakakis’s algorithm, which has been perceived to have a sequential nature. Instead, we surprisingly show that Yannakakis’s algorithm can be parallelized significantly by giving it as input short-depth GHDs of queries. 1.