Results 1 
9 of
9
From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System
"... Big data analytics often requires processing complex queries using massive parallelism, where the main performance metrics is the communication cost incurred during data reshuffling. In this paper, we describe a system that can compute efficiently complex join queries, including queries with cyclic ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Big data analytics often requires processing complex queries using massive parallelism, where the main performance metrics is the communication cost incurred during data reshuffling. In this paper, we describe a system that can compute efficiently complex join queries, including queries with cyclic joins, on a massively parallel architecture. We build on two independent lines of work for multijoin query evaluation: a communicationoptimal algorithm for distributed evaluation, and a worstcase optimal algorithm for sequential evaluation. We evaluate these algorithms together, then describe novel, practical optimizations for both algorithms. 1.
Changing the face of database cloud services with personalized service level agreements.
 In CIDR,
, 2015
"... ABSTRACT We develop and evaluate an approach for generating Personalized Service Level Agreements (PSLAs) that separate cloud users from the details of compute resources behind a cloud database management service. PSLAs retain the possibility to tradeoff performance for cost and do so in a manner ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
ABSTRACT We develop and evaluate an approach for generating Personalized Service Level Agreements (PSLAs) that separate cloud users from the details of compute resources behind a cloud database management service. PSLAs retain the possibility to tradeoff performance for cost and do so in a manner specific to the user's database.
GYM: A Multiround Join Algorithm In MapReduce And Its Analysis
"... We study the problem of computing the join of n relations in multiple rounds of MapReduce. We introduce a distributed and generalized version of Yannakakis’s algorithm, called GYM. GYM takes as input any generalized hypertree decomposition (GHD) of a query of width w and depth d, and computes the ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We study the problem of computing the join of n relations in multiple rounds of MapReduce. We introduce a distributed and generalized version of Yannakakis’s algorithm, called GYM. GYM takes as input any generalized hypertree decomposition (GHD) of a query of width w and depth d, and computes the query in O(d+log(n)) rounds andO(n (IN w+OUT)2 M) communication cost, where M is the memory available per machine in the cluster and IN and OUT are the sizes of input and output of the query, respectively. M is assumed to be IN 1 , for some constant > 1. Using GYM we achieve two main results: (1) Every widthw query can be computed in O(n) rounds of MapReduce with O(n (IN w+OUT)2 M
A Demonstration of the BigDAWG Polystore System
"... ABSTRACT This paper presents BigDAWG, a reference implementation of a new architecture for "Big Data" applications. Such applications not only call for largescale analytics, but also for realtime streaming support, smaller analytics at interactive speeds, data visualization, and crosss ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
ABSTRACT This paper presents BigDAWG, a reference implementation of a new architecture for "Big Data" applications. Such applications not only call for largescale analytics, but also for realtime streaming support, smaller analytics at interactive speeds, data visualization, and crossstoragesystem queries. Guided by the principle that "one size does not fit all", we build on top of a variety of storage engines, each designed for a specialized use case. To illustrate the promise of this approach, we demonstrate its effectiveness on a hospital application using data from an intensive care unit (ICU). This complex application serves the needs of doctors and researchers and provides realtime support for streams of patient data. It showcases novel approaches for querying across multiple storage engines, data visualization, and scalable realtime analytics.
Let's Rethink Join Optimization in Distributed Systems
"... ABSTRACT Distributed sharednothing systems that process largescale data has seen unprecedented developments over the last decade. The specific problem we consider is the evaluation a conjunctive join query Q of m relations R1, ..., Rm on a cluster of p distributed machines. Let IN and OUT be th ..."
Abstract
 Add to MetaCart
ABSTRACT Distributed sharednothing systems that process largescale data has seen unprecedented developments over the last decade. The specific problem we consider is the evaluation a conjunctive join query Q of m relations R1, ..., Rm on a cluster of p distributed machines. Let IN and OUT be the size of the input tables and output of Q, respectively. Let MAXOUT be the maximum possible output of Q under all instances of the input tables. We characterize the performance of distributed algorithms with two parameters: (1) the number of rounds of communication required between machines; and (2) the network IO cost of the algorithm, which consists of the total communication incurred between the machines including writing the results to a distributed file system or database.
Distributed Data Deduplication
"... ABSTRACT Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid co ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously nonduplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a sharednothing computing environment. Our main contribution is a distribution strategy, called DisDedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.
The Myria Big Data Management and Analytics System and Cloud Service
"... ABSTRACT In this paper, we present an overview of the Myria stack for big data management and analytics that we developed in the database group at the University of Washington and that we have been operating as a cloud service aimed at domain scientists around the UW campus. We highlight Myria&apos ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT In this paper, we present an overview of the Myria stack for big data management and analytics that we developed in the database group at the University of Washington and that we have been operating as a cloud service aimed at domain scientists around the UW campus. We highlight Myria's key design choices and innovations and report on our experience with using Myria for various data science usecases.
Towards an Analytics Query Engine
"... This vision paper presents new challenges and opportunities in the area of distributed data analytics, at the core of which are data mining and machine learning. At rst, we provide an overview of the current state of the art in the area and then analyse two aspects of data analytics systems, seman ..."
Abstract
 Add to MetaCart
(Show Context)
This vision paper presents new challenges and opportunities in the area of distributed data analytics, at the core of which are data mining and machine learning. At rst, we provide an overview of the current state of the art in the area and then analyse two aspects of data analytics systems, semantics and optimization. We argue that these aspects will emerge as important issues for the data management community in the next years and propose promising research directions for solving them.
GYM: A Multiround Join Algorithm In MapReduce
"... We study the problem of computing the join of n relations in multiple rounds of MapReduce. We introduce a distributed and generalized version of Yannakakis’s algorithm, called GYM. GYM takes as input any generalized hypertree decomposition (GHD) of a query of width w and depth d, and computes the ..."
Abstract
 Add to MetaCart
(Show Context)
We study the problem of computing the join of n relations in multiple rounds of MapReduce. We introduce a distributed and generalized version of Yannakakis’s algorithm, called GYM. GYM takes as input any generalized hypertree decomposition (GHD) of a query of width w and depth d, and computes the query in O(d) rounds and O(n(INw + OUT)) communication and computation cost. Using GYM we achieve two main results: (1) Every widthw query can be computed in O(n) rounds of MapReduce with O(n(INw + OUT)) cost; (2) Every widthw query can be computed inO(log(n)) rounds of MapReduce withO(n(IN3w+OUT)) cost. We achieve our second result by showing how to construct a O(log(n))depth and width3w GHD of a query of width w. We describe another general technique to construct even shorter depth GHDs with longer widths, effectively showing a spectrum of tradeoffs one can make between communication and computation and the number of rounds of MapReduce. By simulating MapReduce in the PRAM model, our second main result also implies the result of Gottlob et al. [12] that computing acyclic and constantwidth queries are in NC. In fact, for certain queries, our approach yields significantly fewer PRAM steps than does the construction of the latter paper. However, we achieve our results using only Yannakakis’s algorithm, which has been perceived to have a sequential nature. Instead, we surprisingly show that Yannakakis’s algorithm can be parallelized significantly by giving it as input shortdepth GHDs of queries. 1.