Results 1 - 10
of
70
Starfish: A Self-tuning System for Big Data Analytics
- In CIDR
, 2011
"... Timely and cost-effective analytics over “Big Data ” is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack—which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, ..."
Abstract
-
Cited by 79 (6 self)
- Add to MetaCart
(Show Context)
Timely and cost-effective analytics over “Big Data ” is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack—which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces—is a popular choice for big data analytics. Most practitioners of big data analytics—like computational scientists, systems researchers, and business analysts—lack the expertise to tune the system to get good performance. Unfortunately, Hadoop’s performance out of the box leaves much to be desired, leading to suboptimal use of resources, time, and money (in payas-you-go clouds). We introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. While Starfish’s system architecture is guided by work on self-tuning database systems, we discuss how new analysis practices over big data pose new challenges; leading us to different design choices in Starfish. 1.
Large-scale Matrix Factorization with Distributed Stochastic Gradient Descent
- In KDD
, 2011
"... We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. Based on a novel “stratified ” variant of SGD, we ..."
Abstract
-
Cited by 73 (7 self)
- Add to MetaCart
(Show Context)
We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. Based on a novel “stratified ” variant of SGD, we obtain a new matrixfactorization algorithm, called DSGD, that can be fully distributed and run on web-scale datasets using, e.g., MapReduce. DSGD can handle a wide variety of matrix factorizations and has good scalability properties. 1
Efficient processing of k nearest neighbor joins using mapreduce
- Professor of Computer Science at the National University of Singapore (NUS). He obtained his BSc (1st Class Honors) and PhD from Monash University, Australia, in 1985 and
, 2012
"... k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining ap-plications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensiv ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
(Show Context)
k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining ap-plications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of comput-ers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and com-putational costs. To reduce the shuffling cost, we propose two ap-proximate algorithms to minimize the number of replicas. Exten-sive experiments on our in-house cluster demonstrate that our pro-posed methods are efficient, robust and scalable. 1.
Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds
"... Abstract—Running MapReduce programs in the public cloud introduces the important problem: how to optimize resource provisioning to minimize the financial charge for a specific job? In this paper, we study the whole process of MapReduce processing and build up a cost function that explicitly models t ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Running MapReduce programs in the public cloud introduces the important problem: how to optimize resource provisioning to minimize the financial charge for a specific job? In this paper, we study the whole process of MapReduce processing and build up a cost function that explicitly models the relationship between the amount of input data, the available system resources (Map and Reduce slots), and the complexity of the Reduce function for the target MapReduce job. The model parameters can be learned from test runs with a small number of nodes. Based on this cost model, we can solve a number of decision problems, such as the optimal amount of resources that can minimize the financial cost with a time deadline or minimize the time under certain financial budget. Experimental results show that this cost model performs well on tested MapReduce programs.
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs
"... MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computatio ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. We introduce, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. We focus on the optimization opportunities presented by the large space of configuration parameters for these programs. We also introduce a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. All components have been prototyped for the popular Hadoop MapReduce system. The effectiveness of each component is demonstrated through a comprehensive evaluation using representative MapReduce programs from various application domains. 1.
Analytics over large-scale multidimensional data
- In Proc. DOLAP’11
, 2011
"... In this paper, we provide an overview of state-of-the-art research issues and achievements in the field of analytics over big data, and we extend the discussion to analytics over big multidimensional data as well, by highlighting open problems and actual research trends. Our analytical contribution ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
(Show Context)
In this paper, we provide an overview of state-of-the-art research issues and achievements in the field of analytics over big data, and we extend the discussion to analytics over big multidimensional data as well, by highlighting open problems and actual research trends. Our analytical contribution is finally completed by several novel research directions arising in this field, which plays a leading role in next-generation Data Warehousing and OLAP research. Categories and Subject Descriptors
ActiveSLA: A profit-oriented admission control framework for database-as-a-service providers
- In SoCC
, 2011
"... The system overload is a common problem in a Database-asa-Serice (DaaS) environment because of unpredictable and bursty workloads from various clients. Due to the service delivery nature of DaaS, such system overload usually has direct economic impact on the service provider, who has to pay penaltie ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
(Show Context)
The system overload is a common problem in a Database-asa-Serice (DaaS) environment because of unpredictable and bursty workloads from various clients. Due to the service delivery nature of DaaS, such system overload usually has direct economic impact on the service provider, who has to pay penalties if the system performance does not meet clients ’ service level agreements (SLAs). In this paper, we investigate techniques that prevent system overload by using admission control. We propose a profit-oriented admission control framework, called ActiveSLA, for DaaS providers. ActiveSLA is an end-to-end framework that consists of two components. First, a prediction module estimates the probability for a new query to finish the execution before its deadline. Second, based on the predicted probability, a decision module determines whether or not to admit the given query into the database system. The decision is made with the profit optimization objective, where the expected profit is derived from the service level agreements between a service provider and its clients. We present extensive real system experiments with standard database benchmarks, under different traffic patterns, DBMS settings, and SLAs. The results demonstrate that ActiveSLA is able to make admission control decisions that are both more accurate and more profit-effective than several state-of-the-art methods.
Distributed Data Management Using MapReduce
, 2013
"... MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation mo ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research efforts have been directed towards making it more usable and efficient for supporting database-centric operations. In this paper we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.
Llama: Leveraging columnar storage for scalable join processing
- in the MapReduce framework. SIGMOD
, 2011
"... To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data ware-house system, Llama, a hybrid data management system which combines the features of row-wise and col ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
(Show Context)
To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data ware-house system, Llama, a hybrid data management system which combines the features of row-wise and column-wise database sys-tems. In Llama, columns are formed into correlation groups to pro-vide the basis for the vertical partitioning of tables. Llama employs a distributed file system (DFS) to disseminate data among cluster nodes. Above the DFS, a MapReduce-based query engine is sup-ported. We design a new join algorithm to facilitate fast join pro-cessing. We present a performance study on TPC-H dataset and compare Llama with Hive, a data warehouse infrastructure built on top of Hadoop. The experiment is conducted on EC2. The results show that Llama has an excellent load performance and its query performance is significantly better than the traditional MapReduce framework based on row-wise storage.
Efficient Multi-way Theta-Join Processing Using MapReduce
, 2012
"... Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed c ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
(Show Context)
Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a costeffective perspective. Although there have been some works using the (key,value) pair-based programming model to support join operations, efficient processing of multi-way Thetajoin queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Thetajoin query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.