Results 1 - 10
of
20
Airavat: Security and Privacy for MapReduce
, 2009
"... The cloud computing paradigm, which involves distributed computation on multiple large-scale datasets, will become successful only if it ensures privacy, confidentiality, and integrity for the data belonging to individuals and organizations. We present Airavat, a novel integration of decentralized i ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
The cloud computing paradigm, which involves distributed computation on multiple large-scale datasets, will become successful only if it ensures privacy, confidentiality, and integrity for the data belonging to individuals and organizations. We present Airavat, a novel integration of decentralized information flow control (DIFC) and differential privacy that provides strong security and privacy guarantees for MapReduce computations. Airavat allows users to use arbitrary mappers, prevents unauthorized leakage of sensitive data during the computation, and supports automatic declassification of the results when the latter do not violate individual privacy. Airavat minimizes the amount of trusted code in the system and allows users without security expertise to perform privacy-preserving computations on sensitive data. Our prototype implementation demonstrates the flexibility of Airavat on a wide variety of case studies. The prototype is efficient, with run-times on Amazon’s cloud computing infrastructure within 25 % of a MapReduce system with no security.
Cloud Technologies for Bioinformatics Applications
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2010
"... Executing large number of independent jobs or jobs comprising of large number of tasks that perform minimal intertask communication is a common requirement in many domains. Various technologies ranging from classic job schedulers to latest cloud technologies such as MapReduce can be used to execute ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Executing large number of independent jobs or jobs comprising of large number of tasks that perform minimal intertask communication is a common requirement in many domains. Various technologies ranging from classic job schedulers to latest cloud technologies such as MapReduce can be used to execute these “many-tasks” in parallel. In this paper, we present our experience in applying two cloud technologies Apache Hadoop and Microsoft DryadLINQ to two bioinformatics applications with the above characteristics. The applications are a pairwise Alu sequence alignment application and an EST (Expressed Sequence Tag) sequence assembly program. First we compare the performance of these cloud technologies using the above case and also compare them with traditional MPI implementation in one application. Next we analyze the effect of inhomogeneous data on the scheduling mechanisms of the cloud technologies. Finally we present a comparison of performance of the cloud technologies under virtual and non-virtual hardware platforms.
Design patterns for efficient graph algorithms in mapreduce
- In MLG ’10: Proceedings of the Eighth Workshop on Mining and Learning with Graphs
, 2010
"... Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of proteinprotein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of proteinprotein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their large size, often spanning millions of vertices and billions of edges. As such, researchers have increasingly turned to distributed solutions. In particular, MapReduce has emerged as an enabling technology for large-scale graph processing. However, existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serializing, and distributing the graph. In this paper, we present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%. 1.
Does erasure coding have a role to play in my data center?
"... Today replication has become the de facto standard for storing data within and across data centers that process data-intensive workloads. Erasure coding (a form of software RAID), although heavily researched and theoretically more space efficient than replication, has complex tradeoffs which are not ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Today replication has become the de facto standard for storing data within and across data centers that process data-intensive workloads. Erasure coding (a form of software RAID), although heavily researched and theoretically more space efficient than replication, has complex tradeoffs which are not well-understood by practitioners. Today’s data centers have diverse foreground and background data-intensive workloads, and getting these tradeoffs right is becoming increasingly important. Through a series of realistic data center deployment scenarios and workload characteristics, coupled with the implementation of a prototype Hadoop library with erasure coding functionalities, we revisit traditional metrics (performance and dollar cost), present new tradeoffs (power proportionality and complexity) and make recommendations on directions worth researching. 1
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications
"... Cloud computing offers new approaches for scientific computing that leverage the major commercial hardware and software investment in this area. Closely coupled applications are still unclear in clouds as synchronization costs are still higher than on optimized MPI machines. However loosely coupled ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Cloud computing offers new approaches for scientific computing that leverage the major commercial hardware and software investment in this area. Closely coupled applications are still unclear in clouds as synchronization costs are still higher than on optimized MPI machines. However loosely coupled problems are very important in many fields and can achieve good cloud performance even when pleasingly parallel steps are followed by reduction operations as supported by MapReduce. However we can use clouds in several ways and here we compare four different approaches using two biomedical applications. We look at the cloud infrastructure service based virtual machine utility computing models of Amazon AWS and Microsoft Windows Azure; Map Reduce based computing frameworks Apache Hadoop (deployed on raw hardware as well as on virtual machines) and Micrsoft DryadLINQ. We compare performance showing strong variations in cost between different EC2 machine choices and comparable performance between the utility computing (spawn off a set of jobs) and managed parallelism (MapReduce). The MapReduce approach offered the most user friendly approach.
Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother?
"... Modern information retrieval research has evolved a standard workflow that involves first indexing a document collection and then running ad hoc queries sequentially to evaluate retrieval effectiveness using standard test collections. This paper explores how aspects of this workflow might change in ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Modern information retrieval research has evolved a standard workflow that involves first indexing a document collection and then running ad hoc queries sequentially to evaluate retrieval effectiveness using standard test collections. This paper explores how aspects of this workflow might change in a MapReduce cluster-based environment. First, we present and evaluate two algorithms for inverted indexing that take advantage of the programming model’s sorting mechanism to different extents. The running times of both algorithms scale linearly in terms of collection size up to 102 million web pages. Second, we show that it is possible to efficiently perform batch query evaluation with MapReduce by scanning all postings lists in parallel, as opposed to sequentially accessing each postings list. Third, we explore an approach that forgoes inverted indexing altogether and simply computes all query–document scores from document vectors themselves. Experimental results challenge us to think differently about previous assumptions in information retrieval, and show that brute force approaches are surprisingly compelling under certain circumstances: parallel scan of postings can effectively take advantage of large clusters and parallel scan of documents fits naturally with ranking functions that use document-level features. 1
The limitation of MapReduce: A probing case and a lightweight solution
- In Proc. of the 1st Intl. Conf. on Cloud Computing, GRIDs, and Virtualization
, 2010
"... Abstract—MapReduce is arguably the most successful parallelization framework especially for processing large data sets in datacenters comprising commodity computers. However, difficulties are observed in porting sophisticated applications to MapReduce, albeit the existence of numerous parallelizatio ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—MapReduce is arguably the most successful parallelization framework especially for processing large data sets in datacenters comprising commodity computers. However, difficulties are observed in porting sophisticated applications to MapReduce, albeit the existence of numerous parallelization opportunities. Intrinsically, the MapReduce design allows a program to scale up to handle extremely large data sets, but constrains a program’s ability to process smaller data items and exploit variable-degrees of parallelization opportunities which are likely to be the common case in general application. In this paper, we analyze the limitations of MapReduce and present the design and implementation of a new lightweight parallelization framework, MRlite. MRlite can efficiently process moderatesize data with dependences among numerous computational steps. In the mean time, the parallelization on each step emulates the MapReduce model. Hence, the MRlite framework can also scale up for large data sets if massive parallelism with minimal dependence exists. MRlite can significantly improve the flexibility and parallel execution performance for a number of typical programs. Our evaluation shows that MRlite is one order of magnitude faster than Hadoop on problems that MapReduce has difficulty in handling. Keywords-Distributed computing; Parallel architectures I.
Scalable Modular Genome Assembly on Campus Grids
"... Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, is naturally parallel; however, most current implementations are tied to uncommon high end hardware. We solve this problem by introducing a modular, scalable f ..."
Abstract
- Add to MetaCart
Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, is naturally parallel; however, most current implementations are tied to uncommon high end hardware. We solve this problem by introducing a modular, scalable framework for genome assembly that runs on a wide variety of distributed environments without forcing end users to purchase specialized hardware or become experts in parallel programming. For large problems, the framework carefully handles task and data management while also achieving fault-tolerant speedup with good efficiency on several scales of resources. We show results for several assembly-related problems ranging from 738 thousand to over 84 million alignments using campus grid resources ranging from a small cluster to several hundred nodes at each of three institutions. These results show strong scaling beyond 512 nodes using a custom alignment module. 1
Highly Scalable Genome Assembly on Campus Grids
"... Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, has great potential for parallelization, however most previous attempts at parallelization require uncommon high-end hardware. This paper introduces a scalable ..."
Abstract
- Add to MetaCart
Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, has great potential for parallelization, however most previous attempts at parallelization require uncommon high-end hardware. This paper introduces a scalable modular genome assembler that can achieve significant speedup using large numbers of conventional desktop machines, such as those found in a campus computing grid. The system is based on the Celera open-source assembly toolkit, and replaces two independent sequential modules with scalable replacements: a scalable candidate selector exploits the distributed memory capacity of a campus grid, while the scalable aligner exploits the distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency on several scales of resources. We show results for several datasets ranging from 738 thousand to over 121 million alignments using campus grid resources ranging from a small cluster to more than a thousand nodes spanning three institutions. Our largest run so far achieves a 927x speedup with 71.3 percent efficiency. 1.
Supplementary Methods
"... spaced seeds for different sensitivity criteria, we proposed the following three methods to generate full sensitive periodic multiple seeds. For large genome re-sequencing application, multiple index tables can be queried with the MapReduce framework as proposed in [8] to increase the mapping effici ..."
Abstract
- Add to MetaCart
spaced seeds for different sensitivity criteria, we proposed the following three methods to generate full sensitive periodic multiple seeds. For large genome re-sequencing application, multiple index tables can be queried with the MapReduce framework as proposed in [8] to increase the mapping efficiency and sensitivity by utilizing the higher weight of multiple seeds. 1.1 Design paired periodic seeds with exhaustive search The design of single periodic seeds can be generalized to find same-length periodic multiple seeds. Tables 1 and 2 show the increase in weight for different period lengths which results from the using paired rather than single seeds. Fig 1 displays the local maximum of weight-length ratios for paired seed periods. Table 1. The maximum period weight for single and paired seeds at different sensitivity levels Full Sensitive to 2 substitutions 3 substitutions 4 substitutions

