Results 1 - 10
of
11
Scaling the Mobile Millennium System in the Cloud
"... We report on our experience scaling up the Mobile Millennium traffic information system using cloud computing and the Spark cluster computing framework. Mobile Millennium uses machine learning to infer traffic conditions for large metropolitan areas from crowdsourced data, and Spark was specifically ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We report on our experience scaling up the Mobile Millennium traffic information system using cloud computing and the Spark cluster computing framework. Mobile Millennium uses machine learning to infer traffic conditions for large metropolitan areas from crowdsourced data, and Spark was specifically designed to support such applications. Many studies of cloud computing frameworks have demonstrated scalability and performance improvements for simple machine learning algorithms. Our experience implementing a real-world machine learning-based application corroborates such benefits, but we also encountered several challenges that have not been widely reported. These include: managing large parameter vectors, using memory efficiently, and integrating with the application’s existing storage infrastructure. This paper describes these challenges and the changes they required in both the Spark framework and the Mobile Millennium software. While we focus on a system for traffic estimation, we believe that the lessons learned are applicable to other machine learning-based applications.
Shark: Fast Data Analysis Using Coarse-grained Distributed Memory
"... Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing aunifiedsystemforeasydatamanipulationusingSQLand pushing sophisticated analysis closer to data. It scales to thousands of ..."
Abstract
- Add to MetaCart
Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing aunifiedsystemforeasydatamanipulationusingSQLand pushing sophisticated analysis closer to data. It scales to thousands of nodes in a fault-tolerant manner. Shark can answer queries 40X faster than Apache Hive and run machine learning programs 25X faster than MapReduce programs in Apache Hadoop on large datasets.
Sonora: A Platform for Continuous Mobile-Cloud Computing
"... This paper presents Sonora, a platform for mobilecloud computing. Sonora is designed to support the development and execution of continuous mobile-cloud services. To this end, Sonora provides developers with stream-based programming interfaces that coherently integrate a broad range of existing tech ..."
Abstract
- Add to MetaCart
This paper presents Sonora, a platform for mobilecloud computing. Sonora is designed to support the development and execution of continuous mobile-cloud services. To this end, Sonora provides developers with stream-based programming interfaces that coherently integrate a broad range of existing techniques from mobile, database, and distributed systems. These range from support for disconnected operation to relational and event-driven models. Sonora’s execution engine is a fault-tolerant distributed runtime that supports userfacing continuous sensing and processing services in the cloud. Key features of this engine are its dynamic load balancing mechanisms, and a novel failure recovery protocol that performs checkpoint-based partial rollback recovery with selective re-execution. To illustrate the relevance and power of the stream abstraction in describing complex mobile-cloud services we evaluate Sonora’s design in the context of two services. We also validate Sonora’s design, demonstrating that Sonora is efficient, scalable, and provides responsive fault tolerance. 1.
Motivation and Approach
"... I am a systems security researcher who harnesses machine learning to solve problems. My interest in systems security stems from my desire to work on practical problems that affect the average computer user. My interest in learning-based approaches to security comes from the realization that as data ..."
Abstract
- Add to MetaCart
I am a systems security researcher who harnesses machine learning to solve problems. My interest in systems security stems from my desire to work on practical problems that affect the average computer user. My interest in learning-based approaches to security comes from the realization that as data sources become larger and more complex, a handful of well-crafted heuristics is simply not sufficient. Together, the desire to work on practical problems and the observation that we should move beyond heuristics motivate my research. Practical problems in security interest me because I identify with the average user. Criminals and other miscreants commit a plethora of abuses over systems that serve users, with the acts ranging from being inconvenient (such as flooding an email inbox with spam) to being outright criminal (in the case of tricking us to reveal private information, or attempting to take over our machines). Therefore, there is great satisfaction when I can reduce the effectiveness of abusers and protects users from harm, and applied security research provides me this opportunity. Much of my research has addressed security problems in Internet services (such as Web mail and social networking sites), where the goal is to detect and block malicious activity. Because of the size of the user base for these services, the scale of the data is large and the diversity in how abuses manifest themselves is high. In this situation, it is difficult to discern one clear signal in the data for detecting abuse — instead, we find ourselves needing to sift through many indicative but faint signals for this purpose. Rather than using domain expertise to construct a small set of well-crafted
Composable Incremental and Iterative Data-Parallel Computation with Naiad
"... We report on the design and implementation of Naiad, a set of declarative data-parallel language extensions and an associated runtime supporting efficient and composable incremental and iterative computation. This combination is enabled by a new computational model we call differential dataflow, in ..."
Abstract
- Add to MetaCart
We report on the design and implementation of Naiad, a set of declarative data-parallel language extensions and an associated runtime supporting efficient and composable incremental and iterative computation. This combination is enabled by a new computational model we call differential dataflow, in which incremental computation can be performed using a partial, rather than total, order on time. Naiad extends standard batch data-parallel processing models like MapReduce, Hadoop, and Dryad/DryadLINQ, to support efficient incremental updates to the inputs in the manner of a stream processing system, while at the same time enabling arbitrarily nested fixed-point iteration. In this paper, we evaluate a prototype of Naiad that uses shared memory on a single multi-core computer. We apply Naiad to various computations, including several graph algorithms, and observe good scaling properties and efficient incremental recomputation. 1.
Supporting Bulk Synchronous Parallelism in Map-Reduce Queries
"... Abstract—One of the major drawbacks of the Map-Reduce (MR) model is that, to simplify reliability and fault tolerance, it does not preserve data in memory across consecutive MR jobs: a MR job must dump its data to the distributed file system before they can be read by the next MR job. This restricti ..."
Abstract
- Add to MetaCart
Abstract—One of the major drawbacks of the Map-Reduce (MR) model is that, to simplify reliability and fault tolerance, it does not preserve data in memory across consecutive MR jobs: a MR job must dump its data to the distributed file system before they can be read by the next MR job. This restriction imposes a high overhead to complex MR workflows and graph algorithms, such as PageRank, which require repetitive MR jobs. The Bulk Synchronous Parallelism (BSP) programming model, on the other hand, has been recently advocated as an alternative to the MR model that does not suffer from this restriction, and, under certain circumstances, allows complex repetitive algorithms to run entirely in the collective memory of a cluster. We present a framework for translating complex declarative queries for scientific and graph data analysis applications to both MR and BSP evaluation plans, leaving the choice to be made at run-time based on the available resources. If the resources are sufficient, the query will be evaluated entirely in memory based on the BSP model, otherwise, the same query will be evaluated based on the MR model. I.
Themis: An I/O-Efficient MapReduce
"... “Big Data ” computing increasingly utilizes the MapReduce programming model for scalable processing of large data collections. Many MapReduce jobs are I/O-bound, and so minimizing the number of I/O operations is critical to improving their performance. In this work, we present Themis, a MapReduce im ..."
Abstract
- Add to MetaCart
“Big Data ” computing increasingly utilizes the MapReduce programming model for scalable processing of large data collections. Many MapReduce jobs are I/O-bound, and so minimizing the number of I/O operations is critical to improving their performance. In this work, we present Themis, a MapReduce implementation that reads and writes data records to disk exactly twice, which is the minimum amount possible for data sets that cannot fit in memory. In order to minimize I/O, Themis makes fundamentally different design decisions from previous MapReduce implementations. Themis performs a wide variety of MapReduce jobs – including click log analysis, DNA read sequence alignment, and PageRank – at nearly the speed of TritonSort’s record-setting sort performance [29].
Differential dataflow
"... Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as social-graph analysis on changing data at interactive timescales, which would g ..."
Abstract
- Add to MetaCart
Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as social-graph analysis on changing data at interactive timescales, which would greatly benefit those analyzing the behavior of services like Twitter. In this paper we introduce a new model called differential computation, which extends traditional incremental computation to allow arbitrarily nested iteration, and explain—with reference to a publicly available prototype system called Naiad—how differential computation can be efficiently implemented in the context of a declarative dataparallel dataflow language. The resulting system makes it easy to program previously intractable algorithms such as incrementally updated strongly connected components, and integrate them with data transformation operations to obtain practically relevant insights from real data streams. 1.
Noname manuscript No. (will be inserted by the editor) The HaLoop Approach to Large-Scale Iterative Data Analysis
"... Abstract The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable dataintensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce lacks built-in support for iterative progra ..."
Abstract
- Add to MetaCart
Abstract The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable dataintensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce lacks built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, and model fitting. This paper 1 presents HaLoop, a modified version of the Hadoop MapReduce framework, that is designed to serve these applications. HaLoop allows iterative applications to be assembled from existing Hadoop programs without modification, and significantly improves their efficiency by providing inter-iteration caching mechanisms and a loop-aware scheduler to exploit these caches. HaLoop retains the fault-tolerance properties of MapReduce through automatic cache recovery and task re-execution. We evaluated HaLoop on a variety of real applications and real datasets. Compared with Hadoop, on average, HaLoop improved runtimes by a factor of 1.85 and shuffled only 4 % as much data between mappers and reducers in the applications that we tested.
unknown title
"... “Big Data ” computing increasingly utilizes the MapReduce programming model for scalable processing of large data collections. Many MapReduce jobs are I/O-bound, and so minimizing the number of I/O operations is critical to improving their performance. In this work, we present ThemisMR, a MapReduce ..."
Abstract
- Add to MetaCart
“Big Data ” computing increasingly utilizes the MapReduce programming model for scalable processing of large data collections. Many MapReduce jobs are I/O-bound, and so minimizing the number of I/O operations is critical to improving their performance. In this work, we present ThemisMR, a MapReduce implementation that reads and writes data records to disk exactly twice, which is the minimum amount possible for data sets that cannot fit in memory. In order to minimize I/O, ThemisMR makes fundamentally different design decisions from previous MapReduce implementations. ThemisMR performs a wide variety of MapReduce jobs – including click log analysis, DNA read sequence alignment, and PageRank – at nearly the speed of TritonSort’s record-setting sort performance. 1.

