Results 1 -
1 of
1
Data Indexing for Stateful, Large-scale Data Processing
"... Bulk data processing models like MapReduce are popular because they enable users to harness the power of tens of thousands of commodity machines with little programming effort. However, these systems have recently come under fire for lacking features common to parallel relational databases. One key ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Bulk data processing models like MapReduce are popular because they enable users to harness the power of tens of thousands of commodity machines with little programming effort. However, these systems have recently come under fire for lacking features common to parallel relational databases. One key weakness of these architectures is that they do not provide any underlying data indexing. Indexing could potentially provide large increases in performance for workloads that join data across distinct inputs, a common operation. This paper explores the challenges of incorporating indexed data into these processing systems. In particular we explore using indexed data to support stateful groupwise processing. Access to persistent state is a key requirement for incremental processing, allowing operations to incorporate data updates without recomputing from scratch. With indexing, groupwise processing can randomly access state, avoiding costly sequential scans. While random access performance of current table-based stores (Bigtable) is disappointing, the characteristics of solid-state drives (SSDs) promises to make this a relevant optimization for such systems. Experience with an initial prototype and a simple model of system performance indicate that such techniques can halve job runtime in the common case. Finally, we outline the integration and fault tolerance issues this system architecture presents. 1.

