## Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework (2009)

### BibTeX

@MISC{Lin09low-latency,high-throughput,

author = {Jimmy Lin and Shravya Konda and Samantha Mahindrakar},

title = {Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework},

year = {2009}

}

### OpenURL

### Abstract

Hadoop is an open source implementation of Google’s MapReduce programming model that has recently gained popularity as a practical approach to distributed information processing. This work explores the use of memcached, an open-source distributed in-memory object caching system, to provide low-latency, high-throughput access to static global resources in Hadoop. Such a capability is essential to a large class of MapReduce algorithms that require, for example, querying language model probabilities, accessing model parameters in iterative algorithms, or performing joins across relational datasets. Experimental results on a simple demonstration application illustrate that memcached provides a feasible general-purpose solution for rapidly accessing global key-value pairs from within Hadoop programs. Our proposed architecture exhibits the desirable scaling characteristic of linear increase in throughput with respect to cluster size. To our knowledge, this application of memcached in Hadoop is novel. Although considerable opportunities for increased performance remain, this work enables implementation of algorithms that do not have satisfactory solutions at scale today. 1

### Citations

8132 | Maximum likelihood from incomplete data via the em algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ... providing access to distributed language models exist [31, 12], but in Section 4 we propose a more general-purpose solution. Model Parameters in Iterative Algorithms. Expectation-maximization, or EM =-=[9]-=-, describes a class of algorithms that iteratively estimates parameters of a probabilistic model with latent (i.e., hidden) variables, given a set of observations. Each iteration requires access to mo... |

2144 | The PageRank Citation Ranking: Bringing Order to the Web
- Page, Brin, et al.
- 1998
(Show Context)
Citation Context ...pplied to all input records, which generates results that are aggregated by the “reducer”. The runtime groups together values by keys. the standard MapReduce implementation of the well-known PageRank =-=[24]-=- algorithm, each iteration corresponds to a MapReduce job. Under this framework, a programmer needs only to provide implementations of the mapper and reducer. On top of a distributed file system, the ... |

1734 | Mapreduce: Simplified data processing on large clusters
- Dean, Ghemawat
- 2004
(Show Context)
Citation Context ...e remain, this work enables implementation of algorithms that do not have satisfactory solutions at scale today. 1 Introduction Hadoop is an open source implementation of Google’s MapReduce framework =-=[7]-=- for distributed processing. It has recently gained popularity as a practical, cost-effective approach to tackling data-intensive problems, and has stimulated a flurry of research on applications to m... |

1275 | Xen and the Art of Virtualization
- Barham, Dragovic, et al.
- 2003
(Show Context)
Citation Context ... it is important to note that results in Section 6 were obtained from experiments on EC2, which is a virtualized environment where processes incur a noticeable performance overhead (see, for example, =-=[1]-=-). While this affects all cluster configurations equally and will not likely impact results qualitatively, absolute numbers may not be entirely accurate. In the current setup, latency appears to be th... |

1177 | The mathematics of statistical machine translation: Parameter estimation
- Brown, Pietra, et al.
(Show Context)
Citation Context ...on requires access to model parameters from the previous iteration, thus presenting a distributed data access problem. One well-known EM algorithm is word alignment in statistical machine translation =-=[3]-=-, where the task is to induce word translation probabilities based on parallel training data (pairs of sentences in different languages known to be translations of each other). Word alignment is a cri... |

885 | A language modeling approach to information retrieval
- Ponte, Croft
- 1998
(Show Context)
Citation Context ... a formulation ignores the richness and complexity of natural language—disregarding syntax, semantics, and even word order— this simple model has proven to be effective in text retrieval applications =-=[25, 17]-=-. 71: procedure Map(k, s) 2: p ← 0 3: for all w ∈ s do 4: p ← p + LogP(w) 5: Emit(k, p) Figure 4: Pseudo-code of algorithm for scoring a sentence with a unigram language model. Given a sequence of na... |

703 | A study of smoothing methods for language models applied to Ad Hoc information retrieval
- Zhai, Lafferty
- 2001
(Show Context)
Citation Context ...epresents the count of word w in the entire collection. It is well-known that smoothing, the process of assigning non-zero probabilities to unseen events, plays an important role in language modeling =-=[21, 30]-=-. For the purposes of characterizing cache performance, however, the exact probability distribution is unimportant. Computing the maximum-likelihood log probabilities is easily expressed as two MapRed... |

635 | Statistical phrase-based translation
- Koehn, Och, et al.
- 2003
(Show Context)
Citation Context ... large number of observations (i.e., a large text collection); see [21] for an introduction. One concrete example is the decoding (i.e., translation) phase in a statistical machine translation system =-=[15, 20]-=-, where translation hypotheses need to be scored against a language model (to maximize the probability that the generated output is fluent natural language, subjected to various translation constraint... |

530 | Bigtable: A distributed storage system for structured data
- Chang, Dean, et al.
- 2006
(Show Context)
Citation Context ...erefore which probabilities are required) until the algorithm is well underway. Two-pass solutions are possible, but at the cost of added complexity. A distributed data store (e.g., Google’s Bigtable =-=[4]-=- or Amazon’s Dynamo [8]) might provide a possible solution for accessing global resources that are too big to fit into memory on any single machine (and cannot be easily partitioned). HBase 3 is an op... |

451 | Minimum error rate training in statistical machine translation
- Och
- 2003
(Show Context)
Citation Context ...mize the probability that the generated output is fluent natural language, subjected to various translation constraints). One popular optimization technique, called minimum error-rate training (MERT) =-=[22, 27]-=-, requires the repeated decoding of many training sentences, which can be straightforwardly parallelized with MapReduce using mappers. As the algorithm maps over training sentences, it needs to rapidl... |

364 | A hierarchical phrase-based model for statistical machine translation
- Chiang
- 2005
(Show Context)
Citation Context ...ervers in the 80-slave cluster configuration. can take a few seconds with standard phrase-based models [15] and significantly longer (more than ten seconds) for more sophisticated hierarchical models =-=[5]-=-. 7.1 Further Optimizations Nevertheless, there are considerable opportunities for further increasing cache performance. These represent fairly standard engineering practices, but are nevertheless wor... |

359 | Pig latin: a not-so-foreign language for data processing
- Olston, Reed, et al.
- 2008
(Show Context)
Citation Context ... machine translation [10], corpus analysis [19], and text retrieval [11]. Additionally, researchers have examined extensions to the programming model [29], developed a complementary dataflow language =-=[23]-=-, and also built alternative implementations on different architectures [14, 16, 26]. It is no exaggeration to say that MapReduce in general, and Hadoop in particular, has reinvigorated work on applic... |

355 | Dynamo: Amazon’s highly available key-value store
- DeCandia
- 2007
(Show Context)
Citation Context ...ties are required) until the algorithm is well underway. Two-pass solutions are possible, but at the cost of added complexity. A distributed data store (e.g., Google’s Bigtable [4] or Amazon’s Dynamo =-=[8]-=-) might provide a possible solution for accessing global resources that are too big to fit into memory on any single machine (and cannot be easily partitioned). HBase 3 is an open-source implementatio... |

305 | Document language models, query models, and risk minimization for information retrieval
- Lafferty, Zhai
- 2001
(Show Context)
Citation Context ... a formulation ignores the richness and complexity of natural language—disregarding syntax, semantics, and even word order— this simple model has proven to be effective in text retrieval applications =-=[25, 17]-=-. 71: procedure Map(k, s) 2: p ← 0 3: for all w ∈ s do 4: p ← p + LogP(w) 5: Emit(k, p) Figure 4: Pseudo-code of algorithm for scoring a sentence with a unigram language model. Given a sequence of na... |

140 | K.: Map-reduce for machine learning on multicore
- Chu, Kim, et al.
- 2006
(Show Context)
Citation Context ... processing. It has recently gained popularity as a practical, cost-effective approach to tackling data-intensive problems, and has stimulated a flurry of research on applications to machine learning =-=[6, 28]-=-, statistical machine translation [10], corpus analysis [19], and text retrieval [11]. Additionally, researchers have examined extensions to the programming model [29], developed a complementary dataf... |

138 | Evaluating MapReduce for multi-core and multiprocessor systems
- Ranger, Raghuraman, et al.
- 2007
(Show Context)
Citation Context ...ditionally, researchers have examined extensions to the programming model [29], developed a complementary dataflow language [23], and also built alternative implementations on different architectures =-=[14, 16, 26]-=-. It is no exaggeration to say that MapReduce in general, and Hadoop in particular, has reinvigorated work on applications, algorithms, and approaches to distributed processing of large datasets. The ... |

128 |
Map-reduce-merge: simplified relational data processing on large clusters
- Yang, Dasdan, et al.
- 2007
(Show Context)
Citation Context ...lications to machine learning [6, 28], statistical machine translation [10], corpus analysis [19], and text retrieval [11]. Additionally, researchers have examined extensions to the programming model =-=[29]-=-, developed a complementary dataflow language [23], and also built alternative implementations on different architectures [14, 16, 26]. It is no exaggeration to say that MapReduce in general, and Hado... |

119 | Large language models in machine translation
- Brants, Popat, et al.
- 2007
(Show Context)
Citation Context ...odel probabilities into memory. The fundamental bottleneck here of course is memory, which constrains the maximum size of the language model. Web-scale language models can have billions of parameters =-=[2]-=- and may not easily fit into memory on a single machine. Client-server architectures for providing access to distributed language models exist [31, 12], but in Section 4 we propose a more general-purp... |

58 | Mars: A mapreduce framework on graphics processors
- He, Fang, et al.
(Show Context)
Citation Context ...ditionally, researchers have examined extensions to the programming model [29], developed a complementary dataflow language [23], and also built alternative implementations on different architectures =-=[14, 16, 26]-=-. It is no exaggeration to say that MapReduce in general, and Hadoop in particular, has reinvigorated work on applications, algorithms, and approaches to distributed processing of large datasets. The ... |

54 | Statistical machine translation
- Lopez
(Show Context)
Citation Context ... large number of observations (i.e., a large text collection); see [21] for an introduction. One concrete example is the decoding (i.e., translation) phase in a statistical machine translation system =-=[15, 20]-=-, where translation hypotheses need to be scored against a language model (to maximize the probability that the generated output is fluent natural language, subjected to various translation constraint... |

31 | Pairwise document similarity in large collections with mapreduce
- Elsayed, Lin, et al.
(Show Context)
Citation Context ...o tackling data-intensive problems, and has stimulated a flurry of research on applications to machine learning [6, 28], statistical machine translation [10], corpus analysis [19], and text retrieval =-=[11]-=-. Additionally, researchers have examined extensions to the programming model [29], developed a complementary dataflow language [23], and also built alternative implementations on different architectu... |

26 | Fully distributed EM for very large datasets
- Wolfe, Haghighi, et al.
- 2008
(Show Context)
Citation Context ... processing. It has recently gained popularity as a practical, cost-effective approach to tackling data-intensive problems, and has stimulated a flurry of research on applications to machine learning =-=[6, 28]-=-, statistical machine translation [10], corpus analysis [19], and text retrieval [11]. Additionally, researchers have examined extensions to the programming model [29], developed a complementary dataf... |

14 |
cheap: Construction of statistical machine translation models with 204
- Fast
(Show Context)
Citation Context ...arity as a practical, cost-effective approach to tackling data-intensive problems, and has stimulated a flurry of research on applications to machine learning [6, 28], statistical machine translation =-=[10]-=-, corpus analysis [19], and text retrieval [11]. Additionally, researchers have examined extensions to the programming model [29], developed a complementary dataflow language [23], and also built alte... |

10 | Distributed language modeling for N-best list re-ranking
- Zhang, Hildebrand, et al.
- 2006
(Show Context)
Citation Context ...scale language models can have billions of parameters [2] and may not easily fit into memory on a single machine. Client-server architectures for providing access to distributed language models exist =-=[31, 12]-=-, but in Section 4 we propose a more general-purpose solution. Model Parameters in Iterative Algorithms. Expectation-maximization, or EM [9], describes a class of algorithms that iteratively estimates... |

9 | MapReduce for the Cell BE architecture
- Kruijf, Sankaralingam
- 2007
(Show Context)
Citation Context ...ditionally, researchers have examined extensions to the programming model [29], developed a complementary dataflow language [23], and also built alternative implementations on different architectures =-=[14, 16, 26]-=-. It is no exaggeration to say that MapReduce in general, and Hadoop in particular, has reinvigorated work on applications, algorithms, and approaches to distributed processing of large datasets. The ... |

6 |
Largescale distributed language modeling
- Emami, Papineni, et al.
- 2007
(Show Context)
Citation Context ...scale language models can have billions of parameters [2] and may not easily fit into memory on a single machine. Client-server architectures for providing access to distributed language models exist =-=[31, 12]-=-, but in Section 4 we propose a more general-purpose solution. Model Parameters in Iterative Algorithms. Expectation-maximization, or EM [9], describes a class of algorithms that iteratively estimates... |

6 | Exploring Large-Data Issues in the Curriculum: A Case Study with MapReduce
- Lin
(Show Context)
Citation Context ...repeating until convergence. We have implemented the capabilities described here in Cloud 9 , an open-source MapReduce library for Hadoop, 6 being developed at the University of Maryland for teaching =-=[18]-=- and a number of ongoing research projects [11, 19]. Although the idea of providing distributed access to key-value pairs using a client-server architecture is certainly not new (and researchers have ... |

5 | Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with mapreduce
- Lin
- 2008
(Show Context)
Citation Context ...cost-effective approach to tackling data-intensive problems, and has stimulated a flurry of research on applications to machine learning [6, 28], statistical machine translation [10], corpus analysis =-=[19]-=-, and text retrieval [11]. Additionally, researchers have examined extensions to the programming model [29], developed a complementary dataflow language [23], and also built alternative implementation... |