Results 1 -
4 of
4
Phrase Clustering for Discriminative Learning
"... We present a simple and scalable algorithm for clustering tens of millions of phrases and use the resulting clusters as features in discriminative classifiers. To demonstrate the power and generality of this approach, we apply the method in two very different applications: named entity recognition a ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
We present a simple and scalable algorithm for clustering tens of millions of phrases and use the resulting clusters as features in discriminative classifiers. To demonstrate the power and generality of this approach, we apply the method in two very different applications: named entity recognition and query classification. Our results show that phrase clusters offer significant improvements over word clusters. Our NER system achieves the best current result on the widely used CoNLL benchmark. Our query classifier is on par with the best system in KDDCUP 2005 without resorting to labor intensive knowledge engineering efforts. 1
The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce
- In 7th Workshop on Large-Scale Distributed Systems for Information Retrieval
"... This paper explores the problem of “stragglers ” in Map-Reduce: a common phenomenon where a small number of mappers or reducers takes significantly longer than the others to complete. The effects of these stragglers include unnecessarily long wall-clock running times and sub-optimal cluster utilizat ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper explores the problem of “stragglers ” in Map-Reduce: a common phenomenon where a small number of mappers or reducers takes significantly longer than the others to complete. The effects of these stragglers include unnecessarily long wall-clock running times and sub-optimal cluster utilization. In many cases, this problem cannot simply be attributed to hardware idiosyncrasies, but is rather caused by the Zipfian distribution of input or intermediate data. I present a simple theoretical model that shows how such distributions impose a fundamental limit on the amount of parallelism that can be extracted from a large class of algorithms where all occurrences of the same element must be processed together. A case study in parallel ad hoc query evaluation highlights the severity of the stragglers problem. Fortunately, a simple modification of the input data cuts end-to-end running time in half. This example illustrates some of the issues associated with designing efficient MapReduce algorithms for real-world datasets.
Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework
, 2009
"... Hadoop is an open source implementation of Google’s MapReduce programming model that has recently gained popularity as a practical approach to distributed information processing. This work explores the use of memcached, an open-source distributed in-memory object caching system, to provide low-laten ..."
Abstract
- Add to MetaCart
Hadoop is an open source implementation of Google’s MapReduce programming model that has recently gained popularity as a practical approach to distributed information processing. This work explores the use of memcached, an open-source distributed in-memory object caching system, to provide low-latency, high-throughput access to static global resources in Hadoop. Such a capability is essential to a large class of MapReduce algorithms that require, for example, querying language model probabilities, accessing model parameters in iterative algorithms, or performing joins across relational datasets. Experimental results on a simple demonstration application illustrate that memcached provides a feasible general-purpose solution for rapidly accessing global key-value pairs from within Hadoop programs. Our proposed architecture exhibits the desirable scaling characteristic of linear increase in throughput with respect to cluster size. To our knowledge, this application of memcached in Hadoop is novel. Although considerable opportunities for increased performance remain, this work enables implementation of algorithms that do not have satisfactory solutions at scale today. 1

