Results 1 -
5 of
5
Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother?
"... Modern information retrieval research has evolved a standard workflow that involves first indexing a document collection and then running ad hoc queries sequentially to evaluate retrieval effectiveness using standard test collections. This paper explores how aspects of this workflow might change in ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Modern information retrieval research has evolved a standard workflow that involves first indexing a document collection and then running ad hoc queries sequentially to evaluate retrieval effectiveness using standard test collections. This paper explores how aspects of this workflow might change in a MapReduce cluster-based environment. First, we present and evaluate two algorithms for inverted indexing that take advantage of the programming model’s sorting mechanism to different extents. The running times of both algorithms scale linearly in terms of collection size up to 102 million web pages. Second, we show that it is possible to efficiently perform batch query evaluation with MapReduce by scanning all postings lists in parallel, as opposed to sequentially accessing each postings list. Third, we explore an approach that forgoes inverted indexing altogether and simply computes all query–document scores from document vectors themselves. Experimental results challenge us to think differently about previous assumptions in information retrieval, and show that brute force approaches are surprisingly compelling under certain circumstances: parallel scan of postings can effectively take advantage of large clusters and parallel scan of documents fits naturally with ranking functions that use document-level features. 1
Tasks A Study of Skew in MapReduce Applications
"... Abstract—This paper presents a study of skew — highly variable task runtimes — in MapReduce applications. We describe various causes and manifestations of skew as observed in real world Hadoop applications. Runtime task distributions from these applications demonstrate the presence and negative impa ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—This paper presents a study of skew — highly variable task runtimes — in MapReduce applications. We describe various causes and manifestations of skew as observed in real world Hadoop applications. Runtime task distributions from these applications demonstrate the presence and negative impact of skew on performance behavior. We discuss best practices recommended for avoiding such behavior and their limitations. I.
U N I V E R
"... Twitter is a very fast growing social networking site and has millions of users. Twitter user social relationship is based on follower concept rather that we are friend concept and following action is not mutual between Twitter user. Twitter users can be ranked using PageRank method as followers can ..."
Abstract
- Add to MetaCart
Twitter is a very fast growing social networking site and has millions of users. Twitter user social relationship is based on follower concept rather that we are friend concept and following action is not mutual between Twitter user. Twitter users can be ranked using PageRank method as followers can be represented as social graph and the number of followers reflects influence propagation. In this dissertation we implemented Incremental PageRank using Hadoop MapReduce framework. We improved the existing Incremental PageRank method based on the idea that we can reduce the number of affected nodes that are descendants of changed nodes from going to recalculation stage by applying threshold restriction. We named our approach as Incremental+ method. Our experimental results show that the Incremental+ PageRank method is scalable because we successfully applied this method to calculate PageRank value for 1.47 billion Twitter following relations. Incremental+ method also produced the same ranking result as other methods even though we used approximation approach in calculating the PageRank value. The result also shows that Incremental+ method is efficient because it reduced the number of inputs per iteration and also reduced
UMD and USC/ISI: TREC 2010 Web Track Experiments with Ivory
"... Ivory is a web-scale retrieval engine we have been developing for the past two years, built around a cluster-based environment running Hadoop, the open-source implementation of the MapReduce programming model. Building on successes last year at TREC, we explored two major directions this year: more ..."
Abstract
- Add to MetaCart
Ivory is a web-scale retrieval engine we have been developing for the past two years, built around a cluster-based environment running Hadoop, the open-source implementation of the MapReduce programming model. Building on successes last year at TREC, we explored two major directions this year: more sophisticated retrieval models and large-scale graph analysis for spam detection. We describe results of ad hoc retrieval experiments with latent concept expansion and a greedily-learned linear ranking model. Although neither model is novel, our experiments provide some insight on the behavior of these two approaches at scale, on collections larger than those previously studied. We also discuss our link-based spam filtering algorithm that operated on the entire web graph of ClueWeb09. Unfortunately, results in the spam track were worse than the baseline provided by the track organizers. 1.
COUNTING TRIANGLES IN MASSIVE GRAPHS WITH MAPREDUCE ∗
, 1301
"... Abstract. Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness ..."
Abstract
- Add to MetaCart
Abstract. Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness of mutual friends. This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for large-scale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with extremely massive graphs. We show results on publicly-available networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 sec. per million edges plus overhead (approximately 225 sec. total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and non-social networks. To the best of our knowledge, these are the largest triangle-based graph computations published to date.

