Results 11 - 20
of
25
RDFPath: Path Query Processing on Large RDF Graphs with MapReduce
"... Abstract. The MapReduce programming model has gained traction in different application areas in recent years, ranging from the analysis of log files to the computation of the RDFS closure. Yet, for most users the MapReduce abstraction is too low-level since even simple computations have to be expres ..."
Abstract
- Add to MetaCart
Abstract. The MapReduce programming model has gained traction in different application areas in recent years, ranging from the analysis of log files to the computation of the RDFS closure. Yet, for most users the MapReduce abstraction is too low-level since even simple computations have to be expressed as Map and Reduce phases. In this paper we propose RDFPath, an expressive RDF path query language geared towards casual users that benefits from the scaling properties of the MapReduce framework by automatically transforming declarative path queries into MapReduce jobs. Our evaluation on a real world data set shows the applicability of RDFPath for investigating typical graph properties like shortest paths.
Parallel Data Processing with MapReduce: A Survey
"... A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency pe ..."
Abstract
- Add to MetaCart
A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency per node, and simple abstraction. This survey intends to assist the database and open source communities in understanding various technical aspects of the MapReduce framework. In this survey, we characterize the MapReduce framework and discuss its inherent pros and cons. We then introduce its optimization strategies reported in the recent literature. We also discuss the open issues and challenges raised on parallel data analysis with MapReduce. 1.
I V E R
"... Parallel Database Management systems are the dominant technology used for large scale data-analysis. The experience of query evaluation techniques used by Database Management Systems combined with the processing power offered by parallelism are some of the reasons for the wide use of the technology. ..."
Abstract
- Add to MetaCart
Parallel Database Management systems are the dominant technology used for large scale data-analysis. The experience of query evaluation techniques used by Database Management Systems combined with the processing power offered by parallelism are some of the reasons for the wide use of the technology. On the other hand, MapReduce is a new technology which is quickly spreading and becoming a commonly used tool for processing of large portions of data. The fault tolerance, parallelism and scalability, are only some of the characteristics that the framework can provide to any system based on it. The basic idea behind this work is to modify the query evaluation techniques used by parallel database management systems in order to use the Hadoop MapReduce framework as the underlying execution engine. For the purposes of this work we have focused on join evaluation. We have designed and implemented three algorithms which modify the data-flow of the MapReduce framework in order to simulate the data-flow that parallel Database Management Systems use in order to execute query evaluation. More specifically, we have implemented three algorithms that execute parallel hash join: Simple Hash Join is the implementation
Contents lists available at SciVerse ScienceDirect
"... journal homepage: www.elsevier.com/locate/pmc Fast track article Looking ahead in pervasive computing: Challenges and opportunities in ..."
Abstract
- Add to MetaCart
journal homepage: www.elsevier.com/locate/pmc Fast track article Looking ahead in pervasive computing: Challenges and opportunities in
Social Content Matching in MapReduce
"... Matching problems are ubiquitous. They occur in economic markets, labor markets, internet advertising, and elsewhere. In this paper we focus on an application of matching for social media. Our goal is to distribute content from information suppliers to information consumers. We seek to maximize the ..."
Abstract
- Add to MetaCart
Matching problems are ubiquitous. They occur in economic markets, labor markets, internet advertising, and elsewhere. In this paper we focus on an application of matching for social media. Our goal is to distribute content from information suppliers to information consumers. We seek to maximize the overall relevance of the matched content from suppliers to consumers while regulating the overall activity, e.g., ensuring that no consumer is overwhelmed with data and that all suppliers have chances to deliver their content. We propose two matching algorithms, GreedyMR and StackMR, geared for the MapReduce paradigm. Both algorithms have provable approximation guarantees, and in practice they produce high-quality solutions. While both algorithms scale extremely well, we can show that Stack-MR requires only a poly-logarithmic number of MapReduce steps, making it an attractive option for applications with very large datasets. We experimentally show the trade-offs between quality and efficiency of our solutions on two large datasets coming from real-world social-media web sites. 1.
Abstract
, 2011
"... Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference of LDA. In this paper, we propose a technique called MapReduce LDA (Mr. LDA) to ..."
Abstract
- Add to MetaCart
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference of LDA. In this paper, we propose a technique called MapReduce LDA (Mr. LDA) to accommodate very large corpus collections in the MapReduce framework. In contrast to other techniques to scale inference for LDA, which use Gibbs sampling, we use variational inference. Our solution efficiently distributes computation and is relatively simple to implement. More importantly, this variational implementation, unlike highly tuned and specialized implementations, is easily extensible. We demonstrate two extensions of the model possible with this scalable framework: informed priors to guide topic discovery and modeling topics from a multilingual corpus. 1
Parallel Collaborative Filtering for Streaming Data
, 2011
"... We present a distributed stochastic gradient descent algorithm for performing low-rank matrix factorization on streaming data. Low-rank matrix factorization is often used as a technique for collaborative filtering. As opposed to recent algorithms that perform matrix factorization in parallel on a ba ..."
Abstract
- Add to MetaCart
We present a distributed stochastic gradient descent algorithm for performing low-rank matrix factorization on streaming data. Low-rank matrix factorization is often used as a technique for collaborative filtering. As opposed to recent algorithms that perform matrix factorization in parallel on a batch of training examples [4], our algorithm operates on a stream of incoming examples. We experimentally compare our algorithm with a state-of-art method for performing low-rank matrix factorization on batch data. 1
HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce
"... We propose a set of open-source software modules to perform structured Perceptron Training, Prediction and Evaluation within the Hadoop framework. Apache Hadoop is a freely available environment for running distributed applications on a computer cluster. The software is designed within the Map-Reduc ..."
Abstract
- Add to MetaCart
We propose a set of open-source software modules to perform structured Perceptron Training, Prediction and Evaluation within the Hadoop framework. Apache Hadoop is a freely available environment for running distributed applications on a computer cluster. The software is designed within the Map-Reduce paradigm. Thanks to distributed computing, the proposed software reduces substantially execution times while handling huge data-sets. The distributed Perceptron training algorithm preserves convergence properties, thus guaranties same accuracy performances as the serial Perceptron. The presented modules can be executed as stand-alone software or easily extended or integrated in complex systems. The execution of the modules applied to specific NLP tasks can be demonstrated and tested via an interactive web interface that allows the user to inspect the status and structure of the cluster and interact with the MapReduce jobs.
V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors
"... This work proposes V-SMART-Join, a scalable MapReducebased framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a f ..."
Abstract
- Add to MetaCart
This work proposes V-SMART-Join, a scalable MapReducebased framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage computes and joins the partial results, and the second stage computes the similarity exactly for all candidate pairs. The V-SMART-Join algorithms are very efficient and scalable in the number of entities, as well as their cardinalities. They were up to 30 times faster than the state of the art algorithm, VCL, when compared on a real dataset of a small size. We also established the scalability of the proposed algorithms by running them on a dataset of a realistic size, on which VCL never succeeded to finish. Experiments were run using real datasets of IPs and cookies, where each IP is represented as a multiset of cookies, and the goal is to discover similar IPs to identify Internet proxies. 1.
MapReduce for Parallel Reinforcement Learning
"... Abstract. We investigate the parallelization of reinforcement learning algorithms using MapReduce, a popular parallel computing framework. We present parallel versions of several dynamic programming algorithms, including policy evaluation, policy iteration, and off-policy updates. Furthermore, we de ..."
Abstract
- Add to MetaCart
Abstract. We investigate the parallelization of reinforcement learning algorithms using MapReduce, a popular parallel computing framework. We present parallel versions of several dynamic programming algorithms, including policy evaluation, policy iteration, and off-policy updates. Furthermore, we design parallel reinforcement learning algorithms to deal with large scale problems using linear function approximation, including model-based projection, least squares policy iteration, temporal difference learning and recent gradient temporal difference learning algorithms. We give time and space complexity analysis of the proposed algorithms. This study demonstrates how parallelization opens new avenues for solving large scale reinforcement learning problems. 1

