DMCA
Observing and Preventing Leakage in MapReduceO Cédric Fournet
Citations
1372 |
Probabilistic encryption
- Goldwasser, Micali
- 1984
(Show Context)
Citation Context ...times split A into M batches Am such that A “ › › mPr1,Ms Am. Mpaiq applies an operation M on ai, while MpAq applies the operation element-wise on A. Cryptographic Primitives. Our solutions rely on semantically secure encryption, pseudo-random permutations, and pseudo-random functions. We use the usual cryptographic notions of negligibility and indistinguishability. As we treat cryptographic schemes as abstract building blocks, we avoid committing to either an asymptotic or a concrete security model. Similarly we keep cryptographic keys and their sizes implicit. Semantically secure encryption [14] guarantees that every encryption of the same message is very likely to map to a different ciphertext. That is, given two ciphertexts the adversary cannot distinguish whether they correspond to two encryptions of the same message or encryptions of two different messages. This strong security property is possible due to a probabilistic encryption algorithm that uses a random nonce every time an encryption algorithm is invoked. We use rps to denote a semantically secure encryption of a plaintext p. We sometimes overload this notation, using rDs to denote an encrypted dataset D, where each record... |
657 | Sorting networks and their applications
- Batcher
- 1968
(Show Context)
Citation Context ...huffle Given an encrypted dataset rDs as input, the shuffle yields some permutation rπpDqs as output. Since D can be large, we want an efficient implementation of the shuffle within a secure MapReduce framework. Moreover, we want to ensure that the observations about the network traffic that the adversary can make (as described in §4.2) do not leak any information about the data (except its size) and the shuffle permutation π. Hence, an adversary that observes a job implementing either π0 and π1, should not be able to say whether the output is an encryption of π0pDq or π1pDq. Sorting networks [5, 1] provide the security guarantees above: their network traffic is independent of the data. However, since these algorithms perform sorting, they incur a logarithmic depth computational overhead (plus additional constants). Instead, for our solutions, we choose the parallel version of the Melbourne Shuffle [22], which offers the same security guarantees, and we implement it as two successive runs of the MapReduce job described below. We refer to [22] for a detailed analysis of the algorithm. 10 Algorithm 1 Melbourne Shuffle Mapper: Mapperprdi, . . ., di`bs) with π, R, max included, for example, ... |
657 | Fully homomorphic encryption using ideal lattices
- Gentry
- 2009
(Show Context)
Citation Context ...gained widespread prominence. In particular, the MapReduce framework is routinely used to outsource such tasks in a simple, scalable, and cost-effective manner. As can be expected, reliance on a cloud provider for processing sensitive data entails new integrity and privacy risks. Several recent works explore different trade-offs between performance, security, and (partial) trust in the cloud. Most proposals involve protecting data at rest—using some form of authenticated encryption—and protecting data in use with either advanced cryptography or secure hardware. Although homomorphic encryption [12] may address our privacy concerns, it remains impractical for general processing of large ˚This is an extended version of the work to appear in the proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS 2015). :Work done at Microsoft Research. data, in particular when they involve complex, dynamic intermediate data. Conversely, limited trust assumptions on the cloud infrastructure may lead to efficient solutions, but their actual security guarantees are less clear. As a concrete example, VC3 [26] recently showed that, by relying on the new Intel SGX infrastructure ... |
530 | Probability and computing - randomized algorithms and probabilistic analysis - MITZENMACHER, UPFAL - 2005 |
309 | Software protection and simulation on oblivious RAMs
- Goldreich, Ostrovsky
- 1996
(Show Context)
Citation Context ...rmation on the input datasets.) Our attacks suggest that, even with the use of encryption and secure hardware, stronger methods are required to avoid leakage through traffic analysis. To remedy this problem, we propose a new definition of data privacy for MapReduce—essentially, that observable I/O should look independent of the input dataset—and we describe two practical solutions that meet this definition, and thus prevent our attacks. As a trivial solution, we may pad all accesses and communications to their potential maximal lengths. Similarly, we may apply generic oblivious RAM techniques [13] and oblivious sorting on top of a MapReduce implementation. However, such solutions would incur a polylogarithmic overhead, and they would preclude any speedups enabled by the parallel nature of MapReduce. We further discuss related baseline solutions in §9. Intuitively, many existing mechanisms already in place in MapReduce frameworks to achieve good performance should also help us for privacy. Mappers and reducers often use large I/O buffers, making it harder to track individual records within large, encrypted batches of data. Similarly, for jobs with adequate load balancing, one would expe... |
292 |
UCI machine learning repository,
- Bache
- 2013
(Show Context)
Citation Context ...ords, we show that it remains possible when the input records are somewhat sorted, and that MapReduce traffic still leaks information about many statistics in the input data. Our goal is not to uncover new facts about these datasets, readily available from their plaintext, but to show that, more surprisingly, those facts are also available to an adversary that merely observes encrypted traffic. Our experiments also suggest that naive techniques based on padding inputs and outputs would be of limited value for these datasets. Our experiments are based on two datasets: ‚ U.S. 1990 Census Sample [18] (900 MB). The dataset contains 2.5 million personal records. Every record has 120 attributes, including the Age, Gender, POW (place of work), POB (place of birth), MS (marital status), etc. Some attributes have been discretized: for instance, Age ranges over 8 age groups, such as 20–29. ‚ New York 2013 Taxi Rides [28] (24 GB). This dataset contains records for all the taxi rides (yellow cabs) in New York city in 2013. It is split in 12 monthly segments, and each segment contains approximately 14 million records. The records have 14 attributes and describe trip details including the hashed lic... |
227 |
Introduction To Modern Cryptography
- Katz, Lindell
- 2008
(Show Context)
Citation Context ...ifferent ciphertext. That is, given two ciphertexts the adversary cannot distinguish whether they correspond to two encryptions of the same message or encryptions of two different messages. This strong security property is possible due to a probabilistic encryption algorithm that uses a random nonce every time an encryption algorithm is invoked. We use rps to denote a semantically secure encryption of a plaintext p. We sometimes overload this notation, using rDs to denote an encrypted dataset D, where each record may be encrypted separately. The second primitive, a pseudo-random permutation π [16], is an efficiently computable keyed permutation function. Its security property is expressed as indistinguishability from a truly random permutation. That is, if an adversary observes an output of π and a truly random permutation, he is not able to distinguish the two. We use πpiq to denote the location of the ith record according to π and, again overload notations, use πpDq to denote a dataset that contains the records of D permuted according to π. The third primitive, a pseudo-random function f [16], is a keyed cryptographic primitive that is indistinguishable from a truly random function w... |
123 | Cryptdb: protecting confidentiality with encrypted query processing
- Popa, Redfield, et al.
- 2011
(Show Context)
Citation Context ...sus 76 122 Taxi Rides Jan 160 153 performing the shuffle do. The Shuffle-in-the-Middle may still be of interest if one cannot perform the offline phase (shuffling) of the Shuffle & Balance solution before running MapReduce jobs. We also note that if hiding key distribution is not required, the Shuffle & Balance solution can perform a “lighter” version of the online phase and reduce the amount of padding. In this case, one pads only to hide the difference between mappers’ inputs without executing bin packing. 9. RELATED WORK Several systems protect confidentiality of data in the cloud. CryptDB [24] and MrCrypt [29] use partial homomorphic encryption to run some computations on encrypted data; they neither protect confidentiality of code, nor guarantee the integrity of results. On the upside, they do not use trusted hardware. TrustedDB [4], Cipherbase [3], and Monomi [31] use trusted hardware to process database queries over encrypted data, but do not protect the confidentiality and integrity of all code and data. Haven [6] can run databases on a single machine. All systems above are vulnerable to side-channel attacks. For example, Xu et al. [33] show how side-channel attacks can be expl... |
85 |
Towards estimation error guarantees for distinct values
- Charikar, Chaudhuri, et al.
- 2000
(Show Context)
Citation Context ... be enough for estimating α well (e.g., consider a map function with a very restrictive filter). In this case, we may need to increase s to get a higher number of sampled intermediate key-value pairs. We note that in both cases we chose to overestimate our parameters. Overestimation leads to higher padding cost but lower probability of failure of the mapper protocol. Estimating |K|. Given the output of a map function on a sample we wish to estimate the number of distinct keys in the output the MapReduce job will produce on the whole dataset D. To this end, we use the estimation technique from [9] to set our estimate to an upper bound on |K|: ÿ ią1 ei ` |D| |Xs| e1 where ei is the number of distinct keys in X s that occur exactly i times in Xs. We note that better estimates can be achieved if prior information on distribution of K is known in advance. 7.3 Mapping vs. Bin Packing In this section, we explain how we use the statistics collected in the online stage to produce a secure balanced assignment. In particular, we explain how to allocate sufficient bandwidth between mappers and reducers to fit any distribution with the same α. (Recall that Definition 3 enables us to leak only the ... |
82 | Airavat: Security and Privacy for Mapreduce,”
- Roy, Setty, et al.
- 2010
(Show Context)
Citation Context ...stedDB [4], Cipherbase [3], and Monomi [31] use trusted hardware to process database queries over encrypted data, but do not protect the confidentiality and integrity of all code and data. Haven [6] can run databases on a single machine. All systems above are vulnerable to side-channel attacks. For example, Xu et al. [33] show how side-channel attacks can be exploited in systems such as Haven where an untrusted operating system controls page faults. We also refer the reader to [33] for an overview on side-channel attacks. Several security-enhanced MapReduce systems have been proposed. Airavat [25] defends against possibly malicious map function implementations using differential privacy. SecureMR [32] is an integrity enhancement for MapReduce that relies on redundant computations. Ko et al. propose a hybrid security model for MapReduce where sensitive data is handled in a private cloud while non-sensitive processing is outsourced to a public cloud provider [17]. PRISM [7] is a privacy-preserving word search scheme for MapReduce that utilizes private information retrieval methods. Nayak et al. [21] propose a programming model for secure parallel processing of data represented as a graph... |
66 | Privacy-preserving access of outsourced data via oblivious RAM simulation
- Goodrich, Mitzenmacher
- 2011
(Show Context)
Citation Context ...fferential privacy. SecureMR [32] is an integrity enhancement for MapReduce that relies on redundant computations. Ko et al. propose a hybrid security model for MapReduce where sensitive data is handled in a private cloud while non-sensitive processing is outsourced to a public cloud provider [17]. PRISM [7] is a privacy-preserving word search scheme for MapReduce that utilizes private information retrieval methods. Nayak et al. [21] propose a programming model for secure parallel processing of data represented as a graph using oblivious sorting and garbled circuits. Goodrich and Mitzenmacher [15] describe a simulation of MapReduce that resembles a sequential version of our Shuffle-in-the-Middle solution using a sorting network instead of a shuffle to protect against traffic analysis. This method can be parallelized using a step from §6 where oblivious sorting uses a reducer number (computed as a pseudo-random function of each key) to sort key-value pairs and returns reducer keys in the clear. In independent parallel work, Dinh et al. [11] also consider securing MapReduce using a mix network to shuffle traffic between mappers and reducers. The three solutions above rely either on obliv... |
37 | Innovative instructions and software model for isolated execution.
- Mckeen, Alexandrovich, et al.
- 2013
(Show Context)
Citation Context ... may address our privacy concerns, it remains impractical for general processing of large ˚This is an extended version of the work to appear in the proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS 2015). :Work done at Microsoft Research. data, in particular when they involve complex, dynamic intermediate data. Conversely, limited trust assumptions on the cloud infrastructure may lead to efficient solutions, but their actual security guarantees are less clear. As a concrete example, VC3 [26] recently showed that, by relying on the new Intel SGX infrastructure [19] to protect local mapper and reducer processing, one can adapt the popular Hadoop framework [2] and achieve strong integrity and confidentiality for large MapReduce tasks with a small performance overhead. All data is systematically AES-GCM-encrypted, except when processed within hardware-protected, remotely-attested enclaves that include just the code for mapping and reducing data, whereas the rest of the Hadoop distributed infrastructure need not be trusted. They report an average 4% performance overhead for typical MapReduce jobs. Trusting Intel’s CPUs may be adequate for many commercial ap... |
29 | Processing analytical queries over encrypted data. In
- Tu, Kaashoek, et al.
- 2013
(Show Context)
Citation Context ...ot required, the Shuffle & Balance solution can perform a “lighter” version of the online phase and reduce the amount of padding. In this case, one pads only to hide the difference between mappers’ inputs without executing bin packing. 9. RELATED WORK Several systems protect confidentiality of data in the cloud. CryptDB [24] and MrCrypt [29] use partial homomorphic encryption to run some computations on encrypted data; they neither protect confidentiality of code, nor guarantee the integrity of results. On the upside, they do not use trusted hardware. TrustedDB [4], Cipherbase [3], and Monomi [31] use trusted hardware to process database queries over encrypted data, but do not protect the confidentiality and integrity of all code and data. Haven [6] can run databases on a single machine. All systems above are vulnerable to side-channel attacks. For example, Xu et al. [33] show how side-channel attacks can be exploited in systems such as Haven where an untrusted operating system controls page faults. We also refer the reader to [33] for an overview on side-channel attacks. Several security-enhanced MapReduce systems have been proposed. Airavat [25] defends against possibly malicious map... |
28 |
Simple demographics often identify people uniquely.
- Sweeney
- 2000
(Show Context)
Citation Context ...pper can be correlated with intermediate key-value pairs in A, and (2) there is variation between values in each row and column. The first condition depends on how mappers read and write their input and output, while the second condition depends on the data. Network traffic from two or more jobs can easily be combined and lead to a ‘job composition’ attack. The adversary observes a matrix A from each job and, as long as the same input data is used, he can label each input batch with the results of such inferences. For example, he can observe jobs on zip code, gender and data of birth. Sweeney [27] showed that combinations of such simple demographics often already identify people uniquely. In the rest of this section we show how the adversary can still correlate mapper’s inputs and outputs for less trivial input datasets. Granularity: observing traffic on input batches. In general a mapper can process a sequence (or batch) of records (to amortize the cost of encryption, for example). If the mapper reads a batch, there are several ways in which it could control its I/O. For example, it could sequentially read a record and immediately return the corresponding keyvalue pair; it could buffe... |
22 |
Bin packing approximation algorithms: Survey and classification.
- Jr, Csirik, et al.
- 2013
(Show Context)
Citation Context ...arameters (Section 7.2). Our problem, at heart, is an instance of bin packing, so we first review bin packing basics before giving our algorithm. Bin packing. The (offline) bin packing instance is expressed in terms of a fixed bin capacity c and a list of items, each with a weight at most c. (In our case, a key is viewed as an item and its frequency as its weight.) The goal of bin packing algorithms is to minimize the number of bins N needed to allocate all items without an overflow. Since the offline bin-packing problem is NP-hard, approximation algorithms, such as First Fit Decreasing (FFD) [10], return both a number of bins and a guarantee on how far it can be from the optimum in the worst case. The FFD algorithm places items in decreasing weight order, allocating new bins on demand: it places the heaviest item in the first bin, then proceeds with the next item and tries to place it into one of the open bins (i.e., the bins that already have items) without exceeding its capacity. If there is no space left in any open bin, it places the item in a new bin. Bin Packing, Obliviously. Our problem is more general: given only some maximal item weight α, we must find a bin capacity c and an... |
19 | SecureMR: A service integrity assurance framework for MapReduce
- Wei, Du, et al.
- 2009
(Show Context)
Citation Context ...d data, but do not protect the confidentiality and integrity of all code and data. Haven [6] can run databases on a single machine. All systems above are vulnerable to side-channel attacks. For example, Xu et al. [33] show how side-channel attacks can be exploited in systems such as Haven where an untrusted operating system controls page faults. We also refer the reader to [33] for an overview on side-channel attacks. Several security-enhanced MapReduce systems have been proposed. Airavat [25] defends against possibly malicious map function implementations using differential privacy. SecureMR [32] is an integrity enhancement for MapReduce that relies on redundant computations. Ko et al. propose a hybrid security model for MapReduce where sensitive data is handled in a private cloud while non-sensitive processing is outsourced to a public cloud provider [17]. PRISM [7] is a privacy-preserving word search scheme for MapReduce that utilizes private information retrieval methods. Nayak et al. [21] propose a programming model for secure parallel processing of data represented as a graph using oblivious sorting and garbled circuits. Goodrich and Mitzenmacher [15] describe a simulation of Map... |
18 | Orthogonal security with Cipherbase.
- Arasu, Blanas, et al.
- 2013
(Show Context)
Citation Context ...istribution is not required, the Shuffle & Balance solution can perform a “lighter” version of the online phase and reduce the amount of padding. In this case, one pads only to hide the difference between mappers’ inputs without executing bin packing. 9. RELATED WORK Several systems protect confidentiality of data in the cloud. CryptDB [24] and MrCrypt [29] use partial homomorphic encryption to run some computations on encrypted data; they neither protect confidentiality of code, nor guarantee the integrity of results. On the upside, they do not use trusted hardware. TrustedDB [4], Cipherbase [3], and Monomi [31] use trusted hardware to process database queries over encrypted data, but do not protect the confidentiality and integrity of all code and data. Haven [6] can run databases on a single machine. All systems above are vulnerable to side-channel attacks. For example, Xu et al. [33] show how side-channel attacks can be exploited in systems such as Haven where an untrusted operating system controls page faults. We also refer the reader to [33] for an overview on side-channel attacks. Several security-enhanced MapReduce systems have been proposed. Airavat [25] defends against possi... |
14 | Shielding Applications from an Untrusted Cloud with Haven.
- Baumann, Peinado, et al.
- 2014
(Show Context)
Citation Context ...y to hide the difference between mappers’ inputs without executing bin packing. 9. RELATED WORK Several systems protect confidentiality of data in the cloud. CryptDB [24] and MrCrypt [29] use partial homomorphic encryption to run some computations on encrypted data; they neither protect confidentiality of code, nor guarantee the integrity of results. On the upside, they do not use trusted hardware. TrustedDB [4], Cipherbase [3], and Monomi [31] use trusted hardware to process database queries over encrypted data, but do not protect the confidentiality and integrity of all code and data. Haven [6] can run databases on a single machine. All systems above are vulnerable to side-channel attacks. For example, Xu et al. [33] show how side-channel attacks can be exploited in systems such as Haven where an untrusted operating system controls page faults. We also refer the reader to [33] for an overview on side-channel attacks. Several security-enhanced MapReduce systems have been proposed. Airavat [25] defends against possibly malicious map function implementations using differential privacy. SecureMR [32] is an integrity enhancement for MapReduce that relies on redundant computations. Ko et ... |
11 | The HybrEx Model for Confidentiality and Privacy in Cloud Computing”;
- Ko, Jeon, et al.
- 2011
(Show Context)
Citation Context ...in systems such as Haven where an untrusted operating system controls page faults. We also refer the reader to [33] for an overview on side-channel attacks. Several security-enhanced MapReduce systems have been proposed. Airavat [25] defends against possibly malicious map function implementations using differential privacy. SecureMR [32] is an integrity enhancement for MapReduce that relies on redundant computations. Ko et al. propose a hybrid security model for MapReduce where sensitive data is handled in a private cloud while non-sensitive processing is outsourced to a public cloud provider [17]. PRISM [7] is a privacy-preserving word search scheme for MapReduce that utilizes private information retrieval methods. Nayak et al. [21] propose a programming model for secure parallel processing of data represented as a graph using oblivious sorting and garbled circuits. Goodrich and Mitzenmacher [15] describe a simulation of MapReduce that resembles a sequential version of our Shuffle-in-the-Middle solution using a sorting network instead of a shuffle to protect against traffic analysis. This method can be parallelized using a step from §6 where oblivious sorting uses a reducer number (co... |
9 | ControlledChannel Attacks: Deterministic Side Channels for Untrusted Operating Systems.
- Xu, Cui, et al.
- 2015
(Show Context)
Citation Context ...identiality of data in the cloud. CryptDB [24] and MrCrypt [29] use partial homomorphic encryption to run some computations on encrypted data; they neither protect confidentiality of code, nor guarantee the integrity of results. On the upside, they do not use trusted hardware. TrustedDB [4], Cipherbase [3], and Monomi [31] use trusted hardware to process database queries over encrypted data, but do not protect the confidentiality and integrity of all code and data. Haven [6] can run databases on a single machine. All systems above are vulnerable to side-channel attacks. For example, Xu et al. [33] show how side-channel attacks can be exploited in systems such as Haven where an untrusted operating system controls page faults. We also refer the reader to [33] for an overview on side-channel attacks. Several security-enhanced MapReduce systems have been proposed. Airavat [25] defends against possibly malicious map function implementations using differential privacy. SecureMR [32] is an integrity enhancement for MapReduce that relies on redundant computations. Ko et al. propose a hybrid security model for MapReduce where sensitive data is handled in a private cloud while non-sensitive proc... |
7 | GraphSC: Parallel secure computation made easy.
- Nayak, Wang, et al.
- 2015
(Show Context)
Citation Context ...-channel attacks. Several security-enhanced MapReduce systems have been proposed. Airavat [25] defends against possibly malicious map function implementations using differential privacy. SecureMR [32] is an integrity enhancement for MapReduce that relies on redundant computations. Ko et al. propose a hybrid security model for MapReduce where sensitive data is handled in a private cloud while non-sensitive processing is outsourced to a public cloud provider [17]. PRISM [7] is a privacy-preserving word search scheme for MapReduce that utilizes private information retrieval methods. Nayak et al. [21] propose a programming model for secure parallel processing of data represented as a graph using oblivious sorting and garbled circuits. Goodrich and Mitzenmacher [15] describe a simulation of MapReduce that resembles a sequential version of our Shuffle-in-the-Middle solution using a sorting network instead of a shuffle to protect against traffic analysis. This method can be parallelized using a step from §6 where oblivious sorting uses a reducer number (computed as a pseudo-random function of each key) to sort key-value pairs and returns reducer keys in the clear. In independent parallel work... |
7 | MrCrypt: Static analysis for secure cloud computations.
- Tetali, Lesani, et al.
- 2013
(Show Context)
Citation Context ...ides Jan 160 153 performing the shuffle do. The Shuffle-in-the-Middle may still be of interest if one cannot perform the offline phase (shuffling) of the Shuffle & Balance solution before running MapReduce jobs. We also note that if hiding key distribution is not required, the Shuffle & Balance solution can perform a “lighter” version of the online phase and reduce the amount of padding. In this case, one pads only to hide the difference between mappers’ inputs without executing bin packing. 9. RELATED WORK Several systems protect confidentiality of data in the cloud. CryptDB [24] and MrCrypt [29] use partial homomorphic encryption to run some computations on encrypted data; they neither protect confidentiality of code, nor guarantee the integrity of results. On the upside, they do not use trusted hardware. TrustedDB [4], Cipherbase [3], and Monomi [31] use trusted hardware to process database queries over encrypted data, but do not protect the confidentiality and integrity of all code and data. Haven [6] can run databases on a single machine. All systems above are vulnerable to side-channel attacks. For example, Xu et al. [33] show how side-channel attacks can be exploited in systems ... |
5 |
TrustedDB: A trusted hardware-based database with privacy and data confidentiality.
- Bajaj, Sion
- 2014
(Show Context)
Citation Context ... if hiding key distribution is not required, the Shuffle & Balance solution can perform a “lighter” version of the online phase and reduce the amount of padding. In this case, one pads only to hide the difference between mappers’ inputs without executing bin packing. 9. RELATED WORK Several systems protect confidentiality of data in the cloud. CryptDB [24] and MrCrypt [29] use partial homomorphic encryption to run some computations on encrypted data; they neither protect confidentiality of code, nor guarantee the integrity of results. On the upside, they do not use trusted hardware. TrustedDB [4], Cipherbase [3], and Monomi [31] use trusted hardware to process database queries over encrypted data, but do not protect the confidentiality and integrity of all code and data. Haven [6] can run databases on a single machine. All systems above are vulnerable to side-channel attacks. For example, Xu et al. [33] show how side-channel attacks can be exploited in systems such as Haven where an untrusted operating system controls page faults. We also refer the reader to [33] for an overview on side-channel attacks. Several security-enhanced MapReduce systems have been proposed. Airavat [25] defen... |
5 | Prism—privacypreserving search in MapReduce.
- Blass, Pietro, et al.
- 2012
(Show Context)
Citation Context ...uch as Haven where an untrusted operating system controls page faults. We also refer the reader to [33] for an overview on side-channel attacks. Several security-enhanced MapReduce systems have been proposed. Airavat [25] defends against possibly malicious map function implementations using differential privacy. SecureMR [32] is an integrity enhancement for MapReduce that relies on redundant computations. Ko et al. propose a hybrid security model for MapReduce where sensitive data is handled in a private cloud while non-sensitive processing is outsourced to a public cloud provider [17]. PRISM [7] is a privacy-preserving word search scheme for MapReduce that utilizes private information retrieval methods. Nayak et al. [21] propose a programming model for secure parallel processing of data represented as a graph using oblivious sorting and garbled circuits. Goodrich and Mitzenmacher [15] describe a simulation of MapReduce that resembles a sequential version of our Shuffle-in-the-Middle solution using a sorting network instead of a shuffle to protect against traffic analysis. This method can be parallelized using a step from §6 where oblivious sorting uses a reducer number (computed as a... |
4 | Oblivious parallel ram. Cryptology ePrint Archive, Report 2014/594, - Boyle, Chung, et al. - 2014 |
4 | VC3: Trustworthy data analytics in the cloud using SGX.
- Schuster, Costa, et al.
- 2015
(Show Context)
Citation Context ...nced cryptography or secure hardware. Although homomorphic encryption [12] may address our privacy concerns, it remains impractical for general processing of large ˚This is an extended version of the work to appear in the proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS 2015). :Work done at Microsoft Research. data, in particular when they involve complex, dynamic intermediate data. Conversely, limited trust assumptions on the cloud infrastructure may lead to efficient solutions, but their actual security guarantees are less clear. As a concrete example, VC3 [26] recently showed that, by relying on the new Intel SGX infrastructure [19] to protect local mapper and reducer processing, one can adapt the popular Hadoop framework [2] and achieve strong integrity and confidentiality for large MapReduce tasks with a small performance overhead. All data is systematically AES-GCM-encrypted, except when processed within hardware-protected, remotely-attested enclaves that include just the code for mapping and reducing data, whereas the rest of the Hadoop distributed infrastructure need not be trusted. They report an average 4% performance overhead for typical Ma... |
3 |
M2r: Enabling stronger privacy in mapreduce computation.
- Dinh, Saxena, et al.
- 2015
(Show Context)
Citation Context ...gramming model for secure parallel processing of data represented as a graph using oblivious sorting and garbled circuits. Goodrich and Mitzenmacher [15] describe a simulation of MapReduce that resembles a sequential version of our Shuffle-in-the-Middle solution using a sorting network instead of a shuffle to protect against traffic analysis. This method can be parallelized using a step from §6 where oblivious sorting uses a reducer number (computed as a pseudo-random function of each key) to sort key-value pairs and returns reducer keys in the clear. In independent parallel work, Dinh et al. [11] also consider securing MapReduce using a mix network to shuffle traffic between mappers and reducers. The three solutions above rely either on oblivious sort or mix network, and thus incur a logarithmic depth overhead. In comparison, our use of the Melbourne Shuffle in our first solution, Shuffle in the Middle, requires only two additional map-reduce jobs, and incurs a constant depth overhead. Besides, our second solution, Shuffle & Balance, dominates the first, even with the Melbourne Shuffle: the security guarantees are stronger (it hides key distributions, mapper output sizes, and reducer ... |
3 | The Melbourne shuffle: Improving oblivious storage in the cloud.
- Ohrimenko, Goodrich, et al.
- 2014
(Show Context)
Citation Context ... SHUFFLE-IN-THE-MIDDLE SOLUTION Our first solution prevents intermediate traffic analysis on a job by securely shuffling all the key-value pairs produced by the Mappers and consumed by the Reducers. Hence, the adversary may still observe volume of intermediate traffic for each mapper and for each reducer, but it cannot trace traffic from reducers back to individual mappers. We present our solution using a data shuffle algorithm as a black box that, given rXs and a pseudo-random permutation π on 1 . . . |X|, returns rπpXqs. We then describe our implementation of the Melbourne Shuffle algorithm [22] using MapReduce jobs (§6.1). We finally show that our solution meets Definition 2 (§6.2). Let XM be the output of the mappers, and XR the output of the shuffle passed to the reducers. XM and π are given as input to a data shuffle job to permute the records. The output XR of the shuffle is then grouped and sent to the Reducers by the MapReduce framework, as before. Figure 7: Overview of the Shuffle-in-the-Middle solution (see §6) where all data elements are encrypted and Mapper and Reducer code is executed inside of the secure region. In our solution the shuffle is implemented as two MapReduce... |
2 |
On taxis and rainbows: Lessons from NYC’s improperly anonymized taxi logs,
- Pandurangan
- 2014
(Show Context)
Citation Context ...ords for all the taxi rides (yellow cabs) in New York city in 2013. It is split in 12 monthly segments, and each segment contains approximately 14 million records. The records have 14 attributes and describe trip details including the hashed license number, pickup date and time, drop off date and time, and number of passengers. 5 The first dataset is representative of personal data commonly stored in the databases of medical institutions, insurance companies, and banks. The second dataset contains sensitive information and, despite some basic anonymization, is susceptible to inference attacks [23, 30]. Some of these attacks use MapReduce [23] to extract correlation between the rides (in plaintext). We show that the same kind of information can also be extracted by traffic analysis. In this section, the adversary is assumed to have the following subset of the capabilities described in §4.1. He observes only basic aggregate jobs, which all go as follows: M splits the records, with the attribute used for aggregation (e.g., the Age) as key; hence R receives all records with the same attribute value; it may return their count, or any other function of their contents. He is also assumed to have ... |
2 |
Riding with the stars: Passenger privacy in the NYC taxicab dataset,
- Tockar
- 2014
(Show Context)
Citation Context ...ords for all the taxi rides (yellow cabs) in New York city in 2013. It is split in 12 monthly segments, and each segment contains approximately 14 million records. The records have 14 attributes and describe trip details including the hashed license number, pickup date and time, drop off date and time, and number of passengers. 5 The first dataset is representative of personal data commonly stored in the databases of medical institutions, insurance companies, and banks. The second dataset contains sensitive information and, despite some basic anonymization, is susceptible to inference attacks [23, 30]. Some of these attacks use MapReduce [23] to extract correlation between the rides (in plaintext). We show that the same kind of information can also be extracted by traffic analysis. In this section, the adversary is assumed to have the following subset of the capabilities described in §4.1. He observes only basic aggregate jobs, which all go as follows: M splits the records, with the attribute used for aggregation (e.g., the Age) as key; hence R receives all records with the same attribute value; it may return their count, or any other function of their contents. He is also assumed to have ... |
1 |
An Opn lognq sorting network.
- Ajtai, Komlos, et al.
- 1983
(Show Context)
Citation Context ...huffle Given an encrypted dataset rDs as input, the shuffle yields some permutation rπpDqs as output. Since D can be large, we want an efficient implementation of the shuffle within a secure MapReduce framework. Moreover, we want to ensure that the observations about the network traffic that the adversary can make (as described in §4.2) do not leak any information about the data (except its size) and the shuffle permutation π. Hence, an adversary that observes a job implementing either π0 and π1, should not be able to say whether the output is an encryption of π0pDq or π1pDq. Sorting networks [5, 1] provide the security guarantees above: their network traffic is independent of the data. However, since these algorithms perform sorting, they incur a logarithmic depth computational overhead (plus additional constants). Instead, for our solutions, we choose the parallel version of the Melbourne Shuffle [22], which offers the same security guarantees, and we implement it as two successive runs of the MapReduce job described below. We refer to [22] for a detailed analysis of the algorithm. 10 Algorithm 1 Melbourne Shuffle Mapper: Mapperprdi, . . ., di`bs) with π, R, max included, for example, ... |