Results 1 - 10
of
58
Differential privacy via wavelet transforms
- In ICDE
, 2010
"... Abstract — Privacy preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, ɛ-differential privacy provides one of the strongest privacy guarantees. Existing data publishing methods that achieve ɛ-differential privacy, however, offer litt ..."
Abstract
-
Cited by 100 (8 self)
- Add to MetaCart
(Show Context)
Abstract — Privacy preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, ɛ-differential privacy provides one of the strongest privacy guarantees. Existing data publishing methods that achieve ɛ-differential privacy, however, offer little data utility. In particular, if the output dataset is used to answer count queries, the noise in the query answers can be proportional to the number of tuples in the data, which renders the results useless. In this paper, we develop a data publishing technique that ensures ɛ-differential privacy while providing accurate answers for range-count queries, i.e., count queries where the predicate on each attribute is a range. The core of our solution is a framework that applies wavelet transforms on the data before adding noise to it. We present instantiations of the proposed framework for both ordinal and nominal data, and we provide a theoretical analysis on their privacy and utility guarantees. In an extensive experimental study on both real and synthetic data, we show the effectiveness and efficiency of our solution. I.
Differentially private empirical risk minimization
, 2010
"... Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical ris ..."
Abstract
-
Cited by 77 (6 self)
- Add to MetaCart
Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ǫ-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. We prove theoretical results showing that our algorithms preserve privacy and give generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance.
Data mining with differential privacy
- In KDD 2010
, 2010
"... We consider the problem of data mining with formal privacy guarantees, given a data access interface based on the differ-ential privacy framework. Differential privacy requires that computations be insensitive to changes in any particular in-dividual’s record, thereby restricting data leaks through ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
We consider the problem of data mining with formal privacy guarantees, given a data access interface based on the differ-ential privacy framework. Differential privacy requires that computations be insensitive to changes in any particular in-dividual’s record, thereby restricting data leaks through the results. The privacy preserving interface ensures uncondi-tionally safe access to the data and does not require from the data miner any expertise in privacy. However, as we show in the paper, a naive utilization of the interface to construct privacy preserving data mining algorithms could lead to inferior data mining results. We address this problem by considering the privacy and the algorithmic requirements simultaneously, focusing on decision tree induction as a sam-ple application. The privacy mechanism has a profound ef-fect on the performance of the methods chosen by the data miner. We demonstrate that this choice could make the difference between an accurate classifier and a completely useless one. Moreover, an improved algorithm can achieve the same level of accuracy and privacy as the naive imple-mentation but with an order of magnitude fewer learning samples.
Differentially Private Data Cubes: Optimizing Noise Sources and Consistency
"... Data cubes play an essential role in data analysis and decision support. In a data cube, data from a fact table is aggregated on subsets of the table’s dimensions, forming a collection of smaller tables called cuboids. When the fact table includes sensitive data such as salary or diagnosis, publishi ..."
Abstract
-
Cited by 40 (3 self)
- Add to MetaCart
Data cubes play an essential role in data analysis and decision support. In a data cube, data from a fact table is aggregated on subsets of the table’s dimensions, forming a collection of smaller tables called cuboids. When the fact table includes sensitive data such as salary or diagnosis, publishing even a subset of its cuboids may compromise individuals ’ privacy. In this paper, we address this problem using differential privacy (DP), which provides provable privacy guarantees for individuals by adding noise to query answers. We choose an initial subset of cuboids to compute directly from the fact table, injecting DP noise as usual; and then compute the remaining cuboids from the initial set. Given a fixed privacy guarantee, we show that it is NP-hard to choose the initial set of cuboids so that the maximal noise over all published cuboids is minimized, or so that the number of cuboids with noise below a given threshold (precise cuboids) is maximized. We provide an efficient procedure with running time polynomial in the number of cuboids to select the initial set of cuboids, such that the maximal noise in all published cuboids will be within a factor (ln |L | +1) 2 of the optimal, where |L | is the number of cuboids to be published, or the number of precise cuboids will be within a factor (1 − 1/e) of the optimal. We also show how to enforce consistency in the published cuboids while simultaneously improving their utility (reducing error). In an empirical evaluation on real and synthetic data, we report the amounts of error of different publishing algorithms, and show that our approaches outperform baselines significantly.
Learning in a large function space: Privacy-preserving mechanisms for SVM learning
- CoRR, abs/0911.5708. Submitted
, 2009
"... Abstract. The ubiquitous need for analyzing privacy-sensitive information— including health records, personal communications, product ratings, and social network data—is driving significant interest in privacy-preserving data analysis across several research communities. This paper explores the rele ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
(Show Context)
Abstract. The ubiquitous need for analyzing privacy-sensitive information— including health records, personal communications, product ratings, and social network data—is driving significant interest in privacy-preserving data analysis across several research communities. This paper explores the release of Support Vector Machine (SVM) classifiers while preserving the privacy of training data. The SVM is a popular machine learning method that maps data to a highdimensional feature space before learning a linear decision boundary. We present efficient mechanisms for finite-dimensional feature mappings and for (potentially infinite-dimensional) mappings with translation-invariant kernels. In the latter case, our mechanism borrows a technique from large-scale learning to learn in a finite-dimensional feature space whose inner-product uniformly approximates the desired feature space inner-product (the desired kernel) with high probability. Differential privacy is established using algorithmic stability, a property used in learning theory to bound generalization error. Utility—when the private classifier is pointwise close to the non-private classifier with high probability—is proven using smoothness of regularized empirical risk minimization with respect to small perturbations to the feature mapping. Finally we conclude with lower bounds on the differential privacy of any mechanism approximating the SVM. 1
iReduct: Differential privacy with reduced relative errors
- In SIGMOD
, 2011
"... Prior work in differential privacy has produced techniques for answering aggregate queries over sensitive data in a privacypreserving way. These techniques achieve privacy by adding noise to the query answers. Their objective is typically to minimize absolute errors while satisfying differential pri ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
Prior work in differential privacy has produced techniques for answering aggregate queries over sensitive data in a privacypreserving way. These techniques achieve privacy by adding noise to the query answers. Their objective is typically to minimize absolute errors while satisfying differential privacy. Thus, query answers are injected with noise whose scale is independent of whether the answers are large or small. The noisy results for queries whose true answers are small therefore tend to be dominated by noise, which leads to inferior data utility. This paper introduces iReduct, a differentially private algorithm for computing answers with reduced relative errors. The basic idea of iReduct is to inject different amounts of noise to different query results, so that smaller (larger) values are more likely to be injected with less (more) noise. The algorithm is based on a novel resampling technique that employs correlated noise to improve data utility. Performance is evaluated on an instantiation of iReduct that generates marginals, i.e., projections of multi-dimensional histograms onto subsets of their attributes. Experiments on real data demonstrate the effectiveness of our solution. Categories and Subject Descriptors H.2.0 [DATABASE MANAGEMENT]: Security, integrity, and
Consensus-based distributed support vector machines
- JOURNAL OF MACHINE LEARNING RESEARCH 11:1663–1707
, 2010
"... This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, t ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
(Show Context)
This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization subproblems (one per node) with consensus constraints on the wanted classifier parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated with inter-node communications is fixed and solely dependent on the network topology rather than the size of the training sets available per node. Important generalizations to train nonlinear SVMs in a distributed fashion are also developed along with sequential variants capable of online processing. Simulated tests illustrate the performance of the novel algorithms.
Adversarial Machine Learning
, 2011
"... In this paper (expanded from an invited talk at AISEC 2010), we discuss an emerging field of study: adversarial machine learning—the study of effective machine learning techniques against an adversarial opponent. In this paper, we: give a taxonomy for classifying attacks against online machine learn ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
In this paper (expanded from an invited talk at AISEC 2010), we discuss an emerging field of study: adversarial machine learning—the study of effective machine learning techniques against an adversarial opponent. In this paper, we: give a taxonomy for classifying attacks against online machine learning algorithms; discuss application-specific factors that limit an adversary’s capabilities; introduce two models for modeling an adversary’s capabilities; explore the limits of an adversary’s knowledge about the algorithm, feature space, training, and input data; explore vulnerabilities in machine learning algorithms; discuss countermeasures against attacks; introduce the evasion challenge; and discuss privacy-preserving learning techniques.
Probabilistic Inference and Differential Privacy
"... We identify and investigate a strong connection between probabilistic inference and differential privacy, the latter being a recent privacy definition that permits only indirect observation of data through noisy measurement. Previous research on differential privacy has focused on designing measurem ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
(Show Context)
We identify and investigate a strong connection between probabilistic inference and differential privacy, the latter being a recent privacy definition that permits only indirect observation of data through noisy measurement. Previous research on differential privacy has focused on designing measurement processes whose output is likely to be useful on its own. We consider the potential of applying probabilistic inference to the measurements and measurement process to derive posterior distributions over the data sets and model parameters thereof. We find that probabilistic inference can improve accuracy, integrate multiple observations, measure uncertainty, and even provide posterior distributions over quantities that were not directly measured. 1
Distributed Autonomous Online Learning: Regrets and Intrinsic Privacy-Preserving Properties
"... Abstract—Online learning has become increasingly popular on handling massive data. The sequential nature of online learning, however, requires a centralized learner to store data and update parameters. In this paper, we consider online learning with distributed data sources. The autonomous learners ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Abstract—Online learning has become increasingly popular on handling massive data. The sequential nature of online learning, however, requires a centralized learner to store data and update parameters. In this paper, we consider online learning with distributed data sources. The autonomous learners update local parameters based on local data sources and periodically exchange information with a small subset of neighbors in a communication network. We derive the regret bound for strongly convex functions that generalizes the work by Ram et al. (2010) for convex functions. More importantly, we show that our algorithm has intrinsic privacy-preserving properties, and we prove the sufficient and necessary conditions for privacy preservation in the network. These conditions imply that for networks with greater-than-one connectivity, a malicious learner cannot reconstruct the subgradients (and sensitive raw data) of other learners, which makes our algorithm appealing in privacy sensitive applications.