Results 1  10
of
18
B.: Deriving Private Information from Randomized Data
 37–48, ACM SIGMOD Conference
, 2005
"... Randomization has emerged as a useful technique for data disguising in privacypreserving data mining. Its privacy properties have been studied in a number of papers. Kargupta et al. challenged the randomization schemes, and they pointed out that randomization might not be able to preserve privacy. ..."
Abstract

Cited by 87 (2 self)
 Add to MetaCart
Randomization has emerged as a useful technique for data disguising in privacypreserving data mining. Its privacy properties have been studied in a number of papers. Kargupta et al. challenged the randomization schemes, and they pointed out that randomization might not be able to preserve privacy. However, it is still unclear what factors cause such a security breach, how they affect the privacy preserving property of the randomization, and what kinds of data have higher risk of disclosing their private contents even though they are randomized. We believe that the key factor is the correlations among attributes. We propose two data reconstruction methods that are based on data correlations. One method uses the Principal Component Analysis (PCA) technique, and the other method uses the Bayes Estimate (BE) technique. We have conducted theoretical and experimental analysis on the relationship between data correlations and the amount of private information that can be disclosed based our proposed data reconstructions schemes. Our studies have shown that when the correlations are high, the original data can be reconstructed more accurately, i.e., more private information can be disclosed. To improve privacy, we propose a modified randomization scheme, in which we let the correlation of random noises “similar ” to the original data. Our results have shown that the reconstruction accuracy of both PCAbased and BEbased schemes become worse as the similarity increases.
Secure Regression on Distributed Databases
 J. Computational and Graphical Statist
, 2004
"... We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to ma ..."
Abstract

Cited by 25 (14 self)
 Add to MetaCart
We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to make such information available to others. Secure data integration, which provides the lowest level of protection, actually integrates the databases, but in a manner that no database owner can determine the origin of any records other than its own. Regression, associated diagnostics or any other analysis then can be performed on the integrated data.
A Statistical Framework for Differential Privacy
"... One goal of statistical privacy research is to construct a data release mechanism that protects individual privacy while preserving information content. An example is a random mechanism that takes an input database X and outputs a random database Z according to a distribution Qn(·X). Differential p ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
One goal of statistical privacy research is to construct a data release mechanism that protects individual privacy while preserving information content. An example is a random mechanism that takes an input database X and outputs a random database Z according to a distribution Qn(·X). Differential privacy is a particular privacy requirement developed by computer scientists in which Qn(·X) is required to be insensitive to changes in one data point in X. This makes it difficult to infer from Z whether a given individual is in the original database X. We consider differential privacy from a statistical perspective. We consider several datarelease mechanisms that satisfy the differential privacy requirement. We show that it is useful to compare these schemes by computing the rate of convergence of distributions and densities constructed from the released data. We study a general privacy method, called the exponential mechanism, introduced by McSherry and Talwar (2007). We show that the accuracy of this method is intimately linked to the rate at which the probability that the empirical distribution concentrates in a small ball around the true distribution.
Data Dissemination and Disclosure Limitation In a World . . .
 STATIST. SCI
, 2004
"... Given the public's everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdatadata on individual units, such as individuals or establishments. In such a world, an al ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
Given the public's everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdatadata on individual units, such as individuals or establishments. In such a world, an alternative dissemination strategy is remote access analysis servers, to which users submit requests for output from statistical models fit using the data, but are not allowed access to the data themselves. Analysis servers, however, are not free from the risk of disclosure, especially in the face of multiple, interacting queries. We describe these risks and propose quantifiable measures of risk and data utility that can be used to specify which queries can be answered, and with what output. The riskutility framework is illustrated for regression models.
Privacy preserving analysis of vertically partitioned data using secure matrix products
 J. Official
, 2004
"... Reluctance of statistical agencies and other data owners to share their possibly confidential or proprietary data with others who own related databases is a serious impediment to conducting mutually beneficial analyses. In this paper, we propose a protocol for securely computing matrix products in v ..."
Abstract

Cited by 12 (10 self)
 Add to MetaCart
Reluctance of statistical agencies and other data owners to share their possibly confidential or proprietary data with others who own related databases is a serious impediment to conducting mutually beneficial analyses. In this paper, we propose a protocol for securely computing matrix products in vertically partitioned data, i.e., the data sets have the same subjects but disjoint attributes. This protocol allows data owners to estimate coefficients and standard errors of linear regressions, and to examine regression model diagnostics, without disclosing the values of their attributes to each other or to third parties. The protocol can be used to perform other procedures for which sample means and covariances are sufficient statistics. 1
Privacyaware Regression Modeling of Participatory Sensing Data
"... Many participatory sensing applications use data collected by participants to construct a public model of a system or phenomenon. For example, a health application might compute a model relating exercise and diet to amount of weight loss. While the ultimately computed model could be public, the indi ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
Many participatory sensing applications use data collected by participants to construct a public model of a system or phenomenon. For example, a health application might compute a model relating exercise and diet to amount of weight loss. While the ultimately computed model could be public, the individual input and output data traces used to construct it may be private data of participants (e.g., their individual food intake, lifestyle choices, and resulting weight). This paper proposes and experimentally studies a technique that attempts to keep such input and output data traces private, while allowing accurate model construction. This is significantly different from perturbationbased techniques in that no noise is added. The main contribution of the paper is to show a certain data transformation at the client side that helps keeping the client data private while not introducing any additional error to model construction. We particularly focus on linear regression models which are widely used in participatory sensing applications. We use the data set from a mapbased participatory sensing service to evaluate our scheme. The service in question is a green navigation service that constructs regression models from participant data to predict the fuel consumption of vehicles on road segments. We evaluate our proposed mechanism by providing empirical evidence that: i) an individual data trace is generally hard to reconstruct with any reasonable accuracy, and ii) the regression model constructed using the transformed traces has a much smaller error than one based on additive data
Privacypreserving svm classification on vertically partitioned data
 in PanAsia Conference on Knowledge Discover and Data Mining (PAKDD
, 2006
"... Abstract. Classical data mining algorithms implicitly assume complete access to all data, either in centralized or federated form. However, privacy and security concerns often prevent sharing of data, thus derailing data mining projects. Recently, there has been growing focus on finding solutions to ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Abstract. Classical data mining algorithms implicitly assume complete access to all data, either in centralized or federated form. However, privacy and security concerns often prevent sharing of data, thus derailing data mining projects. Recently, there has been growing focus on finding solutions to this problem. Several algorithms have been proposed that do distributed knowledge discovery, while providing guarantees on the nondisclosure of data. Classification is an important data mining problem applicable in many diverse domains. The goal of classification is to build a model which can predict an attribute (binary attribute in this work) based on the rest of attributes. We propose an efficient and secure privacypreserving algorithm for support vector machine (SVM) classification over vertically partitioned data. 1
Differential Privacy with Compression
"... Abstract—This work studies formal utility and privacy guarantees for a simple multiplicative database transformation, where the data are compressed by a random linear or affine transformation, reducing the number of data records substantially, while preserving the number of original input variables. ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Abstract—This work studies formal utility and privacy guarantees for a simple multiplicative database transformation, where the data are compressed by a random linear or affine transformation, reducing the number of data records substantially, while preserving the number of original input variables. We provide an analysis framework inspired by a recent concept known as differential privacy. Our goal is to show that, despite the general difficulty of achieving the differential privacy guarantee, it is possible to publish synthetic data that are useful for a number of common statistical learning applications. This includes high dimensional sparse regression [24], principal component analysis (PCA), and other statistical measures [16] based on the covariance of the initial data. I.
Compressed regression
, 2007
"... Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. In this paper we study a variant of this problem where the original n input variables are compressed by a random l ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. In this paper we study a variant of this problem where the original n input variables are compressed by a random linear transformation to m ≪ n examples in p dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for ℓ1regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called “sparsistence. ” In addition, we show that ℓ1regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called “persistence. ” Finally, we characterize the privacy properties of the compression procedure in informationtheoretic terms, establishing upper bounds on the rate of information communicated between the compressed and uncompressed data that decay to zero. 1
Secure, privacypreserving analysis of distributed databases
 Technometrics
"... There is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unw ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
There is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unwilling to share proprietary databases such as chemical databases held by pharmaceutical manufacturers, government agencies are subject to laws protecting confidentiality of data subjects, and even the sheer volume of the data may preclude actual data integration. In this paper, we show how tools from modern information technology—specifically, secure multiparty computation and networking—can be used to perform statistically valid analyses of distributed databases. The common characteristic of the methods we describe is that the owners share sufficient statistics computed on the local databases in a way that protects each owner from the others. That is, while each owner can calculate the “complement ” of its contribution to the analysis, it cannot discern which other owners contributed what to that complement. Our focus is on horizontally partitioned data: the data records rather than the data attributes are spread among the owners. We present protocols for secure regression, contingency tables, maximum likelihood and Bayesian analysis. For lowrisk situations, we describe a secure data integration protocol that integrates the databases but prevents owners from learning the source of data records other than their own. Finally, we outline three current research directions: a software system implementing the protocols, secure EM algorithms, and partially trusted third parties, which reduce incentives to owners not to be honest. 1 1