Results 1 -
7 of
7
Secure Regression on Distributed Databases
- J. Computational and Graphical Statist
, 2004
"... We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to ma ..."
Abstract
-
Cited by 17 (12 self)
- Add to MetaCart
We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to make such information available to others. Secure data integration, which provides the lowest level of protection, actually integrates the databases, but in a manner that no database owner can determine the origin of any records other than its own. Regression, associated diagnostics or any other analysis then can be performed on the integrated data.
Data Dissemination and Disclosure Limitation In a World . . .
- STATIST. SCI
, 2004
"... Given the public's ever-increasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdata---data on individual units, such as individuals or establishments. In such a world, an al ..."
Abstract
-
Cited by 13 (9 self)
- Add to MetaCart
Given the public's ever-increasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdata---data on individual units, such as individuals or establishments. In such a world, an alternative dissemination strategy is remote access analysis servers, to which users submit requests for output from statistical models fit using the data, but are not allowed access to the data themselves. Analysis servers, however, are not free from the risk of disclosure, especially in the face of multiple, interacting queries. We describe these risks and propose quantifiable measures of risk and data utility that can be used to specify which queries can be answered, and with what output. The risk-utility framework is illustrated for regression models.
Secure, privacy-preserving analysis of distributed databases
- Technometrics
"... There is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unw ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
There is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unwilling to share proprietary databases such as chemical databases held by pharmaceutical manufacturers, government agencies are subject to laws protecting confidentiality of data subjects, and even the sheer volume of the data may preclude actual data integration. In this paper, we show how tools from modern information technology—specifically, secure multiparty computation and networking—can be used to perform statistically valid analyses of distributed databases. The common characteristic of the methods we describe is that the owners share sufficient statistics computed on the local databases in a way that protects each owner from the others. That is, while each owner can calculate the “complement ” of its contribution to the analysis, it cannot discern which other owners contributed what to that complement. Our focus is on horizontally partitioned data: the data records rather than the data attributes are spread among the owners. We present protocols for secure regression, contingency tables, maximum likelihood and Bayesian analysis. For low-risk situations, we describe a secure data integration protocol that integrates the databases but prevents owners from learning the source of data records other than their own. Finally, we outline three current research directions: a software system implementing the protocols, secure EM algorithms, and partially trusted third parties, which reduce incentives to owners not to be honest. 1 1
Secure analysis of distributed chemical databases without data integration
- J. Computer-Aided Molecular Design, November
, 2005
"... We present a method for performing statistical valid linear regressions on the union of distributed chemical databases that preserves confidentiality of those databases. The method employs secure multi-party computation to share local sufficient statistics necessary to compute least squares estimato ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We present a method for performing statistical valid linear regressions on the union of distributed chemical databases that preserves confidentiality of those databases. The method employs secure multi-party computation to share local sufficient statistics necessary to compute least squares estimators of regression coefficients, error variances and other quantities of interest. We illustrate with an example containing four companies ’ rather different databases. Key words: Chemical database, distributed data, regression model, secure multi-party computation 1
Secure statistical analysis of distributed databases using partially trusted third parties. Manuscript in preparation
- In Statistical Methods in Counterterrorism
, 2005
"... A continuing need in the contexts of homeland security, national defense and counterterrorism is for statistical analyses that “integrate ” data stored in multiple, distributed databases. There is some belief, for example, that integration of data from flight schools, airlines, credit card issuers, ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
A continuing need in the contexts of homeland security, national defense and counterterrorism is for statistical analyses that “integrate ” data stored in multiple, distributed databases. There is some belief, for example, that integration of data from flight schools, airlines, credit card issuers, immigration records and other sources might have prevented the terrorist attacks of September 11, 2001, or might be able to prevent
to organize the research and ensure communication:
"... identifying research promising paths for the statistical sciences, applied mathematics and decision sciences in problems of National Defense and Homeland Security (NDHS), and initiating research on them. This effort was especially important because previous efforts by these communities had failed to ..."
Abstract
- Add to MetaCart
identifying research promising paths for the statistical sciences, applied mathematics and decision sciences in problems of National Defense and Homeland Security (NDHS), and initiating research on them. This effort was especially important because previous efforts by these communities had failed to create a self-sustaining research momentum on NDHS. In addition, there have been few research efforts that had spanned the statistical sciences, the applied mathematical sciences and the decision sciences. 2. Working Groups Four Working Groups operated throughout the year, whose principal function, as in all SAMSI programs, was
Privacy-Preserving Maximum Likelihood Estimation for Distributed Data
"... Abstract. Recent technological advances enable the collection of huge amounts of data. Commonly, these data are generated, stored, and owned by multiple entities that are unwilling to cede control of their data. This distributed environment requires statistical tools that can produce correct results ..."
Abstract
- Add to MetaCart
Abstract. Recent technological advances enable the collection of huge amounts of data. Commonly, these data are generated, stored, and owned by multiple entities that are unwilling to cede control of their data. This distributed environment requires statistical tools that can produce correct results while preserving data privacy. Privacy-preserving protocols have been proposed to solve specific statistical analysis such as linear regression, clustering, and classification. In this paper, we present methods and protocols for privacy-preserving maximum likelihood estimation in general settings. We discuss both horizontally and vertically partitioned data, and propose procedures that allow participating parties to withdraw from the joint computation. Logistic regression is used to demonstrate our method. 1

