Results 1  10
of
16
Secure Regression on Distributed Databases
 J. Computational and Graphical Statist
, 2004
"... We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to ma ..."
Abstract

Cited by 38 (17 self)
 Add to MetaCart
(Show Context)
We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to make such information available to others. Secure data integration, which provides the lowest level of protection, actually integrates the databases, but in a manner that no database owner can determine the origin of any records other than its own. Regression, associated diagnostics or any other analysis then can be performed on the integrated data.
Data Dissemination and Disclosure Limitation In a World . . .
 STATIST. SCI
, 2004
"... Given the public's everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdatadata on individual units, such as individuals or establishments. In such a world, ..."
Abstract

Cited by 24 (15 self)
 Add to MetaCart
Given the public's everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdatadata on individual units, such as individuals or establishments. In such a world, an alternative dissemination strategy is remote access analysis servers, to which users submit requests for output from statistical models fit using the data, but are not allowed access to the data themselves. Analysis servers, however, are not free from the risk of disclosure, especially in the face of multiple, interacting queries. We describe these risks and propose quantifiable measures of risk and data utility that can be used to specify which queries can be answered, and with what output. The riskutility framework is illustrated for regression models.
Secure, privacypreserving analysis of distributed databases
 Technometrics
"... There is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unw ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
(Show Context)
There is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unwilling to share proprietary databases such as chemical databases held by pharmaceutical manufacturers, government agencies are subject to laws protecting confidentiality of data subjects, and even the sheer volume of the data may preclude actual data integration. In this paper, we show how tools from modern information technology—specifically, secure multiparty computation and networking—can be used to perform statistically valid analyses of distributed databases. The common characteristic of the methods we describe is that the owners share sufficient statistics computed on the local databases in a way that protects each owner from the others. That is, while each owner can calculate the “complement ” of its contribution to the analysis, it cannot discern which other owners contributed what to that complement. Our focus is on horizontally partitioned data: the data records rather than the data attributes are spread among the owners. We present protocols for secure regression, contingency tables, maximum likelihood and Bayesian analysis. For lowrisk situations, we describe a secure data integration protocol that integrates the databases but prevents owners from learning the source of data records other than their own. Finally, we outline three current research directions: a software system implementing the protocols, secure EM algorithms, and partially trusted third parties, which reduce incentives to owners not to be honest. 1 1
Secure” loglinear and logistic regression analysis of distributed databases
 Privacy in Statistical Databases: CENEXSDC Project International Conference, PSD 2006
, 2006
"... Abstract. The machine learning community has focused on confidentiality problems associated with statistical analyses that “integrate ” data stored in multiple, distributed databases where there are barriers to simply integrating the databases. This paper discusses various techniques which can be ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
(Show Context)
Abstract. The machine learning community has focused on confidentiality problems associated with statistical analyses that “integrate ” data stored in multiple, distributed databases where there are barriers to simply integrating the databases. This paper discusses various techniques which can be used to perform statistical analysis for categorical data, especially in the form of loglinear analysis and logistic regression over partitioned databases, while limiting confidentiality concerns. We show how ideas from the current literature that focus on “secure ” summations and secure regression analysis can be adapted or generalized to the categorical data setting. 1
Valid statistical analysis for logistic regression with multiple sources
 Protecting Persons While Protecting the People: Second Annual Workshop on Information Privacy and National Security, ISIPS 2008
, 2008
"... Abstract. Considerable effort has gone into understanding issues of privacy protection of individual information in single databases, and various solutions have been proposed depending on the nature of the data, the ways in which the database will be used and the precise nature of the privacy prote ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Considerable effort has gone into understanding issues of privacy protection of individual information in single databases, and various solutions have been proposed depending on the nature of the data, the ways in which the database will be used and the precise nature of the privacy protection being offered. Once data are merged across sources, however, the nature of the problem becomes far more complex and a number of privacy issues arise for the linked individual files that go well beyond those that are considered with regard to the data within individual sources. In the paper, we propose an approach that gives full statistical analysis on the combined database without actually combining it. We focus mainly on logistic regression, but the method and tools described may be applied essentially to other statistical models as well.
Secure statistical analysis of distributed databases using partially trusted third parties. Manuscript in preparation
 In Statistical Methods in Counterterrorism
, 2005
"... A continuing need in the contexts of homeland security, national defense and counterterrorism is for statistical analyses that “integrate ” data stored in multiple, distributed databases. There is some belief, for example, that integration of data from flight schools, airlines, credit card issuers, ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
A continuing need in the contexts of homeland security, national defense and counterterrorism is for statistical analyses that “integrate ” data stored in multiple, distributed databases. There is some belief, for example, that integration of data from flight schools, airlines, credit card issuers, immigration records and other sources might have prevented the terrorist attacks of September 11, 2001, or might be able to prevent
Secure Analysis of Distributed Chemical Databases Without Data Integration
 Journal of ComputerAided Molecular Design
, 2005
"... We present a method for performing statistical valid linear regressions on the union of distributed chemical databases that preserves confidentiality of those databases. The method employs secure multiparty computation to share local sufficient statistics necessary to compute least squares estimat ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
We present a method for performing statistical valid linear regressions on the union of distributed chemical databases that preserves confidentiality of those databases. The method employs secure multiparty computation to share local sufficient statistics necessary to compute least squares estimators of regression coefficients, error variances and other quantities of interest. We illustrate with an example containing four companies ’ rather different databases. Key words: Chemical database, distributed data, regression model, secure multiparty computation 1
PrivacyPreserving Maximum Likelihood Estimation for Distributed Data
"... Abstract. Recent technological advances enable the collection of huge amounts of data. Commonly, these data are generated, stored, and owned by multiple entities that are unwilling to cede control of their data. This distributed environment requires statistical tools that can produce correct results ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Recent technological advances enable the collection of huge amounts of data. Commonly, these data are generated, stored, and owned by multiple entities that are unwilling to cede control of their data. This distributed environment requires statistical tools that can produce correct results while preserving data privacy. Privacypreserving protocols have been proposed to solve specific statistical analysis such as linear regression, clustering, and classification. In this paper, we present methods and protocols for privacypreserving maximum likelihood estimation in general settings. We discuss both horizontally and vertically partitioned data, and propose procedures that allow participating parties to withdraw from the joint computation. Logistic regression is used to demonstrate our method. 1
to organize the research and ensure communication:
"... identifying research promising paths for the statistical sciences, applied mathematics and decision sciences in problems of National Defense and Homeland Security (NDHS), and initiating research on them. This effort was especially important because previous efforts by these communities had failed to ..."
Abstract
 Add to MetaCart
identifying research promising paths for the statistical sciences, applied mathematics and decision sciences in problems of National Defense and Homeland Security (NDHS), and initiating research on them. This effort was especially important because previous efforts by these communities had failed to create a selfsustaining research momentum on NDHS. In addition, there have been few research efforts that had spanned the statistical sciences, the applied mathematical sciences and the decision sciences. 2. Working Groups Four Working Groups operated throughout the year, whose principal function, as in all SAMSI programs, was
www.niss.org Data Dissemination and Disclosure Limitation in a World Without Microdata: A RiskUtility Framework for Remote Access Analysis Servers
, 2004
"... Given the public’s everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdata—data on individual units, such as individuals or establishments. In such a world, an altern ..."
Abstract
 Add to MetaCart
Given the public’s everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdata—data on individual units, such as individuals or establishments. In such a world, an alternative dissemination strategy is remote access analysis servers, to which users submit requests for output from statistical models fit using the data, but are not allowed access to the data themselves. Analysis servers, however, are not free from the risk of disclosure, especially in the face of multiple, interacting queries. We describe these risks and propose quantifiable measures of risk and data utility that can be used to specify which queries can be answered, and with what output. The riskutility framework is illustrated for regression models.