Results 1 - 10
of
11
Secure Regression on Distributed Databases
- J. Computational and Graphical Statist
, 2004
"... We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to ma ..."
Abstract
-
Cited by 17 (12 self)
- Add to MetaCart
We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to make such information available to others. Secure data integration, which provides the lowest level of protection, actually integrates the databases, but in a manner that no database owner can determine the origin of any records other than its own. Regression, associated diagnostics or any other analysis then can be performed on the integrated data.
Data Dissemination and Disclosure Limitation In a World . . .
- STATIST. SCI
, 2004
"... Given the public's ever-increasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdata---data on individual units, such as individuals or establishments. In such a world, an al ..."
Abstract
-
Cited by 13 (9 self)
- Add to MetaCart
Given the public's ever-increasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdata---data on individual units, such as individuals or establishments. In such a world, an alternative dissemination strategy is remote access analysis servers, to which users submit requests for output from statistical models fit using the data, but are not allowed access to the data themselves. Analysis servers, however, are not free from the risk of disclosure, especially in the face of multiple, interacting queries. We describe these risks and propose quantifiable measures of risk and data utility that can be used to specify which queries can be answered, and with what output. The risk-utility framework is illustrated for regression models.
Preserving confidentiality of high-dimensional tabular data: Statistical and computational issues
- AND COMPUTING
, 2003
"... Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub- ..."
Abstract
-
Cited by 12 (8 self)
- Add to MetaCart
Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to support efficient computation of marginals and algorithms such as iterative proportional fitting, as well as a generalized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.
A framework for evaluating the utility of data altered to protect confidentiality
, 2006
"... When releasing data to the public, statistical agencies and survey organizations typically alter data values in order to protect the confidentiality of survey respondents ’ identities and attribute values. To select among the wide variety of data alteration methods, agencies require tools for evalua ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
When releasing data to the public, statistical agencies and survey organizations typically alter data values in order to protect the confidentiality of survey respondents ’ identities and attribute values. To select among the wide variety of data alteration methods, agencies require tools for evaluating the utility of proposed data releases. Such utility measures can be combined with disclosure risk measures to gauge risk-utility tradeoffs of competing methods. In this paper, we present utility measures focused on differences in inferences obtained from the altered data and corresponding inferences obtained from the original data. Using both genuine and simulated data, we show how the measures can be used in a decision-theoretic formulation for evaluating disclosure limitation procedures.
Data quality: A statistical perspective
- Statistical Methodology
, 2006
"... We present the old-but–new problem of data quality from a statistical perspective, in part with the goal of attracting more statisticians, especially academics, to become engaged in research on a rich set of exciting challenges. The data quality landscape is described, and its research foundations i ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We present the old-but–new problem of data quality from a statistical perspective, in part with the goal of attracting more statisticians, especially academics, to become engaged in research on a rich set of exciting challenges. The data quality landscape is described, and its research foundations in computer science, total quality management and statistics are reviewed. Two case studies based on an EDA approach to data quality are used to motivate a set of research challenges for statistics that span theory, methodology and software tools. 1
Secure statistical analysis of distributed databases using partially trusted third parties. Manuscript in preparation
- In Statistical Methods in Counterterrorism
, 2005
"... A continuing need in the contexts of homeland security, national defense and counterterrorism is for statistical analyses that “integrate ” data stored in multiple, distributed databases. There is some belief, for example, that integration of data from flight schools, airlines, credit card issuers, ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
A continuing need in the contexts of homeland security, national defense and counterterrorism is for statistical analyses that “integrate ” data stored in multiple, distributed databases. There is some belief, for example, that integration of data from flight schools, airlines, credit card issuers, immigration records and other sources might have prevented the terrorist attacks of September 11, 2001, or might be able to prevent
Distortion Measures for
- J. Official Statist
, 2003
"... Data swapping is a common technique for statistical disclosure limitation, but its effects on real data are not understood completely. In this paper, we consider measures that can be used to quantify distortion to the data engendered by data swapping when the variables in the data set are categor ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Data swapping is a common technique for statistical disclosure limitation, but its effects on real data are not understood completely. In this paper, we consider measures that can be used to quantify distortion to the data engendered by data swapping when the variables in the data set are categorical. These measures are applied to a data set derived from the Current Population Survey. Their behavior is studied and compared for various values of the swapping rate and different choice of the variable swapped.
Data Swapping as a
- J. Official Statist
, 2003
"... We construct a decision-theoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap attribute(s) and possibly, constraints on th ..."
Abstract
- Add to MetaCart
We construct a decision-theoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap attribute(s) and possibly, constraints on the unswapped attributes. Risk--utility frontiers, consisting of those candidates not dominated in (risk, utility) space by any other candidate, are a principal tool for reducing the scale of the decision problem. Multiple measures of disclosure risk and data utility, including utility measures based directly on use of the swapped data for statistical inference, are introduced.
Working Paper No. 30
, 2003
"... In this paper we give an overview of various approaches to the implementation of statistical disclosure control to tabular data released through the Web. We consider three generic groups of statistical disclosure control methods: source data perturbation, output perturbation and query-set restricti ..."
Abstract
- Add to MetaCart
In this paper we give an overview of various approaches to the implementation of statistical disclosure control to tabular data released through the Web. We consider three generic groups of statistical disclosure control methods: source data perturbation, output perturbation and query-set restriction. Considering different types of Web-sites and implementation approaches we discuss the appropriateness and effectiveness of such statistical disclosure control methods.
www.niss.org Bayesian Multiscale Multiple Imputation with Implications to Data Confidentiality
, 2008
"... Many scientific, sociological and economic applications present data that are collected on multiple scales of resolution. One particular form of multiscale data arises when data are aggregated across different scales both longitudinally and by economic sector. Frequently, such data sets experience m ..."
Abstract
- Add to MetaCart
Many scientific, sociological and economic applications present data that are collected on multiple scales of resolution. One particular form of multiscale data arises when data are aggregated across different scales both longitudinally and by economic sector. Frequently, such data sets experience missing observations in a manner that they can be accurately imputed using the method we propose known as Bayesian multiscale multiple imputation. This method borrows information both longitudinally and across different levels of aggregation to produce accurate imputations of missing observations as well as estimates that respect the constraints imposed by the multiscale nature of the data. Our approach couples dynamic linear models with a novel imputation step based on singular normal distribution theory. Although our method is of independent interest, one important implication of such methodology is its potential effect on confidential databases protected by means of cell suppression. In order to demonstrate the proposed methodology and to assess the effectiveness of disclosure practices in longitudinal databases, we conduct a large scale empirical study using the U.S. Bureau of Labor Statistics Quarterly Census of Employment and Wages (QCEW). During the course of our empirical investigation it is determined that several of the predicted cells are within 1 % accuracy, thus causing potential concerns for data confidentiality.

