Results 1  10
of
14
Secure Regression on Distributed Databases
 J. Computational and Graphical Statist
, 2004
"... We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to ma ..."
Abstract

Cited by 25 (14 self)
 Add to MetaCart
We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to make such information available to others. Secure data integration, which provides the lowest level of protection, actually integrates the databases, but in a manner that no database owner can determine the origin of any records other than its own. Regression, associated diagnostics or any other analysis then can be performed on the integrated data.
Data swapping as a decision problem
 J. Official Statist
, 2003
"... We construct a decisiontheoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap attribute(s) and possibly, constraints on the u ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
We construct a decisiontheoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap attribute(s) and possibly, constraints on the unswapped attributes. Risk–utility frontiers, consisting of those candidates not dominated in (risk, utility) space by any other candidate, are a principal tool for reducing the scale of the decision problem. Multiple measures of disclosure risk and data utility, including utility measures based directly on use of the swapped data for statistical inference, are introduced. Their behavior and resulting insights into the decision problem are illustrated using data from the Current Population Survey, the wellstudied “Czech auto worker data ” and data on schools and administrators generated by the National Center for Education Statistics.
Preserving confidentiality of highdimensional tabular data: Statistical and computational issues
 AND COMPUTING
, 2003
"... Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of crosstabulations (marginal sub ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of crosstabulations (marginal subtables) from a single underlying table. These include data structures that exploit sparsity to support efficient computation of marginals and algorithms such as iterative proportional fitting, as well as a generalized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.
Data Dissemination and Disclosure Limitation In a World . . .
 STATIST. SCI
, 2004
"... Given the public's everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdatadata on individual units, such as individuals or establishments. In such a world, an al ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
Given the public's everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdatadata on individual units, such as individuals or establishments. In such a world, an alternative dissemination strategy is remote access analysis servers, to which users submit requests for output from statistical models fit using the data, but are not allowed access to the data themselves. Analysis servers, however, are not free from the risk of disclosure, especially in the face of multiple, interacting queries. We describe these risks and propose quantifiable measures of risk and data utility that can be used to specify which queries can be answered, and with what output. The riskutility framework is illustrated for regression models.
A framework for evaluating the utility of data altered to protect confidentiality
, 2006
"... When releasing data to the public, statistical agencies and survey organizations typically alter data values in order to protect the confidentiality of survey respondents ’ identities and attribute values. To select among the wide variety of data alteration methods, agencies require tools for evalua ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
When releasing data to the public, statistical agencies and survey organizations typically alter data values in order to protect the confidentiality of survey respondents ’ identities and attribute values. To select among the wide variety of data alteration methods, agencies require tools for evaluating the utility of proposed data releases. Such utility measures can be combined with disclosure risk measures to gauge riskutility tradeoffs of competing methods. In this paper, we present utility measures focused on differences in inferences obtained from the altered data and corresponding inferences obtained from the original data. Using both genuine and simulated data, we show how the measures can be used in a decisiontheoretic formulation for evaluating disclosure limitation procedures.
Bounds for cell entries in twoway tables given conditional relative frequencies
 In Privacy in Statistical Databases – PSD 2004, Lecture Notes in Computer Science
, 2004
"... Abstract. In recent work on statistical methods for confidentiality and disclosure limitation, Dobra and Fienberg (2000, 2003) and Dobra (2002) have generalized BonferroniFréchetHoeffding bounds for cell entries in kway contingency tables given marginal totals. In this paper, we consider extensio ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Abstract. In recent work on statistical methods for confidentiality and disclosure limitation, Dobra and Fienberg (2000, 2003) and Dobra (2002) have generalized BonferroniFréchetHoeffding bounds for cell entries in kway contingency tables given marginal totals. In this paper, we consider extensions of their approach focused on upper and lower bounds for cell entries given arbitrary sets of marginals and conditionals. We give a complete characterization of the twoway table problem and discuss some implications to statistical disclosure limitation. In particular, we employ tools from computational algebra to describe the locus of all possible tables under the given constraints and discuss how this additional knowledge affects the disclosure.
Data quality: A statistical perspective
 Statistical Methodology
, 2006
"... We present the oldbut–new problem of data quality from a statistical perspective, in part with the goal of attracting more statisticians, especially academics, to become engaged in research on a rich set of exciting challenges. The data quality landscape is described, and its research foundations i ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
We present the oldbut–new problem of data quality from a statistical perspective, in part with the goal of attracting more statisticians, especially academics, to become engaged in research on a rich set of exciting challenges. The data quality landscape is described, and its research foundations in computer science, total quality management and statistics are reviewed. Two case studies based on an EDA approach to data quality are used to motivate a set of research challenges for statistics that span theory, methodology and software tools. 1
Secure statistical analysis of distributed databases using partially trusted third parties. Manuscript in preparation
 In Statistical Methods in Counterterrorism
, 2005
"... A continuing need in the contexts of homeland security, national defense and counterterrorism is for statistical analyses that “integrate ” data stored in multiple, distributed databases. There is some belief, for example, that integration of data from flight schools, airlines, credit card issuers, ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
A continuing need in the contexts of homeland security, national defense and counterterrorism is for statistical analyses that “integrate ” data stored in multiple, distributed databases. There is some belief, for example, that integration of data from flight schools, airlines, credit card issuers, immigration records and other sources might have prevented the terrorist attacks of September 11, 2001, or might be able to prevent
Distortion Measures for
 J. Official Statist
, 2003
"... Data swapping is a common technique for statistical disclosure limitation, but its effects on real data are not understood completely. In this paper, we consider measures that can be used to quantify distortion to the data engendered by data swapping when the variables in the data set are categor ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Data swapping is a common technique for statistical disclosure limitation, but its effects on real data are not understood completely. In this paper, we consider measures that can be used to quantify distortion to the data engendered by data swapping when the variables in the data set are categorical. These measures are applied to a data set derived from the Current Population Survey. Their behavior is studied and compared for various values of the swapping rate and different choice of the variable swapped.
www.niss.org Bayesian Multiscale Multiple Imputation with Implications to Data Confidentiality
, 2008
"... Many scientific, sociological and economic applications present data that are collected on multiple scales of resolution. One particular form of multiscale data arises when data are aggregated across different scales both longitudinally and by economic sector. Frequently, such data sets experience m ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Many scientific, sociological and economic applications present data that are collected on multiple scales of resolution. One particular form of multiscale data arises when data are aggregated across different scales both longitudinally and by economic sector. Frequently, such data sets experience missing observations in a manner that they can be accurately imputed using the method we propose known as Bayesian multiscale multiple imputation. This method borrows information both longitudinally and across different levels of aggregation to produce accurate imputations of missing observations as well as estimates that respect the constraints imposed by the multiscale nature of the data. Our approach couples dynamic linear models with a novel imputation step based on singular normal distribution theory. Although our method is of independent interest, one important implication of such methodology is its potential effect on confidential databases protected by means of cell suppression. In order to demonstrate the proposed methodology and to assess the effectiveness of disclosure practices in longitudinal databases, we conduct a large scale empirical study using the U.S. Bureau of Labor Statistics Quarterly Census of Employment and Wages (QCEW). During the course of our empirical investigation it is determined that several of the predicted cells are within 1 % accuracy, thus causing potential concerns for data confidentiality.