Results 1  10
of
23
Secure Regression on Distributed Databases
 J. Computational and Graphical Statist
, 2004
"... We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to ma ..."
Abstract

Cited by 38 (17 self)
 Add to MetaCart
(Show Context)
We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to make such information available to others. Secure data integration, which provides the lowest level of protection, actually integrates the databases, but in a manner that no database owner can determine the origin of any records other than its own. Regression, associated diagnostics or any other analysis then can be performed on the integrated data.
A framework for evaluating the utility of data altered to protect confidentiality
, 2005
"... When releasing data to the public, statistical agencies and survey organizations typically alter data values in order to protect the confidentiality of survey respondents ’ identities and attribute values. To select among the wide variety of data alteration methods, agencies require tools for evalua ..."
Abstract

Cited by 33 (14 self)
 Add to MetaCart
When releasing data to the public, statistical agencies and survey organizations typically alter data values in order to protect the confidentiality of survey respondents ’ identities and attribute values. To select among the wide variety of data alteration methods, agencies require tools for evaluating the utility of proposed data releases. Such utility measures can be combined with disclosure risk measures to gauge riskutility tradeoffs of competing methods. In this paper, we present utility measures focused on differences in inferences obtained from the altered data and corresponding inferences obtained from the original data. Using both genuine and simulated data, we show how the measures can be used in a decisiontheoretic formulation for evaluating disclosure limitation procedures.
Data Dissemination and Disclosure Limitation In a World . . .
 STATIST. SCI
, 2004
"... Given the public's everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdatadata on individual units, such as individuals or establishments. In such a world, ..."
Abstract

Cited by 24 (15 self)
 Add to MetaCart
Given the public's everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdatadata on individual units, such as individuals or establishments. In such a world, an alternative dissemination strategy is remote access analysis servers, to which users submit requests for output from statistical models fit using the data, but are not allowed access to the data themselves. Analysis servers, however, are not free from the risk of disclosure, especially in the face of multiple, interacting queries. We describe these risks and propose quantifiable measures of risk and data utility that can be used to specify which queries can be answered, and with what output. The riskutility framework is illustrated for regression models.
Data swapping as a decision problem
 J. Official Statist
, 2003
"... We construct a decisiontheoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap attribute(s) and possibly, constraints on the u ..."
Abstract

Cited by 24 (14 self)
 Add to MetaCart
We construct a decisiontheoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap attribute(s) and possibly, constraints on the unswapped attributes. Risk–utility frontiers, consisting of those candidates not dominated in (risk, utility) space by any other candidate, are a principal tool for reducing the scale of the decision problem. Multiple measures of disclosure risk and data utility, including utility measures based directly on use of the swapped data for statistical inference, are introduced. Their behavior and resulting insights into the decision problem are illustrated using data from the Current Population Survey, the wellstudied “Czech auto worker data ” and data on schools and administrators generated by the National Center for Education Statistics.
Assessing the Risk of Disclosure of Confidential Categorical Data
, 2003
"... Disclosure limitation involves the application of statistical tools to limit the identification of information on individuals (and enterprises) included as part of statistical data bases such as censuses and sample surveys. We outline the major issues involved in assessing disclosure risk and assuri ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
Disclosure limitation involves the application of statistical tools to limit the identification of information on individuals (and enterprises) included as part of statistical data bases such as censuses and sample surveys. We outline the major issues involved in assessing disclosure risk and assuring the protection of confidentiality for data bases, especially those in the form of multiway contingency tables, and we present a Bayesian framework for thinking about such problems both from the perspective of an intruder and the agency trying to protect its data.
Preserving confidentiality of highdimensional tabular data: Statistical and computational issues
 AND COMPUTING
, 2003
"... Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of crosstabulations (marginal sub ..."
Abstract

Cited by 20 (12 self)
 Add to MetaCart
(Show Context)
Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of crosstabulations (marginal subtables) from a single underlying table. These include data structures that exploit sparsity to support efficient computation of marginals and algorithms such as iterative proportional fitting, as well as a generalized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.
Data quality: A statistical perspective
 Statistical Methodology
, 2006
"... We present the oldbut–new problem of data quality from a statistical perspective, in part with the goal of attracting more statisticians, especially academics, to become engaged in research on a rich set of exciting challenges. The data quality landscape is described, and its research foundations i ..."
Abstract

Cited by 15 (8 self)
 Add to MetaCart
(Show Context)
We present the oldbut–new problem of data quality from a statistical perspective, in part with the goal of attracting more statisticians, especially academics, to become engaged in research on a rich set of exciting challenges. The data quality landscape is described, and its research foundations in computer science, total quality management and statistics are reviewed. Two case studies based on an EDA approach to data quality are used to motivate a set of research challenges for statistics that span theory, methodology and software tools. 1
Bounds for cell entries in twoway tables given conditional relative frequencies
 In Privacy in Statistical Databases – PSD 2004, Lecture Notes in Computer Science
, 2004
"... Abstract. In recent work on statistical methods for confidentiality and disclosure limitation, Dobra and Fienberg (2000, 2003) and Dobra (2002) have generalized BonferroniFréchetHoeffding bounds for cell entries in kway contingency tables given marginal totals. In this paper, we consider extensio ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
Abstract. In recent work on statistical methods for confidentiality and disclosure limitation, Dobra and Fienberg (2000, 2003) and Dobra (2002) have generalized BonferroniFréchetHoeffding bounds for cell entries in kway contingency tables given marginal totals. In this paper, we consider extensions of their approach focused on upper and lower bounds for cell entries given arbitrary sets of marginals and conditionals. We give a complete characterization of the twoway table problem and discuss some implications to statistical disclosure limitation. In particular, we employ tools from computational algebra to describe the locus of all possible tables under the given constraints and discuss how this additional knowledge affects the disclosure.
Privacypreserving analysis of vertically partitioned data using secure matrix products
 J. Official Statistics
, 2009
"... Reluctance of statistical agencies and other data owners to share possibly confidential or proprietary data with others who own related databases is a serious impediment to conducting mutually beneficial analyses. In this article, we propose a protocol for conducting secure regressions and similar a ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Reluctance of statistical agencies and other data owners to share possibly confidential or proprietary data with others who own related databases is a serious impediment to conducting mutually beneficial analyses. In this article, we propose a protocol for conducting secure regressions and similar analyses on vertically partitioned data – databases with identical records but disjoint sets of attributes. This protocol allows data owners to estimate coefficients and standard errors of linear regressions, and to examine regression model diagnostics, without disclosing the values of their attributes to each other. No third parties are involved. The protocol can be used to perform other procedures for which sample means and covariances are sufficient statistics. The basis is an algorithm for secure matrix multiplication, which is used by pairs of owners to compute offdiagonal blocks of the full data covariance matrix. Key words: Distributed databases; secure matrix product; vertically partitioned data; regression; data confidentiality. 1.
www.niss.org Bayesian Multiscale Multiple Imputation with Implications to Data Confidentiality
, 2008
"... Many scientific, sociological and economic applications present data that are collected on multiple scales of resolution. One particular form of multiscale data arises when data are aggregated across different scales both longitudinally and by economic sector. Frequently, such data sets experience m ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Many scientific, sociological and economic applications present data that are collected on multiple scales of resolution. One particular form of multiscale data arises when data are aggregated across different scales both longitudinally and by economic sector. Frequently, such data sets experience missing observations in a manner that they can be accurately imputed using the method we propose known as Bayesian multiscale multiple imputation. This method borrows information both longitudinally and across different levels of aggregation to produce accurate imputations of missing observations as well as estimates that respect the constraints imposed by the multiscale nature of the data. Our approach couples dynamic linear models with a novel imputation step based on singular normal distribution theory. Although our method is of independent interest, one important implication of such methodology is its potential effect on confidential databases protected by means of cell suppression. In order to demonstrate the proposed methodology and to assess the effectiveness of disclosure practices in longitudinal databases, we conduct a large scale empirical study using the U.S. Bureau of Labor Statistics Quarterly Census of Employment and Wages (QCEW). During the course of our empirical investigation it is determined that several of the predicted cells are within 1 % accuracy, thus causing potential concerns for data confidentiality.