## Software Systems for Tabular Data Releases (2002)

### Cached

### Download Links

- [nisla05.niss.org]
- [www.stat.duke.edu]
- [www.niss.org]
- [nisla05.niss.org]
- [pdf.aminer.org]
- [nisla05.niss.org]
- [www.niss.org]
- [nisla05.niss.org]
- [www.niss.org]
- DBLP

### Other Repositories/Bibliography

Venue: | INT. J. UNCERTAINTY, FUZZINESS AND KNOWLEDGE BASED SYSTEMS |

Citations: | 21 - 16 self |

### BibTeX

@ARTICLE{Dobra02softwaresystems,

author = {Adrian Dobra and Alan F. Karr and Ashish P. Sanil and Stephen E. Fienberg},

title = {Software Systems for Tabular Data Releases},

journal = {INT. J. UNCERTAINTY, FUZZINESS AND KNOWLEDGE BASED SYSTEMS},

year = {2002},

volume = {10},

pages = {2002}

}

### OpenURL

### Abstract

We describe two classes of software systems that release tabular summaries of an underlying database. Table servers respond to user queries for (marginal) sub-tables of the "full" table summarizing the entire database, and are characterized by dynamic assessment of disclosure risk, in light of previously answered queries. Optimal tabular releases are static releases of sets of sub-tables that are characterized by maximizing the amount of information released, as given by a measure of data utility, subject to a constraint on disclosure risk. Underlying abstractions --- primarily associated with the query space, as well as released and unreleasable sub-tables and frontiers, computational algorithms and issues, especially scalability, and prototype software implementations are discussed.

### Citations

3110 |
UCI repository of machine learning databases
- Blake, Keogh, et al.
- 1998
(Show Context)
Citation Context ...asable constitutes a formidable computational burden. Larger values of the risk threshold β decrease the information that can be released. For β = 4, the optimal released frontier is RF ∗ { (β = 4) = =-=[1, 2, 8, 11, 13]-=-, [2, 3, 4, 7, 8, 11, 13], [2, 3, 7, 8, 11, 12, 13], } [3, 4, 5, 7, 8, 13], [3, 7, 8, 9, 11, 13], [2, 7, 8, 10, 11, 13], [3, 6, 8, 12, 13] , and 319 sub-tables (3.89%) are released. For β = 5, the alg... |

1174 | Graphical Models
- Lauritzen
- 1996
(Show Context)
Citation Context ...calability and other computational challenges. Problem Formulation. Specifically, if we require that the sub-tables in R constitute the minimal sufficient statistics of a decomposable graphical model =-=[1, 15, 25]-=- (see Figure 5), then first of all the number of candidate releases decreases dramatically. But, more important, in this case, both UB(C, R) and LB(C, R) in (4) can be expressed as explicit functions ... |

541 | Discrete multivariate analysis: theory and practice
- Bishop, Fienberg, et al.
- 1975
(Show Context)
Citation Context ...orate a suitably defined value of releasing T [14, 24, 35]. One example of value is the accuracy with which the full table T can reconstructed from R(t) ∪ T by means of iterative proportional fitting =-=[1]-=-. The decision whether to respond to a query need not be taken immediately. Instead, for example, the system could accumulate requests for different sub-tables and employ user interest as a measure of... |

475 |
Graphical Models in Applied Multivariate Statistics
- Whittaker
- 1990
(Show Context)
Citation Context ...restrictingR in (3) to have the “right” special structure, we can overcome both scalability and other computational challenges. We exploit, in this context, the statistical theory on graphical models =-=[24, 28, 41]-=-, which shows that the conditional dependencies induced by the sub-tables in RF among the variables cross-classified in a table of counts consistent withRF can be visualized by means of an independenc... |

297 | Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam’s Window
- Madigan, Raftery
- 1994
(Show Context)
Citation Context ... log-linear models fitted to the full table [1], searching for OTRs has many of the same characteristics of searching through all possible log-linear models or the subclass of all decomposable models =-=[15, 27, 38]-=-. The second way (§5.2) “solves” (3) heuristically, by ordering the marginal sub-tables of T according to a particular notion of risk, and then constructing the heuristically optimal release R ∗ by gr... |

263 | Bayesian Graphical Models for Discrete Data
- Madigan, York
- 1995
(Show Context)
Citation Context ...restrictingR in (3) to have the “right” special structure, we can overcome both scalability and other computational challenges. We exploit, in this context, the statistical theory on graphical models =-=[24, 28, 41]-=-, which shows that the conditional dependencies induced by the sub-tables in RF among the variables cross-classified in a table of counts consistent withRF can be visualized by means of an independenc... |

172 |
Introduction to Graphical Modelling
- Edwards
- 1995
(Show Context)
Citation Context ... log-linear models fitted to the full table [1], searching for OTRs has many of the same characteristics of searching through all possible log-linear models or the subclass of all decomposable models =-=[15, 27, 38]-=-. The second way (§5.2) “solves” (3) heuristically, by ordering the marginal sub-tables of T according to a particular notion of risk, and then constructing the heuristically optimal release R ∗ by gr... |

101 |
Data Swapping: A Technique for Disclosure Control
- Dalenius, Reiss
- 1982
(Show Context)
Citation Context ...of these, such as aggregation [22, 23, 26], cell suppression [6, 30], perturbation [12, 13, 20, 21] and controlled rounding [5], operate on the full table itself. Other methods, such as data swapping =-=[7]-=- and jittering (addition 2of noise to numerical data attributes) [40], operate directly on the underlying database, prior to formation of tables. The software systems described in this paper, in effe... |

87 |
Decomposition by clique separators
- Tarjan
- 1985
(Show Context)
Citation Context ...roken” into components such that (i) every component is associated with exactly one fixed sub-table in the frontier; and (ii) no released sub-table is “split” between two components. Reducible graphs =-=[35, 26]-=- are generalizations of decomposable graphs. A reducible graph is one that can be at least partially decomposed, although the resulting components of the decomposition may correspond to more than 8 Fi... |

59 |
Disclosure limitation using perturbation and related methods for categorical data
- Fienberg, Makov, et al.
- 1998
(Show Context)
Citation Context ... reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data [39]. Some of these, such as aggregation [22, 23, 26], cell suppression [6, 30], perturbation =-=[12, 13, 20, 21]-=- and controlled rounding [5], operate on the full table itself. Other methods, such as data swapping [7] and jittering (addition 2of noise to numerical data attributes) [40], operate directly on the ... |

57 |
Disclosure limitation methods and information loss for tabular data
- Duncan, Fienberg, et al.
- 2001
(Show Context)
Citation Context ... reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data [39]. Some of these, such as aggregation [22, 23, 26], cell suppression [6, 30], perturbation =-=[12, 13, 20, 21]-=- and controlled rounding [5], operate on the full table itself. Other methods, such as data swapping [7] and jittering (addition 2of noise to numerical data attributes) [40], operate directly on the ... |

50 |
A fast procedure for model search in multidimensional contingency tables
- Edwards, Havranek
- 1985
(Show Context)
Citation Context ...25 73 57 yes 51 63 7 16 pos < 3 < 140 no 5 7 21 9 yes 9 17 1 4 ≥ 140 no 4 3 11 8 yes 14 17 5 2 ≥ 3 < 140 no 7 3 14 14 yes 9 16 2 3 ≥ 140 no 4 0 13 11 yes 5 14 4 4 Table 2: Czech auto worker data from =-=[16]-=-. The higher w ∗ (T ), the safer it is to release T . Finally, we construct a release R ∗ by adding sub-tables to it in order of decreasing critical width until no more can be added without exceeding ... |

48 |
Confidentiality, uniqueness, and disclosure limitation for categorical data
- Fienberg, Makov
- 1998
(Show Context)
Citation Context ...hat it comprises more than p% of the sum. A typical value in practice is p = 60%. For tables with sample counts, uniqueness in the sample may or may not pose a risk of disclosure. Thus the literature =-=[18, 19, 31]-=- looks at the proportion of sample uniques that are in fact population uniques. If this is small, then the risk is small even in the presence of a large number of counts of 1 or 2 in the table. A majo... |

47 |
Network models for complementary cell suppression
- Cox
- 1995
(Show Context)
Citation Context ...s strategies exist for reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data [39]. Some of these, such as aggregation [22, 23, 26], cell suppression =-=[6, 30]-=-, perturbation [12, 13, 20, 21] and controlled rounding [5], operate on the full table itself. Other methods, such as data swapping [7] and jittering (addition 2of noise to numerical data attributes)... |

46 |
Bounds for cell entries in contingency tables given marginal totals and decomposable graphs
- Dobra, Fienberg
(Show Context)
Citation Context ...ative measure of risk, which we employ in §3 and 5. In the important cases that the released marginals constitute a decomposable model, the bounds are both sharp and computable using scalable methods =-=[4, 10]-=-; see §3.2. Risk Reduction. Numerous strategies exist for reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data [39]. Some of these, such as aggregat... |

34 |
An algorithm to calculate the lower and upper bounds of the elements of an array given its marginals
- Buzzigoli, Giusti
- 1999
(Show Context)
Citation Context ...ative measure of risk, which we employ in §3 and 5. In the important cases that the released marginals constitute a decomposable model, the bounds are both sharp and computable using scalable methods =-=[4, 10]-=-; see §3.2. Risk Reduction. Numerous strategies exist for reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data [39]. Some of these, such as aggregat... |

34 |
Statistical disclosure control in practice,Lecture Notes in Statistics
- Willenborg, Waal
- 1996
(Show Context)
Citation Context ...utable using scalable methods [4, 10]; see §3.2. Risk Reduction. Numerous strategies exist for reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data =-=[39]-=-. Some of these, such as aggregation [22, 23, 26], cell suppression [6, 30], perturbation [12, 13, 20, 21] and controlled rounding [5], operate on the full table itself. Other methods, such as data sw... |

33 | Obtaining information while preserving privacy: a Markov perturbation method for tabular data. Eurostat. Statistical Data Protection '98 Lisbon
- Duncan, Fienberg
- 1999
(Show Context)
Citation Context ... reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data [42]. Some of these, such as aggregation [21, 22, 25], cell suppression [6, 31], perturbation =-=[11, 12, 19, 20]-=- and controlled rounding [5], operate on the full table itself. Other methods, such as data swapping [7] and jittering (addition 2 of noise to numerical data attributes) [43], operate directly on the ... |

32 |
A constructive procedure for unbiased controlled rounding
- Cox
- 1987
(Show Context)
Citation Context ...asing T together with the one-dimensional sub-tables corresponding to all variables that do not appear in T . For example, if T = [1, 2, 3] for a six-dimensional table then MPR(T ) = {[1, 2, 3], [4], =-=[5]-=-, [6]}. In order to assess how “dangerous” a sub-table T would be if it were released, it suffices to consider MPR(T ), since it is embedded in any other possible release containing T . Moreover, beca... |

30 |
Java 2 platform enterprise edition specification
- Microsystems
(Show Context)
Citation Context ... any given time as well as consequences of releasing particular sub-tables can be readily discerned. Figure 2 shows the architecture of a more powerful table server implemented using server-side Java =-=[32]-=-. This prototype operates on a 14-dimensional full table derived from the 1994 and 1995 CPSs. The full table has approximately 435,000,000 cells, and is extremely (but realistically!) sparse, principa... |

29 | Efficient stepwise selection in decomposable models
- Deshpande, Garofalakis, et al.
- 2001
(Show Context)
Citation Context ...e neighborhood N(R) is taken to consist of all releases defined by decomposable independence graphs obtained by deleting or adding one edge from the graph associated with R. Very efficient algorithms =-=[8]-=- exist for finding N(R). Because any two decomposable graphs can be linked by a sequence of decomposable graphs that differ by exactly one edge [25], the resulting Markov chain is irreducible, as requ... |

29 |
Optimal decomposition by clique separators
- Leimer
- 1993
(Show Context)
Citation Context ...roken” into components such that (i) every component is associated with exactly one fixed sub-table in the frontier; and (ii) no released sub-table is “split” between two components. Reducible graphs =-=[35, 26]-=- are generalizations of decomposable graphs. A reducible graph is one that can be at least partially decomposed, although the resulting components of the decomposition may correspond to more than 8 Fi... |

22 |
Statistical tools for disclosure limitation in multiway contingency tables
- Dobra
- 2002
(Show Context)
Citation Context ...ub-tables R ∗ (β = 3) 10whose frontier is RF ∗ (β = 3) = { [1, 7, 8, 11, 12, 13], [7, 8, 11, 12, 13], [2, 3, 7, 8, 11, 12, 13], [2, 3, 4, 7, 8, 11, 13], [2, 5, 7, 8, 11, 13], [5, 6, 7, 8, 11, 13], } =-=[2, 4, 7, 9, 11, 13]-=-, [2, 7, 8, 10, 11, 13] , and which contains five 6-way and two 7-way sub-tables. The release contains a total of 351 sub-tables, representing 4.25% of the 8,191 sub-tables of the full 13-way table. S... |

17 | A decision-theoretic approach to data disclosure problems
- Trottini
- 2001
(Show Context)
Citation Context ...s on the associated frontiers RF (t) and UF (t) of releasing the 5–way sub-table indicated by the cursor. Another alternative is release rules that incorporate a suitably defined value of releasing T =-=[14, 24, 35]-=-. One example of value is the accuracy with which the full table T can reconstructed from R(t) ∪ T by means of iterative proportional fitting [1]. The decision whether to respond to a query need not b... |

14 | Webbased Systems that Disseminate Information from Data but
- Karr, Lee, et al.
- 2001
(Show Context)
Citation Context ...e §3.2. Risk Reduction. Numerous strategies exist for reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data [39]. Some of these, such as aggregation =-=[22, 23, 26]-=-, cell suppression [6, 30], perturbation [12, 13, 20, 21] and controlled rounding [5], operate on the full table itself. Other methods, such as data swapping [7] and jittering (addition 2of noise to ... |

14 |
A Bayesian, Species-samplinginspired Approach to the Uniques Problem in Microdata Disclosure risk Assessment
- Samuels
- 1998
(Show Context)
Citation Context ...hat it comprises more than p% of the sum. A typical value in practice is p = 60%. For tables with sample counts, uniqueness in the sample may or may not pose a risk of disclosure. Thus the literature =-=[18, 19, 31]-=- looks at the proportion of sample uniques that are in fact population uniques. If this is small, then the risk is small even in the presence of a large number of counts of 1 or 2 in the table. A majo... |

12 | Disseminating information but protecting confidentiality
- Karr, Lee, et al.
- 2001
(Show Context)
Citation Context ...e §3.2. Risk Reduction. Numerous strategies exist for reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data [39]. Some of these, such as aggregation =-=[22, 23, 26]-=-, cell suppression [6, 30], perturbation [12, 13, 20, 21] and controlled rounding [5], operate on the full table itself. Other methods, such as data swapping [7] and jittering (addition 2of noise to ... |

10 |
Auditing disclosure in multi-way tables with cell suppression: Simplex and shuttle solutions
- Roehrig
- 1999
(Show Context)
Citation Context ...other. Reflecting historical usage, a typical risk criterion is accuracy of bounds based on R(t) for sensitive (small count) cells in the full table. Such bounds can be computed using network methods =-=[6, 29]-=- and NISS-developed generalizations of the “shuttle algorithm” [4]. There are also exact techniques for special cases, which are described in detail in §5. Unreleasable Set and Frontier. Whenever an a... |

9 |
Modelling user uncertainty for disclosure risk and data utility
- Trottini, Fienberg
- 2002
(Show Context)
Citation Context ...he utility of the released information — the set R of sub-tables — is maximized, subject to an upper bound on the disclosure risk. This risk–utility approach builds on other risk–utility formulations =-=[14, 24, 36, 37]-=- being investigated under the NISS DG project. Risk Criteria and Thresholds. A typical risk criterion is tightness of bounds based on R for small count cells in the full table; a specific example is {... |

7 | Analysis of aggregated data in survey sampling with application to fertilizer/pesticide usage surveys - Lee, Holloman, et al. - 2001 |

6 |
Current Population Survey, 2002. Information available on-line at www.bls.census.gov/cps/cpsmain.htm
- Bureau
(Show Context)
Citation Context ...re 1. This prototype is valuable for its engaging, but non-scalable, visualization of the query space. It operates on an 8–dimensional full table of data from the 1993 Current Population Survey (CPS) =-=[2, 3]-=-. The underlying data come from a sample survey carried out by the US Census Bureau, which monthly gathers data on approximately 50,000 households across the US. As seen in the figure, the critical su... |

5 |
Report of the panel on confidentiality and data access
- Duncan, Wolf, et al.
- 1993
(Show Context)
Citation Context ...ions as well, must balance concern over confidentiality of their data — in particular, identities of data subjects and sensitive attributes — with their obligation to report information to the public =-=[11]-=-. Advances in information technology threaten confidentiality: for example, powerful capabilities enable record linkage across multiple databases. However, new technologies can also protect confidenti... |

4 | Uniqueness and disclosure risk: Urn models and simulation
- Fienberg, Makov
(Show Context)
Citation Context ...hat it comprises more than p% of the sum. A typical value in practice is p = 60%. For tables with sample counts, uniqueness in the sample may or may not pose a risk of disclosure. Thus the literature =-=[18, 19, 31]-=- looks at the proportion of sample uniques that are in fact population uniques. If this is small, then the risk is small even in the presence of a large number of counts of 1 or 2 in the table. A majo... |

4 |
Java Servlet Technology. Information available on-line at java.sun.com/products/servlet
- Microsystems, Inc
(Show Context)
Citation Context ... frontier display facility, shown in Figure 4, monitors evolution of RF (t). As shown in Figure 2 the HTML-based user-query processing as well as overall program logic is implemented by Java Servlets =-=[33]-=-; we use Apache’s Tomcat [34] as the servlet engine. Beyond straightforward tasks such as interaction with the database and output generation, the most significant work performed is the real-time comp... |

2 |
A progress report to the National Center for Education Statistics: Disclosure-limited statistical analysis of confidential data to support NSFsponsored Digital Government grant
- Keller–McNulty, Duncan
- 2001
(Show Context)
Citation Context ...s on the associated frontiers RF (t) and UF (t) of releasing the 5–way sub-table indicated by the cursor. Another alternative is release rules that incorporate a suitably defined value of releasing T =-=[14, 24, 35]-=-. One example of value is the accuracy with which the full table T can reconstructed from R(t) ∪ T by means of iterative proportional fitting [1]. The decision whether to respond to a query need not b... |

1 |
Post randomization for statistical disclosure control: Theory and implementation
- Wolf
- 1998
(Show Context)
Citation Context ... reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data [39]. Some of these, such as aggregation [22, 23, 26], cell suppression [6, 30], perturbation =-=[12, 13, 20, 21]-=- and controlled rounding [5], operate on the full table itself. Other methods, such as data swapping [7] and jittering (addition 2of noise to numerical data attributes) [40], operate directly on the ... |

1 |
Testing protection of suppressed cells in complex additive tables. Unpublished manuscript
- Saalfeld
(Show Context)
Citation Context ...s strategies exist for reducing disclosure risk and thereby protecting against identity disclosure for subjects of tabular data [39]. Some of these, such as aggregation [22, 23, 26], cell suppression =-=[6, 30]-=-, perturbation [12, 13, 20, 21] and controlled rounding [5], operate on the full table itself. Other methods, such as data swapping [7] and jittering (addition 2of noise to numerical data attributes)... |

1 |
Fitting all possible decomposable and graphical models to multiway contingency tables
- Whitakker
- 1984
(Show Context)
Citation Context ... log-linear models fitted to the full table [1], searching for OTRs has many of the same characteristics of searching through all possible log-linear models or the subclass of all decomposable models =-=[15, 27, 38]-=-. The second way (§5.2) “solves” (3) heuristically, by ordering the marginal sub-tables of T according to a particular notion of risk, and then constructing the heuristically optimal release R ∗ by gr... |