Results 1  10
of
23
A sequential importance sampling algorithm for generating random graphs with prescribed degrees
, 2006
"... Random graphs with a given degree sequence are a useful model capturing several features absent in the classical ErdősRényi model, such as dependent edges and nonbinomial degrees. In this paper, we use a characterization due to Erdős and Gallai to develop a sequential algorithm for generating a ra ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
Random graphs with a given degree sequence are a useful model capturing several features absent in the classical ErdősRényi model, such as dependent edges and nonbinomial degrees. In this paper, we use a characterization due to Erdős and Gallai to develop a sequential algorithm for generating a random labeled graph with a given degree sequence. The algorithm is easy to implement and allows surprisingly efficient sequential importance sampling. Applications are given, including simulating a biological network and estimating the number of graphs with a given degree sequence. 1. Introduction. Random
A SamplingBased Approach to Information Recovery †
"... Abstract — There has been a recent resurgence of interest in research on noisy and incomplete data. Many applications require information to be recovered from such data. Ideally, an approach for information recovery should have the following features. First, it should be able to incorporate prior kn ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
Abstract — There has been a recent resurgence of interest in research on noisy and incomplete data. Many applications require information to be recovered from such data. Ideally, an approach for information recovery should have the following features. First, it should be able to incorporate prior knowledge about the data, even if such knowledge is in the form of complex distributions and constraints for which no closeform solutions exist. Second, it should be able to capture complex correlations and quantify the degree of uncertainty in the recovered data, and further support queries over such data. The database community has developed a number of approaches for information recovery, but none is general enough to offer all above features. To overcome the limitations, we take a significantly more general approach to information recovery based on sampling. We apply sequential importance sampling, a technique from statistics that works for complex distributions and dramatically outperforms naive sampling when data is constrained. We illustrate the generality and efficiency of this approach in two application scenarios: cleansing RFID data, and recovering information from published data that has been summarized and randomized for privacy. I.
Learning Your Identity and Disease from Research Papers: Information Leaks in Genome Wide Association Study
"... Genomewide association studies (GWAS) aim at discovering the association between genetic variations, particularly singlenucleotide polymorphism (SNP), and common diseases, which have been well recognized to be one of the most important and active areas in biomedical research. Also renowned is the ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Genomewide association studies (GWAS) aim at discovering the association between genetic variations, particularly singlenucleotide polymorphism (SNP), and common diseases, which have been well recognized to be one of the most important and active areas in biomedical research. Also renowned is the privacy implication of such studies, which has been brought into the limelight by the recent attack proposed by Homer et al. Homer’s attack demonstrates that it is possible to identify a participant of a GWAS from analyzing the allele frequencies of a large number of SNPs. Such a threat, unfortunately, was found in our research to be significantly understated. In this paper, we demonstrate that individuals can actually be identified from even a relatively small set of statistics, as those routinely published in GWAS papers. We present two attacks. The first one extends Homer’s attack with a much more powerful test statistic, based on the correlations among different SNPs described by coefficient of determination (r 2). This attack can determine the presence of an individual in a GWAS from the statistics related to a couple of hundred SNPs. The second attack can lead to complete disclosure of hundreds of the participants ’ SNPs, by analyzing the information derived from the published statistics. We also found that those attacks can succeed even when the precisions of the statistics are low and part of data is missing, which makes the effects of such simple defense limited. We evaluated our attacks on the real human genomes from the International HapMap project, and concluded that such threats are completely realistic.
Algebraic statistics and contingency table problems: Loglinear models, likelihood estimation and disclosure limitation
 IMA Volumes in Mathematics and its Applications: Emerging Applications of Algebraic Geometry
, 2008
"... Abstract. Contingency tables have provided a fertile ground for the growth of algebraic statistics. In this paper we briefly outline some features of this work and point to open research problems. We focus on the problem of maximum likelihood estimation for loglinear models and a related problem of ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Abstract. Contingency tables have provided a fertile ground for the growth of algebraic statistics. In this paper we briefly outline some features of this work and point to open research problems. We focus on the problem of maximum likelihood estimation for loglinear models and a related problem of disclosure limitation to protect the confidentiality of individual responses. Risk of disclosure has often been measured either formally or informally in terms of information contained in marginal tables linked to a loglinear model analysis and has focused on disclosure potential of small cell counts, especially those equal to 1 or 2. One way to assess risk is to compute bounds for cell entries given a set of released marginals. Both of these methodologies become complicated for large sparse tables. This paper revisits the problem of computing bounds for cell entries and picks up on a theme first suggested in Fienberg [21] thatthereisanintimate link between the ideas on bounds and the existence of maximum likelihood estimates, and shows how these ideas can be made rigorous through the underlying mathematics of the same geometric/algebraic framework. We illustrate the linkages through a series of examples. We also discuss the more complex problem of releasing marginal and conditional information. We illustrate the statistical features of the methodology on two examples and then conclude with a series of open problems.
Markov chains, quotient ideals and connectivity with positive margins
 In Algebraic and Geometric Methods in Statistics
, 2010
"... ..."
Negative Examples for Sequential Importance Sampling of Binary Contingency Tables. Submitted. Available from Mathematics arXiv math.ST/0606650
"... Abstract. The sequential importance sampling (SIS) algorithm has gained considerable popularity for its empirical success. One of its noted applications is to the binary contingency tables problem, an important problem in statistics, where the goal is to estimate the number of 0/1 matrices with pres ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract. The sequential importance sampling (SIS) algorithm has gained considerable popularity for its empirical success. One of its noted applications is to the binary contingency tables problem, an important problem in statistics, where the goal is to estimate the number of 0/1 matrices with prescribed row and column sums. We give a family of examples in which the SIS procedure, if run for any subexponential number of trials, will underestimate the number of tables by an exponential factor. This result holds for any of the usual design choices in the SIS algorithm, namely the ordering of the columns and rows. These are apparently the first theoretical results on the efficiency of the SIS algorithm for binary contingency tables. Finally, we present experimental evidence that the SIS algorithm is efficient for row and column sums that are regular. Our work is a first step in determining rigorously the class of inputs for which SIS is effective. 1
The Generalized Shuttle Algorithm
, 2008
"... Bounds for the cell counts in multiway contingency tables given a set of marginal totals arise in a variety of different statistical contexts including disclosure limitation. We describe the Generalized Shuttle Algorithm for computing integer bounds of multiway contingency tables induced by arbitr ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Bounds for the cell counts in multiway contingency tables given a set of marginal totals arise in a variety of different statistical contexts including disclosure limitation. We describe the Generalized Shuttle Algorithm for computing integer bounds of multiway contingency tables induced by arbitrary linear constraints on cell counts. We study the convergence properties of our method by exploiting the theory of discrete graphical models and demonstrate the sharpness of the bounds for some specific settings. We give a procedure for adjusting these bounds to the sharp bounds that can also be employed to enumerate all tables consistent with the given constraints. Our algorithm for computing sharp bounds and enumerating multiway contingency tables is the first approach that relies exclusively on the unique structure of the categorical data and does not employ any other optimization techniques such as linear or integer programming. We illustrate how our algorithm can be used to compute exact pvalues of goodnessoffit tests in exact conditional inference. Many statistical research problems involve working with sets of multiway contingency tables defined by a set of constraints (e.g., marginal totals or structural zeroes). Four interrelated
Datamining and Disclosure Limitation for Categorical Statistical Databases
 Proceedings of Workshop on Privacy and Security Aspects of Data Mining, Fourth IEEE International Conference on Data Mining (ICDM 2004
, 2004
"... There are many distinctions between statistical research databases and those arising in commercial or administrative settings, and thus different issues regarding confidentiality and privacy protection on the one hand and and data access and the use of databases on the other. Data integration across ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
There are many distinctions between statistical research databases and those arising in commercial or administrative settings, and thus different issues regarding confidentiality and privacy protection on the one hand and and data access and the use of databases on the other. Data integration across multiple databases raises issues in both domains, especially with regard to protection against intruders. This paper highlights some methods developed to limit possible disclosure of confidential information from statistical databases while at the same time publicly releasing sufficient information to allow users, whether dataminers or other more traditional statistical analysts, sufficient data to reach proper statistical conclusions from their analyses. The disclosure limitation tools discussed include: data perturbation and simulation, partial releases, and sampling, with a special focus on partial release of data from multidimensional crossclassifications or contingency tables.
Normal Toric Ideals of Low Codimension
, 2008
"... Every normal toric ideal of codimension two is minimally generated by a Gröbner basis with squarefree initial monomials. A polynomial time algorithm is presented for checking whether a toric ideal of fixed codimension is normal. ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Every normal toric ideal of codimension two is minimally generated by a Gröbner basis with squarefree initial monomials. A polynomial time algorithm is presented for checking whether a toric ideal of fixed codimension is normal.