Results 1  10
of
31
Algebraic Algorithms for Sampling from Conditional Distributions
 Annals of Statistics
, 1995
"... We construct Markov chain algorithms for sampling from discrete exponential families conditional on a sufficient statistic. Examples include generating tables with fixed row and column sums and higher dimensional analogs. The algorithms involve finding bases for associated polynomial ideals and so a ..."
Abstract

Cited by 192 (16 self)
 Add to MetaCart
We construct Markov chain algorithms for sampling from discrete exponential families conditional on a sufficient statistic. Examples include generating tables with fixed row and column sums and higher dimensional analogs. The algorithms involve finding bases for associated polynomial ideals and so an excursion into computational algebraic geometry.
Three Centuries of Categorical Data Analysis: Loglinear Models and Maximum Likelihood Estimation
"... The common view of the history of contingency tables is that it begins in 1900 with the work of Pearson and Yule, but it extends back at least into the 19th century. Moreover it remains an active area of research today. In this paper we give an overview of this history focussing on the development o ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
The common view of the history of contingency tables is that it begins in 1900 with the work of Pearson and Yule, but it extends back at least into the 19th century. Moreover it remains an active area of research today. In this paper we give an overview of this history focussing on the development of loglinear models and their estimation via the method of maximum likelihood. S. N. Roy played a crucial role in this development with two papers coauthored with his students S. K. Mitra and Marvin Kastenbaum, at roughly the midpoint temporally in this development. Then we describe a problem that eluded Roy and his students, that of the implications of sampling zeros for the existence of maximum likelihood estimates for loglinear models. Understanding the problem of nonexistence is crucial to the analysis of large sparse contingency tables. We introduce some relevant results from the application of algebraic geometry to the study of this statistical problem. 1
On the Index of Dissimilarity for Lack of Fit in Log Linear Models
"... The index of dissimilarity, often denoted by Delta, is commonly used, especially in social science and with large datasets, to describe the lack of fit of models for categorical data. In this paper the definition and sampling properties of the index are investigated for general loglinear models. It ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The index of dissimilarity, often denoted by Delta, is commonly used, especially in social science and with large datasets, to describe the lack of fit of models for categorical data. In this paper the definition and sampling properties of the index are investigated for general loglinear models. It is argued that in some applications a standardized version of the index is appropriate for interpretation. A simple, approximate variance formula is derived for the index, whether standardized or not. A simple bias reduction formula is also given. The accuracy of these formulae and of confidence intervals based upon them is investigated in a simulation study based on largescale social mobility data. Key words: bias reduction; dissimilarity index; extended hypergeometric; folded normal; iterative proportional fitting; iterative scaling; stratified sampling. 1
Sequences of regressions and their independences
, 2012
"... Ordered sequences of univariate or multivariate regressions provide statistical modelsfor analysingdata fromrandomized, possiblysequential interventions, from cohort or multiwave panel studies, but also from crosssectional or retrospective studies. Conditional independences are captured by what we ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Ordered sequences of univariate or multivariate regressions provide statistical modelsfor analysingdata fromrandomized, possiblysequential interventions, from cohort or multiwave panel studies, but also from crosssectional or retrospective studies. Conditional independences are captured by what we name regression graphs, provided the generated distribution shares some properties with a joint Gaussian distribution. Regression graphs extend purely directed, acyclic graphs by two types of undirected graph, one type for components of joint responses and the other for components of the context vector variable. We review the special features and the history of regression graphs, prove criteria for Markov equivalence anddiscussthenotion of simpler statistical covering models. Knowledgeof Markov equivalence provides alternative interpretations of a given sequence of regressions, is essential for machine learning strategies and permits to use the simple graphical criteria of regression graphs on graphs for which the corresponding criteria are in general more complex. Under the known conditions that a Markov equivalent directed acyclic graph exists for any given regression graph, we give a polynomial time algorithm to find one such graph.
Sparse Contingency Tables and HighDimensional LogLinealr Models for Alternative Splicing
 in FullLength cDNA Libraries, Research Report 132, Swiss Federal Institute of Technology
, 2006
"... Corinne Dahinden is PhD student at the Seminar für Statistik, ETH Zürich, CH8092 Zürich, ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Corinne Dahinden is PhD student at the Seminar für Statistik, ETH Zürich, CH8092 Zürich,
LINEAR MODELS ANALYSIS OF INCOMPLETE MULTIVARIATE CATEGORICAL DATA
, 1972
"... This research deals with experiments or surveys producing multivariate categorical data which is incomplete, in the sense that not all variables of interest are measured on every subject or element of the sample. For the most part, incompleteness is taken to arise by design, rather than by random fa ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This research deals with experiments or surveys producing multivariate categorical data which is incomplete, in the sense that not all variables of interest are measured on every subject or element of the sample. For the most part, incompleteness is taken to arise by design, rather than by random failure of the measurement process. In these circumstances, one can often assume that counts derived from appropriate disjoint subsets of the data arise from independent multinomial distributions with linearly related parameters. Best asymptotically normal oJ estimates of these parameters may be determined by maximizing the likelihood of the observations or by minimizing Pearson'sx 2, Neyman's X~,
Data Engineering
"... A growing number of applications need access to video data stored in digital form on secondary storage devices (e.g., videoondemand, multimedia messaging). As a result, video servers that are responsible for the storage and retrieval, at fixed rates, of hundreds of videos from disks are becoming i ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A growing number of applications need access to video data stored in digital form on secondary storage devices (e.g., videoondemand, multimedia messaging). As a result, video servers that are responsible for the storage and retrieval, at fixed rates, of hundreds of videos from disks are becoming increasingly important. Since video data tends to be voluminous, several disks are usually used in order to store the videos. A challenge is to devise schemes for the storage and retrieval of videos that distribute the workload evenly across disks, reduce the cost of the server and at the same time, provide good response times to client requests for video data. In this paper, we present schemes that are based on striping videos (finegrained as well as coarsegrained) across disks in order to effectively utilize disk bandwidth. For the schemes, we show how an optimalcost server architecture can be determined if data for a certain prespecified number of videos is to be concurrently retrieved...
Bulletin of the Technical Committee on Data Engineering, December 1997
 Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, volume 282 of SIGMOD Record
, 1997
"... this paper we describe and evaluate several popular techniques for data reduction. Historically, the primary need for data reduction has been internal to a database system, in a costbased query optimizer. The need is for the query optimizer to estimate the cost of alternative query plans cheaply  ..."
Abstract
 Add to MetaCart
this paper we describe and evaluate several popular techniques for data reduction. Historically, the primary need for data reduction has been internal to a database system, in a costbased query optimizer. The need is for the query optimizer to estimate the cost of alternative query plans cheaply  clearly the effort required to do so must be much smaller than the effort of actually executing the query, and yet the cost of executing any query plan depends strongly upon the numerosity of specified attribute values and the selectivities of specified predicates. To address these query optimizer needs, many databases keep summary statistics. Sampling techniques have also been proposed. More recently, there has been an explosion of interest in the analysis of data in warehouses. Data warehouses can be extremely large, yet obtaining answers quickly is important. Often, it is quite acceptable to sacrifice the accuracy of the answer for speed. Particularly in the early, more exploratory, stages of data analysis, interactive response times are critical, while tolerance for approximation errors is quite high. Data reduction, thus, becomes a pressing need. The query optimizer need for estimates was completely internal to the database, and the quality of the estimates used was observable by a user only very indirectly, in terms of the performance of the database system. On the other hand, the more recent data analysis needs for approximate answers directly expose the user to the estimates obtained. Therefore the nature and quality of these estimates becomes more salient. Moreover, to the extent that these estimates are being used as part of a data analysis task, there may often be "byproducts" such as, say, a hierarchical clustering of data, that are of value to the analyst in an...
Data Engineering
"... XML is fast becoming the intergalactic data speak alphabet for data and information exchange that hides the heterogeneity among the components of Looselycoupled, distributed systems and provides the glue that allows the individual components to take part in the loosely integrated system. Since much ..."
Abstract
 Add to MetaCart
XML is fast becoming the intergalactic data speak alphabet for data and information exchange that hides the heterogeneity among the components of Looselycoupled, distributed systems and provides the glue that allows the individual components to take part in the loosely integrated system. Since much of this data is currently stored in relational database systems, simplifying the transformation of this data from and to XML in general and from and to the agreed upon exchange schema specifically is an important feature that should improve the productivity of the programmer and the efficiency of this process. This article provides an overview over the features that are needed to provide access via HTTP and XML and presents the approach taken in Microsoft SQL Server. Keywords: Looselycoupled, distributed system architectures, XML, relational database systems 1