## Causal inference using the algorithmic Markov condition (2008)

### Cached

### Download Links

Citations: | 11 - 11 self |

### BibTeX

@MISC{Janzing08causalinference,

author = {Dominik Janzing and Bernhard Schölkopf},

title = {Causal inference using the algorithmic Markov condition},

year = {2008}

}

### OpenURL

### Abstract

Inferring the causal structure that links n observables is usually basedupon detecting statistical dependences and choosing simple graphs that make the joint measure Markovian. Here we argue why causal inference is also possible when only single observations are present. We develop a theory how to generate causal graphs explaining similarities between single objects. To this end, we replace the notion of conditional stochastic independence in the causal Markov condition with the vanishing of conditional algorithmic mutual information anddescribe the corresponding causal inference rules. We explain why a consistent reformulation of causal inference in terms of algorithmic complexity implies a new inference principle that takes into account also the complexity of conditional probability densities, making it possible to select among Markov equivalent causal graphs. This insight provides a theoretical foundation of a heuristic principle proposed in earlier work. We also discuss how to replace Kolmogorov complexity with decidable complexity criteria. This can be seen as an algorithmic analog of replacing the empirically undecidable question of statistical independence with practical independence tests that are based on implicit or explicit assumptions on the underlying distribution. email:

### Citations

8563 |
Elements of Information Theory
- Cover, Thomas
- 2006
(Show Context)
Citation Context ...oted by {0, 1} ∗ . Recall that the Kolmogorov complexity K(s) ofastrings∈{0, 1} ∗ is defined as the length of the shortest program that generates s using a previously defined universal Turing machine =-=[13, 14, 15, 16, 17, 18, 19]-=-. The conditional Kolmogorov complexity K(t|s) [18]ofastringtgiven another string s is the length of the shortest program that can generate t from s. In order to keep our notation simple we use K(x, y... |

1681 | An Introduction to Kolmogorov Complexity and its Applications
- Li, Vitányi
- 1993
(Show Context)
Citation Context ...oted by {0, 1} ∗ . Recall that the Kolmogorov complexity K(s) ofastrings∈{0, 1} ∗ is defined as the length of the shortest program that generates s using a previously defined universal Turing machine =-=[13, 14, 15, 16, 17, 18, 19]-=-. The conditional Kolmogorov complexity K(t|s) [18]ofastringtgiven another string s is the length of the shortest program that can generate t from s. In order to keep our notation simple we use K(x, y... |

1363 |
Quantum Computation and quantum Information
- Nielsen, Chuang
- 2000
(Show Context)
Citation Context ...given the backgroundinformation “s1 is a human genetic sequence”, s1 can be 4 Note, however, that sending quantum systems between the nodes could transmit a kind of information (“quantum information” =-=[26]-=-) that cannot be phrased in terms of bits. It is known that this enables completely new communication scenarios, e.g. quantum cryptography. The relevance of quantum information transfer for causal inf... |

1103 |
Numerical Modelling of the
- S, ERRAUD, et al.
- 2000
(Show Context)
Citation Context ...ion because we will later introduce an algorithmic version. The fact that conditional irrelevance not only occurs in the context of statistical dependences has been emphasized in the literature (e.g. =-=[4, 1]-=-) in the context of describing abstract properties (like semi-graphoidaxioms) of the relation ·⊥·|·. We will therefore state the causal Markov condition also in an abstract form that does not refer to... |

650 | Quantum theory, the Church-Turing principle and the universal quantum computer
- Deutsch
- 1985
(Show Context)
Citation Context ... represent causal mechanisms by programs written for some universal Turing machine is basically in the spirit of various interpretations of the Church-Turing thesis. One formulation, given by Deutsch =-=[25]-=-, states that every process taking place in the real worldcan be simulatedby 18a Turing machine. Here we assume that the way different systems influence each other by physical signals can be simulate... |

546 |
Mathematical Methods of statistics
- Cramer
- 1946
(Show Context)
Citation Context ...n covariant if P (y|x + t) =P (y − t|x) ∀t ∈ R . A conditional distribution P (Y |X) with density Apart from this, we will also needthe following well-known concept from statistical estimation theory =-=[32]-=-: Definition 8 (Fisher information) Let p(x) be a continuously differentiable probability density of P (X) on R. Then the Fisher information is defined as ∫ ( ) 2 d F (P (X)) := ln p(x) dx . dx Then w... |

521 |
Three approaches to the quantitive definition of information
- Kolmogorov
- 1965
(Show Context)
Citation Context ...oted by {0, 1} ∗ . Recall that the Kolmogorov complexity K(s) ofastrings∈{0, 1} ∗ is defined as the length of the shortest program that generates s using a previously defined universal Turing machine =-=[13, 14, 15, 16, 17, 18, 19]-=-. The conditional Kolmogorov complexity K(t|s) [18]ofastringtgiven another string s is the length of the shortest program that can generate t from s. In order to keep our notation simple we use K(x, y... |

404 |
A formal theory of inductive inference
- Solomonoff
- 1964
(Show Context)
Citation Context |

329 | A theory of program size formally identical to information theory
- Chaitin
- 1975
(Show Context)
Citation Context |

229 | The Interpretation of Quantum Mechanics - Omnès |

226 | On the length of the programs for computing finite binary sequences: Statistical considerations - Chaitin - 1969 |

205 |
Minimum complexity density estimation
- Barron, Cover
- 1991
(Show Context)
Citation Context ...MDL) approaches [21] appear promising from the theoretical point of view due to their close relation to Kolmogorov complexity. We rephrase the minimum complexity estimator describedby Barron andCover =-=[30]-=-: Given a string-valuedrandom variable X anda sample x1,...,xm drawn from P (X), set { ˆPm(X) :=argminK(Q) − m∑ j=1 } log Q(xj) , where Q runs over all probability densities on the probability space u... |

204 |
The Direction of Time
- Reichenbach
(Show Context)
Citation Context ...n two objects means that one has causally influencedthe other or there is a thirdone influencing both. The statistical version of this principle is part of Reichenbach’s principle of the common cause =-=[24]-=- stating that statistical dependences between random variables 2 X and Y are always due at least one of the following three types of causal links: (1) X is a cause of Y or (2) vice versa or (3) there ... |

200 | Scientific Explanation and the Causal Structure of the World - Salmon - 1984 |

186 | The similarity metric
- Li, Chen, et al.
- 2003
(Show Context)
Citation Context ...is is in contrast to approaches from the literature to measure similarity versus differences of single objects that we briefly review now. To measure differences between single objects, e.g. pictures =-=[22, 23]-=-, one defines the information distance E(x, y) between the two corresponding strings as the length of the shortest program that computes x from y and y from x. It can be shown [22] that E(x, y) log = ... |

142 |
The Minimum Description Length Principle
- Grünwald
- 2007
(Show Context)
Citation Context ...x|z)+K(y|x, K(x|z),z) (5) The most important notion in this paper will be the algorithmic mutual information measuring the amount of algorithmic information that two objects have in common. Following =-=[21]-=- we define: 9Definition 2 (algorithmic mutual information) Let x, y be two strings. Then the algorithmic mutual information of x, y is I(x : y) :=K(y) − K(y|x ∗ ) . The mutual information is the numb... |

140 |
Independence properties of directed markov fields
- Lauritzen, Dawid, et al.
- 1990
(Show Context)
Citation Context ...ion that are known to coincide under some technical condition (see Lemma 1). We will first introduce the following version which is sometimes referred to as the parental or the local Markov condition =-=[3]-=-. 2To this end, we introduce the following notations. PAj is the set of parents of Xj and NDj the set of non-descendants of Xj except itself. If S, T, R are sets of random variables, S ⊥ T |R means S... |

80 | A Bayesian Approach to Causal Discovery
- Heckerman, Meek, et al.
- 1999
(Show Context)
Citation Context ...es” the Markov kernels for the different nodes independently according to some density on the parameter space. The above “zero Lebesgue measure argument” is close to the spirit of Bayesian approaches =-=[6]-=-, where priors on the set of Markov kernels are specifiedfor every possible hypothetical causal DAG andcausal inference is performedby maximizing posterior probabilities for hypothetical DAGs, given t... |

71 | The Philosophy of Space and Time - Reichenbach - 1958 |

68 | A new challenge for compression algorithms: Genetic sequences
- Grumbach, Tahi
- 1994
(Show Context)
Citation Context ...tions we have to develop practical inference rules. Compression algorithms have already been developed that are intended to approximate, for instance, the algorithmic information of genetic sequences =-=[37, 38]-=-. Chen et al. [38] constructeda “conditional compression scheme” to approximate conditional Kolmogorov complexity andappliedit to the estimation of the algorithmic mutual information between two genet... |

67 | A compression algorithm for DNA sequences and its applications in genome comparison
- Chen, Kwong, et al.
- 2000
(Show Context)
Citation Context ...tions we have to develop practical inference rules. Compression algorithms have already been developed that are intended to approximate, for instance, the algorithmic information of genetic sequences =-=[37, 38]-=-. Chen et al. [38] constructed a “conditional compression scheme” to approximate conditional Kolmogorov complexity and applied it to the estimation of the algorithmic mutual information between two ge... |

49 | A preliminary report on a general theory of inductive inference
- Solomonoff
- 1960
(Show Context)
Citation Context |

49 | Algorithmic statistics - Gács, Tromp, et al. |

49 |
Strong completeness and faithfulness in Bayesian networks
- Meek
- 1995
(Show Context)
Citation Context ...atural parameterization of the set of Markovian distributions in a finite dimensional space. The non-faithful distributions form a submanifold of lower dimension, i.e., a set of Lebesgue measure zero =-=[5]-=-. They therefore almost surely don’t occur if we assume that “nature chooses” the Markov kernels for the different nodes independently according to some density on the parameter space. The above “zero... |

49 | Entropy and the complexity of the trajectories of a dynamical system Trans - Brudno - 1983 |

48 | Logical Depth and Physical Complexity
- Bennett
- 1988
(Show Context)
Citation Context ...ded complexity couldbe a challenge for the future. There are several reasons to believe that taking into account computational complexity can provide additional hints on the causal structure: Bennett =-=[39, 40, 41]-=-, for instance, has arguedthat the logical depth of an object echoes in some sense its history. The former is, roughly speaking, defined as follows. Let x be a string that describes the object and s b... |

30 |
Chain letters and evolutionary histories
- Bennett, Li, et al.
- 2003
(Show Context)
Citation Context ...1] is that the former is also about causal relations among single objects and does not necessarily require sampling. Assume, for instance, that the comparison of two texts show similarities (see e.g. =-=[11]-=-) such that the author of the text that appeared later is blamed to have copied it from the other one or both are blamed to have copied from a third one. The statement that the texts are similar could... |

30 | Kernel methods for measuring independence
- Gretton, Herbrich, et al.
- 2005
(Show Context)
Citation Context ...me previously defined pair, it is well-justified to reject independence. The same holds if such correlations are detected for f, g in some previously defined, sufficiently small set of functions (cf. =-=[36]-=-). However, if this is not the case, we can never be sure that there is not some pair of arbitrarily complex functions f, g that are correlated with respect to the true distribution. Likewise, if we h... |

24 |
How to define complexity in physics, and why
- Bennett
- 1990
(Show Context)
Citation Context ...ded complexity couldbe a challenge for the future. There are several reasons to believe that taking into account computational complexity can provide additional hints on the causal structure: Bennett =-=[39, 40, 41]-=-, for instance, has arguedthat the logical depth of an object echoes in some sense its history. The former is, roughly speaking, defined as follows. Let x be a string that describes the object and s b... |

22 | On universal prediction and Bayesian confirmation - Hutter - 2007 |

21 | Toward a mathematical definition of life - Chaitin - 1979 |

14 | Discovery by minimal length encoding: a case study in molecular evolution - Milosavljevic, Jurka - 1993 |

14 |
On the nature and origin of complexity in discrete, homogeneous, locally-interacting systems
- Bennett
- 1986
(Show Context)
Citation Context ...ed complexity could be a challenge for the future. There are several reasons to believe that taking into account computational complexity can provide additional hints on the causal structure: Bennett =-=[39, 40, 41]-=-, for instance, has argued that the logical depth of an object echoes in some sense its history. The former is, roughly speaking, defined as follows. Let x be a string that describes the object and s ... |

13 |
Causal inference using nonnormality
- Kano, Shimizu
- 2003
(Show Context)
Citation Context ...probability densities. There is also another recent proposal for new inference rules that refers to a relatedsimplicity assumption, though formally quite different from the ones above. The authors of =-=[10]-=- observe that there are joint distributions of X1,...,Xn that can be explained by a linear model with additive non-Gaussian noise for one causal direction but require non-linear causal influence for t... |

11 | On Modeling INDCCA Security in Cryptographic Protocols. Cryptology ePrint Archive, Report 2003/024
- Hofheinz, Muller-Quade, et al.
- 2003
(Show Context)
Citation Context ... ROR-CCA-secure (Real or Random under Chosen Ciphertext Attacks) if there is no efficient method to decide whether a text is random or the encrypted version of some known text without knowing the key =-=[12]-=-. Given that there are ROR-CCA-secure schemes (which is unknown but believedby cryptographers) we have a causal relation leading to similarities that are not detected by any kind of simple counting st... |

11 | Causal inference by choosing graphs with most plausible Markov kernels
- Sun, Janzing, et al.
- 2006
(Show Context)
Citation Context ... inference described in this paper. The differences, however, which are the main subject of Sections 3 to 4, can hardly be represented in the table. 1.2 Seeking for new statistical inference rules In =-=[8]-=- and [9] we have proposed causal inference rules that are based on the idea that the factorization of P(cause, effect) into P(effect|cause) and P(cause) typically leads to simpler terms than the “arti... |

10 | Kolmogorov complexity and information theory (with an interpretation in terms of questions and answers - Grunwald, Vitanyi |

8 | Causal models as minimal descriptions of multivariate systems. http://parallel.vub.ac.be/∼jan
- Lemeire, Dirkx
- 2006
(Show Context)
Citation Context ...equivalent to saying that the shortest description of P (X1,...,Xn) is given by concatenating the descriptions of the Markov kernels, a postulate that has already been formulated by Lemeire and Dirkx =-=[29]-=-: 20Postulate 7 (algorithmic independence of statistical properties) A causal hypothesis G (i.e., a DAG) is only acceptable if the shortest description of the joint density P is given by a concatenat... |

8 | On causally asymmetric versions of Occam’s Razor and their relation to thermodynamics
- Janzing
- 2007
(Show Context)
Citation Context ...plexity of the latter can exceedthe complexity of the former only by a constant term. We now focus on non-stationary time series. To motivate the general idea we first present an example described in =-=[31]-=-. Consider a random walk of a particle on Z starting at z ∈ Z. In every time step the probability is q to move one site to the right and(1 − q) tomovetotheleft. Let Xj with j =0, 1,... be the random v... |

8 | Distinguishing between cause and effect - Mooij, Janzing - 2010 |

8 |
Quasi-order of clocks and their synchronism and quantum bounds for copying timing information
- Janzing, Beth
(Show Context)
Citation Context ...t p(x) be a continuously differentiable probability density of P(X) on R. Then the Fisher information is defined as ∫ ( d F(P(X)) := dx lnp(x) ) 2 dx. Then we have the following Lemma (see Lemma 1 in =-=[33]-=- showing the statement in a more general setting that involves also quantum stochastic maps): Lemma 11 (monotonicity under covariant maps) Let P(X, Y ) be a joint distribution such that P(Y |X) is tra... |

7 | Detecting the direction of causal time series - Peters, Janzing, et al. - 2009 |

3 | Causal reasoning by evaluating the complexity of conditional densities with kernel methods
- Sun, Janzing, et al.
- 2008
(Show Context)
Citation Context ...ce described in this paper. The differences, however, which are the main subject of Sections 3 to 4, can hardly be represented in the table. 1.2 Seeking for new statistical inference rules In [8] and =-=[9]-=- we have proposed causal inference rules that are based on the idea that the factorization of P(cause, effect) into P(effect|cause) and P(cause) typically leads to simpler terms than the “artificial” ... |

3 | On universal transfer learning - Mahmud - 2007 |

3 |
Complementarity between extractable mechanical work, accessible entanglement, and ability to act as a reference frame, under arbitrary superselection rules. http://arxiv.org/abs/quant-ph/0501121
- Vaccaro, Anselmi, et al.
(Show Context)
Citation Context ...asure µ. Then the reference information is given by: ( [ IG := H P ∫ ( [ = H P ∫ G G X g ] dµ(g) ) ∫ − G X g ] dµ(g) ) − H(P(X)). ( H P(X g ) ) dµ(g) The name “reference information” has been used in =-=[34]-=- in a slightly different context where this information occurred as the value of a physical system to communicate a reference system (e.g. spatial or temporal) where G describes, for instance, transla... |

2 | Quantum thermodynamics with missing reference frames: Decompositions of free energy into non-increasing components
- Janzing
(Show Context)
Citation Context ...n I(X : Z) if we introduce a G-valuedrandom variable Z whose values indicate which transformation g has been applied. One can thus show that IG is non-increasing with respect to every G-covariant map =-=[34, 35]-=-. The following model describes a link between inferring causal directions by preferring covariant conditionals to preferring directions with algorithmically independent Markov kernels. Consider first... |

2 | Inference of graphical causal models: Representing the meaningful information of probability distributions - Lemeire, Steenhaut - 2010 |

2 | On the entropy production of time series with unidirectional linearity - Janzing |

1 |
Strong completeness andfaithfulness in Bayesian networks
- Meek
- 1995
(Show Context)
Citation Context ...atural parameterization of the set of Markovian distributions in a finite dimensional space. The non-faithful distributions form a submanifold of lower dimension, i.e., a set of Lebesgue measure zero =-=[5]-=-. They therefore almost surely don’t occur if we assume that “nature chooses” the Markov kernels for the different nodes independently according to some density on the parameter space. The above “zero... |

1 |
Causal inference by choosing graphs with most plausible Markov kernels
- Schölkopf
- 2006
(Show Context)
Citation Context ... inference described in this paper. The differences, however, which are the main subject of Sections 3 to 4, can hardly be represented in the table. 1.2 Seeking for new statistical inference rules In =-=[8]-=- and[9] we have proposedcausal inference rules that are basedon the idea that the factorization of P (cause, effect) into P (effect|cause) and P (cause) typically leads to simpler terms than the “arti... |