## Learning Markov Network Structure with Decision Trees

Citations: | 7 - 5 self |

### BibTeX

@MISC{Lowd_learningmarkov,

author = {Daniel Lowd and Jesse Davis},

title = {Learning Markov Network Structure with Decision Trees},

year = {}

}

### OpenURL

### Abstract

Abstract—Traditional Markov network structure learning algorithms perform a search for globally useful features. However, these algorithms are often slow and prone to finding local optima due to the large space of possible structures. Ravikumar et al. [1] recently proposed the alternative idea of applying L1 logistic regression to learn a set of pairwise features for each variable, which are then combined into a global model. This paper presents the DTSL algorithm, which uses probabilistic decision trees as the local model. Our approach has two significant advantages: it is more efficient, and it is able to discover features that capture more complex interactions among the variables. Our approach can also be seen as a method for converting a dependency network into a consistent probabilistic model. In an extensive empirical evaluation on 13 datasets, our algorithm obtains comparable accuracy to three standard structure learning algorithms while running 1-4 orders of magnitude faster. Keywords-Markov networks; structure learning; decision trees; probabilistic methods I.

### Citations

2868 |
P.: UCI Repository of Machine Learning Databases
- Merz, Merphy
- 1996
(Show Context)
Citation Context ...rtificial deterministic dependencies. Table III describes the characteristics of each dataset. Datasets are listed in increasing order by number of variables. From the UCI machine learning repository =-=[15]-=- we used: KDDCup 2000 data, MSNBC anonymous web data, MSWeb anonymous web data and Plants domains. The KDD Cup 2000 clickstream prediction data set [16] consists of web session data taken from an onli... |

553 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...peed, we found DTSL to be an order of magnitude faster than L1 logistic regression, and 3-4 orders of magnitude faster than the global structure learning approaches of BLM [6] and Della Pietra et al. =-=[2]-=-. In terms of accuracy, DTSL is comparable in accuracy to other approaches, placing first on 5 out of 13 datasets. Future work includes exploring other methods of learningTable VII FEATURE CHARACTERI... |

488 | On the limited memory BFGS method for large scale optimization
- Liu, Nocedal
- 1989
(Show Context)
Citation Context ...ic regression. The output of each structure learning algorithm is a set of conjunctive features. To learn weights, we optimized the pseudo-likelihood of the data via the limited-memory BFGS algorithm =-=[12]-=- since optimizing the likelihood of the data is prohibitively expensive for the domains we consider. Like Lee et al. [13], we evaluated our algorithm using test set conditional marginal log-likelihood... |

465 | Loopy belief propagation for approximate inference: An empirical study
- Murphy, Weiss, et al.
- 1999
(Show Context)
Citation Context ...ling. In contrast, the Markov networks learned by DTSL always represent consistent probability distributions and allow inference to be done by any standard technique, such as loopy belief propagation =-=[8]-=-, mean field, or MCMC. We now describe each step of DTSL in more detail. A. Learning Trees A probabilistic decision tree represents a probability distribution over a target variable, Xi, given a set o... |

291 | Mining high-speed data streams
- Domingos, Hulten
- 2000
(Show Context)
Citation Context ...data, the depth of the learned trees could be much less than log k(m), leading to faster run times in practice. For large datasets or streaming data, we can apply the Hoeffding tree algorithm instead =-=[10]-=-, which uses the Hoeffding bound to select decision tree splits after enough data has been seen to make a confident choice, rather than using all available data. IV. EMPIRICAL EVALUATION We evaluate o... |

272 | Eigentaste: A constant time collaborative filtering algorithm
- Goldberg, Roeder, et al.
(Show Context)
Citation Context ...ovie 5 is a collaborative filtering dataset in which users rate movies they have seen. We focused on the 500 most-rated movies, and reduced each variable to “rated” or “not rated”. The Jester dataset =-=[19]-=- consists of users’ real-valued ratings for 100 jokes. For Jester we 2 http://lib.stat.cmu.edu/datasets/ 3 http://web.ist.utl.pt/ ∼ acardoso/datasets/ 4 http://www-etud.iro.umontreal.ca/ ∼ bergstrj/au... |

267 |
Statistical analysis of non-lattice data
- Besag
- 1975
(Show Context)
Citation Context ...islead weight learning algorithms. A more efficient alternative, widely used in areas such as spatial statistics, social network modeling and language processing, is to optimize the pseudo-likelihood =-=[5]-=- instead. Pseudo-likelihood is the product of the conditional probabilities of each variable given its Markov blanket: log P • w(X =x) = V j=1 j N i=1 log Pw(Xi,j =xi,j|MBx(Xi,j)) (3) where V is the n... |

166 | A Bayesian approach to learning Bayesian networks with local structure
- Chickering, Heckerman, et al.
- 1997
(Show Context)
Citation Context ...eaf distributions with a Dirichlet prior (α =1) for smoothing. In order to help avoid overfitting, we used a structure prior P (S) ∝ κ p , where p is the number of parameters, as in Chickering et al. =-=[9]-=-. Pseudocode for the tree learning subroutine is in Table II. B. Generating Features While decision trees are not commonly thought of as a log-linear model, any decision tree can be converted to a set... |

161 | G.: Improving recommendation lists through topic diversification
- Ziegler, McNee, et al.
- 2005
(Show Context)
Citation Context ...uired by Last.fm. We focused on the 100 most listened-to artists. We used a random subset of the data and reduced the problem to “listened-to” or “did not listen-to.” The Book Crossing (Book) dataset =-=[18]-=- consists of a users rating of how much they liked a book. We considered the 500 most frequently rated book. We reduced the problem to “rated” or “not rated” and considered all people who rated more t... |

156 | Dependency networks for inference, collaborative filtering, and data - Heckerman, Chickering, et al. - 2000 |

123 | Scalable training of l1-regularized log-linear models
- Andrew, Gao
- 2007
(Show Context)
Citation Context ...ll 13 datasets. DTSL was implemented in OCaml. For both BLM and DP, we used the publicly available code of Davis and Domingos [6]. For Ravikumar et al.’s approach, we used the OWL-QN software package =-=[11]-=- for performing the L1 logistic regression. The output of each structure learning algorithm is a set of conjunctive features. To learn weights, we optimized the pseudo-likelihood of the data via the l... |

102 | Efficient structure learning of Markov networks using L1-regularization
- Lee, Ganapahthi, et al.
- 2007
(Show Context)
Citation Context ...imized the pseudo-likelihood of the data via the limited-memory BFGS algorithm [12] since optimizing the likelihood of the data is prohibitively expensive for the domains we consider. Like Lee et al. =-=[13]-=-, we evaluated our algorithm using test set conditional marginal log-likelihood (CMLL). Calculating the CMLL required dividing the variables into a query set Q and an evidence set E. Then, for each te... |

95 |
Sound and efficient inference with probabilistic and deterministic dependencies
- Poon, Domingos
(Show Context)
Citation Context ...ing three sets served as evidence. We repeated this procedure such that each set served as the query variables. We computed the conditional marginal probabilities using the MC-SAT inference algorithm =-=[14]-=-. For all three domains, we set the burn-in to 1,000 samples and then computed the probability using the next 10,000 samples.Table III DATA SET CHARACTERISTICS Data Set Train Tune Test Num. Density S... |

91 | Kdd-cup 2000 organizers’ report: Peeling the onion
- Kohavi, Bradley, et al.
(Show Context)
Citation Context ...ables. From the UCI machine learning repository [15] we used: KDDCup 2000 data, MSNBC anonymous web data, MSWeb anonymous web data and Plants domains. The KDD Cup 2000 clickstream prediction data set =-=[16]-=- consists of web session data taken from an online retailer. Using the subset of 1 Publicly available at http://alchemy.cs.washington.edu/papers/davis10a Hulten and Domingos [17], each example initial... |

48 | Structured Learning with Approximate Inference. NIPS
- Kulesza, Pereira
- 2007
(Show Context)
Citation Context ...e log likelihood and its gradient in each iteration. This is typically intractable to compute exactly due to the partition function. Furthermore, an approximation may work poorly: Kulesza and Pereira =-=[4]-=- have shown that approximate inference can mislead weight learning algorithms. A more efficient alternative, widely used in areas such as spatial statistics, social network modeling and language proce... |

32 |
SGD-QN: Careful quasi-Newton stochastic gradient descent
- Bordes, Bottou, et al.
- 2009
(Show Context)
Citation Context ...yed for all models. Since DTSL effectively solves the problem of slow structure learning, efficient weight learning becomes increasingly important. A stochastic optimization algorithm, such as SGD-QN =-=[20]-=-, might yield significantly better performance. Since DTSL is always faster than the baseline algorithms and often more accurate, it is a good choice to consider when performing Markov network structu... |

25 |
High-dimensional Ising model selection using l1-regularized logistic regression
- Ravikumar, Wainwright, et al.
- 2009
(Show Context)
Citation Context ...ng algorithms perform a search for globally useful features. However, these algorithms are often slow and prone to finding local optima due to the large space of possible structures. Ravikumar et al. =-=[1]-=- recently proposed the alternative idea of applying L1 logistic regression to learn a set of pairwise features for each variable, which are then combined into a global model. This paper presents the D... |

9 | Bottom-up learning of markov network structure
- Davis, Domingos
- 2010
(Show Context)
Citation Context ...e log-likelihood. It adds the feature that results in the largest gain to the feature set. This procedure terminates when no candidate feature improves the model’s score. Recently, Davis and Domingos =-=[6]-=- proposed an alternative bottom-up approach, called BLM, for learning the structure of a Markov network. BLM starts by treating each complete example as a long feature in the Markov network. The algor... |