## Data Perturbation for Escaping Local Maxima in Learning (2002)

### Cached

### Download Links

Venue: | IN AAAI |

Citations: | 35 - 3 self |

### BibTeX

@INPROCEEDINGS{Elidan02dataperturbation,

author = {Gal Elidan and Matan Ninio and Nir Friedman and Dale Schuurmans},

title = {Data Perturbation for Escaping Local Maxima in Learning},

booktitle = {IN AAAI},

year = {2002},

pages = {132--139},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

Almost all machine learning algorithms---be they for regression, classification or density estimation---seek hypotheses that optimize a score on training data. In most interesting cases, however, full global optimization is not feasible and local search techniques are used to discover reasonable solutions. Unfortunately,

### Citations

8919 | Maximum likelihood from incomplete data via the EM algorithm (with discussion - Dempster, Laird, et al. - 1977 |

5290 |
Neural networks for pattern recognition
- Bishop
- 1995
(Show Context)
Citation Context ...en structure (e.g., neural network training). Therefore, most training algorithms employ local search techniques such as gradient descent or discrete hill climbing to find locally optimal hypotheses (=-=Bishop 1995-=-). The drawback is that local maxima are common and local search often yields poor results. A variety of techniques have been developed for escaping poor local maxima in general search, including rand... |

1404 |
On Information and Sufficiency
- Kullback, RA
- 1951
(Show Context)
Citation Context ... outlined above, we need to anneal the weight vector toward a uniform distribution. Therefore we add a penalty for divergence between wt+1 and uniform weights w0. We use the Kullback-Leibler measure (=-=Kullback & Leibler 1951-=-) to evaluate the divergence between the distribution of the weights with respect to the original weights. We heighten the importance of this term as time progresses and the temperature is cooled down... |

948 | Learning bayesian networks: The combination of knowledge and statistical data - Heckerman, Geiger, et al. - 1995 |

895 | A tutorial on learning with Bayesian Networks
- Heckerman
- 1995
(Show Context)
Citation Context ...er X given by: P (X1, . . . , Xn) = ∏n i=1 P (Xi|U i). Given a training set D = {x[1], . . . , x[M ]} we want to learn a Bayesian network B that best matches D, for each of the above scenarios. (See (=-=Heckerman 1998-=-) for a comprehensive overview of Bayesian network learning.) We explore each of the problems noted above in more detail in the subsequent sections. Perturbing Structure Search Structure Scores In thi... |

731 | Improved boosting algorithms using confidence-rated predictions
- Schapire, Singer
- 1999
(Show Context)
Citation Context ...lution since it benefits from superior guidance. Relation to ensemble reweighting There are interesting relationships between these reweighting techniques and ensemble learning methods like boosting (=-=Schapire & Singer 1999-=-). However, our techniques are not attempting to build an ensemble, and although they are similar to boosting on the surface, there are some fundamental differences. On an intuitive level, one differe... |

659 | Tabu Search
- Glover, Laguna
- 1997
(Show Context)
Citation Context ...cal maxima are common and local search often yields poor results. A variety of techniques have been developed for escaping poor local maxima in general search, including random restarts, TABU search (=-=Glover & Laguna 1993-=-) and simulated annealing (Kirpatrick, Gelatt, & Vecchi 1994). However, these techniques do not exploit the particular nature of the training problem encountered in machine learning. Instead, they alt... |

587 |
Theory of statistics
- Schervish
- 1995
(Show Context)
Citation Context ... be a probability distribution over the M data instances. Thus, we sample with a Dirichlet distribution with parameter β, so that P (W = w) ∝ ∏m wβ−1m for legal probability vectors (see, for example (=-=DeGroot 1989-=-)). When β grows larger, this distribution peaks around the uniform distribution. Thus, if we use β = 1/τ t the randomly chosen distributions will anneal toward the uniform distribution, since the tem... |

258 | Deterministic annealing for clustering, compression, classi regression, and related optimization problems
- Rose
- 1998
(Show Context)
Citation Context ...sarial strategy has the advantage of achieving similar performance in a single run. One class of approaches that might be related to the ones we describe here are the deterministic annealing methods (=-=Rose 1998-=-). These methods are similar in that they change the score by adding a “free energy” component. This component serves to smooth out the score landscape. Deterministic annealing proceeds by finding the... |

257 | Exponentiated gradient versus gradient descent for linear predictors
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...led down by evaluating β KL(wt+1‖w0) where β ∝ 1/τ t+1. Second, to maintain positive weight values we follow an exponential gradient strategy and derive a multiplicative update rule in the manner of (=-=Kivinen & Warmuth 1997-=-) where a penalty term for the for the KL-divergence between successive weight vectors wt+1 and wt is added. All of these adaptations to our general schema lead us to use the following penalized score... |

227 |
The EM algorithm for graphical association models with missing data
- Lauritzen
- 1995
(Show Context)
Citation Context ...cted score between P0 and the new model is a lower bound on the improvement between the (true) scores of the two models.2 The simplest application of EM in graphical models is for parameter learning (=-=Lauritzen 1995-=-; Heckerman 1998). In this case our maximization objective is just the likelihood of the model on the training data. To do this we maximize the expected likelihood at each iteration of the EM algorith... |

224 | The Bayesian structural EM algorithm
- Friedman
- 1998
(Show Context)
Citation Context ...ds on the score. It is true for the likelihood score (Dempster, Laird, & Rubin 1977; Lauritzen 1995) and for the MDL/BIC scores (Friedman 1997), and holds approximately for Bayesian structure scores (=-=Friedman 1998-=-). one iteration suffices to get to the global maximum). Thus, the expected score is biased. In general, this bias is toward models that are in some sense similar to the one with which we computed the... |

166 | Inferring subnetworks from perturbed expression profiles - Pe’er, Regev, et al. |

146 |
Functional gradient techniques for combining hypotheses
- Mason, Baxter, et al.
- 1999
(Show Context)
Citation Context ...s a good score, whereas we are deliberately seeking a single hypothesis that attains this. On a technical level, boosting derives its weight updates by differentiating the loss of an entire ensemble (=-=Mason et al. 2000-=-), whereas our weight updates are derived by taking only the derivative of the score of the most recent hypothesis. Interestingly, although we do not exploit a large ensemble, we find that our methods... |

132 | Learning equivalence classes of Bayesian-network structures - Chickering - 2002 |

131 | Learning Belief Networks in the presence of Missing Values and Hidden Variables
- Friedman
- 1997
(Show Context)
Citation Context ... point of the true score (for otherwise 2This statement of course depends on the score. It is true for the likelihood score (Dempster, Laird, & Rubin 1977; Lauritzen 1995) and for the MDL/BIC scores (=-=Friedman 1997-=-), and holds approximately for Bayesian structure scores (Friedman 1998). one iteration suffices to get to the global maximum). Thus, the expected score is biased. In general, this bias is toward mode... |

103 |
Simulated annealing: Theory and applications
- Laarhoven, Aarts
- 1987
(Show Context)
Citation Context ...lications, we need to tune the implementation to reduce the number of iterations. This can be done by incorporating more sophisticated cooling strategies from the simulated annealing literature (see (=-=Laarhoven & Aarts 1987-=-) for a review). It is also worth exploring improved ways to interleave the maximization and the reweighting steps. Finally, the empirical success of these methods raises the challenge of providing a ... |

95 |
The parsimony ratchet, a new method for rapid parsimony analysis
- Nixon
- 1999
(Show Context)
Citation Context ...om Reweighting The first approach we consider is a randomized method motivated by iterative local search methods in combinatorial optimization (Codenotti et al. 1996) and phylogenetic reconstruction (=-=Nixon 1999-=-). Instead of performing random steps in the hypothesis space, we perturb the score by randomly reweighting each training example around its original weight. Candidate hypotheses are then evaluated wi... |

63 |
Protos: An exemplar-based learning apprentice
- Bareiss, Porter
- 1987
(Show Context)
Citation Context ...learning repository, we used the Soybean (Michalski & Chilausky 1980) disease database that contains 35 variables relevant for the diagnosis of 19 possible plant diseases, and the Audiology data set (=-=Bareiss & Porter 1987-=-) which explores illnesses relating to audiology dysfunctions. Both data sets have many missing values and comparatively few examples. We also used data from Rosetta’s compendium (Hughes et al. 2000),... |

54 | A structural EM algorithm for phylogenetic inference - Friedman, Ninio, et al. - 2001 |

49 | A simple hyper-geometric approach for discovering putative transcription factor binding sites. Algor. Bioinformatics - Barash - 2001 |

45 | The exponentiated subgradient algorithm for heuristic boolean programming - Schuurmans, Southey, et al. - 2001 |

25 |
The ALARM monitoring system
- Beinlich, Suermondt, et al.
- 1989
(Show Context)
Citation Context ...appendix), and therefore both perturbation methods proposed above can be applied to this problem without change. Experimental Evaluation We start by evaluating methods on the synthetic Alarm network (=-=Beinlich et al. 1989-=-) where we can compare our results to the “golden” model that has the additional prior knowledge of the true structure. We compare our methods to a greedy hill-climbing procedure that is augmented wit... |

22 | Perturbation: An efficient technique for the solution of very large instances of the Euclidean TSP
- Codenetti, Manzini, et al.
- 1996
(Show Context)
Citation Context ...which we propose the following two main strategies. Random Reweighting The first approach we consider is a randomized method motivated by iterative local search methods in combinatorial optimization (=-=Codenotti et al. 1996-=-) and phylogenetic reconstruction (Nixon 1999). Instead of performing random steps in the hypothesis space, we perturb the score by randomly reweighting each training example around its original weigh... |

10 |
Learning by being told and learning from examples
- Michalski, Chilausky
- 1980
(Show Context)
Citation Context ...e golden model performance. Experiments with real-life data Finally, we applied our perturbation methods to several real-life datasets: ¿From the UCI machine learning repository, we used the Soybean (=-=Michalski & Chilausky 1980-=-) disease database that contains 35 variables relevant for the diagnosis of 19 possible plant diseases, and the Audiology data set (Bareiss & Porter 1987) which explores illnesses relating to audiolog... |

6 | Learning the structure of complex dynamic systems - Boyen, Friedman, et al. - 1999 |

6 | et al 2000. Functional discovery via a compendium of expression profiles. Cell 102(1):109–26 - Hughes, Marton |

4 |
From promoter sequence to expression:a probabilistic framework
- Segal, Barash, et al.
- 2002
(Show Context)
Citation Context ...e that has a large sum of letter weights is said to match the motif. We use the notation wi[x] to denote the weight of the letter x in the i’th position. Following (Barash, Bejerano, & Friedman 2001; =-=Segal et al. 2002-=-) we define the basic training problem in discriminative terms. Given N promoter sequences s1, . . . , sN , where the n’th sequence consists of K letters sn,1 . . . , sn,K , and a set of of training l... |

2 | Optimization by simulated annealing. Science 220:671–680 - Kirpatrick, Gelatt, et al. - 1994 |