## Curiosity-Driven Optimization (2011)

Venue: | IN IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC |

Citations: | 3 - 2 self |

### BibTeX

@INPROCEEDINGS{Schaul11curiosity-drivenoptimization,

author = {Tom Schaul and Yi Sun and Daan Wierstra and Fausino Gomez and Jürgen Schmidhuber},

title = {Curiosity-Driven Optimization},

booktitle = {IN IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC},

year = {2011},

publisher = {}

}

### OpenURL

### Abstract

The principle of artificial curiosity directs active exploration towards the most informative or most interesting data. We show its usefulness for global black box optimization when data point evaluations are expensive. Gaussian process regression is used to model the fitness function based on all available observations so far. For each candidate point this model estimates expected fitness reduction, and yields a novel closed-form expression of expected information gain. A new type of Pareto-front algorithm continually pushes the boundary of candidates not dominated by any other known data according to both criteria, using multi-objective evolutionary search. This makes the exploration-exploitation trade-off explicit, and permits maximally informed data selection. We illustrate the robustness of our approach in a number of experimental scenarios.

### Citations

832 | A fast and elitist multi-objective genetic algorithm
- Deb, Pratap, et al.
- 2002
(Show Context)
Citation Context ....t. expected improvement and expected information gain, which can be performed by any multi-objective optimization method, for example the Non-dominated Sorting Genetic Algorithm version II (NSGA-II; =-=[21]-=-) which is used in the experiments in section IV. All non-dominated candidates are considered “good” solutions, and therefore each should be assigned a probability of being chosen that favors those th... |

529 | Active learning with statistical models
- Cohn, Ghahramani, et al.
- 1996
(Show Context)
Citation Context ...sociated with that prediction, and provide an analytical expression for computing information gain. One option would be to use a mixture of Gaussians on the joint parameter-cost function space (as in =-=[22]-=-). However, this approach has the drawback of being sensitive to the number of Gaussians used, as well as giving poor interpolation in regions with few sampled points. III. CURIOSITY-DRIVEN OPTIMIZATI... |

360 |
Theory of Optimal Experiments
- Fedorov
- 1972
(Show Context)
Citation Context ... become boring over time. To permit the incorporation of a Bayesian prior, we will focus on probabilistic models and use a particular variant of the KL-based approach [9] to maximize information gain =-=[14]-=-, [15], [16], [17], [18], [19]: The KL-divergence or relative entropy between prior and posterior (before and after seeing the new point) is invariant under any transformation of the parameter space. ... |

329 | Information-Based Objective Functions for Active Data
- MacKay
- 1992
(Show Context)
Citation Context ...ng over time. To permit the incorporation of a Bayesian prior, we will focus on probabilistic models and use a particular variant of the KL-based approach [9] to maximize information gain [14], [15], =-=[16]-=-, [17], [18], [19]: The KL-divergence or relative entropy between prior and posterior (before and after seeing the new point) is invariant under any transformation of the parameter space. Formally, le... |

291 | C.: Gaussian Processes for Machine Learning
- Rasmussen, Williams
- 2006
(Show Context)
Citation Context ...tation of curiositydriven optimization which satisfies all our criteria from the previous section by using Gaussian processes to model the cost function. A. Gaussian Processes Gaussian processes (GP, =-=[23]-=-) can be seen as a probability distribution over functions, as evaluated on an arbitrary but finite number of points. Given a number of observations, a Gaussian process associates a Gaussian probabili... |

237 |
Efficient global optimization of expensive black-box functions
- Jones, Schonlau, et al.
- 1998
(Show Context)
Citation Context ...ibly given in advance) and use them to model the cost function, which useful for dimensionality reduction, visualization, assessing uncertainty, and ultimately determining good points to explore [4], =-=[8]-=-. In addition, a statistical model of the cost function allows expert knowledge to be incorporated in the form of a Bayesian prior. Our variant of curiosity-driven exploration uses such a memory-based... |

178 | Bayesian experimental design: A review
- Chaloner, Verdinelli
- 1995
(Show Context)
Citation Context ...ermit the incorporation of a Bayesian prior, we will focus on probabilistic models and use a particular variant of the KL-based approach [9] to maximize information gain [14], [15], [16], [17], [18], =-=[19]-=-: The KL-divergence or relative entropy between prior and posterior (before and after seeing the new point) is invariant under any transformation of the parameter space. Formally, let Yenv be the envi... |

138 | Bayesian surprise attracts human attention
- Itti, Baldi
- 2006
(Show Context)
Citation Context ...oduced for reinforcement learning [1], [9], the curiosity framework has been used for active learning [10], [11], to explain certain patterns of human visual attentionbetter than previous approaches =-=[12]-=-, and to explain concepts such as beauty, attention and creativity [3], [13]. A. Formalizing Interestingness The interestingness of a new observation is the difference between the performance of an ad... |

129 | Sparse on-line gaussian processes
- Csato, Opper
- 2002
(Show Context)
Citation Context ...cales with O(n 2 ). The computational complexity of Gaussian processes can be reduced e.g. by implementing them online and using a reduced base vector set, containing only the most informative points =-=[26]-=-. We have not implemented these yet, as computation time was not a major concern in our experiments. Gaussian process regression only gives reasonable results when the kernel hyperparameters are set p... |

126 | Neural network exploration using optimal experiment design
- Cohn
- 1996
(Show Context)
Citation Context .... To permit the incorporation of a Bayesian prior, we will focus on probabilistic models and use a particular variant of the KL-based approach [9] to maximize information gain [14], [15], [16], [17], =-=[18]-=-, [19]: The KL-divergence or relative entropy between prior and posterior (before and after seeing the new point) is invariant under any transformation of the parameter space. Formally, let Yenv be th... |

124 | A taxonomy of global optimization methods based on response surfaces
- Jones
(Show Context)
Citation Context ...condary, and only useful inasmuch as it facilitates locating optima more efficiently. Therefore, active learning cannot be used naively for optimization. Instead, the related response surface methods =-=[6]-=-, [7] are the standard tool for global optimization. They store all available evaluations (some possibly given in advance) and use them to model the cost function, which useful for dimensionality redu... |

105 | Curious model-building control systems
- Schmidhuber
- 1991
(Show Context)
Citation Context ... next is a ubiquitous challenge in reinforcement learning and optimization. Inspired by the human drive to discover “interesting” parts of the world, one formal interpretation of artificial curiosity =-=[1]-=-, [2], [3] defines momentary interestingness as the first derivative of the quality of an adaptive world model, where quality is measured in terms of how much the current model is able to compress the... |

42 | Stochastic Optimization
- Schneider, Kirkpatrick
- 2006
(Show Context)
Citation Context ...ry, and only useful inasmuch as it facilitates locating optima more efficiently. Therefore, active learning cannot be used naively for optimization. Instead, the related response surface methods [6], =-=[7]-=- are the standard tool for global optimization. They store all available evaluations (some possibly given in advance) and use them to model the cost function, which useful for dimensionality reduction... |

37 |
Query-based learning applied to partially trained multi-layer perceptrons
- Hwang, Choi, et al.
- 1991
(Show Context)
Citation Context ...e boring over time. To permit the incorporation of a Bayesian prior, we will focus on probabilistic models and use a particular variant of the KL-based approach [9] to maximize information gain [14], =-=[15]-=-, [16], [17], [18], [19]: The KL-divergence or relative entropy between prior and posterior (before and after seeing the new point) is invariant under any transformation of the parameter space. Formal... |

37 | Accelerating evolutionary algorithms with gaussian process fitness function models
- Buche, Schraudolph, et al.
- 2005
(Show Context)
Citation Context ...aussian processes are capable of modeling highly complex cost landscapes through the use of appropriate covariance (kernel) functions, and are commonly used for regression and function modeling [23], =-=[24]-=-. Formally, we consider the Gaussian process with zero mean and the kernel function k (x, x ′ )+σ 2 n δ (x, x′ ) , where δ (·, ·) is the Kronecker delta function. Thus, for any values y, y ′ at x, x ′... |

34 |
Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts
- Schmidhuber
- 2006
(Show Context)
Citation Context ...e next is a ubiquitous challenge in reinforcement learning and optimization. Inspired by the human drive to discover“interesting” parts of the world, one formal interpretation of artificial curiosity =-=[21, 22, 24]-=- defines momentary interestingness as the first derivative of the quality of an adaptive world model, where quality is measured in terms of how much the current model is able to compress the data obse... |

28 | Nonmyopic Active Learning of Gaussian Processes: an Exploration-Exploitation Approach
- Krause, Guestrin
- 2007
(Show Context)
Citation Context ...oblems, even a small reduction in the required number of evaluations justifies a significant investment of computational resources. Expensive global optimization is closely related to active learning =-=[5]-=-, in that candidate points to be evaluated must be chosen with care, but the goal is different: active learning is concerned with obtaining an accurate model of the data, while in optimization modelin... |

25 | Gaussian process regression: Active data selection and test point rejection
- Seo, Wallat, et al.
- 2000
(Show Context)
Citation Context ...rm is constant, therefore there is a direct connection between the expected information gain and the predictive variance given the observation, which can be computed efficiently. Note that Seo et al. =-=[25]-=- found the predictive variance to be a useful criterion for exploration, without realizing that it is equivalent to information gain. C. Algorithm Choosing a Gaussian process to model the cost functio... |

25 | Natural evolution strategies
- Wierstra, Schaul, et al.
(Show Context)
Citation Context ...rly. With enough computation time available, we can periodically optimize the hyperparameters with respect to the marginal likelihood. For this we use the natural evolution strategies algorithm (NES, =-=[27]-=-, [28]). Potentially, we could also employ diagnostic methods [8] to determine whether the model is appropriate. At each iteration, an inner multi-objective optimization algorithm is used, in our case... |

22 |
Neurobiology of Attention
- Itti, Rees, et al.
- 2005
(Show Context)
Citation Context ...ntroduced for reinforcement learning [21, 26], the curiosity framework has been used for active learning [8, 18], to explain certain patterns of human visual attention better than previous approaches =-=[11]-=-, and to explain concepts such as beauty, attention and creativity [23, 24]. 2.1 Formalizing Interestingness The interestingness of a new observation is the difference between the performance of an ad... |

21 | M.W.: Optimization using surrogate objectives on a helicopter test example
- Booker, Jr, et al.
- 1998
(Show Context)
Citation Context ...of the model’s learning algorithm. Here we introduce a novel variant of this notion of artificial curiosity designed for optimization problems, such as assembly-line optimization or helicopter design =-=[4]-=-, where function evaluations are very expensive. For these problems, even a small reduction in the required number of evaluations justifies a significant investment of computational resources. Expensi... |

21 |
Bayesian algorithms for one-dimensional global optimization
- Locatelli
- 1997
(Show Context)
Citation Context ...s will cover the search space densely, which is the only way to ensure that it will eventually find the global optimum. Optimization based on expected improvement has been shown to have this property =-=[31]-=-. It turns out that if we remove the information gain objective from CO-GP, the algorithms are equivalent. Therefore, as one extreme of the Pareto-front will always correspond to the point maximizing ... |

20 | Learning MackeyGlass from 25 examples, Plus or Minus 2
- Plutowski, Cottrell, et al.
(Show Context)
Citation Context ...r time. To permit the incorporation of a Bayesian prior, we will focus on probabilistic models and use a particular variant of the KL-based approach [9] to maximize information gain [14], [15], [16], =-=[17]-=-, [18], [19]: The KL-divergence or relative entropy between prior and posterior (before and after seeing the new point) is invariant under any transformation of the parameter space. Formally, let Yenv... |

14 | Exploration and exploitation in adaptive filtering based on Bayesian active learning
- Zhang, Xu, et al.
- 2003
(Show Context)
Citation Context ...sum of both objectives, where the weights are set manually, or tuned to the problem. Combining two objectives of different scale into a single utility measure is common practice [19], but problematic =-=[20]-=-. In fact, if the cost landscape is ill-shaped each objective can completely dominate in some regions while being dominated in others. Therefore we propose turning the problem around and only deciding... |

11 | Simple algorithmic principles of discovery, subjective beauty, selective attention, curiosity & creativity
- Schmidhuber
- 2007
(Show Context)
Citation Context ... used for active learning [10], [11], to explain certain patterns of human visual attentionbetter than previous approaches [12], and to explain concepts such as beauty, attention and creativity [3], =-=[13]-=-. A. Formalizing Interestingness The interestingness of a new observation is the difference between the performance of an adaptive model on the observation history before and after including the new p... |

10 |
Gaussian process dynamic programming. Neurocomputing
- Deisenroth, Rasmussen, et al.
- 2009
(Show Context)
Citation Context ... improve the model’s predictions or explanations of what is going on in the world. Originally introduced for reinforcement learning [1], [9], the curiosity framework has been used for active learning =-=[10]-=-, [11], to explain certain patterns of human visual attentionbetter than previous approaches [12], and to explain concepts such as beauty, attention and creativity [3], [13]. A. Formalizing Interesti... |

6 |
D.: Enhancing the performance of maximumlikelihood gaussian edas using anticipated mean shift
- Bosman, Grahl, et al.
- 2008
(Show Context)
Citation Context ...h the number of points. Note that many well-known algorithms, such as Estimation of Distribution Algorithms, do not have this property, and instead rely either on correct initialization or heuristics =-=[30]-=-. In contrast, CO-GP does have this property, as our results on the linear function show (see Figure 1). B. Local optimization While designed primarily for multi-modal cost landscapes, we investigated... |

3 | Bayesian active learning for sensitivity analysis
- Pfingsten
- 2006
(Show Context)
Citation Context ...ve the model’s predictions or explanations of what is going on in the world. Originally introduced for reinforcement learning [1], [9], the curiosity framework has been used for active learning [10], =-=[11]-=-, to explain certain patterns of human visual attentionbetter than previous approaches [12], and to explain concepts such as beauty, attention and creativity [3], [13]. A. Formalizing Interestingness... |

2 |
Reinforcement-driven information acquisition in non-deterministic environments
- Storck, Hochreiter, et al.
- 1995
(Show Context)
Citation Context ...ively explore the interesting regions in search space that most improve the model’s predictions or explanations of what is going on in the world. Originally introduced for reinforcement learning [1], =-=[9]-=-, the curiosity framework has been used for active learning [10], [11], to explain certain patterns of human visual attentionbetter than previous approaches [12], and to explain concepts such as beau... |

1 |
robotics, optimal artificial curiosity, creativity, music, and the fine arts
- “Developmental
- 2006
(Show Context)
Citation Context ... is a ubiquitous challenge in reinforcement learning and optimization. Inspired by the human drive to discover “interesting” parts of the world, one formal interpretation of artificial curiosity [1], =-=[2]-=-, [3] defines momentary interestingness as the first derivative of the quality of an adaptive world model, where quality is measured in terms of how much the current model is able to compress the data... |

1 |
by compression progress: A simple principle explains essential aspects of subjective beauty, novelty
- “Driven
- 2009
(Show Context)
Citation Context ... ubiquitous challenge in reinforcement learning and optimization. Inspired by the human drive to discover “interesting” parts of the world, one formal interpretation of artificial curiosity [1], [2], =-=[3]-=- defines momentary interestingness as the first derivative of the quality of an adaptive world model, where quality is measured in terms of how much the current model is able to compress the data obse... |

1 |
Nonmyopic active learningof gaussianprocesses: An explorationexploitation approach
- Guestrin
(Show Context)
Citation Context ...e original version of this report was submitted to the NIPS conference in June 2009, but has since undergone a major revision.Technical Report No. IDSIA-03-10 2 is closely related to active learning =-=[14]-=-, in that candidate points to be evaluated must be chosen with care, but the goal is different: active learning is concerned with obtaining an accurate model of the data, while in optimization modelin... |

1 |
Asimpleprincipleexplainsessentialaspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science,music,jokes. Anticipatory Behavior in Adaptive Learning Systems, from Sensorimotor to Higher-level Cognitive
- Drivenbycompressionprogress
- 2009
(Show Context)
Citation Context ...e next is a ubiquitous challenge in reinforcement learning and optimization. Inspired by the human drive to discover“interesting” parts of the world, one formal interpretation of artificial curiosity =-=[21, 22, 24]-=- defines momentary interestingness as the first derivative of the quality of an adaptive world model, where quality is measured in terms of how much the current model is able to compress the data obse... |