## Online Independent Component Analysis with Local Learning Rate Adaptation (2000)

### Cached

### Download Links

- [www.inf.ethz.ch]
- [sml.nicta.com.au]
- [cnl.salk.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Neural Information Processing Systems |

Citations: | 8 - 2 self |

### BibTeX

@INPROCEEDINGS{Schraudolph00onlineindependent,

author = {Nicol N. Schraudolph and Xavier Giannakopoulos},

title = {Online Independent Component Analysis with Local Learning Rate Adaptation},

booktitle = {Neural Information Processing Systems},

year = {2000},

pages = {789--795},

publisher = {MIT Press}

}

### OpenURL

### Abstract

Stochastic meta-descent (SMD) is a new technique for online adaptation of local learning rates in arbitrary twice-dierentiable systems.

### Citations

1070 | An Information-Maximization Approach to Blind Separation and Blind Deconvolution
- Bell, Sejnowski
- 1995
(Show Context)
Citation Context ...ierent rates in order to achieve good performance. We apply stochastic meta-descent (SMD), a new online adaptation method for local learning rates [3, 4], to an extended Bell-Sejnowski ICA algorithm [=-=5-=-] with natural gradient [6] and kurtosis estimation [7] modications. The resulting algorithm is capable of separating and tracking a time-varying mixture of 10 sources whose unknown mixing coecients c... |

640 | A direct adaptive method for faster backpropagation learning: The Rprop algorithm. Paper presented at
- Riedmiller, Braun
- 1993
(Show Context)
Citation Context ...of a change in ~ p t on ~ w t+1 : declaring ~ w t and ~st in (1) to be independent of ~ p t , one then quickly arrives at ~v t+1 @ ~ w t+1 @ ln ~ p t = ~ p t ~st (4) However, this common approach [1=-=1, 12, 13, 14, 15-=-] fails to take into account the incremental nature of gradient descent: a change in ~p aects not only the current update of ~ w, but also future ones. Some authors account for this by setting ~v to a... |

510 | A New Learning Algorithm for Blind Signal Separation
- Amari, Cichocki, et al.
- 1996
(Show Context)
Citation Context ...achieve good performance. We apply stochastic meta-descent (SMD), a new online adaptation method for local learning rates [3, 4], to an extended Bell-Sejnowski ICA algorithm [5] with natural gradient =-=-=-[6] and kurtosis estimation [7] modications. The resulting algorithm is capable of separating and tracking a time-varying mixture of 10 sources whose unknown mixing coecients change at dierent rates. ... |

416 | A learning algorithm for continually running fully recurrent neural networks
- William, Zipser
(Show Context)
Citation Context ...trast, Sutton [17, 18] models the long-term eect of ~ p on future weight updates in a linear system by carrying the relevant partials forward through time, as is done in real-time recurrent learning [=-=1-=-9]. This results in an iterative update rule for ~v, which we have extended to nonlinear systems [3, 4]. We dene ~v as an exponential average of the eect of all past changes in ~p on the current weigh... |

339 |
Increased rates of convergence through learning rate adaptat ion
- Jacobs
- 1988
(Show Context)
Citation Context ...of a change in ~ p t on ~ w t+1 : declaring ~ w t and ~st in (1) to be independent of ~ p t , one then quickly arrives at ~v t+1 @ ~ w t+1 @ ln ~ p t = ~ p t ~st (4) However, this common approach [1=-=1, 12, 13, 14, 15-=-] fails to take into account the incremental nature of gradient descent: a change in ~p aects not only the current update of ~ w, but also future ones. Some authors account for this by setting ~v to a... |

247 | Exponentiated gradient versus gradient descent for linear predictors
- KIVINEN, WARMUTH
- 1997
(Show Context)
Citation Context ...cent: ~ w t+1 = ~ w t + ~p t ~st ; where ~st @f ~ w t (~x t ) @ ~ w (1) and denotes component-wise multiplication. The local learning rates ~p are best adapted by exponentiated gradient descent [8, 9], so that they can cover a wide dynamic range while staying strictly positive: ln ~ p t = ln ~ p t 1 @f ~ w t (~x t ) @ ln ~p ~ p t = ~ p t 1 exp( ~st ~v t ) ; where ~v t @ ~ w t @ ln ~ p (2)... |

134 |
Additive versus exponentiated gradient updates for linear prediction
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...cent: ~ w t+1 = ~ w t + ~p t ~st ; where ~st @f ~ w t (~x t ) @ ~ w (1) and denotes component-wise multiplication. The local learning rates ~p are best adapted by exponentiated gradient descent [8, 9], so that they can cover a wide dynamic range while staying strictly positive: ln ~ p t = ln ~ p t 1 @f ~ w t (~x t ) @ ln ~p ~ p t = ~ p t 1 exp( ~st ~v t ) ; where ~v t @ ~ w t @ ln ~ p (2)... |

115 |
A Class of Neural Networks for Independent Component Analysis
- Karhunen, Oja, et al.
- 1997
(Show Context)
Citation Context ...ted around a mean of one revolution for every 6 000 data samples. All sources are supergaussian. The ICA-SMD algorithm was implemented with only online access to the data, including on-line whitening =-=[21]-=-. Whenever the condition number of the estimated whitening matrix exceeded a large threshold (set to 350 here), updates (16) and (17) were disabled to prevent the algorithm from diverging. Other param... |

75 | Adapting bias by gradient descent: an incremental version of delta-bar-delta
- Sutton
- 1992
(Show Context)
Citation Context ... approach [3]. While such averaging serves to reduce the stochasticity of the product ~st ~st 1 implied by (3) and (4), the average remains one of immediate, single-step eects. By contrast, Sutton [1=-=7, 18-=-] models the long-term eect of ~ p on future weight updates in a linear system by carrying the relevant partials forward through time, as is done in real-time recurrent learning [19]. This results in ... |

69 | Fast exact multiplication by the Hessian
- Pearlmutter
- 1994
(Show Context)
Citation Context ... certain dependence on an appropriate choice of meta-learning rate . Note that there is an efficient O(n) algorithm to calculate H t ~v t without ever having to compute or store the matrix H t itself =-=[20]-=-; we shall elaborate on this technique for the case of independent component analysis below. Meta-level conditioning. The gradient descent in ~p at the meta-level (2) may of course suffer from ill-con... |

63 |
Supersab: fast adaptive backpropagation with good scaling properties
- tollenaere
- 1990
(Show Context)
Citation Context ...of a change in ~ p t on ~ w t+1 : declaring ~ w t and ~st in (1) to be independent of ~ p t , one then quickly arrives at ~v t+1 @ ~ w t+1 @ ln ~ p t = ~ p t ~st (4) However, this common approach [1=-=1, 12, 13, 14, 15-=-] fails to take into account the incremental nature of gradient descent: a change in ~p aects not only the current update of ~ w, but also future ones. Some authors account for this by setting ~v to a... |

58 | Local gain adaptation in stochastic gradient descent
- Schraudolph
- 1999
(Show Context)
Citation Context ...idual weights in the unmixing matrix must adapt at dierent rates in order to achieve good performance. We apply stochastic meta-descent (SMD), a new online adaptation method for local learning rates [=-=3, 4-=-], to an extended Bell-Sejnowski ICA algorithm [5] with natural gradient [6] and kurtosis estimation [7] modications. The resulting algorithm is capable of separating and tracking a time-varying mixtu... |

42 |
Speeding up backpropagation
- Silva, Almeida
- 1990
(Show Context)
Citation Context |

42 | Gain adaptation beats least squares
- Sutton
- 1992
(Show Context)
Citation Context ... approach [3]. While such averaging serves to reduce the stochasticity of the product ~st ~st 1 implied by (3) and (4), the average remains one of immediate, single-step eects. By contrast, Sutton [1=-=7, 18-=-] models the long-term eect of ~ p on future weight updates in a linear system by carrying the relevant partials forward through time, as is done in real-time recurrent learning [19]. This results in ... |

27 | Adaptive on-line learning in changing environments
- Murata, Müller, et al.
- 1997
(Show Context)
Citation Context ...tation scheme. Adaptation of a single, global learning rate, however, facilitates the tracking only of sources whose mixing coecients change at comparable rates [1], resp. switch all at the same time =-=[-=-2]. In cases where some sources move much faster than others, or switch at dierent times, individual weights in the unmixing matrix must adapt at dierent rates in order to achieve good performance. We... |

20 | Multi-player residual advantage learning with general function approximation (Tech
- Harmon, Baird
- 1996
(Show Context)
Citation Context ...tal nature of gradient descent: a change in ~p aects not only the current update of ~ w, but also future ones. Some authors account for this by setting ~v to an exponential average of past gradients [=-=2, 11, 16-=-]; we found empirically that the method of Almeida et al. [15] can indeed be improved by this approach [3]. While such averaging serves to reduce the stochasticity of the product ~st ~st 1 implied by... |

19 | A fast, compact approximation of the exponential function
- Schraudolph
- 1999
(Show Context)
Citation Context ...through the corresponding element of ~ w. With considerable variation, (2) forms the basis of most local rate adaptation methods found in the literature. In order to avoid an expensive exponentiation =-=-=-[10] for each weight update, we typically use the linearization e u 1 + u, valid for small juj, giving ~ p t = ~p t 1 max(%; 1 + ~st ~v t ) ; (3) where we constrain the multiplier to be at least (... |

15 |
Parameter adaptation in stochastic optimization
- Almeida, Langlois, et al.
- 1999
(Show Context)
Citation Context |

12 |
Generalised independent component analysis through unsupervised learning with emergent bussgang properties
- Girolami, Fyfe
- 1997
(Show Context)
Citation Context ... apply stochastic meta-descent (SMD), a new online adaptation method for local learning rates [3, 4], to an extended Bell-Sejnowski ICA algorithm [5] with natural gradient [6] and kurtosis estimation =-=-=-[7] modications. The resulting algorithm is capable of separating and tracking a time-varying mixture of 10 sources whose unknown mixing coecients change at dierent rates. S.A. Solla, T.K. Leen, and ... |

8 |
source separation and tracking using nonlinear PCA criterion: a least-squares approach
- Karhunen, Pajunen, et al.
- 1997
(Show Context)
Citation Context ...ust be replaced by a learning rate adaptation scheme. Adaptation of a single, global learning rate, however, facilitates the tracking only of sources whose mixing coecients change at comparable rates =-=[-=-1], resp. switch all at the same time [2]. In cases where some sources move much faster than others, or switch at dierent times, individual weights in the unmixing matrix must adapt at dierent rates i... |

7 |
Fast exact multiplication by the
- Pearlmutter
- 1994
(Show Context)
Citation Context ...s a certain dependence on an appropriate choice of meta-learning rate µ. Note that there is an efficient O(n) algorithm to calculate Ht�vt without ever having to compute or store the matrix Ht itself =-=[20]-=-; we shall elaborate on this technique for the case of independent component analysis below. Meta-level conditioning. The gradient descent in �p at the meta-level (2) may of course suffer from ill-con... |

2 | Online learning with adaptive local step sizes
- Schraudolph
- 1999
(Show Context)
Citation Context ...idual weights in the unmixing matrix must adapt at dierent rates in order to achieve good performance. We apply stochastic meta-descent (SMD), a new online adaptation method for local learning rates [=-=3, 4-=-], to an extended Bell-Sejnowski ICA algorithm [5] with natural gradient [6] and kurtosis estimation [7] modications. The resulting algorithm is capable of separating and tracking a time-varying mixtu... |

2 |
Fast exact multiplication by the Hessian", Neural Computation
- Pearlmutter
- 1994
(Show Context)
Citation Context ... a certain dependence on an appropriate choice of meta-learning rate . Note that there is an ecient O(n) algorithm to calculate H t ~v t without ever having to compute or store the matrix H t itself [=-=20-=-]; we shall elaborate on this technique for the case of independent component analysis below. Meta-level conditioning. The gradient descent in ~p at the meta-level (2) may of course suer from ill-cond... |

2 |
III, "Multi-player residual advantage learning with general function approximation
- Harmon, Baird
- 1996
(Show Context)
Citation Context ...nature of gradient descent: a change in ~ p affects not only the current 2 update of ~ w, but also future ones. Some authors account for this by setting ~v to an exponential average of past gradients =-=[2, 11, 16]-=-; we found empirically that the method of Almeida et al. [15] can indeed be improved by this approach [3]. While such averaging serves to reduce the stochasticity of the product ~ ffi t \Delta ~ ffi t... |