## Algorithms for Non-negative Matrix Factorization (2001)

### Cached

### Download Links

- [www.cs.cmu.edu]
- [hebb.mit.edu]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Venue: | In NIPS |

Citations: | 734 - 5 self |

### BibTeX

@INPROCEEDINGS{Lee01algorithmsfor,

author = {Daniel D. Lee and H. Sebastian Seung},

title = {Algorithms for Non-negative Matrix Factorization},

booktitle = {In NIPS},

year = {2001},

pages = {556--562},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithms for NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the ExpectationMaximization algorithm. The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence.

### Citations

8090 | D.: Maximum likelihood from incomplete data via the em algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...deed the case as shown in the next section. 6 Proofs of convergence To prove Theorems 1 and 2, we will make use of an auxiliary function similar to that used in the Expectation-Maximization algorithm =-=[15, 16-=-]. Definition 1 G(h; h 0 ) is an auxiliary function for F (h) if the conditions G(h; h 0 ) F (h); G(h; h) = F (h) (10) are satisfied. The auxiliary function is a useful concept because of the followi... |

2791 | Eigenfaces for recognition
- Turk, Pentland
- 1991
(Show Context)
Citation Context ...epresentational properties. Principal components analysis enforces only a weak orthogonality constraint, resulting in a very distributed representation that uses cancellations to generate variability =-=[1, 2]-=-. On the other hand, vector quantization uses a hard winnertake -all constraint that results in clustering the data into mutually exclusive prototypes [3]. We have previously shown that nonnegativity ... |

2014 |
Principal Component Analysis
- Jolliffe
- 2002
(Show Context)
Citation Context ...epresentational properties. Principal components analysis enforces only a weak orthogonality constraint, resulting in a very distributed representation that uses cancellations to generate variability =-=[1, 2]-=-. On the other hand, vector quantization uses a hard winnertake -all constraint that results in clustering the data into mutually exclusive prototypes [3]. We have previously shown that nonnegativity ... |

1869 |
Numerical Recipes in C: The Art of Scientific Computing
- Press, Teukolsky, et al.
- 1992
(Show Context)
Citation Context ...rse, other types of matrix factorizations have been extensively studied in numerical linear algebra, but the nonnegativity constraint makes much of this previous work inapplicable to the present case =-=[8]-=-. Here we discuss two algorithms for NMF based on iterative updates of W and H . Because these algorithms are easy to implement and their convergence properties are guaranteed, we have found them very... |

1650 |
Vector Quantization and Signal Compression
- Gersho, Gray
- 1991
(Show Context)
Citation Context ...uses cancellations to generate variability [1, 2]. On the other hand, vector quantization uses a hard winnertake -all constraint that results in clustering the data into mutually exclusive prototypes =-=[3]-=-. We have previously shown that nonnegativity is a useful constraint for matrix factorization that can learn a parts representation of the data [4, 5]. The nonnegative basis vectors that are learned a... |

989 |
H.S.: Learning the parts of objects by non-negative matrix factorization
- Lee, Seung
- 1999
(Show Context)
Citation Context ...ustering the data into mutually exclusive prototypes [3]. We have previously shown that nonnegativity is a useful constraint for matrix factorization that can learn a parts representation of the data =-=[4, 5]-=-. The nonnegative basis vectors that are learned are used in distributed, yet still sparse combinations to generate expressiveness in the reconstructions [6, 7]. In this submission, we analyze in deta... |

378 |
What Is the Goal of Sensory Coding
- Field
- 1994
(Show Context)
Citation Context ...earn a parts representation of the data [4, 5]. The nonnegative basis vectors that are learned are used in distributed, yet still sparse combinations to generate expressiveness in the reconstructions =-=[6, 7]-=-. In this submission, we analyze in detail two numerical algorithms for learning the optimal nonnegative factors from data. 2 Non-negative matrix factorization We formally consider algorithms for solv... |

257 |
Maximum likelihood reconstruction for emission tomography
- Shepp, Vardi
- 1982
(Show Context)
Citation Context ...eralize to different cost functions. Algorithms similar to ours where only one of the factors is adapted have previously been used for the deconvolution of emission tomography and astronomical images =-=[9, 10, 11, 12]-=-. At each iteration of our algorithms, the new value of W or H is found by multiplying the current value by some factor that depends on the quality of the approximation in Eq. (1). We prove that the q... |

192 |
Bayesian-based Iterative Method of Image Restoration
- Richardson
(Show Context)
Citation Context ...eralize to different cost functions. Algorithms similar to ours where only one of the factors is adapted have previously been used for the deconvolution of emission tomography and astronomical images =-=[9, 10, 11, 12]-=-. At each iteration of our algorithms, the new value of W or H is found by multiplying the current value by some factor that depends on the quality of the approximation in Eq. (1). We prove that the q... |

172 |
An iterative technique for the rectification of observed distribution
- Lucy
(Show Context)
Citation Context ...eralize to different cost functions. Algorithms similar to ours where only one of the factors is adapted have previously been used for the deconvolution of emission tomography and astronomical images =-=[9, 10, 11, 12]-=-. At each iteration of our algorithms, the new value of W or H is found by multiplying the current value by some factor that depends on the quality of the approximation in Eq. (1). We prove that the q... |

134 |
Additive versus exponentiated gradient updates for linear prediction
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...truction is necessarily a fixed point of the update rules. 5 Multiplicative versus additive update rules It is useful to contrast these multiplicative updates with those arising from gradient descent [14]. In particular, a simple additive update for H that reduces the squared distance can be written as H asH a + a (W T V ) a (W T WH) a : (6) If a are all set equal to some small positive ... |

107 | A unified approach to statistical tomography using coordinate descent optimization
- Bouman, Sauer
- 1996
(Show Context)
Citation Context |

96 |
Least squares formulation of robust non-negative factor analysis
- Paatero
- 1997
(Show Context)
Citation Context ...on. Such a cost function can be constructed using some measure of distance between two non-negative matrices A and B. One useful measure is simply the square of the Euclidean distance between A and B =-=[1-=-3], jjA Bjj 2 = X ij (A ij B ij ) 2 (2) This is lower bounded by zero, and clearly vanishes if and only if A = B. Another useful measure is D(AjjB) = X ij A ij log A ij B ij A ij +B ij (3) Like the ... |

79 | Aggregate and Mixed-Order Markov Models for Statistical Language
- Saul, Pereira
- 1997
(Show Context)
Citation Context ...deed the case as shown in the next section. 6 Proofs of convergence To prove Theorems 1 and 2, we will make use of an auxiliary function similar to that used in the Expectation-Maximization algorithm =-=[15, 16-=-]. Definition 1 G(h; h 0 ) is an auxiliary function for F (h) if the conditions G(h; h 0 ) F (h); G(h; h) = F (h) (10) are satisfied. The auxiliary function is a useful concept because of the followi... |

33 | Unsupervised learning by convex and conic coding. Advances in neural information processing systems
- Lee, Seung
- 1997
(Show Context)
Citation Context ...ustering the data into mutually exclusive prototypes [3]. We have previously shown that nonnegativity is a useful constraint for matrix factorization that can learn a parts representation of the data =-=[4, 5]-=-. The nonnegative basis vectors that are learned are used in distributed, yet still sparse combinations to generate expressiveness in the reconstructions [6, 7]. In this submission, we analyze in deta... |

19 |
Sparse Coding in the Primate Cortex, The Handbook of Brain Theory and Neural Networks
- FĂ¶ldiak, Young
- 1995
(Show Context)
Citation Context ...earn a parts representation of the data [4, 5]. The nonnegative basis vectors that are learned are used in distributed, yet still sparse combinations to generate expressiveness in the reconstructions =-=[6, 7]-=-. In this submission, we analyze in detail two numerical algorithms for learning the optimal nonnegative factors from data. 2 Non-negative matrix factorization We formally consider algorithms for solv... |