## Sound Source Separation in Monaural Music Signals (2006)

Citations: | 26 - 3 self |

### BibTeX

@TECHREPORT{Virtanen06soundsource,

author = {Tuomas Virtanen},

title = {Sound Source Separation in Monaural Music Signals},

institution = {},

year = {2006}

}

### OpenURL

### Abstract

Sound source separation refers to the task of estimating the signals produced by individual sound sources from a complex acoustic mixture. It has several applications, since monophonic signals can be processed more efficiently and flexibly than polyphonic mixtures. This thesis deals with the separation of monaural, or, one-channel music recordings. We concentrate on separation methods, where the sources to be separated are not known beforehand. Instead, the separation is enabled by utilizing the common properties of real-world sound sources, which are their continuity, sparseness, and repetition in time and frequency, and their harmonic spectral structures. One of the separation approaches taken here use unsupervised learning and the other uses model-based inference based on sinusoidal modeling. Most of the existing unsupervised separation algorithms are based on a linear instantaneous signal model, where each frame of the input mixture signal is

### Citations

2330 | The Elements of Statistical Learning - Hastie, Tibshirani, et al. - 2001 |

2196 | Numerical Recipes in C. The Art of Scientific Computation - Press, Teukolsky, et al. - 1994 |

2100 | Matrix computations - Golub, Loan - 1983 |

1697 | Independent Component Analysis
- Hyvärinen, Karhunen, et al.
- 2001
(Show Context)
Citation Context ...23-31, 80-89]. The dependence between two variables can be measured in several ways. Mutual information is a measure of the information that given random variables have on some other random variables =-=[86]-=-. The dependence is also closely related to the Gaussianity of the distribution of the variables. According to the central limit theorem, the distribution of the sum of independent variables is more G... |

1094 |
Learning the parts of objects by non-negative matrix factorization
- Lee, Seung
- 1999
(Show Context)
Citation Context ... 2.4 Non-Negative Matrix Factorization As discussed in Section 2.2.2, it is reasonable to restrict frequency-domain basis functions and their gains to non-negative values. As noticed by Lee and Seung =-=[113]-=-, the non-negativity restrictions can be efficient in learning representations where the whole is represented as a combination of parts which have an intuitive interpretation. The spectrograms of musi... |

912 |
Fundamentals of Statistical Signal Processing: Estimation Theory
- Kay
- 1993
(Show Context)
Citation Context ...sinusoid as αj = aj cos(θj) and βj = −aj sin(θj). Because the real and imaginary parts are orthogonal, least-squares solution for their parameters can be solved separately. The least squares solution =-=[97]-=- for (6.6) is ˆα = (H T ℜHℜ) −1 H T xℜ ˆβ = (H T ℑHℑ) −1 H T xℑ. (6.7) The rows of Hℜ and Hℑ are linearly independent if two or more sinusoids do not have equal frequencies. Since such cases were dete... |

837 | Comparison of parametric representations for monosyllabic wired recognition in continuously spoken sentences
- Davis
(Show Context)
Citation Context ...h center frequency equal to the frequency of the ⌈1.5 l−1 ⌉ th harmonic. Fixed frequency-warped cosine basis model Representing the shape of spectrum using Mel-frequency cepstral coefficients (MFCCs, =-=[37]-=-) is widely used in the classification of audio signals. MFCCs are computed by taking the cosine transform of the log-amplitude spectrum calculated at a Mel-frequency scale. A linear model can approxi... |

820 | Algorithms for non-negative matrix factorization
- Lee, Seung
(Show Context)
Citation Context ... pitch value and the gains follow roughly the amplitude envelope of each tone. The undermost component models the attack transients of the tones. The components were estimated using the NMF algorithm =-=[114, 166]-=- and the divergence objective (explained in Section 2.4). 16sa time-varying gain has been adopted as a part of the MPEG-7 pattern recognition framework [29], where the basis functions and the gains ar... |

804 | Nonlinear Programming. Athena Scientific - Bertsekas - 1995 |

646 | Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision research
- Olshausen, Field
- 1997
(Show Context)
Citation Context ...d, for example, using the NMF algorithms discussed in Section 2.4. 2.3 Sparse Coding Sparse coding represents a mixture signal in terms of a small number of active elements chosen out of a larger set =-=[130]-=-. This is an efficient approach for learning structures and separating sources from mixed data. In the linear signal model (2.3), the sparseness restriction is usually applied on the gains G, which me... |

584 | Fast and robust fixed-point algorithms for independent component analysis
- Hyvärinen
- 1999
(Show Context)
Citation Context ...g gives an exact model for the PCA-transformed observations but not necessarily for the original ones. There are several ICA algorithms, and some implementations are freely available, such as FastICA =-=[54,84]-=- and JADE [27]. Computationally quite efficient separation algorithms can be implemented based on FastICA, for example. 20s2.2.1 Independent Subspace Analysis The idea of independent subspace analysis... |

579 | Solving least squares problems - Lawson, Hanson - 1974 |

553 |
Auditory scene analysis
- Bregman
- 1990
(Show Context)
Citation Context ...cribed in Chapters 3 and 4. Psychoacoustically motivated methods The cognitive ability of humans to perceive and recognize individual sound sources in a mixture referred to as auditory scene analysis =-=[22]-=-. Computational models of this function typically consist of two main stages so that an incoming signal is first decomposed into its elementary time-frequency components and these are then organized t... |

494 | The Fourier Transform and its Applications - Bracewell - 1965 |

420 |
Speech analysis/synthesis based on a sinusoidal representation
- McAulay, Quatieri
- 1986
(Show Context)
Citation Context ...ult of periodic vibration. The deterministic part of the model, which is called sinusoidal model, has been used widely in audio signal processing, for example in speech coding by McAulay and Quatieri =-=[120]-=-. In music signal processing it became known by the work of Smith and Serra [157,167]. 5.1 Signal Model The sinusoidal model for one frame x(n), n = 0, . . .,N − 1 of a signal can be written as x(n) =... |

381 |
On the use of windows for harmonic analysis with the discrete Fourier transform
- Harris
- 1978
(Show Context)
Citation Context ...e solved separately for the real and imaginary parts). The Fourier transform of a real-valued signal which is symmetric with respect to the origin is also real [21, pp. 14-15]. As suggested by Harris =-=[74]-=-, the frequency transforms are here calculated so that the time index n = N/2 (T being the frame length in samples) is regarded as the origin. As a result of this, the window function and the cosine t... |

346 | Psychoacoustics Facts and Models - Zwicker, Fastl - 1999 |

325 | Non-negative matrix factorization with sparseness constraints
- Hoyer
- 2004
(Show Context)
Citation Context ...sis functions. With a sparse prior and non-negativity restrictions, one can use, for example, projected steepest descent algorithms which are discussed, e.g., by Bertsekas in [16, pp. 203-224]. Hoyer =-=[81, 82]-=- proposed a non-negative sparse coding algorithm by combining NMF and sparse coding. His algorithm used a multiplicative rule to update B, and projected steepest descent to update G. In musical signal... |

310 |
Introduction to Spectral Analysis
- Stoica, Moses
- 1997
(Show Context)
Citation Context ...noise, least-squares estimation is the maximum likelihood estimator for individual sinusoids. Nonlinear least-squares (NLS) algorithm can be used to estimate their frequencies, amplitudes, and phases =-=[170]-=-. Stoica and Nehorai found out, that in colored noise, frequency estimates of NLS are the same as in the case of white noise [171]. In colored noise, the amplitudes have to be adjusted by bandwise noi... |

298 |
Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values
- Paatero, Tapper
- 1994
(Show Context)
Citation Context ...NMF algorithm in which the weighted energy of the residual matrix X − BG was minimized by using a least-squares algorithm where B and G were alternatingly updated under non-negativity restric26stions =-=[133]-=-. More recently, Lee and Seung [113,114] proposed NMF algorithms which have been used in several machine learning tasks since the algorithms are easy to implement and modify. Lee and Seung [114] propo... |

262 | Learning from data - Concepts, Theory, and Methods - Cherkassky, Mulier - 1998 |

216 | Blind Separation of Speech Mixtures via Time-Frequency Masking
- Yılmaz, Rickard
- 2004
(Show Context)
Citation Context ...uality [176]. In the case of music signals, the one-channel separation principles have also been integrated into the multichannel separation framework, and this often increases the separation quality =-=[59, 181, 196, 197, 205]-=-. In the case of produced music, modeling the contribution of a source signal within a mixture by filtering with a fixed linear filter may not be valid, since at the production stage, nonlinear effect... |

210 | High-order contrasts for independent component analysis, Neural Computation
- Cardoso
(Show Context)
Citation Context ...model for the PCA-transformed observations but not necessarily for the original ones. There are several ICA algorithms, and some implementations are freely available, such as FastICA [54,84] and JADE =-=[27]-=-. Computationally quite efficient separation algorithms can be implemented based on FastICA, for example. 20s2.2.1 Independent Subspace Analysis The idea of independent subspace analysis (ISA) was ori... |

209 | Multidimensional independent component analysis
- Cardoso
- 1998
(Show Context)
Citation Context ...ional ICA (explained below) is used to separate phase-invariant features into invariant feature subspaces, where each source is modeled as the sum of one or more components [85]. Multidimensional ICA =-=[26]-=- is based on the same linear generative model (2.6) as ICA, but the components are not assumed to be mutually independent. Instead, it is assumed that the components can be divided into disjoint sets,... |

205 | Introduction to Mathematical Statistics - Hogg, Craig - 1995 |

181 | The Physics of Musical Instruments - Fletcher, Rossing - 1998 |

177 | Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces
- Hyvärinen, Hoyer
- 2000
(Show Context)
Citation Context ...tion algorithms can be implemented based on FastICA, for example. 20s2.2.1 Independent Subspace Analysis The idea of independent subspace analysis (ISA) was originally proposed by Hyvärinen and Hoyer =-=[85]-=-. It combines the multidimensional ICA with invariant feature extraction, which are shortly explained later in this section. After the work of Casey and Westner [30], the term ISA has been commonly us... |

166 | Prediction-driven computational auditory scene analysis
- Ellis
- 1996
(Show Context)
Citation Context ...n offset, c) common amplitude modulation, d) common frequency modulation, e) equidirectional movement in the spectrum 4. Spatial proximity These association cues have been used by several researchers =-=[34,47]-=- to develop sound source separation algorithms. Later there has been criticism that the grouping rules can only describe the functioning the human hearing in simple 5scases [163, p. 17], and robust se... |

164 | Non-negative matrix factorization for polyphonic music transcription
- Smaragdis, Brown
- 2003
(Show Context)
Citation Context ...c and percussive sounds. It has been successfully used in the transcription of drum patterns [58,136], in the pitch estimation of speech signals [159], and in the analysis of polyphonic music signals =-=[4,18,30,115,166,178,184,189]-=-. Fig. 2.1 shows an example signal which consists of a diatonic scale and a C major chord played by an acoustic guitar. The signal was separated into components using the NMF algorithm that will be de... |

159 |
Musical sound modeling with sinusoids plus noise
- Serra
- 1997
(Show Context)
Citation Context ...produced by musical instruments is the sinusoids plus noise model, which represents the signal as a sum of deterministic and stochastic parts, or, as a sum of a set of sinusoids plus a noise residual =-=[8, 158]-=-. Sinusoidal components are produced by a vibrating system, and are usually harmonic, i.e. the frequencies are integer multiplies of the fundamental frequency. The residual contains the energy produce... |

145 |
Sound onset detection by applying psychoacoustic knowledge
- Klapuri
- 1999
(Show Context)
Citation Context ...e activity detection has to be done with a more robust method. Paulus and Virtanen [136] proposed an onset detection procedure that was derived from the psychoacoustically motivated method of Klapuri =-=[99]-=-. The gains of a component were compressed, differentiated, and low-pass filtered. In the resulting “accent curve”, all local maxima above a fixed threshold were considered as sound onsets. For percus... |

140 | Modern Spectral Estimation - Kay - 1988 |

138 |
Performance measurement in blind audio source separation
- Vincent, Gribonval, et al.
- 2006
(Show Context)
Citation Context ...mental SDR has often been used to measure the subjective quality of speech. Additional low-level performance measures for audio source separation tasks have been discussed, e.g., by Vincent et al. in =-=[182]-=-. For example, they measured the interference from other sources by the correlation of the separated signal to the other references. Perceptual measures Perceptual measures in general estimate the aud... |

123 |
Dictionary learning algorithms for sparse representation
- Kreutz-Delgado, Murray, et al.
- 2003
(Show Context)
Citation Context ...n techniques based on steepest descent, covariant gradient, quasi-Newton, and active-set methods can be used. Different algorithms and objectives are discussed for example by Kreutz-Delgado et al. in =-=[108]-=-. Our proposed method is presented in Chapter 3. If B is fixed, more efficient optimization algorithms can be used. This can be the case for example when B is learned in advance from a training materi... |

121 |
Virtual pitch and phase sensitivity of a computer model of the auditory periphery
- Meddis, Hewitt
- 1991
(Show Context)
Citation Context ...armonic structures within frequency bands. For example, it has been observed by Klapuri that for harmonic sounds where the overtones have uniform amplitudes within a frequency band the auditory model =-=[121]-=- produces representations with a large energy at the fundamental frequency [101, pp. 238-241]. Because of the above-mentioned acoustic properties and human sound perception, the amplitudes of overlapp... |

120 | Signal estimation from modified shorttime fourier transform
- Griffin, Lim
- 1984
(Show Context)
Citation Context ...tic large values at frame boundaries, resulting in perceptually unpleasant discontinuities when the frames are combined using overlap-add. Also the phase generation method proposed by Griffin and Lim =-=[71]-=- has been used in the synthesis (see for example Casey [28]). The method finds phases so that the error between the separated magnitude spectrogram and the magnitude spectrogram of the resynthesized t... |

118 | Non-Negative Sparse Coding
- Hoyer
- 2002
(Show Context)
Citation Context ...out affecting the first term by setting B ← Bθ and G ← G/θ, where the scalar θ → ∞. The scale of the basis functions can be fixed for example with an additional constraint ||bj|| = 1 as done by Hoyer =-=[81]-=-, or the variance of the gains can be fixed. 25sThe minimization problem (2.15) is usually solved using iterative algorithms. If both B and G are unknown, the cost function may have several local mini... |

118 | One microphone source separation
- Roweis
- 2000
(Show Context)
Citation Context ...(f) shows active time-frequency points of a sparse representation, where only 2% of the original coefficients are active. 8sthe principles described above, i.e., psychoacoustic [83], machine learning =-=[147]-=-, or model-based [52] approaches. The quality can be further improved by using soft masks [144]. In the case of music signals, however, different instruments are more likely to have non-zero coefficie... |

111 | A System for Sound Analysis-TransformationSynthesis based on a Deterministic plus Stochastic Decomposition
- Serra
- 1989
(Show Context)
Citation Context ...inusoidal model, has been used widely in audio signal processing, for example in speech coding by McAulay and Quatieri [120]. In music signal processing it became known by the work of Smith and Serra =-=[157,167]-=-. 5.1 Signal Model The sinusoidal model for one frame x(n), n = 0, . . .,N − 1 of a signal can be written as x(n) = H� ah cos(2πfhn/fs + θh) + r(n), n = 0, . . .,N − 1, (5.1) h=1 where n is the time i... |

107 | Rwc music database: Popular, classical, and jazz music database
- Goto, Hashiguchi
(Show Context)
Citation Context ...75 70 65 60 55 50 45 40 0.5 1 1.5 2 time/s 2.5 3 3.5 Figure 6.10: A pianoroll-type representation of the notes separated from an excerpt of polyphonic music (Piece 18 from the RWC Jazz Music Database =-=[69]-=-). illustrates the approximated loudness of the note. The interface allows editing the signal by moving individual notes in time and pitch, and deleting, and copying notes. A Matlab implementation of ... |

103 |
Separation of Mixed Audio Sources By Independent Subspace Analysis
- Casey, Westner
(Show Context)
Citation Context ...c and percussive sounds. It has been successfully used in the transcription of drum patterns [58,136], in the pitch estimation of speech signals [159], and in the analysis of polyphonic music signals =-=[4,18,30,115,166,178,184,189]-=-. Fig. 2.1 shows an example signal which consists of a diatonic scale and a C major chord played by an acoustic guitar. The signal was separated into components using the NMF algorithm that will be de... |

103 | A computationally efficient multipitch analysis model
- Tolonen, Karjalainen
- 2000
(Show Context)
Citation Context ...d from the autocorrelation function which is obtained by inverse Fourier transforming the power spectrum. In our experiments, the enhanced autocorrelation function proposed by Tolonen and Karjalainen =-=[175]-=- was found to produce good results. In practice, a component may represent more than one pitch. This happens especially when the pitches are always present simultaneously, as is the case in a chord fo... |

97 | Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria - Virtanen |

95 | Monaural speech segregation based on pitch tracking and amplitude modulation
- Hu, Wang
- 2004
(Show Context)
Citation Context ...tory. The bottom panel (f) shows active time-frequency points of a sparse representation, where only 2% of the original coefficients are active. 8sthe principles described above, i.e., psychoacoustic =-=[83]-=-, machine learning [147], or model-based [52] approaches. The quality can be further improved by using soft masks [144]. In the case of music signals, however, different instruments are more likely to... |

81 |
Modelling auditory processing and organisation
- Cooke
- 1991
(Show Context)
Citation Context ...n offset, c) common amplitude modulation, d) common frequency modulation, e) equidirectional movement in the spectrum 4. Spatial proximity These association cues have been used by several researchers =-=[34,47]-=- to develop sound source separation algorithms. Later there has been criticism that the grouping rules can only describe the functioning the human hearing in simple 5scases [163, p. 17], and robust se... |

80 | The MPEG-4 - Pereira, Ebrahimil - 2002 |

76 | Music-Listening Systems
- Scheirer
- 2000
(Show Context)
Citation Context ... estimate phases of the signals. The human audio perception can also be modeled as a process, where features are extracted from each source within a mixture, or, as “understanding without separation” =-=[153]-=-. Different representations are discussed in more detail in Section 1.4. 1.2 Applications In most audio applications, applying some processing only to a certain source within a polyphonic mixture is v... |

75 |
Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs
- Smaragdis
- 2004
(Show Context)
Citation Context ...dels which extend the instantaneous model to allow either time-varying spectra or frequencies. In addition to our original work in [190], similar extensions were simultaneously published by Smaragdis =-=[164, 165]-=-. Estimation algorithms for the convolutive model were proposed which are based on the minimization of the reconstruction error between the observed magnitude spectrogram and the model while restricti... |

71 | Algorithms for non-negative independent component analysis
- Plumbley
- 2003
(Show Context)
Citation Context ...y independent from each other. However, it has been proved that under certain conditions, the non-negativity restrictions are theoretically sufficient for separating statistically independent sources =-=[138]-=-. It has not been investigated whether musical signals fulfill these conditions, and whether NMF implement a suitable estimation algorithm. Currently, there is no comprehensive theoretical explanation... |

70 | Signal Processing Methods for Music Transcription - Klapuri, Davy, et al. - 2006 |