## A Generative Model for Music Transcription (2005)

### Cached

### Download Links

Citations: | 42 - 14 self |

### BibTeX

@MISC{Cemgil05agenerative,

author = {Ali Taylan Cemgil and Bert Kappen and David Barber},

title = {A Generative Model for Music Transcription},

year = {2005}

}

### OpenURL

### Abstract

In this paper we present a graphical model for polyphonic music transcription. Our model, formulated as a Dynamical Bayesian Network, embodies a transparent and computationally tractable approach to this acoustic analysis problem. An advantage of our approach is that it places emphasis on explicitly modelling the sound generation procedure. It provides a clear framework in which both high level (cognitive) prior information on music structure can be coupled with low level (acoustic physical) information in a principled manner to perform the analysis. The model is a special case of the, generally intractable, switching Kalman filter model. Where possible, we derive, exact polynomial time inference procedures, and otherwise efficient approximations. We argue that our generative model based approach is computationally feasible for many music applications and is readily extensible to more general auditory scene analysis scenarios.

### Citations

1159 |
Information theory, inference and learning algorithms. Cambridge university press
- MacKay
- 2003
(Show Context)
Citation Context ...section. 1In the simulations we have fixed the transition parameter p(r = mutejr = sound) = p(r = soundjr = mute) = 10 2It is instructive to interpret (13) from a Bayesian model selection perspective =-=[34]-=-. In this interpretation, we view the set of all piano-rolls, indexed by configurations of discrete indicator variables r , as the set of all models among which we search for the best model r . In thi... |

520 |
Forecasting, structural time series models and the Kalman filter
- Harvey
- 1989
(Show Context)
Citation Context ...cause this provides a natural way to couple the signal model with the piano-roll representation. Similar formulations are used in the econometrics literature to model seasonal fluctuations, e.g., see =-=[31]-=- and [32]. Here we omit the transient component and focus on the periodic component. It is conceptually straightforward to include the transient component as this does not affect the complexity of our... |

484 |
Auditory scene analysis
- Bregman
- 1990
(Show Context)
Citation Context ... pitch tracking, switching Kalman filters. I. INTRODUCTION WHEN HUMANS listen to sound, they are able to associate acoustical signals generated by different mechanisms with individual symbolic events =-=[1]-=-. The study and computational modeling of this human ability forms the focus of computational auditory scene analysis (CASA) and machine listening [2]. Research in this area seeks solutions to a broad... |

378 |
Speech analysis/synthesis based on sinusoidal representation
- McAulay, Quatieri
- 1986
(Show Context)
Citation Context ...ansient attack characteristics [27]. It is common to model such signals as the sum of a periodic component and a transient nonperiodic component (see e.g., [28], [29], and [13]). The sinusoidal model =-=[30]-=- is often a good approximation that provides a compact representation for the periodic component. The transient component can be (1)CEMGIL et al.: GENERATIVE MODEL FOR MUSIC TRANSCRIPTION 681 Fig. 1.... |

309 |
Expectation propagation for approximate Bayesian inference,” tech
- Minka
- 2005
(Show Context)
Citation Context ...s derived from easy-to-compute features such as the energy spectrum. Alternatively, sequential Monte Carlo methods or deterministic message propagation algorithms such as Expectation propagation (EP) =-=[47]-=- could be also used. We have not yet tested our model for more general scenarios, such as music fragments containing percussive instruments or bell sounds with inharmonic spectra. Our simple periodic ... |

221 | Independent factor analysis
- Attias
- 1999
(Show Context)
Citation Context ...around zero and has broad tails, indicating that most of the sources are muted and only a few are sounding. It is well known that such Gaussian mixture priors induce sparse representations, e.g., see =-=[42]-=-, [43] for applications in the context of source separation. A. Future Work Although our approach has many desirable features (automatically deducing number of correct notes, high temporal resolution ... |

213 |
Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition
- Serra, Smith
- 1990
(Show Context)
Citation Context ...s, albeit with strong damping effects and transient attack characteristics [27]. It is common to model such signals as the sum of a periodic component and a transient nonperiodic component (see e.g., =-=[28]-=-, [29], and [13]). The sinusoidal model [30] is often a good approximation that provides a compact representation for the periodic component. The transient component can be (1)CEMGIL et al.: GENERATI... |

164 |
Physical modeling using digital waveguides
- Smith
- 1992
(Show Context)
Citation Context ... state analytically. An alternative would be to formulate a nonlinear dynamical system that implements a nonlinear synthesis model (e.g., FM synthesis, waveshaping synthesis, or even a physical model =-=[45]-=-). Such an approach would reduce the dimensionality of the latent state space but force us to use approximate integration methods such as particle filters or EKF/UKF [46]. It remains an interesting op... |

158 |
P.: Computational auditory scene analysis
- Brown, Cooke
- 1994
(Show Context)
Citation Context ...ferent mechanisms with individual symbolic events [1]. The study and computational modeling of this human ability forms the focus of computational auditory scene analysis (CASA) and machine listening =-=[2]-=-. Research in this area seeks solutions to a broad range of problems such as the cocktail party problem, (for example automatically separating voices of two or more simultaneously speaking persons, se... |

158 | The cognition of basic musical structures - Temperley - 2001 |

156 | Parameter estimation for linear dynamical systems - Ghahramani, Hinton - 1996 |

155 |
The Physics of Musical Instruments
- Fletcher, Rossing
- 1998
(Show Context)
Citation Context .... Modeling a Single Note Musical instruments tend to create oscillations with modes that are roughly related by integer ratios, albeit with strong damping effects and transient attack characteristics =-=[27]-=-. It is common to model such signals as the sum of a periodic component and a transient nonperiodic component (see e.g., [28], [29], and [13]). The sinusoidal model [30] is often a good approximation ... |

151 | Prediction–driven computational auditory scene analysis
- Ellis
- 1996
(Show Context)
Citation Context ...such as the cocktail party problem, (for example automatically separating voices of two or more simultaneously speaking persons, see, e.g., [3] and [4]), identification of environmental sound objects =-=[5]-=- and musical scene analysis [6]. Traditionally, the focus of most research activities has been in speech applications. Recently, analysis of musical scenes is drawing increasingly more attention, prim... |

117 |
Bayesian Forecasting and Dynamic Models, 2nd Edition
- West, Harrison
- 1997
(Show Context)
Citation Context ...s provides a natural way to couple the signal model with the piano-roll representation. Similar formulations are used in the econometrics literature to model seasonal fluctuations, e.g., see [31] and =-=[32]-=-. Here we omit the transient component and focus on the periodic component. It is conceptually straightforward to include the transient component as this does not affect the complexity of our inferenc... |

110 | Propagation algorithms for Variational Bayesian learning
- Ghahramani, Beal
- 2001
(Show Context)
Citation Context ...prior. Note that (14) becomes equivalent to (13), if we knew the “best” parameter , i.e., . Unfortunately, the integration on can not be calculated analytically and approximation methods must be used =-=[37]-=-. A crude but computationally cheap approximation replaces the integration on in (14) with maximization Essentially, this is a joint optimization problem on piano-rolls and parameters which we solve b... |

107 | One microphone source separation
- Roweis
- 2000
(Show Context)
Citation Context ...s area seeks solutions to a broad range of problems such as the cocktail party problem, (for example automatically separating voices of two or more simultaneously speaking persons, see, e.g., [3] and =-=[4]-=-), identification of environmental sound objects [5] and musical scene analysis [6]. Traditionally, the focus of most research activities has been in speech applications. Recently, analysis of musical... |

106 |
Pitch Determination of Speech Signals
- Hess
- 1983
(Show Context)
Citation Context ...y and more recent work, respectively. In speech processing, the related task of tracking the pitch of a single speaker is a fundamental problem and methods proposed in the literature are well studied =-=[10]-=-. However, most current pitch detection algorithms are based largely on heuristics (e.g., picking high energy peaks of a spectrogram, correlogram, auditory filter bank, etc.) and their formulation usu... |

71 | Music-listening systems
- Scheirer
- 2000
(Show Context)
Citation Context ...lem, (for example automatically separating voices of two or more simultaneously speaking persons, see, e.g., [3] and [4]), identification of environmental sound objects [5] and musical scene analysis =-=[6]-=-. Traditionally, the focus of most research activities has been in speech applications. Recently, analysis of musical scenes is drawing increasingly more attention, primarily because of the need for c... |

59 | Switching Kalman filters - Murphy - 1998 |

58 |
Physical modeling of plucked string instruments with application to real-time sound synthesis
- Välimäki, Huopaniemi, et al.
- 1996
(Show Context)
Citation Context ...nerator index j. The actual observed signal y is a superposition of the outputs of all generators. geometrically with respect to that of the fundamental frequency, i.e., higher harmonics decay faster =-=[33]-=-. is the transition matrix at time and encodes the physical properties of the sound generator as a first order Markov Process. The rotation angle can be made time dependent for modeling pitch drifts o... |

57 | Musical Sound Signals Analysis/Synthesis: Sinusoidal+Residual and Elementary Waveform Models
- Rodet
- 1997
(Show Context)
Citation Context ...eit with strong damping effects and transient attack characteristics [27]. It is common to model such signals as the sum of a periodic component and a transient nonperiodic component (see e.g., [28], =-=[29]-=-, and [13]). The sinusoidal model [30] is often a good approximation that provides a compact representation for the periodic component. The transient component can be (1)CEMGIL et al.: GENERATIVE MOD... |

55 | Application of Bayesian probability network to music scene analysis
- Kashino, Nakadai, et al.
- 1995
(Show Context)
Citation Context ...y propose the use of Laplace approximation around the predicted mean instead of the extended Kalman filter (EKF). For both methods, however, it is not obvious how to extend them to polyphony. Kashino =-=[17]-=- is, to our knowledge, the first author to apply graphical models explicitly to the problem of polyphonic music 1558-7916/$20.00 © 2006 IEEE680 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCES... |

53 | Monte Carlo methods for tempo tracking and rhythm quantization
- Cemgil, Kappen
- 2003
(Show Context)
Citation Context ...ce is already a challenging task. On the other hand, automated generation of a human readable score includes nontrivial tasks such as tempo tracking, rhythm quantization, meter and key induction [21]–=-=[23]-=-. As also noted by other authors (e.g., [17], [24], and [25]), we believe that a model that integrates this higher level symbolic prior knowledge can guide and potentially improve the inferences, both... |

52 |
Manipulation, Analysis and Retrieval Systems for Audio Signals
- Tzanetakis
- 2002
(Show Context)
Citation Context ...en in speech applications. Recently, analysis of musical scenes is drawing increasingly more attention, primarily because of the need for content based retrieval in very large digital audio databases =-=[7]-=- and increasing interest in interactive music performance systems [8]. A. Music Transcription One of the hard problems in musical scene analysis is automatic music transcription, that is, the extracti... |

51 | Expectation propagation for approximate inference in dynamic Bayesian networks
- Heskes, Zoeter
- 1986
(Show Context)
Citation Context ...ge that contribute less than a given fraction (e.g., 0.0001) to the total evidence. More sophisticated pruning methods with profound theoretical justification, such as resampling [23] or collapsation =-=[50]-=-, are viable alternatives but these are computationally more expensive. In our simulations, we observe that using a simple pruning method with the maximum number of components per message set to , we ... |

45 | Bayesian harmonic models for musical signal analysis
- Davy, Godsill
- 2002
(Show Context)
Citation Context ...he model is that it makes no strong assumptions about the signal generation mechanism, and views the number of sources as well as the number of harmonics as unknown model parameters. Davy and Godsill =-=[20]-=- address some of the shortcomings of his model and allow changing amplitudes and frequency deviations. The reported results are encouraging, although the method is computationally very expensive. B. A... |

45 |
Exact and efficient Bayesian inference for multiple change-point problems, Stat
- Fearnhead
- 2006
(Show Context)
Citation Context ... grows exponentially with time step (i.e., one Gaussian for each of the exponentially many configurations ). Luckily, for the model we are considering here, the growth is polynomial in only. See also =-=[49]-=-. To see this, suppose we have the filtering density available at time as . The transition models can be organized also in a table where th row and th column correspond to Calculation of the predictiv... |

42 |
Automatic Transcription of Piano Music
- Raphael
- 2002
(Show Context)
Citation Context ... DISCUSSION We have presented a model driven approach where transcription is viewed as a Bayesian inference problem. In this respect, at least, our approach parallels the previous work of [19], [20], =-=[39]-=-. We believe, however, that our formulation, based on a switching state space model, has several advantages. We can remove the assumption of a frame based model and this enables us to analyze music on... |

42 | Learning sparse codes with a mixture-of-gaussians prior
- Olshausen, Millman
- 2000
(Show Context)
Citation Context ... zero and has broad tails, indicating that most of the sources are muted and only a few are sounding. It is well known that such Gaussian mixture priors induce sparse representations, e.g., see [42], =-=[43]-=- for applications in the context of source separation. A. Future Work Although our approach has many desirable features (automatically deducing number of correct notes, high temporal resolution e.t.c.... |

41 |
A theory and computational model of auditory monaural sound separation
- Weintraub
- 1985
(Show Context)
Citation Context ...h in this area seeks solutions to a broad range of problems such as the cocktail party problem, (for example automatically separating voices of two or more simultaneously speaking persons, see, e.g., =-=[3]-=- and [4]), identification of environmental sound objects [5] and musical scene analysis [6]. Traditionally, the focus of most research activities has been in speech applications. Recently, analysis of... |

36 | Robust Multipitch Estimation for the Analysis and Manipulation of Polyphonic Musical Signals
- Klapuri, Virtanen, et al.
- 2000
(Show Context)
Citation Context ...ed generation of a human readable score includes nontrivial tasks such as tempo tracking, rhythm quantization, meter and key induction [21]–[23]. As also noted by other authors (e.g., [17], [24], and =-=[25]-=-), we believe that a model that integrates this higher level symbolic prior knowledge can guide and potentially improve the inferences, both in terms quality of a solution and computation time. There ... |

30 |
Model-based segmentation of time–frequency images for musical transcription
- Sterian
- 1999
(Show Context)
Citation Context ...al models explicitly to the problem of polyphonic music 1558-7916/$20.00 © 2006 IEEE680 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006 transcription. Sterian =-=[18]-=- described a system that viewed transcription as a model driven segmentation of a time-frequency image. Walmsley [19] treats transcription and source separation in a full Bayesian framework. He employ... |

29 | Harmonic analysis with probabilistic graphical models
- Raphael, Stoddard
- 2003
(Show Context)
Citation Context ...ny, tempo or expression. Such structure can be captured by training probabilistic generative models on a corpus of compositions and performances by collecting statistics over selected features (e.g., =-=[44]-=-). One of the important advantages of our approach is that such prior knowledge about the musical structure can be formulated as an informative prior on a piano-roll; thus can be integrated in signal ... |

29 | Automatic music transcription and audio source separation
- Plumbley, Abdallah, et al.
(Show Context)
Citation Context ...in this paper is to consider a computational framework to move us closer to a practical solution of this problem. Music transcription has attracted significant research effort in the past—see [6] and =-=[9]-=- for a detailed review of early and more recent work, respectively. In speech processing, the related task of tracking the pitch of a single speaker is a fundamental problem and methods proposed in th... |

26 |
The Estimation and Tracking of Frequency
- Quinn, Hannan, et al.
- 2001
(Show Context)
Citation Context ...king of single or multiple sinusoids is a fundamental problem in many branches of applied sciences, so it is less surprising that the topic has also been deeply investigated in statistics, (e.g., see =-=[11]-=-). However, ideas from statistics seem to be not widely applied in the context of musical sound analysis, with only a few exceptions [12], [13] who present frequentist techniques for very detailed ana... |

26 | Consonance and Harmony - Terhardt, “Pitch - 1974 |

23 | A hybrid graphical model for rhythmic parsing
- Raphael
- 2002
(Show Context)
Citation Context ... source is already a challenging task. On the other hand, automated generation of a human readable score includes nontrivial tasks such as tempo tracking, rhythm quantization, meter and key induction =-=[21]-=-–[23]. As also noted by other authors (e.g., [17], [24], and [25]), we believe that a model that integrates this higher level symbolic prior knowledge can guide and potentially improve the inferences,... |

17 | Generative model based polyphonic music transcription
- Cemgil, Kappen, et al.
- 2003
(Show Context)
Citation Context ...bolic prior knowledge can guide and potentially improve the inferences, both in terms quality of a solution and computation time. There are many different natural generative models for pianorolls. In =-=[26]-=-, we proposed a realistic hierarchical prior model. In this paper, we consider computationally simpler prior models and focus more on developing efficient inference techniques of a piano-roll represen... |

17 | Bayesian spectrum estimation of unevenly sampled nonstationary data,” tech
- Qi, Minka, et al.
- 2002
(Show Context)
Citation Context ...odeling. Our analysis can be interpreted as a search procedure for a sparse representation on a set of basis vectors. In contrast to Fourier analysis, where the basis vectors are sinusoids (e.g., see =-=[41]-=- for a Bayesian treatment), we represent the observed signal implicitly using signals drawn from a stochastic process which typically generates decaying periodic oscillations (e.g., notes) with occasi... |

14 |
Signal Separation of Musical Instruments
- Walmsley
- 2000
(Show Context)
Citation Context ...SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006 transcription. Sterian [18] described a system that viewed transcription as a model driven segmentation of a time-frequency image. Walmsley =-=[19]-=- treats transcription and source separation in a full Bayesian framework. He employs a frame based generalized linear model (a sinusoidal model) and proposes inference by reversible-jump Markov Chain ... |

8 |
time voice processing with audiovisual feedback: Toward autonomous agents with perfect pitch
- Saul, Lee, et al.
- 2002
(Show Context)
Citation Context ...sis, with only a few exceptions [12], [13] who present frequentist techniques for very detailed analysis of musical sounds with particular focus on decomposition of periodic and transient components. =-=[14]-=- has presented real-time monophonic pitch tracking application based on a Laplace approximation to the posterior parameter distribution of an AR(2) model [15], [11, p. 19]. Their method outperforms se... |

8 |
Approximate Kalman filtering for the harmonic plus noise model
- Para, Jain
- 2001
(Show Context)
Citation Context ... standard pitch tracking algorithms for speech, suggesting potential practical benefits of an approximate Bayesian treatment. For monophonic speech, a Kalman filter based pitch tracker is proposed by =-=[16]-=- that tracks parameters of a harmonic plus noise model (HNM). They propose the use of Laplace approximation around the predicted mean instead of the extended Kalman filter (EKF). For both methods, how... |

6 | Local harmonic estimation in musical sound signals
- Irizarry
- 2001
(Show Context)
Citation Context ...ic has also been deeply investigated in statistics, (e.g., see [11]). However, ideas from statistics seem to be not widely applied in the context of musical sound analysis, with only a few exceptions =-=[12]-=-, [13] who present frequentist techniques for very detailed analysis of musical sounds with particular focus on decomposition of periodic and transient components. [14] has presented real-time monopho... |

4 |
Sound-Source Recognition
- Martin
- 1999
(Show Context)
Citation Context ...d, automated generation of a human readable score includes nontrivial tasks such as tempo tracking, rhythm quantization, meter and key induction [21]–[23]. As also noted by other authors (e.g., [17], =-=[24]-=-, and [25]), we believe that a model that integrates this higher level symbolic prior knowledge can guide and potentially improve the inferences, both in terms quality of a solution and computation ti... |

3 |
Machine Musichanship
- Rowe
- 2001
(Show Context)
Citation Context ...awing increasingly more attention, primarily because of the need for content based retrieval in very large digital audio databases [7] and increasing interest in interactive music performance systems =-=[8]-=-. A. Music Transcription One of the hard problems in musical scene analysis is automatic music transcription, that is, the extraction of a human readable and interpretable description from a recording... |

3 |
estimation of harmonic components in a musical sound signal
- “Weighted
(Show Context)
Citation Context ... also been deeply investigated in statistics, (e.g., see [11]). However, ideas from statistics seem to be not widely applied in the context of musical sound analysis, with only a few exceptions [12], =-=[13]-=- who present frequentist techniques for very detailed analysis of musical sounds with particular focus on decomposition of periodic and transient components. [14] has presented real-time monophonic pi... |

2 |
Parameter Estimation for Linear Dynamical Systems (crg-tr-96-2
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...o parameter estimation in linear dynamical systems, for which no closed form solution is known. Nevertheless, this step can be calculated by an iterative expectation maximization (EM) algorithm [36], =-=[38]-=-. In practice, we observe that for realistic starting conditions , the688 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006 Fig. 12. Training the signal model wi... |

1 |
A new approach to frequency analysis with amplified harmonics
- Truong-Van
- 1990
(Show Context)
Citation Context ...ion of periodic and transient components. [14] has presented real-time monophonic pitch tracking application based on a Laplace approximation to the posterior parameter distribution of an AR(2) model =-=[15]-=-, [11, p. 19]. Their method outperforms several standard pitch tracking algorithms for speech, suggesting potential practical benefits of an approximate Bayesian treatment. For monophonic speech, a Ka... |

1 |
Switching Kalman filters,” Dept
- Murphy
- 1998
(Show Context)
Citation Context ...ions is called the evidence. 2 Unfortunately, calculating this most likely piano-roll configuration is generally intractable, and is related to the difficulty of inference in Switching Kalman Filters =-=[35]-=-, [36]. We shall need to develop approximation schemes for this general case, to which we shall return in a later section. 1In the simulations we have fixed the transition parameter p(r = mutejr = sou... |