Results 1 - 10
of
36
A Real-time Music Scene Description System: Detecting Melody and Bass Lines in Audio Signals
- Speech Communication
, 1999
"... This paper describes a predominant-pitch estimation method that enables us to build a realtime system detecting melody and bass lines as a subsystem of our music scene description system. The purpose of this study is to build such a real-time system that is practical from the engineering viewp ..."
Abstract
-
Cited by 78 (25 self)
- Add to MetaCart
This paper describes a predominant-pitch estimation method that enables us to build a realtime system detecting melody and bass lines as a subsystem of our music scene description system. The purpose of this study is to build such a real-time system that is practical from the engineering viewpoint, that gives suggestions to the modeling of music understanding, and that is useful in various applications. Most previous pitch-estimation methods premised either a single-pitch sound with aperiodic noises or a few musical instruments and had great di#culty dealing with complex audio signals sampled from compact discs, especially discs recording jazz or popular music with drum-sounds. Our method can estimate the most predominant fundamental frequency (F0) in such signals containing sounds of various instruments because it does not rely on the F0's frequency component, which is often overlapped by other sounds' components, and instead estimates the F0 by using the Expectat...
Separation of Speech from Interfering Sounds Based on Oscillatory Correlation
- IEEE TRANSACTIONS ON NEURAL NETWORKS
, 1999
"... A multistage neural model is proposed for an auditory scene analysis task---segregating speech from interfering sound sources. The core of the model is a two-layer oscillator network that performs stream segregation on the basis of oscillatory correlation. In the oscillatory correlation framework, a ..."
Abstract
-
Cited by 67 (22 self)
- Add to MetaCart
A multistage neural model is proposed for an auditory scene analysis task---segregating speech from interfering sound sources. The core of the model is a two-layer oscillator network that performs stream segregation on the basis of oscillatory correlation. In the oscillatory correlation framework, a stream is represented by a population of synchronized relaxation oscillators, each of which corresponds to an auditory feature, and different streams are represented by desynchronized oscillator populations. Lateral connections between oscillators encode harmonicity, and proximity in frequency and time. Prior to the oscillator network are a model of the auditory periphery and a stage in which mid-level auditory representations are formed. The model has been systematically evaluated using a corpus of voiced speech mixed with interfering sounds, and produces improvements in terms of signal-to-noise ratio for every mixture. The performance of our model is compared with other studies on computa...
Sound-Source Recognition: A Theory and Computational Model
, 1999
"... The ability of a normal human listener to recognize objects in the environment from only the sounds they produce is extraordinarily robust with regard to characteristics of the acoustic environment and of other competing sound sources. In contrast, computer systems designed to recognize sound source ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
The ability of a normal human listener to recognize objects in the environment from only the sounds they produce is extraordinarily robust with regard to characteristics of the acoustic environment and of other competing sound sources. In contrast, computer systems designed to recognize sound sources function precariously, breaking down whenever the target sound is degraded by reverberation, noise, or competing sounds. Robust listening requires extensive contextual knowledge, but the potential contribution of sound-source recognition to the process of auditory scene analysis has largely been neglected by researchers building computational models of the scene analysis process. This thesis proposes a theory of sound-source recognition, casting recognition as a process of gathering information to enable the listener to make inferences about
Structured Audio: Creation, Transmission, and Rendering of Parametric Sound Representations
- PROC. IEEE
, 1998
"... ..."
Automatic Transcription of Simple Polyphonic Music: . . .
, 1996
"... It is only very recently that systems have been developed that transcribe polyphonic music with more than two voices in even limited generality. Two of these systems [Kashino et al.1995, Martin 1996] have been built within a blackboard framework, integrating front ends based on sinusoidal analy ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
It is only very recently that systems have been developed that transcribe polyphonic music with more than two voices in even limited generality. Two of these systems [Kashino et al.1995, Martin 1996] have been built within a blackboard framework, integrating front ends based on sinusoidal analysis with musical knowledge. These and other systems to date rely on instrument models for detecting octaves. Recent results have shown that an autocorrelation-based front end may make bottom-up detection of octaves possible, thereby improving system performance as well as reducing the distance between transcription models and human audition. This report outlines the blackboard approach to automatic transcription and presents a new system based on the log-lag correlogram of [Ellis 1996]. Preliminary results are presented, outlining the bottom-up detection of octaves and transcription of simple polyphonic music.
A Generative Model for Music Transcription
, 2005
"... In this paper we present a graphical model for polyphonic music transcription. Our model, formulated as a Dynamical Bayesian Network, embodies a transparent and computationally tractable approach to this acoustic analysis problem. An advantage of our approach is that it places emphasis on explicitl ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
In this paper we present a graphical model for polyphonic music transcription. Our model, formulated as a Dynamical Bayesian Network, embodies a transparent and computationally tractable approach to this acoustic analysis problem. An advantage of our approach is that it places emphasis on explicitly modelling the sound generation procedure. It provides a clear framework in which both high level (cognitive) prior information on music structure can be coupled with low level (acoustic physical) information in a principled manner to perform the analysis. The model is a special case of the, generally intractable, switching Kalman filter model. Where possible, we derive, exact polynomial time inference procedures, and otherwise efficient approximations. We argue that our generative model based approach is computationally feasible for many music applications and is readily extensible to more general auditory scene analysis scenarios.
Automatic chord transcription with concurrent recognition of chord symbols and boundaries
- Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR04
, 2004
"... This paper describes a method that recognizes musical chords from real-world audio signals in compact-disc recordings. The automatic recognition of musical chords is necessary for music information retrieval (MIR) systems, since the chord sequences of musical pieces capture the characteristics of th ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
This paper describes a method that recognizes musical chords from real-world audio signals in compact-disc recordings. The automatic recognition of musical chords is necessary for music information retrieval (MIR) systems, since the chord sequences of musical pieces capture the characteristics of their accompaniments. None of the previous methods can accurately recognize musical chords from complex audio signals that contain vocal and drum sounds. The main problem is that the chordboundary-detection and chord-symbol-identification processes are inseparable because of their mutual dependency. In order to solve this mutual dependency problem, our method generates hypotheses about tuples of chord symbols and chord boundaries, and outputs the most plausible one as the recognition result. The certainty of a hypothesis is evaluated based on three cues: acoustic features, chord progression patterns, and bass sounds. Experimental results show that our method successfully recognized chords in seven popular music songs; the average accuracy of the results was around 77%.
Personal communication with A. Agogino
- IEEE Trans. Audio, Speech, and Language Proc
, 2007
"... Abstract — Although the process of analyzing an audio recording of a music performance is complex and difficult even for a human listener, there are limited forms of information that may be tractably extracted and yet still enable interesting applications. We discuss melody – roughly, the part a lis ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Abstract — Although the process of analyzing an audio recording of a music performance is complex and difficult even for a human listener, there are limited forms of information that may be tractably extracted and yet still enable interesting applications. We discuss melody – roughly, the part a listener might whistle or hum – as one such reduced descriptor of music audio, and consider how to define it, and what use it might be. We go on to describe the results of full-scale evaluations of melody transcription systems conducted in 2004 and 2005, including an overview of the systems submitted, details of how the evaluations were conducted, and a discussion of the results. For our definition of melody, current systems can achieve around 70 % correct transcription at the frame level, including distinguishing between the presence or absence of the melody. Melodies transcribed at this level are readily recognizable, and show promise for practical applications. I.

