Results 1 - 10
of
26
Construction And Evaluation Of A Robust Multifeature Speech/music Discriminator
, 1997
"... We report on the construction of a real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input. We have examined 13 features intended to measure conceptually distinct properties of speech and/or music signals, and combined them in se ..."
Abstract
-
Cited by 230 (4 self)
- Add to MetaCart
We report on the construction of a real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input. We have examined 13 features intended to measure conceptually distinct properties of speech and/or music signals, and combined them in several multidimensional classification frameworks. We provide extensive data on systemperformanceand the cross-validated training/test setup used to evaluate the system. For the datasets currently in use, the best classifier classifies with 5.8% error on a frame-by-frame basis, and 1.4% error when integrating long (2.4 second) segments of sound. 1. OVERVIEW The problem of distinguishing speech signals from music signals has become increasingly important as automatic speech recognition (ASR) systems are applied to more and more "real-world" multimedia domains. If we wish to build systems that perform ASR on soundtrack data, for example, it is important to be able to distinguish which segments...
Content-based representation and retrieval of visual media: A state-of-the-art review
- Multimedia Tools and Applications
, 1996
"... This paper reviews a number of recently available techniques in contentanalysis of visual media and their application to the indexing, retrieval,abstracting, relevance assessment, interactive perception, annotation and re-use of visualdocuments. 1. Background A few years ago, the problems of represe ..."
Abstract
-
Cited by 117 (2 self)
- Add to MetaCart
This paper reviews a number of recently available techniques in contentanalysis of visual media and their application to the indexing, retrieval,abstracting, relevance assessment, interactive perception, annotation and re-use of visualdocuments. 1. Background A few years ago, the problems of representation and retrieval of visualmedia were confined to specialized image databases (geographical, medical, pilot experimentsin computerized slide libraries), in the professional applications of the audiovisualindustries (production, broadcasting and archives), and in computerized training or education. The presentdevelopment of multimedia technology and information highways has put content processing of visualmedia at the core of key application domains: digital and interactive video, large distributed digital libraries, multimedia publishing. Though the most important investments have been targeted at the information infrastructure (networks, servers, coding and compression, deliverymodels, multimedia systems architecture), a growing number of researchers have realized thatcontent processing will be a key asset in putting together successful applications. The need for contentprocessing techniques has been made evident from a variety of angles, ranging from achievingbetter quality in compression, allowing user choice of programs in video-on-demand, achieving betterproductivity in video production, providing access to large still image databases or integrating still images and video in multimedia publishing and cooperative work. Content-based retrieval of visual media and representation of visualdocuments in human-computer interfaces are based on the availability of content representationdata (time-structure for
SpeechSkimmer: A System for Interactively Skimming Recorded Speech
- ACM Transactions on Computer Human Interaction
, 1997
"... Note that the text that appeared in printed journal contains very minor typographic and grammatical corrections that do not appear in this version. SpeechSkimmer: ..."
Abstract
-
Cited by 85 (1 self)
- Add to MetaCart
Note that the text that appeared in printed journal contains very minor typographic and grammatical corrections that do not appear in this version. SpeechSkimmer:
Sound-Source Recognition: A Theory and Computational Model
, 1999
"... The ability of a normal human listener to recognize objects in the environment from only the sounds they produce is extraordinarily robust with regard to characteristics of the acoustic environment and of other competing sound sources. In contrast, computer systems designed to recognize sound source ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
The ability of a normal human listener to recognize objects in the environment from only the sounds they produce is extraordinarily robust with regard to characteristics of the acoustic environment and of other competing sound sources. In contrast, computer systems designed to recognize sound sources function precariously, breaking down whenever the target sound is degraded by reverberation, noise, or competing sounds. Robust listening requires extensive contextual knowledge, but the potential contribution of sound-source recognition to the process of auditory scene analysis has largely been neglected by researchers building computational models of the scene analysis process. This thesis proposes a theory of sound-source recognition, casting recognition as a process of gathering information to enable the listener to make inferences about
Multiple Fundamental Frequency Estimation Based on Harmonicity and Spectral Smoothness
, 2003
"... A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the res ..."
Abstract
-
Cited by 46 (5 self)
- Add to MetaCart
A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the residual signal. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the subtraction stage, the spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds. With these techniques, multiple fundamental frequency estimation can be performed quite accurately in a single time frame, without the use of long-term temporal features. The experimental data comprised recorded samples of 30 musical instruments from four different sources. Multiple fundamental frequency estimation was performed for random sound source and pitch combinations. Error rates for mixtures ranging from one to six simultaneous sounds were 1.8%, 3.9%, 6.3%, 9.9%, 14%, and 18%, respectively. In musical interval and chord identification tasks, the algorithm outperformed the average of ten trained musicians. The method works robustly in noise, and is able to handle sounds that exhibit inharmonicities. The inharmonicity factor and spectral envelope of each sound is estimated along with the fundamental frequency.
A Blackboard System For Automatic Transcription of . . .
, 1987
"... A novel computational system has been constructed which is capable of transcribing piano performances of four-voice Bachchorales written in the style of 18th century counter-point. The system ..."
Abstract
-
Cited by 46 (1 self)
- Add to MetaCart
A novel computational system has been constructed which is capable of transcribing piano performances of four-voice Bachchorales written in the style of 18th century counter-point. The system
Media Streams: An iconic visual language for video annotation
- In Proc. IEEE Symposium on Visual Languages
, 1993
"... In order to enable the search and retrieval of video from large archives, we need a representation language for video content. Although some aspects of video can be automatically parsed, a sufficient representation requires that video be annotated. We discuss the design of a video representation lan ..."
Abstract
-
Cited by 37 (2 self)
- Add to MetaCart
In order to enable the search and retrieval of video from large archives, we need a representation language for video content. Although some aspects of video can be automatically parsed, a sufficient representation requires that video be annotated. We discuss the design of a video representation language with special attention to the issue of creating a global, reusable video archive. Our prototype system, Media Streams, enables users to create multi-layered, iconic annotations of streams of video data. Within Media Streams, the organization and categories of the Icon Space allow users to browse and compound over 3500 iconic primitives by means of a cascading hierarchical structure that supports compounding icons across branches of the hierarchy. A Media Time Line enables users to visualize, browse, annotate, and retrieve video content. The challenges of creating a representation of human action in video are discussed in detail, with focus on the effect of the syntax of video sequences on the semantics of video shots. 1 Introduction: The Need
Automatic Transcription of Simple Polyphonic Music: . . .
, 1996
"... It is only very recently that systems have been developed that transcribe polyphonic music with more than two voices in even limited generality. Two of these systems [Kashino et al.1995, Martin 1996] have been built within a blackboard framework, integrating front ends based on sinusoidal analy ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
It is only very recently that systems have been developed that transcribe polyphonic music with more than two voices in even limited generality. Two of these systems [Kashino et al.1995, Martin 1996] have been built within a blackboard framework, integrating front ends based on sinusoidal analysis with musical knowledge. These and other systems to date rely on instrument models for detecting octaves. Recent results have shown that an autocorrelation-based front end may make bottom-up detection of octaves possible, thereby improving system performance as well as reducing the distance between transcription models and human audition. This report outlines the blackboard approach to automatic transcription and presents a new system based on the log-lag correlogram of [Ellis 1996]. Preliminary results are presented, outlining the bottom-up detection of octaves and transcription of simple polyphonic music.
Automatic Transcription of Music
, 2001
"... A system for the automatic transcription of music is described. Signal processing methods are introduced that solve different facets of the overall problem. Main emphasis is laid on finding the multiple pitches of concurrent musical sounds. Sound onset detection and musical meter estimation are desc ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
A system for the automatic transcription of music is described. Signal processing methods are introduced that solve different facets of the overall problem. Main emphasis is laid on finding the multiple pitches of concurrent musical sounds. Sound onset detection and musical meter estimation are described to some extent. Other topics discussed are noise robustness, estimation of the number of concurrent voices, sound separation, and musical instrument recognition. The presented system is evaluated using a database of musical sounds, synthesized MIDI-songs, and CDrecordings. Also, the performance of the system is compared to that of human listeners. 1.
Musical Information Retrieval Using Musical Parameters
- In Proceedings of the 1998 International Computer Music Conference
, 1998
"... . The application domain for automatical retrieval of melodic excerpts in musical collections is wide; e.g. it would facilitate the work of music researcher trying to find specific features in music. In this paper we consider several parts of the retrieving process. We present our representation for ..."
Abstract
-
Cited by 25 (8 self)
- Add to MetaCart
. The application domain for automatical retrieval of melodic excerpts in musical collections is wide; e.g. it would facilitate the work of music researcher trying to find specific features in music. In this paper we consider several parts of the retrieving process. We present our representation for musical data. This inner representation is converted and established from MIDI-files. For the matching we use a particular encoding (two dimensional relative code), which is formed out of the inner representation. This encoding can be interpreted differently depending on the way the key is given. Furthermore, in the matching phase we use an efficient indexing structure, well-known in string pattern matching, called suffix-trie. 1 Introduction In the earlier researches concerning musical data representation, researchers seemed to be rather sensible to the delicate details of different styles of music. One example of such a meticulous approach is Leo Plenckers encoding system for Spanish med...

