Results 1 - 10
of
245
3-D Sound for Virtual Reality and Multimedia
, 2000
"... This paper gives HRTF magnitude data in numerical form for 43 frequencies between 0.2---12 kHz, the average of 12 studies representing 100 different subjects. However, no phase data is included in the tables; group delay simulation would need to be included in order to account for ITD. In 3-D sound ..."
Abstract
-
Cited by 177 (1 self)
- Add to MetaCart
This paper gives HRTF magnitude data in numerical form for 43 frequencies between 0.2---12 kHz, the average of 12 studies representing 100 different subjects. However, no phase data is included in the tables; group delay simulation would need to be included in order to account for ITD. In 3-D sound applications intended for many users, we want might want to use HRTFs that represent the common features of a number of individuals. But another approach might be to use the features of a person who has desirable HRTFs, based on some criteria. (One can sense a future 3-D sound system where the pinnae of various famous musicians are simulated.) A set of HRTFs from a good localizer (discussed in Chapter 2) could be used if the criterion were localization performance. If the localization ability of the person is relatively accurate or more accurate than average, it might be reasonable to use these HRTF measurements for other individuals. The Convolvotron 3-D audio system (Wenzel, Wightman, and Foster, 1988) has used such sets particularly because elevation accuracy is affected negatively when listening through a bad localizers ears (see Wenzel, et al., 1988). It is best when any single nonindividualized HRTF set is psychoacoustically validated using a 113 statistical sample of the intended user population, as shown in Chapter 2. Otherwise, the use of one HRTF set over another is a purely subjective judgment based on criteria other than localization performance. The technique used by Wightman and Kistler (1989a) exemplifies a laboratory-based HRTF measurement procedure where accuracy and replicability of results were deemed crucial. A comparison of their techniques with those described in Blauert (1983), Shaw (1974), Mehrgardt and Mellert (1977), Middlebrooks, Makous, and Gree...
Automatic Extraction of Tempo and Beat from Expressive Performances
- Journal of New Music Research
, 2001
"... We describe a computer program which is able to estimate the tempo and the times of musical beats in expressively performed music. The input data may be either digital audio or a symbolic representation of music such as MIDI. The data is processed off-line to detect the salient rhythmic events and t ..."
Abstract
-
Cited by 121 (18 self)
- Add to MetaCart
We describe a computer program which is able to estimate the tempo and the times of musical beats in expressively performed music. The input data may be either digital audio or a symbolic representation of music such as MIDI. The data is processed off-line to detect the salient rhythmic events and the timing of these events is analysed to generate hypotheses of the tempo at various metrical levels. Based on these tempo hypotheses, a multiple hypothesis search nds the sequence of beat times which has the best fit to the rhythmic events. We show that estimating the perceptual salience of rhythmic events significantly improves the results. No prior knowledge of the tempo, meter or musical style is assumed; all required information is derived from the data. Results are presented for a range of different musical styles, including classical, jazz, and popular works with a variety of tempi and meters. The system calculates the tempo correctly in most cases, the most common error being a doubling or halving of the tempo. The calculation of beat times is also robust. When errors are made concerning the phase of the beat, the system recovers quickly to resume correct beat tracking, despite the fact that there is no high level musical knowledge encoded in the system.
A Review of The Cocktail Party Effect
- JOURNAL OF THE AMERICAN VOICE I/O SOCIETY
, 1992
"... The "cocktail party effect"---the ability to focus one's listening attention on a single talker among a cacophony of conversations and background noise---has been recognized for some time. This specialized listening ability may be because of characteristics of the human speech production system, the ..."
Abstract
-
Cited by 74 (3 self)
- Add to MetaCart
The "cocktail party effect"---the ability to focus one's listening attention on a single talker among a cacophony of conversations and background noise---has been recognized for some time. This specialized listening ability may be because of characteristics of the human speech production system, the auditory system, or high-level perceptual and language processing. This paper investigates the literature on what is known about the effect, from the original technical descriptions through current research in the areas of auditory streams and spatial display systems. The underlying goal of the paper is to analyze the components of this effect to uncover relevant attributes of the speech production and perception chain that could be exploited in future speech communication systems. The motivation is to build a system that can simultaneously present multiple streams of speech information such that a user can focus on one stream, yet easily shift attention to the others. A set of speech appli...
Unsupervised Clustering Of Ambulatory Audio And Video
"... A truly personal and reactive computer system should have access to the same information as its user, including the ambient sights and sounds. To this end, we have developed a system for extracting events and scenes from natural audio/visual input. We find our system can (without any prior labeling ..."
Abstract
-
Cited by 72 (9 self)
- Add to MetaCart
A truly personal and reactive computer system should have access to the same information as its user, including the ambient sights and sounds. To this end, we have developed a system for extracting events and scenes from natural audio/visual input. We find our system can (without any prior labeling of data) cluster the audio/visual data into events, such as passing through doors and crossing the street. Also, we hierarchically cluster these events into scenes and get clusters that correlate with visiting the supermarket, or walking down a busy street.
The Link Between Brain Learning, Attention, And Consciousness
, 1998
"... The processes whereby our brains continue to learn about a changing world in a stable fashion throughout life are proposed to lead to conscious experiences. These processes include the learning of top-down expectations, the matching of these expectations against bottom-up data, the focusing of atten ..."
Abstract
-
Cited by 65 (28 self)
- Add to MetaCart
The processes whereby our brains continue to learn about a changing world in a stable fashion throughout life are proposed to lead to conscious experiences. These processes include the learning of top-down expectations, the matching of these expectations against bottom-up data, the focusing of attention upon the expected clusters of information, and the development of resonant states between bottom-up and top-down processes as they reach an attentive consensus between what is expected and what is there in the outside world. It is suggested that all conscious states in the brain are resonant states, and that these resonant states trigger learning of sensory and cognitive representations. The models which summarize these concepts are therefore called Adaptive Resonance Theory, or ART, models. Psychophysical and neurobiological data in support of ART are presented from early vision, visual object recognition, auditory streaming, variable-rate speech perception, somatosensory perception, a...
The Attentive Brain
, 1995
"... in's face (A) is seen through small apertures (B), its meaning as a face is greatly degraded. ears sequentially. To process a pattern of sounds as a whole, it must be "recoded". Such a recoding, or processing stage, is often called a working memory, which stores short-termmemory traces. To identify ..."
Abstract
-
Cited by 63 (25 self)
- Add to MetaCart
in's face (A) is seen through small apertures (B), its meaning as a face is greatly degraded. ears sequentially. To process a pattern of sounds as a whole, it must be "recoded". Such a recoding, or processing stage, is often called a working memory, which stores short-termmemory traces. To identify familiar events, the brain compares short-term traces with stored categories. These categories are accessed using long-term-memory traces, which represent previous experiences that have been acquired through learning. Somehow, we can rapidly learn new facts--placing them in long-term memory--without being forced just as rapidly to forget others. How does brain processing keep old memories stable and still maintain enough plasticity to learn new things? What I call the stabilityplasticity dilemma must be solved by every brain system that attempts to learn about the flood of external signals. I shall examine several challenging examples of visual and auditory data that suggest how the brain mi
Video Scene Segmentation Via Continuous Video Coherence
- Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition
, 1998
"... In extended video sequences, individual frames are grouped into shots which are defined as a sequence taken by a single camera, and related shots are grouped into scenes which are defined by a single dramatic event taken by a small number of related cameras. This hierarchical structure is deliberate ..."
Abstract
-
Cited by 58 (5 self)
- Add to MetaCart
In extended video sequences, individual frames are grouped into shots which are defined as a sequence taken by a single camera, and related shots are grouped into scenes which are defined by a single dramatic event taken by a small number of related cameras. This hierarchical structure is deliberately constructed, dictated by the limitations and preferences of the human visual and memory systems. We present three novel high-level segmentation results derived from these considerations, some of which are analogous to those involved in the perception of the structure of music. First and primarily, we derive and demonstrate a method for measuring probable scene boundaries, by calculating a short term memory-based model of shot-to-shot "coherence". The detection of local minima in this continuous measure permits robust and flexible segmentation of the video into scenes, without the necessity for first aggregating shots into clusters. Second, and independently of the first, we then derive and demonstrate a one-pass on-the-fly shot clustering algorithm. Third, we demonstrate partially successful results on the application of these two new methods to the next higher, "theme", level of video structure.
Visual indexes, preconceptual objects, and situated vision
- Cognition
, 2001
"... www.elsevier.com/locate/cognit This paper argues that a theory of situated vision, suited for the dual purposes of object recognition and the control of action, will have to provide something more than a system that constructs a conceptual representation from visual stimuli: it will also need to pro ..."
Abstract
-
Cited by 53 (12 self)
- Add to MetaCart
www.elsevier.com/locate/cognit This paper argues that a theory of situated vision, suited for the dual purposes of object recognition and the control of action, will have to provide something more than a system that constructs a conceptual representation from visual stimuli: it will also need to provide a special kind of direct (preconceptual, unmediated) connection between elements of a visual representation and certain elements in the world. Like natural language demonstratives (such as `this ' or `that') this direct connection allows entities to be referred to without being categorized or conceptualized. Several reasons are given for why we need such a preconceptual mechanism which individuates and keeps track of several individual objects in the world. One is that early vision must pick out and compute the relation among several individual objects while ignoring their properties. Another is that incrementally computing and updating representations of a dynamic scene requires keeping track of token individuals despite changes in their properties or locations. It is then noted that a mechanism meeting these requirements has already been proposed in order to account for a number of disparate empirical phenomena, including subitizing, search-subset selection and multiple object tracking
Hyperspeech: Navigating in Speech-Only Hypermedia
- In Hypertext '91
, 1991
"... Most hypermedia systems emphasize the integration of graphics, images, video, and audio into a traditional hypertext framework. The hyperspeech system described in this paper, a speech-only hypermedia application, explores issues of navigation and system architecture in an audio environment without ..."
Abstract
-
Cited by 51 (11 self)
- Add to MetaCart
Most hypermedia systems emphasize the integration of graphics, images, video, and audio into a traditional hypertext framework. The hyperspeech system described in this paper, a speech-only hypermedia application, explores issues of navigation and system architecture in an audio environment without a visual display. The system under development uses speech recognition to maneuver in a database of digitally recorded speech segments; synthetic speech is used for control information and user feedback. In this research prototype, recorded audio interviews were segmented by topic, and hypertext-style links were added to connect logically related comments and ideas. The software architecture is data driven, with all knowledge embedded in the links and nodes, allowing the software that traverses through the network to be straightforward and concise. Several user interfaces were prototyped, emphasizing different styles of speech interaction and feedback between the user and machine. In additio...
Audio content analysis for online audiovisual data segmentation and classification
- 62 IEEE SIGNAL PROCESSING MAGAZINE MARCH 2004
, 2001
"... Abstract—While current approaches for audiovisual data segmentation and classification are mostly focused on visual cues, audio signals may actually play a more important role in content parsing for many applications. An approach to automatic segmentation and classification of audiovisual data based ..."
Abstract
-
Cited by 46 (2 self)
- Add to MetaCart
Abstract—While current approaches for audiovisual data segmentation and classification are mostly focused on visual cues, audio signals may actually play a more important role in content parsing for many applications. An approach to automatic segmentation and classification of audiovisual data based on audio content analysis is proposed. The audio signal from movies or TV programs is segmented and classified into basic types such as speech, music, song, environmental sound, speech with music background, environmental sound with music background, silence, etc. Simple audio features including the energy function, the average zero-crossing rate, the fundamental frequency, and the spectral peak tracks are extracted to ensure the feasibility of real-time processing. A heuristic rule-based procedure is proposed to segment and classify audio signals and built upon morphological and statistical analysis of the time-varying functions of these audio features. Experimental results show that the proposed scheme achieves an accuracy rate of more than 90 % in audio classification. Index Terms—Audio analysis, audio indexing, audio segmentation, audiovisual content parsing, information filtering and retrieval, multimedia database management. I.

