Results 1 - 10
of
44
3-D Sound for Virtual Reality and Multimedia
, 2000
"... This paper gives HRTF magnitude data in numerical form for 43 frequencies between 0.2---12 kHz, the average of 12 studies representing 100 different subjects. However, no phase data is included in the tables; group delay simulation would need to be included in order to account for ITD. In 3-D sound ..."
Abstract
-
Cited by 177 (1 self)
- Add to MetaCart
This paper gives HRTF magnitude data in numerical form for 43 frequencies between 0.2---12 kHz, the average of 12 studies representing 100 different subjects. However, no phase data is included in the tables; group delay simulation would need to be included in order to account for ITD. In 3-D sound applications intended for many users, we want might want to use HRTFs that represent the common features of a number of individuals. But another approach might be to use the features of a person who has desirable HRTFs, based on some criteria. (One can sense a future 3-D sound system where the pinnae of various famous musicians are simulated.) A set of HRTFs from a good localizer (discussed in Chapter 2) could be used if the criterion were localization performance. If the localization ability of the person is relatively accurate or more accurate than average, it might be reasonable to use these HRTF measurements for other individuals. The Convolvotron 3-D audio system (Wenzel, Wightman, and Foster, 1988) has used such sets particularly because elevation accuracy is affected negatively when listening through a bad localizers ears (see Wenzel, et al., 1988). It is best when any single nonindividualized HRTF set is psychoacoustically validated using a 113 statistical sample of the intended user population, as shown in Chapter 2. Otherwise, the use of one HRTF set over another is a purely subjective judgment based on criteria other than localization performance. The technique used by Wightman and Kistler (1989a) exemplifies a laboratory-based HRTF measurement procedure where accuracy and replicability of results were deemed crucial. A comparison of their techniques with those described in Blauert (1983), Shaw (1974), Mehrgardt and Mellert (1977), Middlebrooks, Makous, and Gree...
Multimodal System Processing in Mobile Environments
, 2000
"... One major goal of multimodal system design is to support more robust performance than can be achieved with a unimodal recognition technology, such as a spoken language system. In recent years, the multimodal literatures on speech and pen input and speech and lip movements have begun developing relev ..."
Abstract
-
Cited by 31 (5 self)
- Add to MetaCart
One major goal of multimodal system design is to support more robust performance than can be achieved with a unimodal recognition technology, such as a spoken language system. In recent years, the multimodal literatures on speech and pen input and speech and lip movements have begun developing relevant performance criteria and demonstrating a reliability advantage for multimodal architectures. In the present studies, over 2,600 utterances processed by a multimodal pen/voice system were collected during both mobile and stationary use. A new data collection infrastructure was developed, including instrumentation worn by the user while roaming, a researcher field station, and a multimodal data logger and analysis tool tailored for mobile research. Although speech recognition as a stand-alone failed more often during mobile system use, the results confirmed that a more stable multimodal architecture decreased this error rate by 1935 %. Furthermore, these findings were replicated across dif...
Uncertainty decoding for noise robust speech recognition
- in Proc. Interspeech
, 2004
"... This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings
Effect of speaking style on LVCSR performance
- In ICSLP
, 1996
"... SRI collected a corpus to study how spontaneous speech differs from other types of speech. The corpus was collected in two parts: (1) a spontaneous Switchboard-style conversation on an assigned topic, and (2) a reading session in which participants read transcripts of their conversation from part 1. ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
SRI collected a corpus to study how spontaneous speech differs from other types of speech. The corpus was collected in two parts: (1) a spontaneous Switchboard-style conversation on an assigned topic, and (2) a reading session in which participants read transcripts of their conversation from part 1. Experiments were conducted on sentences with identical transcripts that varied in speaking style. The word-error rates varied from 29% (careful dictation) to 53 % (spontaneous conversation) depending on the speaking style. These experiments show that speaking style is a dominant factor in determining the performance of large-vocabulary conversational speech recognition (LVCSR) systems. 1.
Pronunciation Adaptation At the Lexical Level
- Proceedings ISCA ITRW Workshop Adaptation Methods for Speech Recognition, Sophia Antipolis, France [on CD-ROM
, 2001
"... There are various kinds of adaptation which can be used to enhance the performance of automatic speech recognizers. This paper is about pronunciation adaptation at the lexical level, i.e. about modeling pronunciation variation at the lexical level. In the early years of automatic speech recognition ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
There are various kinds of adaptation which can be used to enhance the performance of automatic speech recognizers. This paper is about pronunciation adaptation at the lexical level, i.e. about modeling pronunciation variation at the lexical level. In the early years of automatic speech recognition (ASR) research, the amount of pronunciation variation was limited by using isolated words. Since the focus gradually shifted from isolated words to conversational speech, the amount of pronunciation variation present in the speech signals has increased, as has the need to model it. This is reflected by the growing attention for this topic. In this paper, an overview of the studies on lexicon adaptation is presented. Furthermore, many examples are mentioned of situations in which lexicon adaptation is likely to improve the performance of speech recognizers. Finally, it is argued that some assumptions made in current standard ASR systems are not in line with the properties of the speech signals. Consequently, the problem of pronunciation variation at the lexical level probably cannot be solved by simply adding new transcriptions to the lexicon, as it is generally done at the moment.
Linguistic adaptations during spoken and multimodal error resolution. Language and Speech
- Language and Speech. Special issue on Prosody and Conversation
, 1998
"... error resolution hypet^rticulation linguistic contrast multimodal intetycUon spiral errors spoken and multimodal interaction ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
error resolution hypet^rticulation linguistic contrast multimodal intetycUon spiral errors spoken and multimodal interaction
Toward adaptive Conversational interfaces: Modeling speech convergence with animated personas
- ACM TRANS. ON CHI
, 2004
"... The design of robust interfaces that process conversational speech is a challenging research direction largely because users’ spoken language is so variable. This research explored a new dimension of speaker stylistic variation by examining whether users’ speech converges systematically with the tex ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
The design of robust interfaces that process conversational speech is a challenging research direction largely because users’ spoken language is so variable. This research explored a new dimension of speaker stylistic variation by examining whether users’ speech converges systematically with the text-to-speech (TTS) heard from a software partner. To pursue this question, a study was conducted in which twenty-four 7-to-10-yearold children conversed with animated partners that embodied different TTS voices. An analysis of children’s amplitude, durational features, and dialogue response latencies confirmed that they spontaneously adapt several basic acoustic-prosodic features of their speech 10-50%, with the largest adaptations involving utterance pause structure and amplitude. Children’s speech adaptations were relatively rapid, bidirectional, and dynamically readaptable when introduced to new partners, and generalized across different types of users and TTS voices. Adaptations also occurred consistently, with 70-95 % of children converging with their partner’s TTS, although individual differences in magnitude of adaptation were evident. In the design of future conversational systems, users’ spontaneous convergence could be exploited to guide their speech within system processing bounds, thereby enhancing robustness. Adaptive system processing could yield further significant performance gains. The long-term goal of this research is the development of predictive models of human-computer communication to guide the design of new
Multimodal interface research: A science without borders
, 2000
"... Multimodal research represents "Science without Borders" because it requires combining expertise from different component technologies, academic disciplines, and cultural/international perspectives. It also is rapidly erasing borders as it promotes the increased accessibility of computing for divers ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Multimodal research represents "Science without Borders" because it requires combining expertise from different component technologies, academic disciplines, and cultural/international perspectives. It also is rapidly erasing borders as it promotes the increased accessibility of computing for diverse and non-specialist users, and for field and mobile usage environments. This paper reviews two studies that highlight recent advances within the field. It also draws parallels between the multimodal areas of speech/pen and speech/lip movement research. Finally, it indicates new research challenges that will require additional bold "border crossings" in the near future. In the medical community, there is an international group called Physicians without Borders that many of you undoubtedly are familiar with (URL: http://www.dwb.org/). Physicians without Borders//Medecins sans Frontiers is an organization of volunteer medical personnel who respond to medical needs and emergencies around the w...
The Effect of Cue-Enhancement on the Intelligibility of Nonsense Word and Sentence Materials Presented in Noise
, 1998
"... Two sets of experiments were performed to test the perceptual benefits of enhancing consonantal regions which contain a high density of acoustic cues to phonemic contrasts. In the first set, hand-annotated consonantal regions of natural vowelconsonant vowel (VCV) stimuli were amplified to increase t ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Two sets of experiments were performed to test the perceptual benefits of enhancing consonantal regions which contain a high density of acoustic cues to phonemic contrasts. In the first set, hand-annotated consonantal regions of natural vowelconsonant vowel (VCV) stimuli were amplified to increase their salience, and filtered to stylise the cues they contained. In the second set, corresponding regions in natural semantically-unpredictable sentence (SUS) material were annotated and enhanced in the same way. Both sets of stimuli were combined with speech-shaped noise and presented to normally-hearing listeners. The VCV experiments showed statistically significant improvements in intelligibility as a result of enhancement; significant improvements were also obtained for sentence material after some adjustments in enhancement strategies and levels. These results demonstrate the benefits gained from enhancement techniques which use knowledge of acoustic cues to phonetic contrasts to improve the intelligibility of speech in the presence of background noise. <3 1998 Elsevier Science B.V. All rights reserved.

