Results 1 - 10
of
69
Speaker recognition: A tutorial
"... A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person’s claimed id ..."
Abstract
-
Cited by 121 (1 self)
- Add to MetaCart
A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person’s claimed identity. Speech processing and the basic components of automatic speakerrecognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9 % correct identification. Last, the performances of various systems are compared.
Evaluating Audio and Video Quality in Low-Cost Multimedia Conferencing Systems
, 1996
"... Real-time audio and video transmission over shared packet networks, such as the Internet, has become possible thanks to efficient data compression schemes and the provision of high-speed networks. Low cost multimedia conferencing technology could benefit many users in different areas, such as rem ..."
Abstract
-
Cited by 54 (12 self)
- Add to MetaCart
Real-time audio and video transmission over shared packet networks, such as the Internet, has become possible thanks to efficient data compression schemes and the provision of high-speed networks. Low cost multimedia conferencing technology could benefit many users in different areas, such as remote collaboration, distance education and healthcare. It is likely that diverse tasks performed by users in different application domains will require different levels of audio and video quality. Established methods of rating audio and video quality in the broadcast and telephony world cannot be applied to digital, lower quality images and sound. The providers of networks and services are looking to HCI to provide means of assessing audio and video quality. This paper describes two different approaches to assessing audio and video of desktop conferencing systems - a controlled experimental study and an informal field trial. We discuss the advantages and disadvantages of both approach...
Natural-Sounding Speech Synthesis Using Variable-Length Units
, 1998
"... The goal of this work was to develop a speech synthesis system which concatenates variable-length units to create naturalsounding speech. Our initial work in this area showed that by careful design of system responses to ensure consistent intonation contours, natural-sounding speech synthesis was ac ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
The goal of this work was to develop a speech synthesis system which concatenates variable-length units to create naturalsounding speech. Our initial work in this area showed that by careful design of system responses to ensure consistent intonation contours, natural-sounding speech synthesis was achievable with word- and phrase-level concatenation. In order to extend the flexibility of this framework, we focused on the problem of generating novel words from a corpus of sub-word units. The design of the sub-word units was motivated by perceptual studies that investigated where speech could be spliced with minimal audible distortion and what contextual constraints were necessary to maintain in order to produce natural sounding speech. The sub-word corpus is searched during synthesis using a Viterbi search which selects a sequence of units based on how well they individually match the input specification and on how well they sound as an ensemble. This concatenative speech synthesis system, ENVOICE, has been used in a conversational information retrieval system in two application domains to convert meaning representations into speech waveforms.
Integration of acoustic and visual speech signals using neural networks
- IEEE Communications Magazine
, 1989
"... rely almost exclusively on the acoustic speech signal and, consequently, these systems often perform poorly in noisy environments [I]. Attempts to clean up the acoustic input have had limited success [2]. Another approach is to use other sources of speech information, such as visual speech signals. ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
rely almost exclusively on the acoustic speech signal and, consequently, these systems often perform poorly in noisy environments [I]. Attempts to clean up the acoustic input have had limited success [2]. Another approach is to use other sources of speech information, such as visual speech signals. The perception of acoustic speech by humans can be affected by the visible speech signals [3-51. Specifically, when the acoustic signal is degraded by noise, the visual signal can provide supplemental speech information that improves speech perception [6-81. When no acoustic signal is available, as for the profoundly deaf, the visual signal alone can provide speech information through lip reading [9- 1 I]. Here we answer two questions: Can the speech information conveyed by visual speech signals be extracted automatically? How can this information be combined with information from the acoustic signal to improve automat
Of packets and people: A User-centered Approach to Quality of Service
, 2001
"... Multimedia communication has gained increasing attention, both from the application side and the network provider side. While resource provisioning for QoS support in packet switched networks has lead to the design and development of sophisticated QoS architectures, notably ATM, IntServ or DiffServ, ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
Multimedia communication has gained increasing attention, both from the application side and the network provider side. While resource provisioning for QoS support in packet switched networks has lead to the design and development of sophisticated QoS architectures, notably ATM, IntServ or DiffServ, research has not exactly been user or application-context centered. In the cause of the evolution of QoS architectures, the integrated service network approach has lost momentum, and with it, the notion of QoS guarantees. Differentiation of QoS classes within the DiffServ framework is based on the definition of various per-hop behaviors. What is currently missing is a technique for specification and mapping of application and user QoS preferences onto evolving service profiles. In addition, adaptation of applications (and users) is becoming increasingly important in the face of dominating weak QoS-assurance paradigms, both in wireline and wireless environments. As a prerequisite, this paper...
A Modeling Framework for Speech Motor Development and Kinematic Articulator Control
, 1995
"... This paper presents three hypotheses that are central to a computational model of speech production: (1) Sound targets take the form of regions, rather than points, in a planning reference frame. (2) The planning frame is more acoustic-like than the frames used in most recent models. (3) A direction ..."
Abstract
-
Cited by 17 (11 self)
- Add to MetaCart
This paper presents three hypotheses that are central to a computational model of speech production: (1) Sound targets take the form of regions, rather than points, in a planning reference frame. (2) The planning frame is more acoustic-like than the frames used in most recent models. (3) A direction-to-direction mapping transforms planned trajectories into articulator movements. These hypotheses are supported by experimental data and simulation results. 1. INTRODUCTION: REFERENCE FRAMES AND MAPPINGS It is useful to think of speech production as the process of formulating a trajectory within a planning reference frame to pass through a sequence of targets, each corresponding to a different phoneme in the string being produced. This trajectory can then be mapped into a set of articulator movements that carry out the planned trajectory. The articulator movements are defined within an articulatory reference frame that relates closely to the musculature or primary movement degrees of free...
Digital Audio Compression
- Digital Technical Journal
, 1993
"... Compared to most digital data types, with the exception of digital video, the data rates associated with uncompressed digital audio are substantial. Digital audio compression enables more efficient storage and transmission of audio data. The many forms of audio compression techniques offer a range o ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
Compared to most digital data types, with the exception of digital video, the data rates associated with uncompressed digital audio are substantial. Digital audio compression enables more efficient storage and transmission of audio data. The many forms of audio compression techniques offer a range of encoder and decoder complexity, compressed audio quality, and differing amounts of data compression. The-law transformation and ADPCM coder are simple approaches with low-complexity, low-compression, and medium audio quality algorithms. The MPEG/audio standard is a highcomplexity, high-compression, and high audio quality algorithm. These techniques apply to general audio signals and are not specifically tuned for speech signals.
Extraction of Vocal-Tract System Characteristics from Speech Signals
"... We propose methods to track natural variations in the characteristics of the vocal-tract system from speech signals. We are especially interested in the cases where these characteristics vary over time, as happens in dynamic sounds such as consonant-vowel transitions. We show that the selection of a ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
We propose methods to track natural variations in the characteristics of the vocal-tract system from speech signals. We are especially interested in the cases where these characteristics vary over time, as happens in dynamic sounds such as consonant-vowel transitions. We show that the selection of appropriate analysis segments is crucial in these methods and we propose a selection based on estimated instants of significant excitation. These instants are obtained by a method based on the average group-delay property of minimum-phase signals. In voiced speech they correspond to the instants of glottal closure. The vocal-tract system is characterized by its formants parameters, which are extracted from the analysis segments. Because the segments are always at the same relative position in each pitch period, in voiced speech the extracted formants are consistent across successive pitch periods. We demonstrate the results of the analysis for several difficult cases of speech signals. I. Int...
Pitch Extraction and Fundamental Frequency: History and Current Techniques
, 2003
"... Pitch extraction (also called fundamental frequency estimation) has been a popular topic in many fields of research since the age of computers. Yet in the course of some 50 years of study, current techniques are still not to a desired level of accuracy and robustness. When ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Pitch extraction (also called fundamental frequency estimation) has been a popular topic in many fields of research since the age of computers. Yet in the course of some 50 years of study, current techniques are still not to a desired level of accuracy and robustness. When
Estimation of Glottal Closure Instants in Voiced Speech using the DYPSA Algorithm
- IEEE Trans. Speech Audio Processing
, 2007
"... Phase-Slope Algorithm (DYPSA) for automatic estimation of glottal closure instants (GCIs) in voiced speech. Accurate estimation of GCIs is an important tool that can be applied to a wide range of speech processing tasks including speech analysis, synthesis and coding. DYPSA is automatic and operates ..."
Abstract
-
Cited by 12 (8 self)
- Add to MetaCart
Phase-Slope Algorithm (DYPSA) for automatic estimation of glottal closure instants (GCIs) in voiced speech. Accurate estimation of GCIs is an important tool that can be applied to a wide range of speech processing tasks including speech analysis, synthesis and coding. DYPSA is automatic and operates using the speech signal alone without the need for an EGG signal. The algorithm employs the phase-slope function and a novel phase-slope projection technique for estimating GCI candidates from the speech signal. The most likely candidates are then selected using a dynamic programming technique to minimize a cost function that we define. We review and evaluate three existing methods of GCI estimation and compare the new DYPSA algorithm to them. Results are presented for the APLAWD and SAM databases for which 95.7 % and 93.1 % of GCIs are correctly identified. Index Terms—Closed-phase, glottal closure, speech processing, speech segmentation. I.

