Results 1 - 10
of
25
Survey of the State of the Art in Human Language Technology
, 1995
"... Contents 1 Spoken Language Input 1 Ron Cole & Victor Zue, chapter editors 1.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Victor Zue & Ron Cole 1.2 Speech Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 Victor Zue, Ron Cole, & Wayne Ward 1.3 Sig ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
Contents 1 Spoken Language Input 1 Ron Cole & Victor Zue, chapter editors 1.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Victor Zue & Ron Cole 1.2 Speech Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 Victor Zue, Ron Cole, & Wayne Ward 1.3 Signal Representation : : : : : : : : : : : : : : : : : : : : : : : : : : 11 Melvyn J. Hunt 1.4 Robust Speech Recognition : : : : : : : : : : : : : : : : : : : : : : 17 Richard M. Stern 1.5 HMM Methods in Speech Recognition : : : : : : : : : : : : : : : 24 Renato De Mori & Fabio Brugnara 1.6 Language Representation : : : : : : : : : : : : : : : : : : : : : : : : 35 Salim Roukos 1.7 Speaker Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : :<F35.37
Multiple Fundamental Frequency Estimation Based on Harmonicity and Spectral Smoothness
, 2003
"... A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the res ..."
Abstract
-
Cited by 46 (5 self)
- Add to MetaCart
A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the residual signal. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the subtraction stage, the spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds. With these techniques, multiple fundamental frequency estimation can be performed quite accurately in a single time frame, without the use of long-term temporal features. The experimental data comprised recorded samples of 30 musical instruments from four different sources. Multiple fundamental frequency estimation was performed for random sound source and pitch combinations. Error rates for mixtures ranging from one to six simultaneous sounds were 1.8%, 3.9%, 6.3%, 9.9%, 14%, and 18%, respectively. In musical interval and chord identification tasks, the algorithm outperformed the average of ten trained musicians. The method works robustly in noise, and is able to handle sounds that exhibit inharmonicities. The inharmonicity factor and spectral envelope of each sound is estimated along with the fundamental frequency.
Lexical Modeling in a Speaker Independent Speech Understanding System
, 1993
"... Over the past 40 years, significant progress has been made in the fields of speech recognition and speech understanding. Current state-of-the-art speech recognition systems are capable of achieving word-level accuracies of 90 % to 95 % on continuous speech recognition tasks using 5000 words. Even la ..."
Abstract
-
Cited by 39 (8 self)
- Add to MetaCart
Over the past 40 years, significant progress has been made in the fields of speech recognition and speech understanding. Current state-of-the-art speech recognition systems are capable of achieving word-level accuracies of 90 % to 95 % on continuous speech recognition tasks using 5000 words. Even larger systems, capable of recognizing 20,000 words are just now being developed. Speech understanding systems have recently been developed that perform fairly well within a restricted domain. While the size and performance of modern speech recognition and understanding systems are impressive, it is evident to anyone who has used these systems that the technology is primitive compared to our own human ability to understand speech. Some of the difficulties hampering progress in the fields of speech recognition and understanding stem from the many sources of variation that occur during human communication. One of the sources of variation that occurs in human communication is the different ways that words can be pronounced. There are many causes of pronunciation variation, such as: the phonetic environment in which the word occurs, the dialect of the speaker,
The challenge of spoken language systems: Research directions for the nineties
- IEEE Transactions on Speech and Audio Processing
, 1995
"... Footnote This article is based on a February, 1992workshop sponsored by the National Science ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Footnote This article is based on a February, 1992workshop sponsored by the National Science
Speech Enhancement Based On Temporal Processing
- in Proc. Int. Conf. on Acoust., Speech, Signal Processing
, 1995
"... Finite Impulse Response (FIR) Wiener-like filters are applied to time trajectories of cubic-root compressed short-term power spectrum of noisy speech recorded over cellular telephone communications. Informal listenings indicate that the technique brings a noticeable improvement to the quality of pro ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
Finite Impulse Response (FIR) Wiener-like filters are applied to time trajectories of cubic-root compressed short-term power spectrum of noisy speech recorded over cellular telephone communications. Informal listenings indicate that the technique brings a noticeable improvement to the quality of processed noisy speech while not causing any significant degradation to clean speech. Alternative filter structures are being investigated as well as other potential applications in cellular channel compensation and narrowband to wideband speech mapping. 1. INTRODUCTION The need for enhancement of noisy speech in telecommunications increases with the spread of cellular telephony. Calls may originate from rather noisy environments such as moving cars or crowded public places. The corrupting noise is often relatively stationary, or at least changing rather slowly. This leads us to investigate the use of RelAtive SpecTrAl (RASTA) processing of speech (Hermansky and Morgan, et. al., 1991, 1994, ...
Spoken-Language Access to Multimedia (SLAM): Masters Thesis
"... Introduction 1.1 The problem The World-Wide Web (WWW) (CERN, 1994) is a network-based standard for hypermedia documents that combines documents prepared in HyperText Markup Language (HTML) (NCSA, 1994a) with an extensible set of multimedia resources. The most popular WWW browser with available sour ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Introduction 1.1 The problem The World-Wide Web (WWW) (CERN, 1994) is a network-based standard for hypermedia documents that combines documents prepared in HyperText Markup Language (HTML) (NCSA, 1994a) with an extensible set of multimedia resources. The most popular WWW browser with available source code is Mosaic (NCSA, 1994b), a cross-platform program developed and distributed by NCSA, now running in X11-based Unix, Macintosh and PC-Windows environments. As a hypermedia viewer, Mosaic combines the flexibility and navigability of hypermedia with multimedia outputs such as audio and GIF images. The World-Wide Web, especially as viewed with Mosaic, is phenomenally popular. By mid-Spring of 1994, Internet traffic was doubling about every six months. Of this growth, 2 the World-Wide Web's proportional usage was doubling approximately every four months. In absolute volume of traffic, use of the WWW was doubling every two and a half months (Wallach, 1994). Much of the popu
Robust Feature-Estimation and Objective Quality Assessment for Noisy Speech Recognition using the Credit Card Corpus
, 1994
"... It is well known that the introduction of acoustic background distortion into speech causes recognition algorithms to fail. In order to improve the environmental robustness of speech recognition in adverse conditions, a novel constrained-iterative feature-estimation algorithm, which was previously f ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
It is well known that the introduction of acoustic background distortion into speech causes recognition algorithms to fail. In order to improve the environmental robustness of speech recognition in adverse conditions, a novel constrained-iterative feature-estimation algorithm, which was previously formulated for speech enhancement, is considered and shown to produce improved feature characterization in a variety of actual noise conditions such as computer fan, large crowd, and voice communications channel noise. In addition, an objective measure based MAP estimator is formulated as a means of predicting changes in robust recognition performance at the speech feature extraction stage. The four measures considered include (i) NIST SNR, (ii) Itakura-Saito log-likelihood, (iii) log-area-ratio, and (iv) the weighted-spectral slope measure. A continuous distribution, monophone based, hidden Markov model recognition algorithm is used for objective measure based MAP estimator analysis and reco...
Development of an Approach to Language Identification Based on Language-dependent Phone Recognition
, 1995
"... xii 1 Introduction 1 1.1 Background : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 1.1.1 Nature of the Problem : : : : : : : : : : : : : : : : : : : : : : : : : : 2 1.1.2 The Difficulties: Challenges to LID : : : : : : : : : : : : : : : : : : : 5 1.2 Related Work : : : : ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
xii 1 Introduction 1 1.1 Background : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 1.1.1 Nature of the Problem : : : : : : : : : : : : : : : : : : : : : : : : : : 2 1.1.2 The Difficulties: Challenges to LID : : : : : : : : : : : : : : : : : : : 5 1.2 Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 1.2.1 Early Work: 1973--1992 : : : : : : : : : : : : : : : : : : : : : : : : : 7 1.2.2 Current Activities: 1992--present : : : : : : : : : : : : : : : : : : : : 9 1.2.3 The Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 1.3 An Approach to Language Identification based on language-dependent phone recognition. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12 1.3.1 Finding a Good Modeling Unit : : : : : : : : : : : : : : : : : : : : : 12 1.3.2 The Baseline System : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 1.3.3 Contributions: Methods Proposed to Improve the Baseline...
Desired Characteristics Of Modulation Spectrum For Robust Automatic Speech Recognition
- In: Proc. ICASSP’98
, 1998
"... We report on the effect of band-pass filtering of the time trajectories of spectral envelopes on speech recognition. Several types of filter (linear-phase FIR, DCT, DFT) are studied. Results indicate the relative importance of different components of the modulation spectrum of speech for ASR. Genera ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We report on the effect of band-pass filtering of the time trajectories of spectral envelopes on speech recognition. Several types of filter (linear-phase FIR, DCT, DFT) are studied. Results indicate the relative importance of different components of the modulation spectrum of speech for ASR. General conclusions are: (1) most of the useful linguistic information is in modulation frequency components from the range between 1 and 16 Hz, with the dominant component at around 4 Hz, (2) it is important to preserve the phase information in modulation frequency domain, (3) The features which include components at around 4 Hz in modulation spectrum outperform the conventional delta features, (4) The features which represent the several modulation frequency bands with appropriate center frequency and band width increase recognition performance. 1. INTRODUCTION Temporal processing of time trajectories in the logarithmic spectrum domain is becoming a common procedure in automatic speech recogni...
Computations and Evaluations of an Optimal Feature-set for an HMM-based Recognizer
, 1996
"... The benefits of a speech recognition machine would be many, resulting in the improvement of the quality of life for people. The design of a speech recognition system can be divided into two parts, commonly known as the front-end and back-end. The front-end deals with the conversion of the analog sp ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The benefits of a speech recognition machine would be many, resulting in the improvement of the quality of life for people. The design of a speech recognition system can be divided into two parts, commonly known as the front-end and back-end. The front-end deals with the conversion of the analog speech signal into features for classification. This thesis investigates optimal feature-sets for speech recognition. The objectives for an optimal feature-set are improved recognition performance, noise robustness, talker insensitivity and efficiency. Three problems that make it difficult to find optimal features are: 1) the amount of resources (time and computations) required to evaluate the performance of a feature-set, 2) the size of the feature space, and 3) the dependence of features upon some words in t...

