Results 1 - 10
of
65
Speech Analysis
, 1998
"... Contents 1 Introduction 4 1.1 What is Speech Analysis? . . . . . . . . . . . . . . . . . . . . 4 1.1.1 So what is an acoustic vector? . . . . . . . . . . . . . . 4 1.2 Why Speech Analysis? . . . . . . . . . . . . . . . . . . . . . . 4 1.3 The problems of speech analysis . . . . . . . . . . . . . . ..."
Abstract
-
Cited by 134 (0 self)
- Add to MetaCart
Contents 1 Introduction 4 1.1 What is Speech Analysis? . . . . . . . . . . . . . . . . . . . . 4 1.1.1 So what is an acoustic vector? . . . . . . . . . . . . . . 4 1.2 Why Speech Analysis? . . . . . . . . . . . . . . . . . . . . . . 4 1.3 The problems of speech analysis . . . . . . . . . . . . . . . . . 7 1.4 Standard references for this course . . . . . . . . . . . . . . . 7 2 Background 7 2.1 Sampling theory . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Sampling frequency . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Sampling resolution . . . . . . . . . . . . . . . . . . . . 8 2.2 Linear filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Finite Impulse Response filters . . . . . . . . . . . . . 8 2.2.2 Infinite Impulse Response filters . . . . . . . . . . . . . 11 2.3 The source filter model of speech . . . . . . . . . . . . . . . . 12 3 Filter bank Analysis 12 3.1 Spectrograms . . . . . . . . .
Towards the Digital Music Library: Tune Retrieval From . . .
, 1996
"... Music is traditionally retrieved by title, composer or subject classification. It is possible, with current technology, to retrieve music from a database on the basis of a few notes sung or hummed into a microphone. This paper describes the implementation of such a system, and discusses several issu ..."
Abstract
-
Cited by 99 (11 self)
- Add to MetaCart
Music is traditionally retrieved by title, composer or subject classification. It is possible, with current technology, to retrieve music from a database on the basis of a few notes sung or hummed into a microphone. This paper describes the implementation of such a system, and discusses several issues pertaining to music retrieval. We first describe an interface that transcribes acoustic input into standard music notation. We then analyze string matching requirements for ranked retrieval of music and present the results of an experiment which tests how accurately people sing well known melodies. The performance of several string matching criteria are analyzed using two folk song databases. Finally, we describe a prototype system which has been developed for retrieval of tunes from acoustic input.
Signal modeling techniques in speech recognition
- PROCEEDINGS OF THE IEEE
, 1993
"... We have seen three important trends develop in the last five years in speech recognition. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similariry transform techniques, often used to norm ..."
Abstract
-
Cited by 99 (5 self)
- Add to MetaCart
We have seen three important trends develop in the last five years in speech recognition. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similariry transform techniques, often used to normalize and decor-relate parameters in some computationally inexpensive way, have become popular. Third, the signal parameter estimation problem has merged with the speech recognition process so that more sophisticated statistical models of the signal’s spectrum can be estimated in a closed-loop manner. In this paper, we review the signal processing components of these algorithms. These al-gorithms are presented as part of a unified view of the signal parameterization problem in which there are three major tasks: measurement, transformation, and statistical modeling. This paper is by no means a comprehensive survey of all possible techniques of signal modeling in speech recognition. There are far too many algorithms in use today to make an exhaustive survey feasible (and cohesive). Instead, this paper is meant to serve as a tutorial on signal processing in state-of-the-art speech recognition systems and to review those techniques most commonly used. In keeping with this goal, a complete mathematical description of each algorithm has been included in the paper.
Audio Feature Extraction And Analysis For Scene Segmentation And Classification
- Journal of VLSI Signal Processing System
, 1998
"... Understanding of the scene content of a video sequence is very important for content-based indexing and retrieval of multimedia databases. Research in this area in the past several years has focused on the use of speech recognition and image analysis techniques. As a complimentary effort to the prio ..."
Abstract
-
Cited by 79 (6 self)
- Add to MetaCart
Understanding of the scene content of a video sequence is very important for content-based indexing and retrieval of multimedia databases. Research in this area in the past several years has focused on the use of speech recognition and image analysis techniques. As a complimentary effort to the prior work, we have focused on using the associated audio information (mainly the nonspeech portion) for video scene analysis. As an example, we consider the problem of discriminating five types of TV programs, namely commercials, basketball games, football games, news reports, and weather forecasts. A set of low-level audio features are proposed for characterizing semantic contents of short audio clips. The linear separability of different classes under the proposed feature space is examined using a clustering analysis. The effective features are identified by evaluating the intracluster and intercluster scattering matrices of the feature space. Using these features, a neural net classifier was...
Multimedia Content Analysis Using Both Audio and Visual Cues
, 2000
"... : Including all the scenes/shots that contain special events may generate too long an abstract. Also, simply staggering them together may not be visually or aurally appealing. In the MoCA project, it was determined that only 50% of the abstract should contain special events. The remaining part shoul ..."
Abstract
-
Cited by 70 (0 self)
- Add to MetaCart
: Including all the scenes/shots that contain special events may generate too long an abstract. Also, simply staggering them together may not be visually or aurally appealing. In the MoCA project, it was determined that only 50% of the abstract should contain special events. The remaining part should be left for filler clips. The special event clips to be included are chosen uniformly and randomly from different types of events. The selection of a short clip from a scene is subject to some additional criteria, such as the amount of action and the similarity to the overall color composition of the movie. Closeness to the desired AV characteristics of certain scene types are also considered. The filler clips are chosen so that they do not overlap with the content of chosen special event clips, to ensure a good coverage of all parts of a movie. MPEG-7 Standard for Multimedia Content Description Interface MPEG-7 is an on-going standardization effort for content description of AV documen...
Analysis and Synthesis of Intonation using the Tilt Model
- Journal of the Acoustical Society of America
"... This paper introduces the tilt intonational model and describes how this model can be used to automatically analyse and synthesize intonation. In the model, intonation is represented as a linear sequence of events, which can be pitch accents or boundary tones. Each event is characterised by continuo ..."
Abstract
-
Cited by 68 (3 self)
- Add to MetaCart
This paper introduces the tilt intonational model and describes how this model can be used to automatically analyse and synthesize intonation. In the model, intonation is represented as a linear sequence of events, which can be pitch accents or boundary tones. Each event is characterised by continuous parameters representing amplitude, duration and tilt (a measure of the shape of the event). The paper describes a event detector, in effect an intonational recognition system, which produces a transcription of an utterance's intonation. The features and parameters of the event detector are discussed and performance figures are shown on a variety of read and spontaneous speaker independent conversational speech databases. Given the event locations, algorithms are described which produce an automatic analysis of each event in terms of the Tilt parameters. Synthesis algorithms are also presented which generate F0 contours from Tilt representations. The accuracy of these is shown by comparing...
A Multi-Pitch Tracking Algorithm for Noisy Speech
- IEEE Transactions on Speech and Audio Processing
, 2002
"... We present a robust algorithm for multi-pitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new integration method for extracting periodicity information across different frequency channels, and a hidden Markov model (HMM) for forming continuous p ..."
Abstract
-
Cited by 49 (11 self)
- Add to MetaCart
We present a robust algorithm for multi-pitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new integration method for extracting periodicity information across different frequency channels, and a hidden Markov model (HMM) for forming continuous pitch tracks, and as a result, our algorithm can reliably track single and double pitch tracks in a noisy environment. The proposed algorithm is evaluated on a database of speech utterances mixed with various interferences and the results show that our algorithm outperforms existing algorithms significantly.
Multiple fundamental frequency estimation by summing harmonic amplitudes
- in ISMIR
, 2006
"... This paper proposes a conceptually simple and computationally efficient fundamental frequency (F0) estimator for polyphonic music signals. The studied class of estimators calculate the salience, or strength, of a F0 candidate as a weighted sum of the amplitudes of its harmonic partials. A mapping fr ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
This paper proposes a conceptually simple and computationally efficient fundamental frequency (F0) estimator for polyphonic music signals. The studied class of estimators calculate the salience, or strength, of a F0 candidate as a weighted sum of the amplitudes of its harmonic partials. A mapping from the Fourier spectrum to a “F0 salience spectrum” is found by optimization using generated training material. Based on the resulting function, three different estimators are proposed: a “direct ” method, an iterative estimation and cancellation method, and a method that estimates multiple F0s jointly. The latter two performed as well as a considerably more complex reference method. The number
The Rise/Fall/Connection Model of Intonation
, 1994
"... This paper describes a new model of intonation for English. The paper proposes that intonation can be described using a sequence of rise, fall and connection elements. Pitch accents and boundary rises are described using rise and fall elements, and connection elements are used to describe everything ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
This paper describes a new model of intonation for English. The paper proposes that intonation can be described using a sequence of rise, fall and connection elements. Pitch accents and boundary rises are described using rise and fall elements, and connection elements are used to describe everything else. Equations can be used to synthesize fundamental frequency (F 0 ) contours from these elements. An automatic labelling system is described which can derive a rise/fall/connection description from any utterance without using prior knowledge or top-down processing. Synthesis and analysis experiments are described using utterances from six speakers of various English accents. An analysis/resynthesis experiment is described which shows that the contours produced by the model are similar to within 3.6 to 7.3 Hz of the originals. An assessment of the automatic labeller shows 72% to 92% agreement between automatic and hand labels. The paper concludes with a comparison between this model and o...

