Results 1 - 10
of
98
Speech Analysis
, 1998
"... Contents 1 Introduction 4 1.1 What is Speech Analysis? . . . . . . . . . . . . . . . . . . . . 4 1.1.1 So what is an acoustic vector? . . . . . . . . . . . . . . 4 1.2 Why Speech Analysis? . . . . . . . . . . . . . . . . . . . . . . 4 1.3 The problems of speech analysis . . . . . . . . . . . . . . ..."
Abstract
-
Cited by 134 (0 self)
- Add to MetaCart
Contents 1 Introduction 4 1.1 What is Speech Analysis? . . . . . . . . . . . . . . . . . . . . 4 1.1.1 So what is an acoustic vector? . . . . . . . . . . . . . . 4 1.2 Why Speech Analysis? . . . . . . . . . . . . . . . . . . . . . . 4 1.3 The problems of speech analysis . . . . . . . . . . . . . . . . . 7 1.4 Standard references for this course . . . . . . . . . . . . . . . 7 2 Background 7 2.1 Sampling theory . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Sampling frequency . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Sampling resolution . . . . . . . . . . . . . . . . . . . . 8 2.2 Linear filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Finite Impulse Response filters . . . . . . . . . . . . . 8 2.2.2 Infinite Impulse Response filters . . . . . . . . . . . . . 11 2.3 The source filter model of speech . . . . . . . . . . . . . . . . 12 3 Filter bank Analysis 12 3.1 Spectrograms . . . . . . . . .
Large Margin Hierarchical Classification
- In Proceedings of the Twenty-First International Conference on Machine Learning
"... We present an algorithmic framework for supervised classification learning where the set of labels is organized in a predefined hierarchical structure. This structure is encoded by a rooted tree which induces a metric over the label set. ..."
Abstract
-
Cited by 52 (7 self)
- Add to MetaCart
We present an algorithmic framework for supervised classification learning where the set of labels is organized in a predefined hierarchical structure. This structure is encoded by a rooted tree which induces a metric over the label set.
Analysis of Dynamic Spectra in Ferret Primary Auditory Cortex: I. Characteristics of single unit responses to moving ripple spectra
"... this article relate most directly to a specific hypothesis on the nature of this representation - the so-called `ripple analysis model' (Shamma and Versnel 1995; Shamma et al. 1995; Versnel et al. 1995). Briefly, the model postulates that the acoustic spectrum is encoded in AI at varying degrees of ..."
Abstract
-
Cited by 48 (9 self)
- Add to MetaCart
this article relate most directly to a specific hypothesis on the nature of this representation - the so-called `ripple analysis model' (Shamma and Versnel 1995; Shamma et al. 1995; Versnel et al. 1995). Briefly, the model postulates that the acoustic spectrum is encoded in AI at varying degrees of resolution by the activity of units with a range of response area bandwidths, asymmetries and best frequencies (BF's). Furthermore, it is assumed that this multi-scale decomposition can be characterized to a very good approximation as a linear process. Thus, if a complex spectral profile is decomposed into a weighted sum of simpler spectra, then linearity implies that responses to the complex profile can be predicted from a weighted superposition of the responses to the simpler spectra. Note further, that if the basic set of simple spectra is taken to be sinusoidally modulated envelopes or ripples, then the decomposition of an arbitrary profile into ripples with different amplitudes, phases, and densities corresponds simply to a Fourier decomposition of the spectral profile. J0856 5 (1 of 2) 3 The above postulates were extensively investigated and validated for stationary spectra in the ferret AI (Shamma et al. 1995; Shamma and Versnel 1995; Versnel et al. 1995; in cat: Schreiner and Calhoun 1995). For instance, it was shown that an AI unit could be fully characterized by its responses to ripples with a range of ripple frequencies and ripple phases, that is by its ripple transfer function. It was also shown that inverse Fourier transforming this function generates a response field (RF) - a function that is analogous to the response area of the unit obtained with single tones. The RF's of AI units exhibited a range of bandwidths and asymmetries, as required by the multi-scal...
Audio content analysis for online audiovisual data segmentation and classification
- 62 IEEE SIGNAL PROCESSING MAGAZINE MARCH 2004
, 2001
"... Abstract—While current approaches for audiovisual data segmentation and classification are mostly focused on visual cues, audio signals may actually play a more important role in content parsing for many applications. An approach to automatic segmentation and classification of audiovisual data based ..."
Abstract
-
Cited by 46 (2 self)
- Add to MetaCart
Abstract—While current approaches for audiovisual data segmentation and classification are mostly focused on visual cues, audio signals may actually play a more important role in content parsing for many applications. An approach to automatic segmentation and classification of audiovisual data based on audio content analysis is proposed. The audio signal from movies or TV programs is segmented and classified into basic types such as speech, music, song, environmental sound, speech with music background, environmental sound with music background, silence, etc. Simple audio features including the energy function, the average zero-crossing rate, the fundamental frequency, and the spectral peak tracks are extracted to ensure the feasibility of real-time processing. A heuristic rule-based procedure is proposed to segment and classify audio signals and built upon morphological and statistical analysis of the time-varying functions of these audio features. Experimental results show that the proposed scheme achieves an accuracy rate of more than 90 % in audio classification. Index Terms—Audio analysis, audio indexing, audio segmentation, audiovisual content parsing, information filtering and retrieval, multimedia database management. I.
A theoretical investigation of reference frames for the planning of speech movements
- Psychological Review
, 1998
"... Running title: Speech reference frames Does the speech motor control system utilize invariant vocal tract shape targets of any kind when producing phonemes? We present a four-part theoretical treatment favoring models whose only invariant targets are auditory perceptual targets over models that posi ..."
Abstract
-
Cited by 39 (21 self)
- Add to MetaCart
Running title: Speech reference frames Does the speech motor control system utilize invariant vocal tract shape targets of any kind when producing phonemes? We present a four-part theoretical treatment favoring models whose only invariant targets are auditory perceptual targets over models that posit invariant constriction targets. When combined with earlier theoretical and experimental results (Guenther, 1995a,b; Perkell et al., 1993; Savariaux et al., 1995a,b), our hypothesis is that, for vowels and semi-vowels at least, the only invariant targets of the speech production process are multidimensional regions in auditory perceptual space. These auditory perceptual target regions are hypothesized to arise during development as an emergent property of neural map formation in the auditory system. Furthermore, speech movements are planned as trajectories in auditory perceptual space. These trajectories are then mapped into articulator movements through a neural mapping that allows motor equivalent variability in constriction locations and degrees when needed, but maintains approximate constriction invariance for a given sound in most instances. These hypotheses are illustrated and substantiated using computer simulations of the DIVA model of speech acquisition and production. Finally, we pose several difficult challenges to proponents of constriction theories based on this theoretical treatment.
Generalized Stochastic Subdivision
- ACM Transactions on Graphics
, 1987
"... This paper describes the basis for techniques such as stochastic subdivision in the theory of random processes and estimation theory. The popular stochastic subdivision construction is then generalized to provide control of the autocorrelation and spectral properties of the synthesized random functi ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
This paper describes the basis for techniques such as stochastic subdivision in the theory of random processes and estimation theory. The popular stochastic subdivision construction is then generalized to provide control of the autocorrelation and spectral properties of the synthesized random functions. The generalized construction is suitable for generating a variety of perceptually distinct high-quality random functions, including those with non-fractal spectra and directional or oscillatory characteristics. It is argued that a spectral modeling approach provides a more powerful and somewhat more intuitive perceptual characterization of random processes than does the fractal model. Synthetic textures and terrains are presented as a means of visually evaluating the generalized subdivision technique. Categories and Subject Descriptors: I.3.3 [Computer Graphics]: Picture/Image Generation; I.3.7 [Computer Graphics]: Three Dimensional Graphics and Realism -<F11.
Natural-Sounding Speech Synthesis Using Variable-Length Units
, 1998
"... The goal of this work was to develop a speech synthesis system which concatenates variable-length units to create naturalsounding speech. Our initial work in this area showed that by careful design of system responses to ensure consistent intonation contours, natural-sounding speech synthesis was ac ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
The goal of this work was to develop a speech synthesis system which concatenates variable-length units to create naturalsounding speech. Our initial work in this area showed that by careful design of system responses to ensure consistent intonation contours, natural-sounding speech synthesis was achievable with word- and phrase-level concatenation. In order to extend the flexibility of this framework, we focused on the problem of generating novel words from a corpus of sub-word units. The design of the sub-word units was motivated by perceptual studies that investigated where speech could be spliced with minimal audible distortion and what contextual constraints were necessary to maintain in order to produce natural sounding speech. The sub-word corpus is searched during synthesis using a Viterbi search which selects a sequence of units based on how well they individually match the input specification and on how well they sound as an ensemble. This concatenative speech synthesis system, ENVOICE, has been used in a conversational information retrieval system in two application domains to convert meaning representations into speech waveforms.
Social Signal Processing: Survey of an Emerging Domain
, 2008
"... The ability to understand and manage social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. This paper argues that next- ..."
Abstract
-
Cited by 32 (10 self)
- Add to MetaCart
The ability to understand and manage social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. This paper argues that next-generation computing needs to include the essence of social intelligence – the ability to recognize human social signals and social behaviours like turn taking, politeness, and disagreement – in order to become more effective and more efficient. Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis of relevant behavioural cues like blinks, smiles, crossed arms, laughter, and similar, design and development of automated systems for Social Signal Processing (SSP) are rather difficult. This paper surveys the past efforts in solving these problems by a computer, it summarizes the relevant findings in social psychology, and it proposes a set of recommendations for enabling the development of the next generation of socially-aware computing.
Statistical Trajectory Models for Phonetic Recognition
, 1994
"... The main goal of this work is to develop an alternative methodology for acoustic-- phonetic modelling of speech sounds. The approach utilizes a segment--based framework to capture the dynamical behavior and statistical dependencies of the acoustic attributes used to represent the speech waveform. Te ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
The main goal of this work is to develop an alternative methodology for acoustic-- phonetic modelling of speech sounds. The approach utilizes a segment--based framework to capture the dynamical behavior and statistical dependencies of the acoustic attributes used to represent the speech waveform. Temporal behavior is modelled explicitly by creating dynamic tracks of the acoustic attributes used to represent the waveform, and by estimating the spatio--temporal correlation structure of the resulting errors. The tracks serve as templates from which synthetic segments of the acoustic attributes are generated. Scoring of an hypothesized phonetic segment is then based on the error between the measured acoustic attributes and the synthetic segments generated for each phonetic model.
Melody description and extraction in the context of music content processing
- Journal of New Music Research
, 2003
"... A huge amount of audio data is accessible to everyone by on-line or off-line information services and it is necessary to develop techniques to automatically describe and deal with this data in a meaningful way. In the particular context of music content processing it is important to take into accoun ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
A huge amount of audio data is accessible to everyone by on-line or off-line information services and it is necessary to develop techniques to automatically describe and deal with this data in a meaningful way. In the particular context of music content processing it is important to take into account the melodic aspects of the sound. The goal of this article is to review the different techniques proposed for melodic description and extraction. Some ideas around the concept of melody are first presented. Then, an overview of the different ways of describing melody is done. As a third step, an analysis of the methods proposed for melody extraction is made, including pitch detection algorithms. Finally, techniques for melodic pattern induction and matching are also studied, and some useful melodic transformations are reviewed. 1

