Results 1 - 10
of
41
Fast Transcription of Unstructured Audio Recordings
"... We introduce a new method for human-machine collaborative speech transcription that is significantly faster than existing transcription methods. In this approach, automatic audio processing algorithms are used to robustly detect speech in audio recordings and split speech into short, easy to transcr ..."
Abstract
-
Cited by 14 (10 self)
- Add to MetaCart
We introduce a new method for human-machine collaborative speech transcription that is significantly faster than existing transcription methods. In this approach, automatic audio processing algorithms are used to robustly detect speech in audio recordings and split speech into short, easy to transcribe segments. Sequences of speech segments are loaded into a transcription interface that enables a human transcriber to simply listen and type, obviating the need for manually finding and segmenting speech or explicitly controlling audio playback. As a result, playback stays synchronized to the transcriber’s speed of transcription. In evaluations using naturalistic audio recordings made in everyday home situations, the new method is up to 6 times faster than other popular transcription tools while preserving transcription quality. Index Terms: speech transcription, speech corpora 1.
TotalRecall: Visualization and Semi-Automatic Annotation of Very Large Audio-Visual Corpora ABSTRACT
"... We introduce a system for visualizing, annotating, and analyzing very large collections of longitudinal audio and video recordings. The system, TotalRecall, is designed to address the requirements of projects like the Human Speechome Project [18], for which more than 100,000 hours of multitrack audi ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
We introduce a system for visualizing, annotating, and analyzing very large collections of longitudinal audio and video recordings. The system, TotalRecall, is designed to address the requirements of projects like the Human Speechome Project [18], for which more than 100,000 hours of multitrack audio and video have been collected over a twentytwo month period. Our goal in this project is to transcribe speech in over 10,000 hours of audio recordings, and to annotate the position and head orientation of multiple people in the 10,000 hours of corresponding video. Higher level behavioral analysis of the corpus will be based on these and other annotations. To efficiently cope with this huge corpus, we are developing semi-automatic data coding methods that are integrated into TotalRecall. Ultimately, this system and the underlying methodology may enable new forms of multimodal behavioral analysis grounded in ultradense longitudinal data.
Using context and sensory data to learn first and second person pronouns
- in Human-Robot Interaction 2006
, 2006
"... We present a method of grounded word learning that can learn the meanings of first and second person pronouns. The model selectively associates new words with agents in the environment by using already understood words to establish context. The method uses chi-square tests to find significant associ ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
We present a method of grounded word learning that can learn the meanings of first and second person pronouns. The model selectively associates new words with agents in the environment by using already understood words to establish context. The method uses chi-square tests to find significant associations between the new words and attributes of the relevant agents. We show that this model can learn from a transcript of a parent-child interaction that “I ” refers to the person who is speaking. With the additional information that questions about wants refer to the person being asked about them, the system learns that “you ” refers to the person being addressed. We show that an incorrect assumption about the subject of “want ” questions can lead to pronoun reversal, a linguistic error most commonly found in autistic and congenitally blind children. Finally, we present results from a physical implementation on a robot that runs in real time. Categories and Subject Descriptors I.2.7 [Natural Language Processing]: [speech recognition
Strengths and weaknesses of software architectures for the rapid creation of tangible
"... and multimodal interfaces ..."
Isolated Word Recognition Using a Liquid State Machine
- In ESANN’05, European Symposium on Artificial Neural Network
, 2005
"... An implementation of the recently proposed concept of the Liquid State Machine using a Spiking Neural Network (SNN) is trained to perform isolated word recognition. We investigate two di#erent speech front ends and di#erent ways of coding the inputs into spike trains. The robustness against nois ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
An implementation of the recently proposed concept of the Liquid State Machine using a Spiking Neural Network (SNN) is trained to perform isolated word recognition. We investigate two di#erent speech front ends and di#erent ways of coding the inputs into spike trains. The robustness against noise added to the speech is also briefly researched. It turns out that a biologically realistic configuration of the LSM gives the best result, and that it performs very well for the task of speech recognition.
Simulating Spoken Dialogue With a Focus on Realistic Turn-Taking”. To appear in
- Proceedings of the 13th ESSLLI Summerschool
, 2008
"... Abstract. We present a system for testing turn-taking strategies in a simulation environment, in which artificial dialogue participants exchange audio streams in real time–unlikeearlierturn-taking simulations, whichinterchanged unambiguous symbolic messages. Dialogue participants autonomously determ ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract. We present a system for testing turn-taking strategies in a simulation environment, in which artificial dialogue participants exchange audio streams in real time–unlikeearlierturn-taking simulations, whichinterchanged unambiguous symbolic messages. Dialogue participants autonomously determine their turn-taking behaviour, based on their analysis of the incoming audio. We use machine-learning methods to classifiy the continuous audio signal into symbolic turn-taking states. Weexperiment withvarious rulesets andshow howsimple, local management rules cancreate realistic behavioural patterns. 1.
Interacting with our Environment through Sentient Mobile Phones
- 2nd Int. Workshop on Ubiquitous Computing (IWUC 2005
, 2005
"... Abstract. The latest mobile phones are offering more multimedia features, better communication capabilities (Bluetooth, GPRS, 3G) and are far more easily programmable (extensible) than ever before. So far, the “killer apps ” to exploit these new capabilities have been presented in the form of MMS (M ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. The latest mobile phones are offering more multimedia features, better communication capabilities (Bluetooth, GPRS, 3G) and are far more easily programmable (extensible) than ever before. So far, the “killer apps ” to exploit these new capabilities have been presented in the form of MMS (Multimedia Messaging), video conferencing and multimedia-on-demand services. We deem that a new promising application domain for the latest Smart Phones is their use as intermediaries between us and our surrounding environment. Thus, our mobiles will behave as personal butlers who assist us in our daily tasks, taking advantage of the computational services provided at our working or living environments. For this to happen, a key element is to add senses to our mobiles: capability to see (camera), hear (michrophone), notice (Bluetooth) the objects and devices offering computational services. In this paper, we present a solution to this issue, the MobileSense system. We illustrate its use in two scenarios: (1) making mobiles more accessible to people with disabilities and (2) enabling the mobiles as guiding devices within a museum.
POMDP Models for Assistive Technology
, 2005
"... This paper presents a general decision theoretic model of interactions between users and cognitive assistive technologies for various tasks of importance to the elderly population. The model is a partially observable Markov decision process (POMDP) whose goal is to work in conjunction with a us ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
This paper presents a general decision theoretic model of interactions between users and cognitive assistive technologies for various tasks of importance to the elderly population. The model is a partially observable Markov decision process (POMDP) whose goal is to work in conjunction with a user towards the completion of a given activity or task. This requires the model to monitor and assist the user, to maintain indicators of overall user health, and to adapt to changes. The key strengths of the POMDP model are that it is able to deal with uncertainty, it is easy to specify, it can be applied to different tasks with little modification, and it is able to learn and adapt to changing tasks and situations.
The RWTH Aachen University Open Source Speech Recognition System
"... We announce the public availability of the RWTH Aachen University speech recognition toolkit. The toolkit includes state of the art speech recognition technology for acoustic model training and decoding. Speaker adaptation, speaker adaptive training, unsupervised training, a finite state automata li ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We announce the public availability of the RWTH Aachen University speech recognition toolkit. The toolkit includes state of the art speech recognition technology for acoustic model training and decoding. Speaker adaptation, speaker adaptive training, unsupervised training, a finite state automata library, and an efficient tree search decoder are notable components. Comprehensive documentation, example setups for training and recognition, and a tutorial are provided to support newcomers. Index Terms: speech recognition, LVCSR, software 1.
Context-sensitive utterance planning for ccg
- In European Workshop on Natural Language Generation
, 2005
"... The paper presents an approach to utterance planning, which can dynamically use context information about the environment in which a dialogue is situated. The approach is functional in nature, using systemic networks to specify its planning grammar. The planner takes a description of a communicative ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The paper presents an approach to utterance planning, which can dynamically use context information about the environment in which a dialogue is situated. The approach is functional in nature, using systemic networks to specify its planning grammar. The planner takes a description of a communicative goal as input, and produces one or more logical forms that can express that goal in a contextually appropriate way. Both the goal and the resulting logical forms are expressed in a single formalism as ontologically rich, relational structures. To realize the logical forms, OpenCCG is used. The paper focuses primarily on the implementation, but also discusses how the planning grammar can be based on the grammar used in OpenCCG, and trained on (parseable) data. 1

