Results 11 - 20
of
25
Combined Intention, Activity, and Motion Recognition for a Humanoid Household Robot
"... Abstract — In this paper, a multi-level approach to intention, activity, and motion recognition for a humanoid robot is proposed. Our system processes images from a monocular camera and combines this information with domain knowledge. The recognition works on-line and in real-time, it is independent ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract — In this paper, a multi-level approach to intention, activity, and motion recognition for a humanoid robot is proposed. Our system processes images from a monocular camera and combines this information with domain knowledge. The recognition works on-line and in real-time, it is independent of the test person, but limited to predefined view-points. Main contributions of this paper are the extensible, multi-level modeling of the robot’s vision system, the efficient activity and motion recognition, and the asynchronous information fusion based on generic processing of mid-level recognition results. The complementarity of the activity and motion recognition renders the approach robust against misclassifications. Experimental results on a real-world data set of complex kitchen tasks, e.g., Prepare Cereals or Lay Table, prove the performance and robustness of the multi-level recognition approach.
Issues in Meeting Transcription -- The ISL Meeting Transcription System
, 2004
"... This paper describes the Interactive Systems Lab's Meeting transcription system, which performs segmentation, speaker clustering as well as transcriptions of conversational meeting speech. The system described here was evaluated in NIST's RT04S "Meeting" speech evaluation and reached the lowest word ..."
Abstract
- Add to MetaCart
This paper describes the Interactive Systems Lab's Meeting transcription system, which performs segmentation, speaker clustering as well as transcriptions of conversational meeting speech. The system described here was evaluated in NIST's RT04S "Meeting" speech evaluation and reached the lowest word error rates for the distant microphone conditions.
The Isl Rt04 Mandarin Broadcast News Evaluation System
- in EARS Rich Transcription Workshop, Palisades
, 2004
"... This paper describes our effort in developing a Mandarin Broadcast News system for the RT-04f (Rich Transcription) evaluation. Starting from a legacy system, we revisited all the issues including partitioning, acoustic modeling, language modeling, decoding and system combination strategies. We have ..."
Abstract
- Add to MetaCart
This paper describes our effort in developing a Mandarin Broadcast News system for the RT-04f (Rich Transcription) evaluation. Starting from a legacy system, we revisited all the issues including partitioning, acoustic modeling, language modeling, decoding and system combination strategies. We have achieved a sizable improvement, from 21.2% to 5.2% on the development set, from 42.7% to 22.4% measured on the RT-04f evaluation set, over a period of three months.
Rapid Development of an Afrikaans-English Speech-to-Speech Translator
, 2005
"... In this paper we investigate the rapid deployment of a twoway Afrikaans to English Speech-to-Speech Translation system. We discuss the approaches and amount of work involved to port a system to a new language pair, i.e. the steps required to rapidly adapt ASR, MT and TTS component to AFrikaans under ..."
Abstract
- Add to MetaCart
In this paper we investigate the rapid deployment of a twoway Afrikaans to English Speech-to-Speech Translation system. We discuss the approaches and amount of work involved to port a system to a new language pair, i.e. the steps required to rapidly adapt ASR, MT and TTS component to AFrikaans under limited time and data constraints. The resulting system represent the first fully functional prototype built for Afrikaans to English speech translation.
Pronunciation Modeling for Dialectal Arabic Speech Recognition
"... Abstract — Short vowels in Arabic are normally omitted in written text which leads to ambiguity in the pronunciation. This is even more pronounced for dialectal Arabic where a single word can be pronounced quite differently based on the speaker’s nationality, level of education, social class and rel ..."
Abstract
- Add to MetaCart
Abstract — Short vowels in Arabic are normally omitted in written text which leads to ambiguity in the pronunciation. This is even more pronounced for dialectal Arabic where a single word can be pronounced quite differently based on the speaker’s nationality, level of education, social class and religion. In this paper we focus on pronunciation modeling for Iraqi-Arabic speech. We introduce multiple pronunciations into the Iraqi speech recognition lexicon, and compare the performance, when weights computed via forced alignment are assigned to the different pronunciations of a word. Incorporating multiple pronunciations improved recognition accuracy compared to a single pronunciation baseline and introducing pronunciation weights further improved performance. Using these techniques an absolute reduction in word-error-rate of 2.4 % was obtained compared to the baseline system. I.
Selecting Relevant Features for Human Motion Recognition
"... Recently, there is a growing interest in automatic recognition of human motion for applications, such as humanoid robots, human activity monitoring, and surveillance. In this paper we investigate motion recognition based on joint angle trajectories derived from marker-based video recordings. The goa ..."
Abstract
- Add to MetaCart
Recently, there is a growing interest in automatic recognition of human motion for applications, such as humanoid robots, human activity monitoring, and surveillance. In this paper we investigate motion recognition based on joint angle trajectories derived from marker-based video recordings. The goal of this paper is to improve the generalization and robustness of human motion recognition even if only limited amount of training data is available. We achieve this goal by significantly reducing the amount of input features. We leverage on recent studies in the area of neuroscience which indicate that human motions display only a few independent degrees of freedom (DOF). We examine which DOF are relevant for recognizing upper body human motions and to what extend the dimensionality of the feature vectors can be reduced in order to simplify the data acquisition and improve the robustness of the recognition process. Our final results indicate that careful selection of features proves to reduce the number of features by a factor of up to 3, while at the same time significantly improving the recognition performance. 1.
Modeling Pronunciation Variation for Bi-Lingual Mandarin/Taiwanese Speech Recognition
"... In this paper, a bi-lingual large vocaburary speech recognition experiment based on the idea of modeling pronunciation variations is described. The two languages under study are Mandarin Chinese and Taiwanese (Min-nan). These two languages are basically mutually unintelligible, and they have many wo ..."
Abstract
- Add to MetaCart
In this paper, a bi-lingual large vocaburary speech recognition experiment based on the idea of modeling pronunciation variations is described. The two languages under study are Mandarin Chinese and Taiwanese (Min-nan). These two languages are basically mutually unintelligible, and they have many words with the same Chinese characters and the same meanings, although they are pronounced differently. Observing the bi-lingual corpus, we found five types of pronunciation variations for Chinese characters. A one-pass, three-layer recognizer was developed that includes a combination of bi-lingual acoustic models, an integrated pronunciation model, and a tree-structure based searching net. The recognizer’s performance was evaluated under three different pronunciation models. The results showed that the character error rate with integrated pronunciation models was better than that with pronunciation models, using either the knowledge-based or the data-driven approach. The relative frequency ratio was also used as a measure to choose the best number of pronunciation variations for each Chinese character. Finally, the best character error rates in Mandarin and Taiwanese testing sets were found to be 16.2 % and 15.0%, respectively, when the average number of pronunciations for one Chinese character was 3.9. Keywords: Bi-lingual, One-pass ASR, Pronunciation Modeling 1.
Informedia @ TRECVID 2010
"... The Informedia group participated in four tasks this year, including Semantic indexing, Known-item search, Surveillance event detection and Event detection in Internet multimedia pilot. For semantic indexing, except for training traditional SVM classifiers for each high level feature by using differ ..."
Abstract
- Add to MetaCart
The Informedia group participated in four tasks this year, including Semantic indexing, Known-item search, Surveillance event detection and Event detection in Internet multimedia pilot. For semantic indexing, except for training traditional SVM classifiers for each high level feature by using different low level features, a kind of cascade classifier was trained which including four layers with different visual features respectively. For Known Item Search task, we built a text-based video retrieval and a visual-based video retrieval system, and then query-class dependent late fusion was used to combine the runs from these two systems. For surveillance event detection, we especially put our focus on analyzing motions and human in videos. We detected the events by three channels. Firstly, we adopted a robust new descriptor called MoSIFT, which explicitly encodes appearance features together with motion information. And then we trained event classifiers in sliding windows using a bag-of-video-word approach. Secondly, we used the human detection and tracking algorithms to detect and track the regions of human,
INTERSPEECH 2010 The 2010 CMU GALE Speech-to-Text System
"... This paper describes the latest Speech-to-Text system developed for the Global Autonomous Language Exploitation (“GALE”) domain by Carnegie Mellon University (CMU). This systems uses discriminative training, bottle-neck features and other techniques that were not used in previous versions of our sys ..."
Abstract
- Add to MetaCart
This paper describes the latest Speech-to-Text system developed for the Global Autonomous Language Exploitation (“GALE”) domain by Carnegie Mellon University (CMU). This systems uses discriminative training, bottle-neck features and other techniques that were not used in previous versions of our system, and is trained on 1150 hours of data from a variety of Arabic speech sources. In this paper, we show how different lexica, pre-processing, and system combination techniques can be used to improve the final output, and provide analysis of the improvements achieved by the individual techniques. Index Terms: speech recognition, discriminative training, bottle-neck features
In Language and Information Technologies
"... This thesis describes MultiSphinx, a concurrent architecture for scalable, low-latency automatic speech recognition. We first consider the problem of constructing a universal “core ” speech recognizer on top of which domain and task specific adaptation layers can be constructed. We then show that wh ..."
Abstract
- Add to MetaCart
This thesis describes MultiSphinx, a concurrent architecture for scalable, low-latency automatic speech recognition. We first consider the problem of constructing a universal “core ” speech recognizer on top of which domain and task specific adaptation layers can be constructed. We then show that when this problem is restricted to that of expanding the search space from a “core ” vocabulary to a superset of this vocabulary across multiple passes of search, it allows us to effectively “factor ” a recognizer into components of roughly equal complexity. We present simple but effective algorithms for constructing the reduced vocabulary and associated statistical language model from an existing system. Finally, we describe the MultiSphinx decoder architecture, which allows multiple passes of recognition to operate concurrently and incrementally, either in multiple threads in the same process, or across multiple processes on separate machines, and which allows the best possible partial results, including confidence scores, to be obtained at any time during the recognition process. Acknowledgments v vi This thesis would not be possible without the support of all the friends, family, faculty, and colleagues who steered, encouraged, and supported me all the way. First and foremost, I am deeply grateful to my advisor, Dr. Alexander I. Rudnicky, for his support through the MLT and PhD programs here at CMU, and

