• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Subword-based Approaches for Spoken Document Retrieval (2000)

by K Ng
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 32
Next 10 →

R&D status of ERIC-7 and MADIS: two systems for MPEG-7 indexing/search of audio-visual content

by L. Gagnon - in Proc. SPIE Conference on Multimedia Systems and Applications VIII (SPIE #6015 , 2005
"... We present the research and development status of two MPEG-7 indexing/search systems under development at the Computer Research Institute of Montreal (CRIM). The first (called ERIC-7) targets content-based encoding of still images and is mainly designed to experiment with the various aspects of the ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
We present the research and development status of two MPEG-7 indexing/search systems under development at the Computer Research Institute of Montreal (CRIM). The first (called ERIC-7) targets content-based encoding of still images and is mainly designed to experiment with the various aspects of the visual MPEG-7/XML schema with the help of analysis and exploration tools. The interface allows navigating graphically among the various descriptors in the XML files and through interactive UML graphics. The second (called MADIS) aims at providing a practical audiovisual MPEG-7 indexing/retrieval tool, within the framework of a light architecture. MADIS is designed to (1) be fully MPEG-7 compliant, (2) address both encoding and search, (3) combine audio, speech and visual modalities and (4) have search capability on the Internet. MADIS currently targets content-based indexing of documentary films. Keywords: MPEG-7, video processing, multimedia systems, content-based image retrieval, audio-visual indexing 1.

P.: Multimodal redundancy across handwriting and speech during computer mediated human-human interactions

by Edward C. Kaiser, Paulo Barthelmess, Ice Erdmann, Phil Cohen - In CHI ’07 , 2007
"... Lecturers, presenters and meeting participants often say what they publicly handwrite. In this paper, we report on three empirical explorations of such multimodal redundancy — during whiteboard presentations, during a spontaneous brainstorming meeting, and during the informal annotation and discussi ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Lecturers, presenters and meeting participants often say what they publicly handwrite. In this paper, we report on three empirical explorations of such multimodal redundancy — during whiteboard presentations, during a spontaneous brainstorming meeting, and during the informal annotation and discussion of photographs. We show that redundantly presented words, compared to other words used during a presentation or meeting, tend to be topic specific and thus are likely to be out-of-vocabulary. We also show that they have significantly higher tf-idf (term frequency–inverse document frequency) weights than other words, which we argue supports the hypothesis that they are dialogue-critical words. We frame the import of these empirical findings by describing SHACER, our recently introduced Speech and HAndwriting reCognizER, which can combine information from instances of redundant handwriting and speech to dynamically learn new vocabulary.

Analytical comparison between position specific posterior lattices and confusion networks based on words and subword units for spoken document indexing

by Yi-cheng Pan, Hung-lin Chang, Lin-shan Lee - In Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE Workshop on , 2007
"... In this paper we analytically compare the two widely accepted approaches of spoken document indexing, Position Specific Posterior Lattices (PSPL) and Confusion Network (CN), in terms of retrieval accuracy and index size. The fundamental distinctions between these two approaches in terms of construct ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
In this paper we analytically compare the two widely accepted approaches of spoken document indexing, Position Specific Posterior Lattices (PSPL) and Confusion Network (CN), in terms of retrieval accuracy and index size. The fundamental distinctions between these two approaches in terms of construction units, posterior probabilities, number of clusters, indexing coverage and space requirements are discussed in detail. A new approach to approximate subword posterior probability in a word lattice is also incorporated in PSPL/CN to handle OOV/rare word problems, which were unaddressed in original PSPL and CN approaches. Extensive experimental results on Chinese broadcast news segments indicate that PSPL offers higher accuracy than CN but requiring much larger disk space, while subword-based PSPL turns out to be very attractive because it lowers the storage cost while offers even higher accuracies. Index Terms — PSPL, S-PSPL, Spoken Document Retrieval 1.

Combining LVCSR and Vocabulary-Independent Ranked Utterance Retrieval for Robust Speech Search ABSTRACT

by J. Scott Olsson
"... (LVCSR) has been shown to generally be more effective than vocabulary-independent techniques for ranked retrieval of spoken content when one or the other approach is used alone. Tuning LVCSR systems to a topic domain can be costly, however, and the experiments in this paper show that Out-Of-Vocabula ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
(LVCSR) has been shown to generally be more effective than vocabulary-independent techniques for ranked retrieval of spoken content when one or the other approach is used alone. Tuning LVCSR systems to a topic domain can be costly, however, and the experiments in this paper show that Out-Of-Vocabulary (OOV) query terms can significantly reduce retrieval effectiveness when that tuning is not performed. Further experiments demonstrate, however, that retrieval effectiveness for queries with OOV terms can be substantially improved by combining evidence from LVCSR with additional evidence from vocabulary-independent Ranked Utterance Retrieval (RUR). The combination is performed by using relevance judgments from held-out topics to learn generic (i.e., topic-independent), smooth, non-decreasing transformations from LVCSR and RUR system scores to probabilities of topical relevance. Evaluated using a CLEF collection that includes topics, spontaneous conversational speech audio, and relevance judgments, the system recovers 57 % of the mean uninterpolated average precision that could have been obtained through LVCSR domain tuning for very short queries (or 41 % for longer queries).

Phone-Based Spoken Document Retrieval in Conformance with the MPEG-7 Standard

by Nicolas Moreau, Hyoung Gook Kim, Thomas Sikora , 2004
"... This paper presents a phone-based approach of spoken document retrieval, developed in the framework of the emerging MPEG-7 standard. The audio part of MPEG-7 encloses a SpokenContent tool that provides a standardized description of the content of spoken documents. In the context of MPEG-7, we propos ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
This paper presents a phone-based approach of spoken document retrieval, developed in the framework of the emerging MPEG-7 standard. The audio part of MPEG-7 encloses a SpokenContent tool that provides a standardized description of the content of spoken documents. In the context of MPEG-7, we propose an indexing and retrieval method that uses phonetic information only and a vector space IR model. Experiments are conducted on a database of German spoken documents, with 10 city name queries. Two phone-based retrieval approaches are presented and combined. The first one is based on the combination of phone N-grams of different lengths used as indexing terms. The other consists of expanding the document representation by means of phone confusion probabilities

A Comparison of Query-by-Example Methods for Spoken Term Detection

by Wade Shen, Christopher M. White, Timothy J. Hazen
"... In this paper we examine an alternative interface for phonetic search, namely query-by-example, that avoids OOV issues associated with both standard word-based and phonetic search methods. We develop three methods that compare query lattices derived from example audio against a standard ngrambased p ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
In this paper we examine an alternative interface for phonetic search, namely query-by-example, that avoids OOV issues associated with both standard word-based and phonetic search methods. We develop three methods that compare query lattices derived from example audio against a standard ngrambased phonetic index and we analyze factors affecting the performance of these systems. We show that the best systems under this paradigm are able to achieve 77 % precision when retrieving utterances from conversational telephone speech and returning 10 results from a single query (performance that is better than a similar dictionary-based approach) suggesting significant utility for applications requiring high precision. We also show that these systems can be further improved using relevance feedback: By incorporating four additional queries the precision of the best system can be improved by 13.7 % relative. Our systems perform well despite high phone recognition error rates (> 40%) and make use of no pronunciation or letter-to-sound resources.

Hash Table Sizes for Storing N-Grams for Text Processing

by Zhong Gu, Daniel Berleant
"... N-grams have been widely investigated for a number of text processing tasks. However n-gram based systems often labor under the large memory requirements of nave storage of the large vectors that describe the many n-grams that could potentially appear in documents. This problem becomes more seve ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
N-grams have been widely investigated for a number of text processing tasks. However n-gram based systems often labor under the large memory requirements of nave storage of the large vectors that describe the many n-grams that could potentially appear in documents. This problem becomes more severe as the number of documents (and hence the number of vectors to store and process) rises. A natural approach to reducing vector size is to hash the large number of possible n-grams into a smaller vector. We address this problem by identifying good and bad hash table sizes over a wide range of sizes. We show that English, French, and German n-grams behave similarly when hashed, and that this is unlike the behavior of randomly generated n-grams. Therefore the difference in behavior is due to properties of the languages themselves. We then investigate different table sizes and identify which sizes are particularly good when hashing ngrams during processing of these languages.

A comparison of sub-word indexing methods for information retrieval

by Johannes Leveling
"... This paper compares different methods of subword indexing and their performance on the English and German domain-specific document collection of the Cross-language Evaluation Forum (CLEF). Four major methods to index sub-words are investigated and compared to indexing stems: 1) sequences of vowels a ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
This paper compares different methods of subword indexing and their performance on the English and German domain-specific document collection of the Cross-language Evaluation Forum (CLEF). Four major methods to index sub-words are investigated and compared to indexing stems: 1) sequences of vowels and consonants, 2) a dictionary-based approach for decompounding, 3) overlapping character n-grams, and 4) Knuth’s algorithm for hyphenation. The performance and effects of sub-word extraction on search time and index size and time are reported for English and German retrieval experiments. The main results are: For English, indexing sub-words does not outperform the baseline using standard retrieval on stemmed word forms (–8 % mean average precision (MAP), – 11 % geometric MAP (GMAP), +1 % relevant and retrieved documents (rel ret) for the best experiment). For German, with the exception of n-grams, all methods for indexing sub-words achieve a higher performance than the stemming baseline. The best performing sub-word indexing methods are to use consonant-vowelconsonant sequences and index them together with word stems (+17 % MAP, +37 % GMAP, +14 % rel ret compared to the baseline), or to index syllable-like sub-words obtained from the hyphenation algorithm together with stems (+9% MAP, +23 % GMAP, +11 % rel ret). 1

Phrase-Based Query Degradation Modeling for Vocabulary-Independent Ranked Utterance Retrieval

by J. Scott Olsson, Douglas W. Oard - In NAACL-HLT ’09 , 2009
"... This paper introduces a new approach to ranking speech utterances by a system’s confidence that they contain a spoken word. Multiple alternate pronunciations, or degradations, of a query word’s phoneme sequence are hypothesized and incorporated into the ranking function. We consider two methods for ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
This paper introduces a new approach to ranking speech utterances by a system’s confidence that they contain a spoken word. Multiple alternate pronunciations, or degradations, of a query word’s phoneme sequence are hypothesized and incorporated into the ranking function. We consider two methods for hypothesizing these degradations, the best of which is constructed using factored phrasebased statistical machine translation. We show that this approach is able to significantly improve upon a state-of-the-art baseline technique in an evaluation on held-out speech. We evaluate our systems using three different methods for indexing the speech utterances (using phoneme, phoneme multigram, and word recognition), and find that degradation modeling shows particular promise for locating out-of-vocabulary words when the underlying indexing system is constructed with standard word-based speech recognition. 1

SEARCHING THE AUDIO NOTEBOOK: KEYWORD

by Peng Yu, Kaijiang Chen, Lie Lu, Frank Seide
"... MIT’s Audio Notebook added great value to the note-taking process by retaining audio recordings, e.g. during lectures or interviews. The key was to provide users ways to quickly and easily access portions of interest in a recording. Several non-speech-recognition based techniques were employed. In t ..."
Abstract - Add to MetaCart
MIT’s Audio Notebook added great value to the note-taking process by retaining audio recordings, e.g. during lectures or interviews. The key was to provide users ways to quickly and easily access portions of interest in a recording. Several non-speech-recognition based techniques were employed. In this paper we present a system to search directly the audio recordings by key phrases. We have identified the user requirements as accurate ranking of phrase matches, domain independence, and reasonable response time. We address these requirements by a hybrid word/phoneme search in lattices, and a supporting indexing scheme. We will introduce the ranking criterion, a unified hybrid posterior-lattice representation, and the indexing algorithm for hybrid lattices. We present results for five different recording sets, including meetings, telephone conversations, and interviews. Our results show an average search accuracy of 84%, which is dramatically better than a direct search in speech recognition transcripts (less than 40 % search accuracy). 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University