• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Exploiting Syntactic Structure for Natural Language Modeling (2000)

by Ciprian Chelba
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 20
Next 10 →

The SuperARV language model: Investigating the effectiveness of tightly integrating multiple knowledge sources

by Wen Wang, Mary P. Harper - in Proceedings of Conference of Empirical Methods in Natural Language Processing , 2002
"... A new almost-parsing language model incorporating multiple knowledge sources that is based upon the concept of Constraint Dependency Grammars is presented in this paper. Lexical features and syntactic constraints are tightly integrated into a uniform linguistic structure called a SuperARV that is as ..."
Abstract - Cited by 21 (5 self) - Add to MetaCart
A new almost-parsing language model incorporating multiple knowledge sources that is based upon the concept of Constraint Dependency Grammars is presented in this paper. Lexical features and syntactic constraints are tightly integrated into a uniform linguistic structure called a SuperARV that is associated with a word in the lexicon. The Super-ARV language model reduces perplexity and word error rate compared to trigram, part-of-speech-based, and parser-based language models. The relative contributions of the various knowledge sources to the strength of our model are also investigated by using constraint relaxation at the level of the knowledge sources. We have found that although each knowledge source contributes to language model quality, lexical features are an outstanding contributor when they are tightly integrated with word identity and syntactic constraints. Our investigation also suggests possible reasons for the reported poor performance of several probabilistic dependency grammar models in the literature. 1

Combining Semantic And Syntactic Structure For Language Modeling

by Rens Bod - Proceedings ICSLP-2000 , 2000
"... Structured language models for speech recognition have been shown to remedy the weaknesses of n -gram models. All current structured language models, however, are limited in that they do not take into account dependencies between non-headwords. We show that non-headword dependencies contribute signi ..."
Abstract - Cited by 16 (6 self) - Add to MetaCart
Structured language models for speech recognition have been shown to remedy the weaknesses of n -gram models. All current structured language models, however, are limited in that they do not take into account dependencies between non-headwords. We show that non-headword dependencies contribute significantly to improved word error rate, and that a data-oriented parsing model trained on semantically and syntactically annotated data can exploit these dependencies. This paper contains the first published experiments with a data-oriented parsing model trained by means of a maximum likelihood reestimation procedure. 1. INTRODUCTION Structured language models for speech recognition have recently gained a considerable interest. They have been shown to outperform the 3-gram language model on various domains and they can be efficiently parsed in a left-to-right manner (Chelba & Jelinek 1998; Chelba 2000). Although it has been reported that higher order n-gram models perform as well as structur...

Structural Event Detection for Rich Transcription of Speech

by Yang Liu , 2004
"... xviii 1 ..."
Abstract - Cited by 12 (5 self) - Add to MetaCart
Abstract not found

Techniques for modelling Phonological Processes in Automatic Speech Recognition

by Harriet Jane Nock , 2001
"... Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices does ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices does not exceed 29,500 words and includes no more than 40 figures. 1 Systems which automatically transcribe carefully dictated speech are now commercially available, but their performance degrades dramatically when the speaking style of users becomes more relaxed or conversational. This dissertation focuses on techniques that aim to improve the robustness of statistical speech transcription systems to conversational speaking styles. The dissertation shows first that the performance degradation occuring as speech becomes more conversational is severe and is partially attributable to differences in the acoustic realizations of sentences. Hypothesizing that the quantifiably wider range of

The Robustness of an Almost-Parsing Language Model Given Errorful Training Data

by Wen Wang, Mary P. Harper, Andreas Stolcke, An Almost-parsing , 2003
"... An almost-parsing language model has been developed [1] that provides a framework for tightly integrating multiple knowledge sources. Lexical features and syntactic constraints are integrated into a uniform linguistic structure (called a SuperARV) that is associated with words in the lexicon. The ..."
Abstract - Cited by 4 (4 self) - Add to MetaCart
An almost-parsing language model has been developed [1] that provides a framework for tightly integrating multiple knowledge sources. Lexical features and syntactic constraints are integrated into a uniform linguistic structure (called a SuperARV) that is associated with words in the lexicon. The SuperARV language model has been found able to reduce perplexity and word error rate (WER) compared to trigram, part-of-speech-based, and parser-based language models on the DARPA Wall Street Journal (WSJ) CSR task. In this paper we further investigate the robustness of the language model to possibly inconsistent and flawed training data, as well as its ability to scale up to sophisticated LVCSR tasks by comparing performance on the DARPA WSJ and Hub4 (Broadcast News) CSR tasks.

Implementation Testing of a Hybrid Symbolic/Statistical Multimodal Architecture

by Edward C. Kaiser, Philip R. Cohen , 2002
"... The design and implementation of hybrid symbolic/statistical architectures is a major area of interest in current multimodal system development. Such an architecture attempts to improve multimodal recognition and disambiguation rates by using corpus-based statistics to weight the contributions from ..."
Abstract - Cited by 3 (2 self) - Add to MetaCart
The design and implementation of hybrid symbolic/statistical architectures is a major area of interest in current multimodal system development. Such an architecture attempts to improve multimodal recognition and disambiguation rates by using corpus-based statistics to weight the contributions from various input streams. This is in contrast to current architectures that assume independence between input streams, and combine un-weighted posterior probabilities simply by taking their cross product.

A Statistical Information Extraction System for Turkish

by Gökhan Tür , 2000
"... This thesis presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. We have successfully applied statistical methods using both the lexical and morphological information to the following tasks: The Turkish Text Deasciifi ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
This thesis presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. We have successfully applied statistical methods using both the lexical and morphological information to the following tasks: The Turkish Text Deasciifier task aims to convert the ASCII characters in a Turkish text, into the corresponding non-ASCII Turkish characters (i.e., "fi", ";5", "g", "", "", '5", and their upper cases).

A Syntactified Direct Translation Model with Linear-time Decoding

by Hany Hassan, Andy Way
"... Recent syntactic extensions of statistical translation models work with a synchronous context-free or tree-substitution grammar extracted from an automatically parsed parallel corpus. The decoders accompanying these extensions typically exceed quadratic time complexity. This paper extends the Direct ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Recent syntactic extensions of statistical translation models work with a synchronous context-free or tree-substitution grammar extracted from an automatically parsed parallel corpus. The decoders accompanying these extensions typically exceed quadratic time complexity. This paper extends the Direct Translation Model 2 (DTM2) with syntax while maintaining linear-time decoding. We employ a linear-time parsing algorithm based on an eager, incremental interpretation of Combinatory Categorial Grammar (CCG). As every input word is processed, the local parsing decisions resolve ambiguity eagerly, by selecting a single supertag–operator pair for extending the dependency parse incrementally. Alongside translation features extracted from the derived parse tree, we explore syntactic features extracted from the incremental derivation process. Our empirical experiments show that our model significantly outperforms the state-of-the art DTM2 system. 1

A Semantically Structured Language Model

by Alex Acero, Ye-yi Wang, Kuansan Wang - in Special Workshop in Maui (SWIM , 2004
"... In this paper we propose a semantically structured language (SSLM) model that significantly reduces the authoring load required over the traditional manually derived grammar when developing a spoken language system. At the same time, the SSLM results in an understanding error rate which is roughly h ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
In this paper we propose a semantically structured language (SSLM) model that significantly reduces the authoring load required over the traditional manually derived grammar when developing a spoken language system. At the same time, the SSLM results in an understanding error rate which is roughly half as large as that of the manually authored grammar. The proposed model combines the advantages of both statistical word n-grams and context-free grammars. When the SSLM directly acts as the recognizer’s language model there’s a significant reduction in understanding error rate over the case where it is applied only at the output of a recognizer driven by an word n-gram language model. 1.

Automatic Sentence Structure Annotation for Spoken Language Processing

by Dustin Lundring Hillard , 2008
"... Increasing amounts of easily available electronic data are precipitating a need for automatic processing that can aid humans in digesting large amounts of data. Speech and video are becoming an increasingly significant portion of on-line information, from news and television broadcasts, to oral hist ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Increasing amounts of easily available electronic data are precipitating a need for automatic processing that can aid humans in digesting large amounts of data. Speech and video are becoming an increasingly significant portion of on-line information, from news and television broadcasts, to oral histories, on-line lectures, or user generated content. Automatic processing of audio and video sources requires automatic speech recognition (ASR) in order to provide transcripts. Typical ASR generates only words, without punctuation, capitalization, or further structure. Many techniques available from natural language processing therefore suffer when applied to speech recognition output, because they assume the presence of reliable punctuation and structure. In addition, errors from automatic transcription also degrade the performance of downstream processing such as machine translation, name detection, or information retrieval. We develop approaches for automatically annotating structure in speech, including sentence and sub-sentence segmentation, and then turn towards optimizing ASR and annotation for downstream applications.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University