• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

A.: A study in machine learning from imbalanced data for sentence boundary detection in speech. Computer Speech & Language 20(4 (2006)

by Y Liu, N V Chawla, M P Harper, E Shriberg, Stolcke
Add To MetaCart

Tools

Sorted by:
Results 1 - 5 of 5

Classifying and Filtering Blind Feedback Terms to Improve Information Retrieval Effectiveness

by Johannes Leveling
"... The classification of blind relevance feedback (BRF) terms described in this paper aims at increasing precision or recall by determining which terms decrease, increase or do not change the corresponding information retrieval (IR) performance metric. Classification and IR experiments are performed on ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
The classification of blind relevance feedback (BRF) terms described in this paper aims at increasing precision or recall by determining which terms decrease, increase or do not change the corresponding information retrieval (IR) performance metric. Classification and IR experiments are performed on the German and English GIRT data, using the BM25 retrieval model. Several basic memory-based classifiers are trained on different feature sets, grouping together features from different query expansion (QE) approaches. Combined classifiers employ the results of the basic classifiers and correctness predictions as features. The best combined classifiers for German (English) yield 22.9 % (26.4%) and 5.8 % (1.9%) improvement for term classification wrt. precision and recall compared to the best basic classifiers. IR experiments based on this term classification have also been performed. Filtering out different types of BRF terms shows that selecting feedback terms predicted to increase precision improves the average precision significantly compared to experiments without BRF. MAP is improved by +19.8 % compared to the best standard BRF experiment (+11 % for German). BRF term classification also increases the number of relevant and retrieved documents, geometric MAP, and P@10 in comparison to standard BRF. Experiments based on an optimal classification show that there is potential for improving IR effectiveness even more.

GENRE EFFECTS ON AUTOMATIC SENTENCE SEGMENTATION OF SPEECH: A COMPARISON OF BROADCAST NEWS AND BROADCAST CONVERSATIONS

by Yang Liu, Elizabeth Shriberg
"... We investigate genre effects on the task of automatic sentence segmentation, focusing on two important domains – broadcast news (BN) and broadcast conversation (BC). We employ an HMM model based on textual and prosodic information and analyze differences in segmentation accuracy and feature usage be ..."
Abstract - Add to MetaCart
We investigate genre effects on the task of automatic sentence segmentation, focusing on two important domains – broadcast news (BN) and broadcast conversation (BC). We employ an HMM model based on textual and prosodic information and analyze differences in segmentation accuracy and feature usage between the two genres using both manual and automatic speech transcripts. Experiments are evaluated using Czech broadcast corpora annotated for sentencelike units (SUs). Prosodic features capture information about pause, duration, pitch, and energy patterns. Textual knowledge sources include words, part-of-speech, and automatically induced classes. We also analyze effects of using additional textual data that is not annotated for SUs. Feature analysis reveals significant differences in both textual and prosodic feature usage patterns between the two genres. The analysis is important for building automatic understanding systems when limited matched-genre data are available, or for designing eventual genre-independent systems. Index Terms — Spoken language understanding, sentence segmentation, broadcast news, broadcast conversations, prosody

Comparing and Combining Modeling Techniques for Sentence Segmentation of Spoken Czech Using Textual and Prosodic Information

by Yang Liu
"... This paper deals with automatic sentence boundary detection in spoken Czech using both textual and prosodic information. This task is important to make automatic speech recognition (ASR) output more readable and easier for downstream language processing modules. We compare and combine three statisti ..."
Abstract - Add to MetaCart
This paper deals with automatic sentence boundary detection in spoken Czech using both textual and prosodic information. This task is important to make automatic speech recognition (ASR) output more readable and easier for downstream language processing modules. We compare and combine three statistical models – hidden Markov model, maximum entropy, and adaptive boosting. We evaluate these methods on two Czech corpora, broadcast news and broadcast conversations, using both manual and ASR transcripts. Our results show that superior results are achieved when all the three models are combined via posterior probability interpolation, and that there is substantial difference among the three methods when using different knowledge sources, as well as in different genres. Feature analysis also reveals significant differences in prosodic feature usage patterns between the two genres. Index Terms: sentence segmentation, prosody, HMM, maximum entropy, boosting

Extraction of Definitions in Portuguese: An Imbalanced Data Set Problem

by Rosa Del Gaudio, António Branco
"... Abstract. Definition extraction is an important task in NLP and IR fields in the context of e.g. question answering, ontology learning, dictionary and glossary construction. When addressed with learning algorithms, it turns out to be a challenging task due to the structure of the data set, the reaso ..."
Abstract - Add to MetaCart
Abstract. Definition extraction is an important task in NLP and IR fields in the context of e.g. question answering, ontology learning, dictionary and glossary construction. When addressed with learning algorithms, it turns out to be a challenging task due to the structure of the data set, the reason being that the definition-bearing sentences are much fewer than the sentences that are non definitions. In this paper, we present results from experiments that seek to obtain optimal solutions for this problem by using a corpus written in the Portuguese language. Our results show an improvement of 29 points regarding AUC metric and more than 60 points when considering the F-measure. Key words: automatic definition extraction, machine learning, imbalanced data set. 1

Language Independent System for Definition Extraction: First Results Using Learning Algorithms

by Rosa Del, Gaudio António Branco
"... In this paper we report on the performance of different learning algorithms and different sampling technique applied to a definition extraction task, using data sets in different language. We compare our results with those obtained by handcrafted rules to extract definitions. When Definition Extract ..."
Abstract - Add to MetaCart
In this paper we report on the performance of different learning algorithms and different sampling technique applied to a definition extraction task, using data sets in different language. We compare our results with those obtained by handcrafted rules to extract definitions. When Definition Extraction is handled with machine learning algorithms, two different issues arise. On the one hand, in most cases the data set used to extract definitions is unbalanced, and this means that it is necessary to deal with this characteristic with specific techniques. On the other hand it is possible to use the same methods to extract definitions from documents in different corpus, making the classifier language independent. Keywords machine learning, imbalanced data set, language independent, definition extraction 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University