Results 1 -
2 of
2
Topic-Based Mixture Language Modelling
, 2000
"... This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. U ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling. A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (...
Thematic Indexing of Spoken Documents by Using Self-Organizing Maps
- RR 00-5, IDIAP
, 2000
"... A method is presented to provide a useful searchable index for spoken audio documents. The task differs from the traditional (text) document indexing, because large audio databases are decoded by automatic speech recognition and decoding errors occur frequently. The idea in this paper is to take adv ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
A method is presented to provide a useful searchable index for spoken audio documents. The task differs from the traditional (text) document indexing, because large audio databases are decoded by automatic speech recognition and decoding errors occur frequently. The idea in this paper is to take advantage of the large size of the database and select the best index terms for each document with the help of the other documents close to it using a semantic vector space. First, the audio stream is converted into a text stream by a speech recognizer. Then the text of each story is represented by a document vector which is the normalized sum of the word vectors in the story. A large collection of document vectors is used to train a self-organizing map to find the clusters and latent semantic structures in the collection. Because the news stories are quite short and include speech recognition errors, the idea of smoothing the document vectors using the thematic clusters determined by the self-...

