Results 1 -
9 of
9
An efficient, probabilistically sound algorithm for segmentation and word discovery
- MACHINE LEARNING
, 1999
"... This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstract ..."
Abstract
-
Cited by 103 (2 self)
- Add to MetaCart
This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, word-order, and word frequency can be replaced in a modular fashion. The model yields a language-independent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on words; instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
Self-supervised Chinese Word Segmentation
- In F. Homan et al. (Eds.): Advances in Intelligent Data Analysis, Proceedings of the Fourth International Conference (IDA-01), LNCS 2189
, 2001
"... We propose a new unsupervised training method for acquiring... ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
We propose a new unsupervised training method for acquiring...
Leading up the lexical garden-path: Segmentation and ambiguity in spoken word recognition
- Journal of Experimental Psychology: Human Perception and Performance
, 2002
"... Two gating studies, a forced-choice identification study and 2 series of cross-modal repetition priming experiments, traced the time course of recognition of words with onset embeddings (captain) and short words in contexts that match (cap tucked) or mismatch (cap looking) with longer words. Results ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Two gating studies, a forced-choice identification study and 2 series of cross-modal repetition priming experiments, traced the time course of recognition of words with onset embeddings (captain) and short words in contexts that match (cap tucked) or mismatch (cap looking) with longer words. Results suggest that acoustic differences in embedded syllables assist the perceptual system in discriminating short words from the start of longer words. The ambiguity created by embedded words is therefore not as severe as predicted by models of spoken word recognition based on phonemic representations. These additional acoustic cues combine with post-offset information in identifying onset-embedded words in connected speech. An important problem in the perception of connected speech is segmentation: how listeners divide the speech stream into individual lexical units or words. Words in fluent speech are not separated by silence in the same way that printed words are divided by blank spaces, yet connected speech is perceived as a sequence of individual words. This perceptual experience clearly reflects acquired language-specific knowledge, because listeners do not have the
Unsupervised Lexical Learning as Inductive Inference
, 2000
"... To learn a language, the learners must first learn its words, the essential building blocks for utterances. The difficulty in learning words lies in the unavailability of explicit word boundaries in speech input. The learners have to infer lexical items with some innately endowed learning mechanism( ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
To learn a language, the learners must first learn its words, the essential building blocks for utterances. The difficulty in learning words lies in the unavailability of explicit word boundaries in speech input. The learners have to infer lexical items with some innately endowed learning mechanism(s) for regularity detection- regularities in the speech normally indicate word patterns. With respect to Zipf's least-effort principle and Chomsky's thoughts on the minimality of grammar for human language, we hypothesise a cognitive mechanism underlying language learning that seeks for the least-effort representation for input data. Accordingly, lexical learning is to infer the minimal-cost representation for the input under the constraint of permissible representation for lexical items. The main theme of this thesis is to examine how far this learning mechanism can go in unsupervised lexical learning from real language data without any pre-defined (e.g., prosodic and phonotactic) cues, but entirely resting on statistical induction of structural patterns for the most economic representation for the data. We first review
Using Self-Supervised Word Segmentation in Chinese Information Retrieval
"... We propose a self-supervised word-segmentation technique for Chinese information retrieval. This method combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. Experiments on TREC data show comparable performanc ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We propose a self-supervised word-segmentation technique for Chinese information retrieval. This method combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. Experiments on TREC data show comparable performance to both the dictionary based and the character based approaches. However, our method is language independent and unsupervised, which provides a promising avenue for constructing accurate multilingual information retrieval systems that are flexible and adaptive.
A Day in the Life of a Spoken Word
"... Two experiments tracked the emergence of lexical competition effects for newly learnt spoken words (e.g., "cathedruke"). Experiment 1 compared form-only learning with learning in semantically rich sentence contexts. In both cases, although immediate explicit recognition of the novel words ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Two experiments tracked the emergence of lexical competition effects for newly learnt spoken words (e.g., "cathedruke"). Experiment 1 compared form-only learning with learning in semantically rich sentence contexts. In both cases, although immediate explicit recognition of the novel words was good, lexical competition effects (e.g., "cathedruke-cathedral") emerged only after a delay of at least 24 hours. Experiment 2 evaluated the timecourse of learning in more detail and used embedding (rather than cohort) new competitors (e.g., "shadowks"). Again results showed no evidence of lexicalization immediately after exposure, but clear lexical competition effects after 24 hours. Furthermore, recognition and free recall improved over time. These results are interpreted in terms of a consolidation process that integrates words into the mental lexicon over a relatively protracted period of time. 1.
Lexical segmentation in spoken word recognition
- Birkbeck College, University of London
, 2000
"... This thesis examines an important issue in spoken word recognition; how the perceptual system segments connected speech into lexical units or words. Research on this topic has investigated the role of different sources of information in dividing up the speech stream: acoustic cues in the speech sign ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This thesis examines an important issue in spoken word recognition; how the perceptual system segments connected speech into lexical units or words. Research on this topic has investigated the role of different sources of information in dividing up the speech stream: acoustic cues in the speech signal, statistical regularities in the structure of the language or through the identification of individual lexical items. This research focuses on cases in which the location of word boundaries may be ambiguous by one or more of these segmentation mechanisms using words embedded at the onset of longer words (such as cap in captain). The ambiguities proposed for onset-embedded words have motivated accounts of segmentation based on competition between alternative parses of speech into words. In these accounts, the recognition of embedded words is delayed until after their offset when subsequent input rules out longer competitors. In this thesis it is demonstrated that training a simple recurrent network to activate a representation of all the words in a sequence allows a connectionist network to learn the appropriate delay to allow the identification of onset-embedded words without requiring directly implemented competition between words. Both lexical competition and recurrent network models assume ambiguity between onset-embedded words and equivalent syllables in longer competitors. Acoustic analysis carried out in this thesis confirms the presence of reliable acoustic differences between syllables in short and long words. A series of experiments using gating and cross-modal priming suggest that the perceptual system uses these acoustic differences to discriminate embedded words from the onset of longer competitors and that match or mismatch with longer competitors may be less important for the identification of onset-embedded words. These results are interpreted within a revised version of the recurrent network model, incorporating input representing the acoustic differences between syllables in short and long words.
Applying Machine Learning to Text Segmentation for Information Retrieval
, 2002
"... We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments o ..."
Abstract
- Add to MetaCart
We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are exible and adaptive. We nd that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give comparable (in some cases even better) retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we nd that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this eect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach.
Phonotactics, parsing and productivity
"... This paper argues that parsing and productivity are causally related, the more an affix is prone to parsing in speech perception, the more productive it is likely to be. We support this claim by demonstrating a strona relationship between junctural phonotactics and affix productivity. Affixes which ..."
Abstract
- Add to MetaCart
This paper argues that parsing and productivity are causally related, the more an affix is prone to parsing in speech perception, the more productive it is likely to be. We support this claim by demonstrating a strona relationship between junctural phonotactics and affix productivity. Affixes which tend to create phonotactic junctures which facilite parsing also tend to be more productive. We show that there is a strong statisticol relationship between factors relating to phonotactics, and those relating to productivity. We further show that factors relating to productivity are themselves highly inter-correlated. A Principal Components Analysis reveals that affixes can be assessed on two dimensions, which we label parsability nnd usefuiness. Both of these dimensions substantially contribute to overall productivity. 1

