Results 11 - 20
of
41
Unsupervised Learning of Probabilistic Context-Free Grammar using Iterative
"... Abstract. This paper presents PCFG-BCL, an unsupervised algorithm that learns a probabilistic context-free grammar (PCFG) from positive samples. The algorithm acquires rules of an unknown PCFG through iterative biclustering of bigrams in the training corpus. Our analysis shows that this procedure us ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. This paper presents PCFG-BCL, an unsupervised algorithm that learns a probabilistic context-free grammar (PCFG) from positive samples. The algorithm acquires rules of an unknown PCFG through iterative biclustering of bigrams in the training corpus. Our analysis shows that this procedure uses a greedy approach to adding rules such that each set of rules that is added to the grammar results in the largest increase in the posterior of the grammar given the training corpus. Results of our experiments on several benchmark datasets show that PCFG-BCL is competitive with existing methods for unsupervised CFG learning. 1
Characterizing Motherese: On the Computational Structure of Child-Directed Language
"... We report a quantitative analysis of the cross-utterance coordination observed in child-directed language, where successive utterances often overlap in a manner that makes their constituent structure more prominent, and describe the application of a recently published unsupervised algorithm for gram ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We report a quantitative analysis of the cross-utterance coordination observed in child-directed language, where successive utterances often overlap in a manner that makes their constituent structure more prominent, and describe the application of a recently published unsupervised algorithm for grammar induction to the largest available corpus of such language, producing a grammar capable of accepting and generating novel wellformed sentences. We also introduce a new corpus-based method for assessing the precision and recall of an automatically acquired generative grammar without recourse to human judgment. The present work sets the stage for the eventual development of more powerful unsupervised algorithms for language acquisition, which would make use of the coordination structures present in natural child-directed speech.
Structure induction by lossless graph compression
"... This work is motivated by the necessity to automate the discovery of structure in vast and evergrowing collection of relational data commonly represented as graphs, for example genomic networks. A novel algorithm, dubbed Graphitour, for structure induction by lossless graph compression is presented ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This work is motivated by the necessity to automate the discovery of structure in vast and evergrowing collection of relational data commonly represented as graphs, for example genomic networks. A novel algorithm, dubbed Graphitour, for structure induction by lossless graph compression is presented and illustrated by a clear and broadly known case of nested structure in a DNA molecule. This work extends to graphs some well established approaches to grammatical inference previously applied only to strings. The bottom-up graph compression problem is related to the maximum cardinality (non-bipartite) maximum cardinality matching problem. The algorithm accepts a variety of graph types including directed graphs and graphs with labeled nodes and arcs. The resulting structure could be used for representation and classification of graphs. 1
Unsupervised language acquisition: syntax from plain corpus
, 2004
"... We describe results of a novel algorithm for grammar induction from a large corpus. The ADIOS (Automatic DIstillation of Structure) algorithm searches for significant patterns, chosen according to context dependent statistical criteria, and builds a hierarchy of such patterns according to a set of r ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We describe results of a novel algorithm for grammar induction from a large corpus. The ADIOS (Automatic DIstillation of Structure) algorithm searches for significant patterns, chosen according to context dependent statistical criteria, and builds a hierarchy of such patterns according to a set of rules leading to structured generalization. The corpus is thus generalized into a context free grammar (CFG), composed of patterns, equivalence classes and words of the initial lexicon. We have evaluated our method both on corpora generated by CFG and on natural language ones. The performance of ADIOS is judged by searching for both good recall (acceptance of correct novel sentences) and good precision (production of correct novel sentences). The results are very encouraging.
Recursive data mining for role identification
- In 5th International Conference on Soft Computing as Transdisciplinary Science and Technology
, 2008
"... We present a text mining approach that enables an extension of a standard authorship assessment problem (the problem in which an author of a text needs to be established) to role identification in communications within some Internet community. More precisely, we want to recognize a group of authors ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We present a text mining approach that enables an extension of a standard authorship assessment problem (the problem in which an author of a text needs to be established) to role identification in communications within some Internet community. More precisely, we want to recognize a group of authors communicating in a specific role within such a community rather than a single author. The challenge here is that the same author may participate in different roles in communications within the group, in each role having different authors as peers. An additional challenge of our problem is the length of communications. Each individual exchange in our intended domain, communications within an Internet community, is relatively short, in the order of several dozens of words, so standard text mining approaches may fail. An example of such a problem is recognizing roles in a collection of emails from an organization in which middle level managers communicate both with superiors and subordinates. To validate our approach we use the Enron email dataset which is such a collection. Our approach is based on discovering patterns at varying degrees of abstraction in a hierarchical fashion. Such discovery process allows for certain degree of approximation in matching patterns, which is necessary for capturing nontrivial structures in realistic datasets. The discovered patterns are used as features to build efficient classifiers. Due to the nature of the pattern discovery process, we call our approach Recursive Data Mining. The results show that a classifier that uses the dominant patterns discovered by Recursive Data Mining performs well in role detection.
The Power and Perils of MDL
"... Abstract — We point out a potential weakness in the application of the celebrated Minimum Description Length (MDL) principle for model selection. Specifically, it is shown that (although the index of the model class which actually minimizes a two-part code has many desirable properties) a model whic ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract — We point out a potential weakness in the application of the celebrated Minimum Description Length (MDL) principle for model selection. Specifically, it is shown that (although the index of the model class which actually minimizes a two-part code has many desirable properties) a model which has a shorter twopart code-length than another is not necessarily better (unless of course it achieves the global minimum). This is illustrated by an application to infer a grammar (DFA) from positive examples. We also analyze computability issues, and robustness under recoding of the data. Generally, the classical approach is inadequate to express the goodness-of-fit of individual models for individual data sets. In practice however, this is precisely what we are interested in: both to express the goodness of a procedure and where and how it can fail. To achieve this practical goal, we paradoxically have to use the, supposedly impractical, vehicle of Kolmogorov complexity. I.
Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
"... We consider a new subproblem of unsupervised parsing from raw text, unsupervised partial parsing—the unsupervised version of text chunking. We show that addressing this task directly, using probabilistic finite-state methods, produces better results than relying on the local predictions of a current ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We consider a new subproblem of unsupervised parsing from raw text, unsupervised partial parsing—the unsupervised version of text chunking. We show that addressing this task directly, using probabilistic finite-state methods, produces better results than relying on the local predictions of a current best unsupervised parser, Seginer’s (2007) CCL. These finite-state models are combined in a cascade to produce more general (full-sentence) constituent structures; doing so outperforms CCL by a wide margin in unlabeled PARSEVAL scores for English, German and Chinese. Finally, we address the use of phrasal punctuation
From exemplar to grammar: Integrating analogy and probability in language learning
, 2008
"... We present a new model of language learning which is based on the following idea: if a language learner does not know which phrase-structure trees should be assigned to initial sentences, s/he allows (implicitly) for all possible trees and lets linguistic experience decide which is the ‘best’ tree f ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We present a new model of language learning which is based on the following idea: if a language learner does not know which phrase-structure trees should be assigned to initial sentences, s/he allows (implicitly) for all possible trees and lets linguistic experience decide which is the ‘best’ tree for each sentence. The best tree is obtained by maximizing ‘structural analogy ’ between a sentence and previous sentences, which is formalized by the most probable shortest combination of subtrees from all trees of previous sentences. Corpus-based experiments with this model on the Penn Treebank and the Childes database indicate that it can learn both exemplar-based and rulebased aspects of language, ranging from phrasal verbs to auxiliary fronting. By having learned the syntactic structures of sentences, we have also learned the grammar implicit in these structures, which can in turn be used to produce new sentences. We show that our model mimicks children’s language development from item-based constructions to abstract constructions, and that the model can simulate some of the errors made by children in producing complex questions. 1 1
Learning Automata on Protein Sequences
"... Abstract: Pattern discovery is limited to position-specific characterizations like Prosite’s patterns or profile-HMMs which are unable to handle, for instance, dependencies between amino acids distant in the sequence of a protein, but close in its three-dimensional structure. To overcome these limit ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract: Pattern discovery is limited to position-specific characterizations like Prosite’s patterns or profile-HMMs which are unable to handle, for instance, dependencies between amino acids distant in the sequence of a protein, but close in its three-dimensional structure. To overcome these limitations, we propose to learn automata on proteins. Inspired by grammatical inference and multiple alignment techniques, we introduce a sequence-driven approach based on the idea of merging ordered partial local multiple alignments (PLMA) under preservation or consistency constraints and on an identification of informative positions with respect to physico-chemical properties. The quality of the characterization is asserted experimentally on two difficult sets of proteins by a comparison with (semi)-manually designed patterns of Prosite and with state-of-the-art pattern discovery algorithms. Further leave-one-out experimentations show that learning more precise automata allows to gain in accuracy by increasing the classification margins.

