Abstract:
SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algorithm is driven by two constraints that reduce the size of the grammar, and produce structure as a by-product. SEQUITUR breaks new ground by operating incrementally. Moreover, the method's simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per second and has been applied to an extensive range of real world sequences. 1. Introduction Many sequences of discrete symbols exhibit natural hierarchical structure. Text is made up of paragraphs, sentences, phrases, and words. Music is composed from major sections, motifs, bars, and notes. Records of ...
Citations
|
663
|
Language identification in the limit
– Gold
- 1967
|
|
567
|
An Introduction to hidden Markov models
– Rabiner, Juang
- 1986
|
|
536
|
Text Compression
– Bell, Cleary, et al.
- 1990
|
|
122
|
Inference of reversible languages
– Angluin
- 1982
|
|
86
|
Inducing probabilistic grammars by Bayesian model merging
– Stolcke, Omohundro
- 1994
|
|
34
|
Inferring Sequential Structure
– Nevill-Manning
|
|
33
|
Attention and structure in sequence learning
– Cohen, Ivry, et al.
- 1990
|
|
31
|
A version space approach to learning contextfree grammars
– VanLehn, Ball
- 1987
|
|
30
|
Learning syntax by automata induction
– Berwick, Pilato
- 1987
|
|
30
|
Discrete Sequence Prediction and its Applications
– Laird
- 1992
|
|
26
|
Manual of Information to Accompany the Lancaster/Oslo-Bergen Corpus of British English, for Use with Digital Computers
– JOHANSSON, LEECH, et al.
- 1978
|
|
23
|
Browsing in digital libraries: A phrase-based approach
– Nevill-Manning, Witten, et al.
- 1997
|
|
22
|
Grammatical Inference by HillClimbing
– Cook, Rosenfeld, et al.
- 1976
|
|
18
|
Language acquisition and the discovery of phrase structure
– Wolff
- 1980
|
|
17
|
Behaviour/structure transformations under uncertainty
– Gaines
- 1976
|
|
16
|
The discovery of segments in natural language
– Wolff
- 1977
|
|
15
|
An algorithm for the segmentation of an artificial language analogue
– Wolff
|
|
12
|
Simplicity and Representation Change in Grammar Induction (Unpublished Manuscript). Palo Alto, CA: Institute for the Study of Learning and Expertise
– Langley
- 1995
|
|
10
|
Grammar enumeration and inference
– Wharton
- 1977
|
|
9
|
The art of computer programming 1: fundamental algorithms
– Knuth
- 1968
|
|
5
|
Thinking With The Teachable Machine
– Andreae
- 1977
|