Results 1 - 10
of
61
An efficient, probabilistically sound algorithm for segmentation and word discovery
- MACHINE LEARNING
, 1999
"... This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstract ..."
Abstract
-
Cited by 103 (2 self)
- Add to MetaCart
This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, word-order, and word frequency can be replaced in a modular fashion. The model yields a language-independent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on words; instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
Compressing XML with Multiplexed Hierarchical PPM Models
- In Data Compression Conference
, 2001
"... this paper, we will describe alternative approaches to XML compression that illustrate other tradeos between speed and eectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: a ..."
Abstract
-
Cited by 71 (3 self)
- Add to MetaCart
this paper, we will describe alternative approaches to XML compression that illustrate other tradeos between speed and eectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: an online binary encoding for XML called Encoded SAX (ESAX) that compresses better and faster than existing methods; and an online, adaptive, XML-conscious encoding based on Prediction by Partial Match (PPM) [5] called Multiplexed Hierarchical Modeling (MHM) that compresses up to 35% better than any existing method but is fairly slow. First, of course, we need to describe XML in more detail.
Pattern Processing in Melodic Sequences: Challenges, Caveats and Prospects
- In Proceedings of the AISB'99 Convention (Arti Intelligence and Simulation of Behaviour
, 1999
"... In this paper a number of issues relating to the application of string processing techniques on musical sequences are discussed. A brief survey of some musical string processing algorithms is given and some issues of melodic representation, abstraction, segmentation and categorisation are presented. ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
In this paper a number of issues relating to the application of string processing techniques on musical sequences are discussed. A brief survey of some musical string processing algorithms is given and some issues of melodic representation, abstraction, segmentation and categorisation are presented. This paper is not intended towards providing solutions to string processing problems but rather towards highlighting possible stumbling-block areas and raising awareness of primarily music-related particularities that can cause problems in matching applications. 1.
Structures of String Matching and Data Compression
, 1999
"... This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It comprises both theoretic results and practical implementations. We study the suffix tree data st ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It comprises both theoretic results and practical implementations. We study the suffix tree data structure, presenting an efficient representation and several generalizations. This includes augmenting the suffix tree to fully support sliding window indexing (including a practical implementation) in linear time. Furthermore, we consider a variant that indexes naturally word-partitioned data, and present a linear-time construction algorithm for a tree that represents only suffixes starting at word boundaries, requiring space linear in the number of words. By applying our sliding window indexing techniques, we achieve an efficient implementation for dictionary-based compression based on the LZ-77 algorithm. Furthermore, considering predictive source
A Scalable System for Identifying Co-Derivative Documents
- In Proceedings of the Symposium on String Processing and Information Retrieval
, 2004
"... Abstract. Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is do ..."
Abstract
-
Cited by 22 (9 self)
- Add to MetaCart
Abstract. Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying coderivative clusters, and describe deco, a prototype system which makes use of spex. Our experiments with several document collections demonstrate the effectiveness of the approach. 1
Motivation for variable length intervals and hierarchical phase behavior
- In IEEE International Symposium on Performance Analysis of Systems and Software
, 2005
"... Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior tec ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a program’s periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the program’s actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint. 1
Efficient Universal Lossless Data Compression Algorithms Based on a Greedy Sequential Grammar Transform -- Part One: Without Context Models
- IEEE TRANSACTIONS ON INFORMATION THEORY
, 2000
"... A grammar transform is a transformation that converts any data sequence to be compressed into a grammar from which the original data sequence can be fully reconstructed. In a grammar-based code, a data sequence is first converted into a grammar by a grammar transform and then losslessly encoded. In ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
A grammar transform is a transformation that converts any data sequence to be compressed into a grammar from which the original data sequence can be fully reconstructed. In a grammar-based code, a data sequence is first converted into a grammar by a grammar transform and then losslessly encoded. In this paper, a greedy grammar transform is first presented; this grammar transform constructs sequentially a sequence of irreducible grammars from which the original data sequence can be recovered incrementally. Based on this grammar transform, three universal lossless data compression algorithms, a sequential algorithm, an improved sequential algorithm, and a hierarchical algorithm, are then developed. These algorithms combine the power of arithmetic coding with that of string matching. It is shown that these algorithms are all universal in the sense that they can achieve asymptotically the entropy rate of any stationary, ergodic source. Moreover, it is proved that their worst case redundancies among all individual sequences of length are upper-bounded by �� � �� � �� � , where is a constant. Simulation results show that the proposed algorithms outperform the Unix Compress and Gzip algorithms, which are based on LZ78 and LZ77, respectively.
The Smallest Grammar Problem
- IEEE TRANSACTIONS ON INFORMATION THEORY
, 2005
"... This paper addresses the smallest grammar problem: What is the smallest context-free grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addi ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
This paper addresses the smallest grammar problem: What is the smallest context-free grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem’s inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, worst-case behavior) to establish provable performance guarantees and to address short-comings in the classical measure of redundancy in the literature. Our first results are a variety of hardness results, most notably that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569 unless P = NP. 8568 We then bound approximation ratios for several of the bestknown grammar-based compression algorithms, including LZ78, BISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and RE-PAIR. Among these, the best upper bound we show is O(n 1/2). We finish by presenting two novel algorithms with exponentially better ratios of O(log 3 n) and O(log(n/m ∗)), where m ∗ is the size of the smallest grammar for that input. The latter highlights a connection between grammar-based compression and LZ77.
Contentful Mental States for Robot Baby
, 2002
"... In this paper we claim that meaningful representations can be learned by programs, although today they are almost always designed by skilled engineers. We discuss several kinds of meaning that representations might have, and focus on a functional notion of meaning as appropriate for programs to ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
In this paper we claim that meaningful representations can be learned by programs, although today they are almost always designed by skilled engineers. We discuss several kinds of meaning that representations might have, and focus on a functional notion of meaning as appropriate for programs to learn. Specifically, a representation is meaningful if it incorporates an indicator of external conditions and if the indicator relation informs action. We survey methods for inducing kinds of representations we call structural abstractions. Prototypes of sensory time series are one kind of structural abstraction, and though they are not denoting or compositional, they do support planning. Deictic representations of objects and prototype representations of words enable a program to learn the denotational meanings of words. Finally, we discuss two algorithms designed to find the macroscopic structure of episodes in a domain-independent way.
Grammar Based Codes: A New Class of Universal Lossless Source Codes
- IEEE TRANSACTIONS ON INFORMATION THEORY
, 2000
"... We investigate a type of lossless source code called a grammar based code, which, in response to any input data string x over a fixed finite alphabet, selects a context-free grammar Gx representing x in the sense that x is the unique string belonging to the language generated by Gx. Lossless compres ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
We investigate a type of lossless source code called a grammar based code, which, in response to any input data string x over a fixed finite alphabet, selects a context-free grammar Gx representing x in the sense that x is the unique string belonging to the language generated by Gx. Lossless compression of x takes place indirectly via compression of the production rules of the grammar Gx. It is shown that, subject to some mild restrictions, a grammar based code is a universal code with respect to the family of finite state information sources over the finite alphabet. Redundancy bounds for grammar based codes are established. Reduction rules for designing grammar based codes are presented.

