Results 1  10
of
69
An efficient, probabilistically sound algorithm for segmentation and word discovery
 MACHINE LEARNING
, 1999
"... This paper presents a modelbased, unsupervised algorithm for recovering word boundaries in a naturallanguage text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstract ..."
Abstract

Cited by 142 (2 self)
 Add to MetaCart
This paper presents a modelbased, unsupervised algorithm for recovering word boundaries in a naturallanguage text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, wordorder, and word frequency can be replaced in a modular fashion. The model yields a languageindependent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on words; instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
Compressing XML with Multiplexed Hierarchical PPM Models
 In Data Compression Conference
, 2001
"... this paper, we will describe alternative approaches to XML compression that illustrate other tradeos between speed and eectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: a ..."
Abstract

Cited by 82 (3 self)
 Add to MetaCart
this paper, we will describe alternative approaches to XML compression that illustrate other tradeos between speed and eectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: an online binary encoding for XML called Encoded SAX (ESAX) that compresses better and faster than existing methods; and an online, adaptive, XMLconscious encoding based on Prediction by Partial Match (PPM) [5] called Multiplexed Hierarchical Modeling (MHM) that compresses up to 35% better than any existing method but is fairly slow. First, of course, we need to describe XML in more detail.
Pattern Processing in Melodic Sequences: Challenges, Caveats and Prospects
 In Proceedings of the AISB'99 Convention (Arti Intelligence and Simulation of Behaviour
, 1999
"... In this paper a number of issues relating to the application of string processing techniques on musical sequences are discussed. A brief survey of some musical string processing algorithms is given and some issues of melodic representation, abstraction, segmentation and categorisation are presented. ..."
Abstract

Cited by 30 (11 self)
 Add to MetaCart
In this paper a number of issues relating to the application of string processing techniques on musical sequences are discussed. A brief survey of some musical string processing algorithms is given and some issues of melodic representation, abstraction, segmentation and categorisation are presented. This paper is not intended towards providing solutions to string processing problems but rather towards highlighting possible stumblingblock areas and raising awareness of primarily musicrelated particularities that can cause problems in matching applications. 1.
A Scalable System for Identifying CoDerivative Documents
 In Proceedings of the Symposium on String Processing and Information Retrieval
, 2004
"... Abstract. Documents are coderivative if they share content: for two documents to be coderived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all coderivatives in a collection is do ..."
Abstract

Cited by 29 (9 self)
 Add to MetaCart
Abstract. Documents are coderivative if they share content: for two documents to be coderived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all coderivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying coderivatives. In this paper we present spex, a novel hashbased algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying coderivative clusters, and describe deco, a prototype system which makes use of spex. Our experiments with several document collections demonstrate the effectiveness of the approach. 1
Structures of String Matching and Data Compression
, 1999
"... This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It comprises both theoretic results and practical implementations. We study the suffix tree data st ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It comprises both theoretic results and practical implementations. We study the suffix tree data structure, presenting an efficient representation and several generalizations. This includes augmenting the suffix tree to fully support sliding window indexing (including a practical implementation) in linear time. Furthermore, we consider a variant that indexes naturally wordpartitioned data, and present a lineartime construction algorithm for a tree that represents only suffixes starting at word boundaries, requiring space linear in the number of words. By applying our sliding window indexing techniques, we achieve an efficient implementation for dictionarybased compression based on the LZ77 algorithm. Furthermore, considering predictive source
Motivation for variable length intervals and hierarchical phase behavior
 In IEEE International Symposium on Performance Analysis of Systems and Software
, 2005
"... Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior tec ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a program’s periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the program’s actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint. 1
The Smallest Grammar Problem
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2005
"... This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addi ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem’s inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, worstcase behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are a variety of hardness results, most notably that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569 unless P = NP. 8568 We then bound approximation ratios for several of the bestknown grammarbased compression algorithms, including LZ78, BISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and REPAIR. Among these, the best upper bound we show is O(n 1/2). We finish by presenting two novel algorithms with exponentially better ratios of O(log 3 n) and O(log(n/m ∗)), where m ∗ is the size of the smallest grammar for that input. The latter highlights a connection between grammarbased compression and LZ77.
Efficient Universal Lossless Data Compression Algorithms Based on a Greedy Sequential Grammar Transform  Part One: Without Context Models
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2000
"... A grammar transform is a transformation that converts any data sequence to be compressed into a grammar from which the original data sequence can be fully reconstructed. In a grammarbased code, a data sequence is first converted into a grammar by a grammar transform and then losslessly encoded. In ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
A grammar transform is a transformation that converts any data sequence to be compressed into a grammar from which the original data sequence can be fully reconstructed. In a grammarbased code, a data sequence is first converted into a grammar by a grammar transform and then losslessly encoded. In this paper, a greedy grammar transform is first presented; this grammar transform constructs sequentially a sequence of irreducible grammars from which the original data sequence can be recovered incrementally. Based on this grammar transform, three universal lossless data compression algorithms, a sequential algorithm, an improved sequential algorithm, and a hierarchical algorithm, are then developed. These algorithms combine the power of arithmetic coding with that of string matching. It is shown that these algorithms are all universal in the sense that they can achieve asymptotically the entropy rate of any stationary, ergodic source. Moreover, it is proved that their worst case redundancies among all individual sequences of length are upperbounded by �� � �� � �� � , where is a constant. Simulation results show that the proposed algorithms outperform the Unix Compress and Gzip algorithms, which are based on LZ78 and LZ77, respectively.
Contentful Mental States for Robot Baby
, 2002
"... In this paper we claim that meaningful representations can be learned by programs, although today they are almost always designed by skilled engineers. We discuss several kinds of meaning that representations might have, and focus on a functional notion of meaning as appropriate for programs to ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
In this paper we claim that meaningful representations can be learned by programs, although today they are almost always designed by skilled engineers. We discuss several kinds of meaning that representations might have, and focus on a functional notion of meaning as appropriate for programs to learn. Specifically, a representation is meaningful if it incorporates an indicator of external conditions and if the indicator relation informs action. We survey methods for inducing kinds of representations we call structural abstractions. Prototypes of sensory time series are one kind of structural abstraction, and though they are not denoting or compositional, they do support planning. Deictic representations of objects and prototype representations of words enable a program to learn the denotational meanings of words. Finally, we discuss two algorithms designed to find the macroscopic structure of episodes in a domainindependent way.
C.: Detecting software theft via whole program path birthmarks
 In: Information Security
, 2004
"... Abstract. A software birthmark is a unique characteristic of a program that can be used as a software theft detection technique. In this paper we present and empirically evaluate a novel birthmarking technique — Whole Program Path Birthmarking — which uniquely identifies a program based on a complet ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
Abstract. A software birthmark is a unique characteristic of a program that can be used as a software theft detection technique. In this paper we present and empirically evaluate a novel birthmarking technique — Whole Program Path Birthmarking — which uniquely identifies a program based on a complete control flow trace of its execution. To evaluate the strength of the proposed technique we examine two important properties: credibility and tolerance against program transformations such as optimization and obfuscation. Our evaluation demonstrates that, for the detection of theft of an entire program, Whole Program Path birthmarks are more resilient to attack than previously proposed techniques. In addition, we illustrate several instances where a birthmark can be used to identify program theft even when an embedded watermark was destroyed by program transformation.