Results 1 - 10
of
13
Storing Text Retrieval Systems on CD-ROM: Compression and Encryption Considerations
- ACM Transactions on Information Systems
, 1989
"... : The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this dat ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
: The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this database is 700 MB long, more than a CD-ROM can hold. But in addition the dictionary and concordance needed to access this data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed and the compression of the text is related to the problem of data encryption: specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishible from a representation of coin flips. Categories and Subject Descriptors: E.3 E.4 H.3.2 J.5 General terms: ...
Is Huffman Coding Dead?
- Computing
, 1993
"... : In recent publications about data compression, arithmetic codes are often suggested as the state of the art, rather than the more popular Huffman codes. While it is true that Huffman codes are not optimal in all situations, we show that the advantage of arithmetic codes in compression performance ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
: In recent publications about data compression, arithmetic codes are often suggested as the state of the art, rather than the more popular Huffman codes. While it is true that Huffman codes are not optimal in all situations, we show that the advantage of arithmetic codes in compression performance is often negligible. Referring also to other criteria, we conclude that for many applications, Huffman codes should still remain a competitive choice. 1. Introduction It is paradoxical that, as the technology for storing and transmitting information has gotten cheaper and more effective, interest in data compression has increased. There are many explanations, but most conspicuous is that improvements in media have expanded our sense of what we wish to store. For example, CD-Rom technology allows us to store whole libraries instead of records describing individual items; but the requirements of storing full text easily exceeds the capabilities even of the optical format. Similarly, there is ...
Bounding the Depth of Search Trees
- The Computer Journal
, 1993
"... For an ordered sequence of n weights, Huffman's algorithm constructs in time and space O(n) a search tree with minimum average path length, or, which is equivalent, a minimum redundancy code. However, if an upper bound B is imposed on the length of the codewords, the best known algorithms for the co ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
For an ordered sequence of n weights, Huffman's algorithm constructs in time and space O(n) a search tree with minimum average path length, or, which is equivalent, a minimum redundancy code. However, if an upper bound B is imposed on the length of the codewords, the best known algorithms for the construction of an optimal code have time and space complexities O(Bn 2 ). A new algorithm is presented, which yields sub-optimal codes, but in time O(n log n) and space O(n). Under certain conditions, these codes are shown to be close to optimal, and extensive experiments suggest that in many practical applications, the deviation from the optimum is negligible. 1. Motivation and Introduction We consider the set B(n; b) of extended binary trees with n leaves, labelled 1 to n, and with depth b, henceforth called b-restricted trees. An extended binary tree is a binary tree in which every internal node has two sons (here, and in what follows, we use the terminology of Knuth [16, pp. 399--...
Skeleton Trees for the Efficient Decoding of Huffman Encoded Texts
- Information Retrieval
, 1997
"... : A new data structure is investigated, which allows fast decoding of texts encoded by canonical Huffman codes. The storage requirements are much lower than for conventional Huffman trees, O(log 2 n) for trees of depth O(log n), and decoding is faster, because a part of the bit-comparisons nec ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
: A new data structure is investigated, which allows fast decoding of texts encoded by canonical Huffman codes. The storage requirements are much lower than for conventional Huffman trees, O(log 2 n) for trees of depth O(log n), and decoding is faster, because a part of the bit-comparisons necessary for the decoding may be saved. Empirical results on large real-life distributions show a reduction of up to 50% and more in the number of bit operations. The basic idea is then generalized, yielding further savings. This is an extended version of a paper which has been presented at the 8th Annual Symposium on Combinatorial Pattern Matching (CPM'97), and appeared in its proceedings, pp. 65--75. -- 1 -- 1.
Hebrew computational linguistics: Past and future
- Artificial Intelligence Review
, 2004
"... This paper reviews the current state of the art in Natural Language Processing for Hebrew, both theoretical and practical. The Hebrew language, like other Semitic languages, poses special challenges for developers of programs for natural language processing: the writing system, rich morphology, uniq ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This paper reviews the current state of the art in Natural Language Processing for Hebrew, both theoretical and practical. The Hebrew language, like other Semitic languages, poses special challenges for developers of programs for natural language processing: the writing system, rich morphology, unique word formation process of roots and patterns, lack of linguistic corpora that document language usage, all contribute to making computational approaches to Hebrew challenging. The paper briefly reviews the field of computational linguistics and the problems it addresses, describes the special difficulties inherent to Hebrew (as well as to other Semitic languages), surveys a wide variety of past and ongoing works and attempts to characterize future needs and possible solutions. 1
Robust Universal Complete Codes for Transmission and Compression
- Discrete Applied Mathematics
, 1996
"... Several measures are defined and investigated, which allow the comparison of codes as to their robustness against errors. Then new universal and complete sequences of variable-length codewords are proposed, based on representing the integers in a binary Fibonacci numeration system. Each sequence is ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Several measures are defined and investigated, which allow the comparison of codes as to their robustness against errors. Then new universal and complete sequences of variable-length codewords are proposed, based on representing the integers in a binary Fibonacci numeration system. Each sequence is constant and need not be generated for every probability distribution. These codes can be used as alternatives to Huffman codes when the optimal compression of the latter is not required, and simplicity, faster processing and robustness are preferred. The codes are compared on several "real-life" examples. 1. Motivation and Introduction Let A = fA 1 ; A 2 ; \Delta \Delta \Delta ; An g be a finite set of elements, called cleartext elements, to be encoded by a static uniquely decipherable (UD) code. For notational ease, we use the term `code' as abbreviation for `set of codewords'; the corresponding encoding and decoding algorithms are always either given or clear from the context. A code i...
Using Bitmaps for Medium Sized Information Retrieval Systems
- Information Processing & Management
, 1990
"... : We describe the use of various forms of bitmaps as a basic tool for improving the search algorithms in medium sized information retrieval systems. The bitmaps considered include and extend known techniques using occurrence maps and signatures. Such an approach to text retrieval is flexible, effici ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
: We describe the use of various forms of bitmaps as a basic tool for improving the search algorithms in medium sized information retrieval systems. The bitmaps considered include and extend known techniques using occurrence maps and signatures. Such an approach to text retrieval is flexible, efficient and, relative to the customary concordance approach, inexpensive in storage costs. 1. Introduction Our ability to control textual information is being strongly influenced by a variety of technological advances. These include new means of storing and sharing information that makes possible and realistic an information system model in which large bodies of full text are compactly stored, widely distributed, and shared by a large number of interested persons. Such changes require a careful search for techniques that promise convenient and effective access to such textual databases. The research that is required in this environment differs from that traditional in Information Retrieval (IR)...
Information Retrieval from Annotated Texts
- J. Am. Soc. Inf. Sci
, 1998
"... Methods for the correct and efficient handling of annotations in a full-text retrieval system are investigated. The problem with annotations is that they cannot be treated as regular text, since this would disrupt proximity searches, but on the other hand, they cannot be ignored, as they may carr ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Methods for the correct and efficient handling of annotations in a full-text retrieval system are investigated. The problem with annotations is that they cannot be treated as regular text, since this would disrupt proximity searches, but on the other hand, they cannot be ignored, as they may carry important information. Moreover, in some cases, a user may wish to restrict a search to prespecified subsets of annotations. We suggest a new way of processing the database to overcome the above dilemma. Keywords: Full-text information retrieval systems, annotations, access methods, inverted files, concordances, proximity searches. -- 1 -- Information Retrieval from Annotated Texts Abstract: Methods for the correct and efficient handling of annotations in a full-text retrieval system are investigated. The problem with annotations is that they cannot be treated as regular text, since this would disrupt proximity searches, but on the other hand, they cannot be ignored, as they may ca...
The Responsa Storage and Retrieval System - Whither?
, 1996
"... p. 173). We did develop such a tool [CCDFS1971]. As each of these methods has certain advantages and disadvantages, we ended up by merging -- 2 -- them into a joint analysis-synthesis method; a global analysis of all words in the database is done, but without prepositions (otiyot shimush), in order ..."
Abstract
- Add to MetaCart
p. 173). We did develop such a tool [CCDFS1971]. As each of these methods has certain advantages and disadvantages, we ended up by merging -- 2 -- them into a joint analysis-synthesis method; a global analysis of all words in the database is done, but without prepositions (otiyot shimush), in order to end up with a database of manageable size; the prepositions are left to the synthesis phase. See [AFCS1972] for full details. I also set up a "Committee for the Mechanization in Jewish Law Research" whose first members were, I think, Dr. Choueka, Mr. Asa Kasher, later professor of Philosophy at Tel Aviv University, Mr. Joseph Dueck, a young lawyer and research assistant at the IRJL, who served as their representative, and assistants, to formulate procedures for preediting and postediting texts to be inputted, and various algorithms needed for the work. (Many other persons, such as Mr. Reuven Mirkin of the Academy of the Hebrew Language, and research students, joined later.) I also felt ...
Modeling Word Occurences for the Compression of Concordances
"... An earlier paper developed a procedure for compressing concordances, assuming that all elements occurred independently. The models introduced in that paper are extended here to take the possibility of clustering into account. The concordance is conceptualized as a set of bitmaps, in which the bit lo ..."
Abstract
- Add to MetaCart
An earlier paper developed a procedure for compressing concordances, assuming that all elements occurred independently. The models introduced in that paper are extended here to take the possibility of clustering into account. The concordance is conceptualized as a set of bitmaps, in which the bit locations represent documents, and the 1-bits represent the occurrence of given terms. Hidden Markov models (HMM) are used to describe the clustering of the 1bits. However, for computational reasons, the HMM is approximated by traditional Markov models. A set of criteria is developed to constrain the allowable set of n-state models, and a full inventory is given for n 4. Graph theoretic reduction and complementation operations are defined among the various models, and are used to provide a structure relating the models studied. Finally, the new methods were tested on the concordances of the English Bible and of two of the world's largest full-text retrieval system: the Tr'esor de la Langue Fr...

