@MISC{Brisaboa_databaselab.,, author = {Nieves R. Brisaboa and A Coruña Spain and Antonio Fariña and A Coruña Spain and Susana Ladra and Gonzalo Navarro}, title = {Database Lab.,}, year = {} }
Bookmark
OpenURL
Abstract
Recent research has demonstrated beyond doubts the benefits of compressing natural language texts using word-based statistical semistatic compression. Not only it achieves extremely competitive compression rates, but also direct search on the compressed text can be carried out faster than on the original text; indexing based on inverted lists benefits from compression as well. Such compression methods assign a variable-length codeword to each different text word. Some coding methods (Plain Huffman and Restricted Prefix Byte Codes) do not clearly mark codeword boundaries, and hence cannot be accessed at random positions nor searched with the fastest text search algorithms. Other coding methods (Tagged Huffman, End-Tagged Dense Code, or (s, c)-Dense Code) do mark