Results 1  10
of
12
Suffix arrays: A new method for online string searches
 SIAM J. Comput
, 1993
"... Abstract. A new and conceptually simple data structure, called a suffix array, for online string searches is introduced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix tree ..."
Abstract

Cited by 642 (1 self)
 Add to MetaCart
Abstract. A new and conceptually simple data structure, called a suffix array, for online string searches is introduced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix trees is that, in practice, they use three to five times less space. From a complexity standpoint, suffix arrays permit online string searches of the type, "Is W a substring of A? " to be answered in time O(P + log N), where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) suffix trees. The only drawback is that in those instances where the underlying alphabet is finite and small, suffix trees can be constructed in O (N) time in the worst case, versus O (N log N) time for suffix arrays. However, an augmented algorithm is given that, regardless of the alphabet size, constructs suffix arrays in O (N) expected time, albeit with lesser.space efficiency. It is believed that suffix arrays will prove to be better in practice than suffix trees for many applications.
A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 401 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
Complete inverted files for efficient text retrieval and analysis
 Journal of the ACM
, 1987
"... Abstract. Given a finite set of texts S = (wi, *.., wk) over some fixed finite alphabet 2, a complete inverted tile for S is an abstract data type that provides the functionsfind ( which returns the longest prefix of w that occurs (as a subword of a word) in S, freq(w), which returns the number of t ..."
Abstract

Cited by 59 (1 self)
 Add to MetaCart
Abstract. Given a finite set of texts S = (wi, *.., wk) over some fixed finite alphabet 2, a complete inverted tile for S is an abstract data type that provides the functionsfind ( which returns the longest prefix of w that occurs (as a subword of a word) in S, freq(w), which returns the number of times w occurs in S, and locations(w), which returns the set of positions where w occurs in S. A data structure. that implements a complete inverted file for S that occupies linear space and can be built in linear time, using the uniformcost RAM model, is given. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, techniques from the theory of finite automata and the work on suffix trees are used to build a deterministic finite automaton that recognizes the set of all subwords of the set S. This automaton is then annotated with additional information and compacted to facilitate the desired query functions. The result is a data structure that is smaller and more flexible than the s&ix tree.
Suffix Trees and their Applications in String Algorithms
, 1993
"... : The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
: The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others. In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching. Work partially supported by the ESPRIT BRA ALCOM II under contract no. 7141 and by the Italian MURST Project "Algoritmi, Modelli di Calcolo e Strutture Informative". y Part of this work was done while the author was visiting AT&T Bell Laboratories. Email: grossi@di.uni...
A Linear Time, Constant Space Differencing Algorithm
 In Performance, Computing, and Communication Conference (IPCCC
, 1997
"... An efficient differencing algorithm can be used to compress version of files for both transmission over low bandwidth channels and compact storage. This can greatly reduce network traffic and execution time for distributed applications which include software distribution, source code control, file s ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
An efficient differencing algorithm can be used to compress version of files for both transmission over low bandwidth channels and compact storage. This can greatly reduce network traffic and execution time for distributed applications which include software distribution, source code control, file system replication, and data backup and restore. An algorithm for such applications needs to be both general and efficient; able to compress binary inputs in linear time. We present such an algorithm for differencing files at the granularity of a byte. The algorithm uses constant memory and handles arbitrarily large input files. While the algorithm makes minor sacrifices in compression to attain linear runtime performance, it outperforms the bytewise differencing algorithms that we have encountered in the literature on all inputs. I. INTRODUCTION Differencing algorithms compress data by taking advantage of statistical correlations between different versions of the same data sets. Strictly ...
Differential Compression: A Generalized Solution For Binary Files
, 1996
"... Differential Compression: A Generalized Solution for Binary Files by Randal C. Burns This work presents the development and analysis of a family of algorithms for generating differentially compressed output from binary sources. The algorithms all perform the same fundamental task: given two versi ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
Differential Compression: A Generalized Solution for Binary Files by Randal C. Burns This work presents the development and analysis of a family of algorithms for generating differentially compressed output from binary sources. The algorithms all perform the same fundamental task: given two versions of the same data as input streams, generate and output a compact encoding of one of the input streams by representing it as a set of changes with respect to the other input stream. Differential compression provides a computationally efficient compression technique for applications that generate versioned data and we often expect differencing to produce a significantly more compact file than more traditional compression techniques. The greedy algorithm for file differencing is presented and this algorithm is proven to produce the optimally compressed differential output. However, this algorithm requires execution time quadratic in the size of the input files. We next present an algorithm...
PROSIMA: Protein Similarity Algorithm
"... In this article we present a novel algorithm for measuring protein similarity based on their three dimensional structure (protein tertiary structure). The PROSIMA algorithm using suffix tress for discovering common parts of mainchains of all proteins appearing in current NCSB Protein Data Bank (PDB ..."
Abstract
 Add to MetaCart
In this article we present a novel algorithm for measuring protein similarity based on their three dimensional structure (protein tertiary structure). The PROSIMA algorithm using suffix tress for discovering common parts of mainchains of all proteins appearing in current NCSB Protein Data Bank (PDB). By identifying these common parts we build a vector model and next use classical information retrieval tasks based on the vector model to measure the similarity between proteins all to all protein similarity. For the calculation of protein similarity we are using tfidf term weighing schema and cosine similarity measure. The goal of this work to use the whole current PDB database (downloaded on June 2009) of known proteins, not just some kinds of selections of this database, which have been studied in other works. We have chose the SCOP database for verification of precision of our algorithm because it is maintained primarily by humans. The next success of this work is to be able to determine protein SCOP categories of proteins not included in the latest version of the SCOP database (v. 1.75) with nearly 100 % precision. 1
unknown title
"... This article presents a novel method for measuring protein similarity based on their tertiary structure. The new method deals with suffix trees and classical information retrieval tasks, such as the vector space model, using tfidf term weighing schema or using various types of similarity measures. ..."
Abstract
 Add to MetaCart
This article presents a novel method for measuring protein similarity based on their tertiary structure. The new method deals with suffix trees and classical information retrieval tasks, such as the vector space model, using tfidf term weighing schema or using various types of similarity measures. Our goal to use the whole PDB database of known proteins, not just some kinds of selections, which have been studied in other works. For verification of our algorithm we are using comparisons with the SCOP database, which is maintained primarily by humans. The next goal is to be able to categorize proteins not included in the latest version of the SCOP database with nearly 100 % accuracy. 1
YAPS: Yet Another Protein Similarity
 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION
, 2009
"... In this article we present a novel method for measuring protein similarity based on their tertiary structure. Our new method deals with suffix trees and classical information retrieval tasks, such as the vector space model, using tfidf term weighing schema or using various types of similarity measu ..."
Abstract
 Add to MetaCart
In this article we present a novel method for measuring protein similarity based on their tertiary structure. Our new method deals with suffix trees and classical information retrieval tasks, such as the vector space model, using tfidf term weighing schema or using various types of similarity measures. Our goal to use the whole PDB database of known proteins, not just some kinds of selections, which have been studied in other works. For verification of our algorithm we are using comparisons with the SCOP database which is maintained primarily by humans. The next goal is to be able to categorize proteins not included in the latest version of the SCOP database with nearly 100 % accuracy.