Results 1  10
of
30
Fast Algorithms for Sorting and Searching Strings
, 1997
"... We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching a ..."
Abstract

Cited by 148 (0 self)
 Add to MetaCart
We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algorithms date back at least to the 1960s, but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partialmatch searching. 1. Introduction Section 2 briefly reviews Hoare's [9] Quicksort and binary search trees. We emphasize a wellknown isomorphism relating the two, and summarize other basic facts. The multikey algorithms and data structures are presented in Section 3. Multikey Quicksort orders a set of n vectors with k components each. Like regular Quicksort, it partitions its input into...
Fast Text Searching for Regular Expressions or Automaton Searching on Tries
"... We present algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. We obtain searching algorithms that run in logarithmic expected time in the size of the text for a wide subclass of regular expressions, and in subline ..."
Abstract

Cited by 49 (6 self)
 Add to MetaCart
We present algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. We obtain searching algorithms that run in logarithmic expected time in the size of the text for a wide subclass of regular expressions, and in sublinear expected time for any regular expression. This is the first such algorithm to be found with this complexity.
Text Retrieval: Theory and Practice
 In 12th IFIP World Computer Congress, volume I
, 1992
"... We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main obse ..."
Abstract

Cited by 46 (14 self)
 Add to MetaCart
We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main observation is that simpler ideas are better in practice. 1597 Shaks. Lover's Compl. 2 From off a hill whose concaue wombe reworded A plaintfull story from a sistring vale. OED2, reword, sistering 1 1 Introduction Full text retrieval systems are becoming a popular way of providing support for online text. Their main advantage is that they avoid the complicated and expensive process of semantic indexing. From the enduser point of view, full text searching of online documents is appealing because a valid query is just any word or sentence of the document. However, when the desired answer cannot be obtained with a simple query, the user must perform his/her own semantic processing to guess w...
On the use of Regular Expressions for Searching Text
 ACM Transactions on Programming Languages and Systems
, 1995
"... The use of regular expressions to search text is well known and understood as a useful technique. It is then surprising that the standard techniques and tools prove to be of limited use for searching text formatted with SGML or other similar markup languages. Experience with structured text search h ..."
Abstract

Cited by 38 (3 self)
 Add to MetaCart
The use of regular expressions to search text is well known and understood as a useful technique. It is then surprising that the standard techniques and tools prove to be of limited use for searching text formatted with SGML or other similar markup languages. Experience with structured text search has caused us to carefully reexamine the current practice. The generally accepted rule of "leftmost longest match" is an unfortunate choice and is at the root of the difficulties. We instead propose a rule which is semantically cleaner and is incidentally more simple and efficient to implement. This rule is generally applicable to any text search application. 1 Introduction Regular expressions are widely regarded as a precise, succinct notation for specifying a text search, with a straightforward efficient implementation. Many people routinely use regular expressions to specify searches in text editors and with standalone search tools such as the Unix grep utility. A regular expression ...
Lexicographical Indices for Text: Inverted files vs. PAT trees
, 1991
"... We survey two indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text. to appear in Information Retrieval: Data Structures and Algori ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
We survey two indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text. to appear in Information Retrieval: Data Structures and Algorithms, R.A. BaezaYates and W. Frakes, eds., PrenticeHall. 1 1 Introduction Text searching methods may be classified as lexicographical indices (indices that are sorted), clustering techniques, and indices based on hashing (for example, signature files [FC87]). In this report we discuss lexicographical indices, in particular, two main data structures: inverted files and Pat trees. Our aim is to build an index for the text of size similar to or smaller than the text. Briefly, the traditional model of text used in information retrieval is that of a set of documents. Each document is assigned a list of keywords (attributes), with optional relevance weights associated to each keyword. This ...
Approximate Matching of Network Expressions with Spacers
 Journal of Computational Biology
, 1992
"... Two algorithmic results are presented that are pertinent to the matching of patterns of interest in macromolecular sequences. The first result is an output sensitive algorithm for approximately matching network expressions, i.e., regular expressions without Kleene closure. This result generalizes th ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
Two algorithmic results are presented that are pertinent to the matching of patterns of interest in macromolecular sequences. The first result is an output sensitive algorithm for approximately matching network expressions, i.e., regular expressions without Kleene closure. This result generalizes the O (kn ) expectedtime algorithm of Ukkonen for approximately matching keywords [Ukk85]. The second result concerns the problem of matching a pattern that is a network expression whose elements are approximate matches to network expressions interspersed with specifiable distance ranges. For this class of patterns, it is shown how to determine a backtracking procedure whose order of evaluation is optimal in the sense that its expected time is minimal over all such procedures. Key words: Approximate Match, Backtracking, Network Expression, Proximity Search January 16, 1992 Department of Computer Science The University of Arizona Tucson, Arizona 85721 *This work was supported in part by the ...
Efficient Discovery of Optimal WordAssociation Patterns in Large Text Databases
 New Generation Computing
, 2000
"... We study efficient discovery of proximity wordassociation patterns, defined by a sequence of strings and a proximity gap, from a collection of texts with the positive and the negative labels. We present an algorithm that finds all dstrings kproximity wordassociation patterns that maximize the nu ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
We study efficient discovery of proximity wordassociation patterns, defined by a sequence of strings and a proximity gap, from a collection of texts with the positive and the negative labels. We present an algorithm that finds all dstrings kproximity wordassociation patterns that maximize the number of texts whose matching agree with their labels.
Matching a Set of Strings with Variable Length Don’t Cares, Theoretical Computer Science 178
, 1997
"... Given an alphabet A, a pattern p is a sequence (vl,...,vm) of words from A * called keywords. We represent p as a single word ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
Given an alphabet A, a pattern p is a sequence (vl,...,vm) of words from A * called keywords. We represent p as a single word
A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases
 In Proc. the 9th Int. Workshop on Algorithmic Learning Theory, LNAI 1501
"... . We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A twowords association pattern is an expression such as (TATA, 30, AGGAGGT) ) C that expresses a rule that if a text contains a subword TATA followed by another subwor ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
. We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A twowords association pattern is an expression such as (TATA, 30, AGGAGGT) ) C that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with high probability. The optimized confidence pattern problem is to compute frequent patterns (ff; k; fi) that optimize the confidence with respect to a given collection of texts. Although this problem is solved in polynomial time by a straightforward algorithm that enumerates all the possible patterns in time O(n 5 ), we focus on the development of more efficient algorithms that can be applied to large text databases. We present an algorithm that solves the optimized confidence pattern problem in time O(maxfk; mgn 2 ) and space O(kn), where m and n are the number and the total length of classification example...