Results 11  20
of
42
Approximate String Matching with LempelZiv Compressed Indexes
, 2007
"... A compressed fulltext selfindex for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on selfindexes in recent years, there has ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
A compressed fulltext selfindex for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on selfindexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a LempelZiv selfindex. We consider the socalled hybrid indexes, which are the best in practice for this problem. We show that a LempelZiv index can be seen as an extension of the classical qsamples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the LempelZiv index. We show experimentally that our algorithm has a competitive performance and provides a useful spacetime tradeoff compared to classical indexes.
Finding patterns in given intervals
 of Lecture Notes in Computer Science
, 2007
"... Abstract. In this paper, we study the pattern matching problem in given intervals. Depending on whether the intervals are given a priori for preprocessing, or during the query along with the pattern or, even in both cases, we develop solutions for different variants of this problem. In particular, ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Abstract. In this paper, we study the pattern matching problem in given intervals. Depending on whether the intervals are given a priori for preprocessing, or during the query along with the pattern or, even in both cases, we develop solutions for different variants of this problem. In particular, we present efficient indexing schemes for each of the above variants of the problem. 1
Maximal intersection queries in randomized graph models
 In CSR’07
, 2007
"... Abstract. Consider a family of sets and a single set, called query set. How can one quickly find a member of the family which has a maximal intersection with the query set? Strict time constraints on the query and on a possible preprocessing of the set family make this problem challenging. Such maxi ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Abstract. Consider a family of sets and a single set, called query set. How can one quickly find a member of the family which has a maximal intersection with the query set? Strict time constraints on the query and on a possible preprocessing of the set family make this problem challenging. Such maximal intersection queries arise in a wide range of applications, including web search, recommendation systems, and distributing online advertisements. In general, maximal intersection queries are computationally expensive. Therefore, one need to add some assumptions about input in order to get an efficient solution. We investigate two wellmotivated distributions over all families of sets and propose an algorithm for each of them. We show that with very high probability an almost optimal solution is found in time logarithmic in the size of the family. In particular, we point out a threshold phenomenon on the probabilities of intersecting sets in each of our two input models which leads to efficient algorithms mentioned above. 1
Property matching and weighted matching
 In CPM
, 2006
"... In many pattern matching applications the text has some properties attached to various of its parts. Pattern Matching with Properties (Property Matching, for short), involves a string matching between the pattern and the text, and the requirement that the text part satisfies some property. Some imme ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
In many pattern matching applications the text has some properties attached to various of its parts. Pattern Matching with Properties (Property Matching, for short), involves a string matching between the pattern and the text, and the requirement that the text part satisfies some property. Some immediate examples come from molecular biology where it has long been a practice to consider special areas in the genome by their structure. It is straightforward to do sequential matching in a text with properties. However, indexing in a text with properties becomes difficult if we desire the time to be output dependent. We present an algorithm for indexing a text with properties in O(n log Σ  + n log log n) time for preprocessing and O(P  log Σ  + toccπ) per query, where n is the length of the text, P is the sought pattern, and toccπ is the number of occurrences of the pattern that satisfy some property π. As a practical use of Property Matching we show how to solve Weighted Matching problems using techniques from Property Matching. Weighted sequences have been recently introduced as a tool to handle a set of sequences that are not identical but have many local similarities. The weighted sequence is a “statistical image ” of this set, where we are given the probability of every symbol’s occurrence at every text location. Weighted matching problems are pattern matching problems where the given text is weighted. We present a reduction from Weighted Matching to Property Matching that allows offtheshelf solutions to numerous weighted matching problems including indexing (which is nontrivial without this reduction). Assuming that one seeks the occurrence of pattern P with probability ɛ in weighted text T of length n, we reduce the problem to a property matching problem of pattern P in text T ′ of length O(n ( 1 ɛ)2 log 1 ɛ). 1
Probabilistic Management of OCR Data using an RDBMS
, 2011
"... The digitization of scanned forms and documents is changing the data sources that enterprises manage. To integrate these new data sources with enterprise data, the current stateoftheart approach is to convert the images to ASCII text using optical character recognition (OCR) software and then to s ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The digitization of scanned forms and documents is changing the data sources that enterprises manage. To integrate these new data sources with enterprise data, the current stateoftheart approach is to convert the images to ASCII text using optical character recognition (OCR) software and then to store the resulting ASCII text in a relational database. The OCR problem is challenging, and so the output of OCR often contains errors. In turn, queries on the output of OCR may fail to retrieve relevant answers. Stateoftheart OCR programs, e.g., the OCR powering Google Books, use a probabilistic model that captures many alternatives during the OCR process. Only when the results of OCR are stored in the database, do these approaches discard the uncertainty. In this work, we propose to retain the probabilistic models produced by OCR process in a relational database management system. A key technical challenge is that the probabilistic data produced by OCR software is very large (a single book blows up to 2GB from 400kB as ASCII). As a result, a baseline solution that integrates these models with an RDBMS is over 1000x slower versus standard text processing for single table selectproject queries. However, many applications may have qualityperformance needs that are in between these two extremes of ASCII and the complete model output by the OCR software. Thus, we propose a novel approximation scheme called Staccato that allows a user to trade recall for query performance. Additionally, we provide a formal analysis of our scheme’s properties, and describe how we integrate our scheme with standardRDBMS text indexing.
A.: Dotted suffix trees: a structure for approximate text indexing
, 2006
"... Abstract. In this work, we address is text indexing for approximate matching. Given a text T which undergoes some preprocessing to generate an index, we can later query this index to identify the places where a string occurs up to a certain number of errors k (edition distance). The indexing structu ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. In this work, we address is text indexing for approximate matching. Given a text T which undergoes some preprocessing to generate an index, we can later query this index to identify the places where a string occurs up to a certain number of errors k (edition distance). The indexing structure occupies space O(n log k n) in the average case, independent of alphabet size. This structure can be used to report the existence of a match with k errors in O(3 k m k+1) and to report the occurrences in O(3 k m k+1 + ed) time, where m is the length of the pattern and where ed the number of matching edit scripts. The construction of the structure has time bound by O(kNΣ), where N is the number of nodes in the index and Σ  the alphabet size.
Efficient approximate dictionary lookup over small alphabets
, 2005
"... Given a dictionary W consisting of n binary strings of length m each, a dquery asks if there exists a string in W within Hamming distance d of a given binary query string q. The problem was posed by Minsky and Papert in 1969 [10] as a challenge to data structure design. Efficient solutions have bee ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Given a dictionary W consisting of n binary strings of length m each, a dquery asks if there exists a string in W within Hamming distance d of a given binary query string q. The problem was posed by Minsky and Papert in 1969 [10] as a challenge to data structure design. Efficient solutions have been developed only for the special case when d = 1 (the 1query problem). We assume the standard RAM model of computation, and consider the case of the problem when alphabet size is arbitrary but finite, and d is small. We preprocess the dictionary, and construct an edgelabelled tree with bounded branching factor, and height. We present an algorithm to answer dictionary lookup within given distance d of a given query string q. The algorithm is efficient when the alphabet size is small, or the dictionary is sparse. In particular, for the dquery problem the algorithm takes time O(m(log 4/3 n − 1) d (log 2 n) d+1). This is an improvement over previously known algorithms for the dquery problem when d> 1. We also generalize the results for the case of the problem when edit distances are used. The algorithm can be modified such that it allows for words of different lengths as well as different lengths of query strings. 1
On the Suffix Automaton with mismatches ⋆
"... Abstract. In this paper we focus on the construction of the minimal deterministic finite automaton S k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of the automaton S k in order to accept in an efficient way the language of all suffixes of w ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. In this paper we focus on the construction of the minimal deterministic finite automaton S k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of the automaton S k in order to accept in an efficient way the language of all suffixes of w up to k errors in every windows of size r, where r is the value of the repetition index of w. Moreover, we give some experimental results on some wellknown words, like prefixes of Fibonacci and ThueMorse words, and we make a conjecture on the size of the suffix automaton with mismatches.
Compressed indexes for approximate string matching
 In Proceedings of the European Symposium on Algorithms
, 2006
"... Abstract. We revisit the problem of indexing a string S[1..n] to support searching all substrings in S that match a given pattern P [1..m] with at most k errors. Previous solutions either require an index of size exponential in k or need Ω(m k) time for searching. Motivated by the indexing of DNA se ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. We revisit the problem of indexing a string S[1..n] to support searching all substrings in S that match a given pattern P [1..m] with at most k errors. Previous solutions either require an index of size exponential in k or need Ω(m k) time for searching. Motivated by the indexing of DNA sequences, we investigate space efficient indexes that occupy only O(n) space. For k = 1, we give an index to support matching in O(m + occ + log n log log n) time. The previously best solution achieving this time complexity requires an index of size O(n log n). This new index can be used to improve existing indexes for k ≥ 2 errors. Among others, it can support matching with k = 2 errors in O(m log n log log n + occ) time. 1