Results 1 - 10
of
33
A New Approach to Text Searching
"... We introduce a family of simple and fast algorithms for solving the classical string matching problem, string matching with classes of symbols, don't care symbols and complement symbols, and multiple patterns. In addition we solve the same problems allowing up to k mismatches. Among the feature ..."
Abstract
-
Cited by 293 (15 self)
- Add to MetaCart
We introduce a family of simple and fast algorithms for solving the classical string matching problem, string matching with classes of symbols, don't care symbols and complement symbols, and multiple patterns. In addition we solve the same problems allowing up to k mismatches. Among the features of these algorithms are that they don't need to buffer the input, they are real time algorithms (for constant size patterns), and they are suitable to be implemented in hardware. 1 Introduction String searching is a very important component of many problems, including text editing, bibliographic retrieval, and symbol manipulation. Recent surveys of string searching can be found in [17, 4]. The string matching problem consists of finding all occurrences of a pattern of length m in a text of length n. We generalize the problem allowing "don't care" symbols, the complement of a symbol, and any finite class of symbols. We solve this problem for one or more patterns, with or without mismatches. Fo...
A PATTERN MATCHING MODEL FOR MISUSE INTRUSION DETECTION
"... This paper describes a generic model of matching that can be usefully applied to misuse intrusion detection. The model is based on Colored Petri Nets. Guards define the context in which signatures are matched. The notion of start and final states, and paths between them define the set of event seque ..."
Abstract
-
Cited by 193 (7 self)
- Add to MetaCart
(Show Context)
This paper describes a generic model of matching that can be usefully applied to misuse intrusion detection. The model is based on Colored Petri Nets. Guards define the context in which signatures are matched. The notion of start and final states, and paths between them define the set of event sequences matched by the net. Partial order matching can also be specified in this model. The main benefits of the model are its generality, portability and flexibility.
Text Retrieval: Theory and Practice
- In 12th IFIP World Computer Congress, volume I
, 1992
"... We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main obse ..."
Abstract
-
Cited by 51 (14 self)
- Add to MetaCart
We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main observation is that simpler ideas are better in practice. 1597 Shaks. Lover's Compl. 2 From off a hill whose concaue wombe reworded A plaintfull story from a sistring vale. OED2, reword, sistering 1 1 Introduction Full text retrieval systems are becoming a popular way of providing support for on-line text. Their main advantage is that they avoid the complicated and expensive process of semantic indexing. From the end-user point of view, full text searching of on-line documents is appealing because a valid query is just any word or sentence of the document. However, when the desired answer cannot be obtained with a simple query, the user must perform his/her own semantic processing to guess w...
Approximate Pattern Matching with Samples
- In Proc. of ISAAC'94
, 1994
"... . We simplify in this paper the algorithm by Chang and Lawler for the approximate string matching problem, by adopting the concept of sampling. We have a more general analysis of expected time with the simplified algorithm for the one-dimensional case under a non-uniform probability distribution ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
(Show Context)
. We simplify in this paper the algorithm by Chang and Lawler for the approximate string matching problem, by adopting the concept of sampling. We have a more general analysis of expected time with the simplified algorithm for the one-dimensional case under a non-uniform probability distribution, and we show that our method can easily be generalized to the two-dimensional approximate pattern matching problem with sublinear expected time. 1 Introduction Since the inaugural papers on string matching algorithms were published by Knuth, Morris and Pratt[11] and Boyer and Moore [5], the problem diversified into various directions. Let us call string matching one-dimensional pattern matching. One is two-dimensional pattern matching and the other is approximate pattern matching where up to k differences are allowed for a match. Yet another theme is two-dimensional approximate pattern matching. There are numerous papers in these new research areas. We cite just a few of them to compare...
An expert system for automatically correcting OCR output
- In Proceedings of the IS&T/SPIE 1994 International Symposium on Electronic Imaging Science and Technology
, 1994
"... This paper describes a new expert system for automatically correcting errors made by optical character recognition (OCR) devices. The system, which we call the post-processing system, is designed to improve the quality of text produced by an OCR device in preparation for subsequent retrieval from an ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
This paper describes a new expert system for automatically correcting errors made by optical character recognition (OCR) devices. The system, which we call the post-processing system, is designed to improve the quality of text produced by an OCR device in preparation for subsequent retrieval from an information system. The system is composed of numerous parts: an information retrieval system, an English dictionary, a domainspecific dictionary, and a collection of algorithms and heuristics designed to correct as many OCR errors as possible. For the remaining errors that cannot be corrected, the system passes them on to a user-level editing program. This post-processing system can be viewed as part of a larger system that would streamline the steps of taking a document from its hard copy form to its usable electronic form, or it can be considered a stand alone system for OCR error correction. An earlier version of this system has been used to process approximately 10,000 pages of OCR gen...
Post-editing through approximation and global correction
- International Journal of Pattern Recognition and Artificial Intelligence
, 1993
"... This paper describes a new automatic spelling correction program to deal with OCR generated errors. The method used here is based on three principles: 1. Approximate string matching between the misspellings and the terms occuring in the database as opposed to the entire dictionary 2. Local informati ..."
Abstract
-
Cited by 19 (10 self)
- Add to MetaCart
(Show Context)
This paper describes a new automatic spelling correction program to deal with OCR generated errors. The method used here is based on three principles: 1. Approximate string matching between the misspellings and the terms occuring in the database as opposed to the entire dictionary 2. Local information obtained from the individual documents 3. The use of a confusion matrix, which contains information inherently specific to the nature of errors caused by the particular OCR device This system is then utilized to process approximately 10,000 pages of OCR generated documents. Among the misspellings discovered by this algorithm, about 87% were corrected.
An Evaluation of Phonetic Spell Checkers
- Mechanisms of Radiation Eflects in Electronic Materials
, 2001
"... In the work reported here, we describe a phonetic spell-checking algorithm, Phonetex which integrates aspects of Soundex and its extension Phonix. It is designed to provide a phonetic component for an existing typographic spell checker. We increase the number of letter codes compared to Soundex a ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
In the work reported here, we describe a phonetic spell-checking algorithm, Phonetex which integrates aspects of Soundex and its extension Phonix. It is designed to provide a phonetic component for an existing typographic spell checker. We increase the number of letter codes compared to Soundex and Phonix. We also integrate phonetic rules but use far less than Phonix which was designed for South African name matching or Rogers and Willett's Phonix extension which was designed for 17th century spellings as these includes many rules that are redundant in a contemporary word-based domain. We evaluate our algorithm by comparing it to phonetic spell checkers, Soundex and Editex and four benchmark spell checkers (Agrep, MS Word 97 & 2000 and UNIX `ispell') using a list of phonetic spelling errors. We nd that our approach has superior recall (accuracy) to the alternative approaches although the higher recall is at the expense of precision (number of possible matches retrieved). We intend to integrate it into an existing spell checker so the precision will be improved by integration thus high recall is the aim for our approach in this paper. Keywords: Data Cleaning, Phonetic Spell Checker, Phonetic Code Generation. 1
Differential Synchronization
"... This paper describes the Differential Synchronization (DS) method for keeping documents synchronized. The key feature of DS is that it is simple and well suited for use in both novel and existing state-based applications without requiring application redesign. DS uses deltas to make efficient use of ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
This paper describes the Differential Synchronization (DS) method for keeping documents synchronized. The key feature of DS is that it is simple and well suited for use in both novel and existing state-based applications without requiring application redesign. DS uses deltas to make efficient use of bandwidth, and is fault-tolerant, allowing copies to converge in spite of occasional errors. We consider practical implementation of DS and describe some techniques to improve its performance in a browser environment.
High-Speed String Searching against Large Dictionaries on the Cell/B.E. Processor
"... Our digital universe is growing, creating exploding amounts of data which need to be searched, protected and filtered. String searching is at the core of the tools we use to curb this explosion, such as search engines, network intrusion detection systems, spam filters, and anti-virus programs. But a ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Our digital universe is growing, creating exploding amounts of data which need to be searched, protected and filtered. String searching is at the core of the tools we use to curb this explosion, such as search engines, network intrusion detection systems, spam filters, and anti-virus programs. But as communication speed grows, our capability to perform string searching in real-time seems to fall behind. Multi-core architectures promise enough computational power to cope with the incoming challenge, but it is still unclear which algorithms and programming models to use to unleash this power. We have parallelized a popular string searching algorithm, Aho-Corasick, on the IBM Cell/B.E. processor, with the goal of performing exact string matching against large dictionaries. In this article we propose a novel approach to fully exploit the DMA-based communication mechanisms of the Cell/B.E. to provide an unprecedented level of aggregate performance with irregular access patterns. We have discovered that memory congestion plays a crucial role in determining the performance of this algorithm. We discuss three aspects of congestion: memory pressure, layout issues and hot spots, and we present a collection of algorithmic solutions to alleviate these problems and achieve quasi-optimal performance. The implementation of our algorithm provides a worst-case throughput of 2.5 Gbps, and a typical throughput between 3.3 and 4.4 Gbps, measured on realistic scenarios with a twoprocessor Cell/B.E. system. 1