Results 1  10
of
14
Fast String Correction with LevenshteinAutomata
 INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION
, 2002
"... The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levensht ..."
Abstract

Cited by 33 (5 self)
 Add to MetaCart
The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levenshteindistance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshteinautomaton of degree n for W in time linear in the length of W . Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshteinautomaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshteindistance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshteinautomata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshteindistance where further primitive edit operations (transpositions, merges and splits) may be used.
Improved Website Fingerprinting on Tor
"... In this paper, we propose new website fingerprinting techniques that achieve a higher classification accuracy on Tor than previous works. We describe our novel methodology for gathering data on Tor; this methodology is essential for accurate classifier comparison and analysis. We offer new ways to i ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
In this paper, we propose new website fingerprinting techniques that achieve a higher classification accuracy on Tor than previous works. We describe our novel methodology for gathering data on Tor; this methodology is essential for accurate classifier comparison and analysis. We offer new ways to interpret the data by using the more fundamental Tor cells as a unit of data rather than TCP/IP packets. We demonstrate an experimental method to remove Tor SENDMEs, which are control cells that provide no useful data, in order to improve accuracy. We also propose a new set of metrics to describe the similarity between two traffic instances; they are derived from observations on how a site is loaded. Using our new metrics we achieve a higher success rate than previous authors. We conduct a thorough analysis and comparison between our new algorithms and the previous best algorithm. To identify the potential power of website fingerprinting on Tor, we perform openworld experiments; we achieve a recall rate over 95 % and a false positive rate under 0.2 % for several potentially monitored sites, which far exceeds previous reported recall rates. In the closedworld experiments, our accuracy is 91%, as compared to 86–87 % from the best previous classifier on the same data.
Semgram  integrating semantic graphs into association rule mining
 In AusDM, volume 70 of CRPIT
, 2007
"... To date, most association rule mining algorithms have assumed that the domains of items are either discrete or, in a limited number of cases, hierarchical, categorical or linear. This constrains the search for interesting rules to those that satisfy the specified quality metrics as independent value ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
To date, most association rule mining algorithms have assumed that the domains of items are either discrete or, in a limited number of cases, hierarchical, categorical or linear. This constrains the search for interesting rules to those that satisfy the specified quality metrics as independent values or as higher level concepts of those values. However, in many cases the determination of a single hierarchy is not practicable and, for many datasets, an item’s value may be taken from a domain that is more conveniently structured as a graph with weights indicating semantic (or conceptual) distance. Research in the development of algorithms that generate disjunctive association rules has allowed the production of rules such as Radios ∨ T V s → Cables. In many cases there is little semantic relationship between the disjunctive terms and arguably less readable rules such as Radios ∨ T uesday → Cables can result. This paper describes two association rule mining algorithms, SemGrAMG and SemGrAMP, that accommodate conceptual distance information contained in a semantic graph. The SemGrAM algorithms permit the discovery of rules that include an association between sets of cognate groups of item values. The paper discusses the algorithms, the design decisions made during their development and some experimental results.
Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions
 In ICSC
, 1994
"... . We consider a problem which can greatly enhance the areas of cursive script recognition and the recognition of printed character sequences. This problem involves recognizing words/strings by processing their noisy subsequences. Let X * be any unknown word from a finite dictionary H. Let U be a ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
. We consider a problem which can greatly enhance the areas of cursive script recognition and the recognition of printed character sequences. This problem involves recognizing words/strings by processing their noisy subsequences. Let X * be any unknown word from a finite dictionary H. Let U be any arbitrary subsequence of X * . We study the problem of estimating X * by processing Y, a noisy version of U. Y contains substitution, insertion, deletion and generalized transposition errors  the latter occurring when transposed characters are themselves subsequently substituted. We solve the noisy subsequence recognition problem by defining and using the constrained edit distance between X H and Y subject to any arbitrary edit constraint involving the number and type of edit operations to be performed. An algorithm to compute this constrained edit distance has been presented. Using these algorithms we present a syntactic Pattern Recognition (PR) scheme which corrects noisy tex...
A Character Recognition Approach using Freeman Chain Code and Approximate String Matching
"... This paper deals with a syntactic approach for character recognition using approximate string matching and chain coding of characters. Here we deal only with the classification of characters and not on other phase of the character recognition process in a Optical character Recognition. The character ..."
Abstract
 Add to MetaCart
(Show Context)
This paper deals with a syntactic approach for character recognition using approximate string matching and chain coding of characters. Here we deal only with the classification of characters and not on other phase of the character recognition process in a Optical character Recognition. The character image is first normalized to a specified size then by boundary detection process we detect the boundary of the character image. The character now converted to boundary curve representation of the characters. Then the curve is encoded to a sequence of numbers using Freeman chain coding. The coding scheme gives a sequence of numbers ranges from 0 to 7. Now the characters are in form of strings. For training set we will get a set of strings which is stored in the trie. The extracted unclassified character is also converted to string and searched in the trie. As we are dealing with the character which can be of different orientation so the searching is done with approximate string matching to support noisy character that of different orientation. For approximate string matching we use Look Ahead Branch and Bound scheme to prune path and make the approximation accurate and efficient. As we are using trie data structure, so it take uniform time and don't dependent on the size of the input. When we performed our experimentation for noiseless character that is printed character it successfully recognize all characters. But when we tested with the different variation of the character then it detect most of the character except some noisy character.
This thesis is accepted
, 2002
"... But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep. –RobertFrost Stopping by the woods on a snowy evening ii To my wife, Rachel, to our children, Conrad and Martina, to our parents, and, last but not least, to Kaboose. The Evolving Transformations Systems (ET ..."
Abstract
 Add to MetaCart
(Show Context)
But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep. –RobertFrost Stopping by the woods on a snowy evening ii To my wife, Rachel, to our children, Conrad and Martina, to our parents, and, last but not least, to Kaboose. The Evolving Transformations Systems (ETS) model is a new inductive learning model proposed by Goldfarb in 1990. The development of the ETS model was motivated by need for the unification of the two competing approaches that model learning – numeric (vector space) and symbolic. This model provides a new method for describing classes (or concepts) and also a framework for learning classes from
Pattern Recognition of Strings Containing Traditional and Generalized Transposition Errors
"... We study the problem of recognizing a string Y which is the noisy version of some unknown string X* chosen from a finite dictionary, H. The traditional case which has been extensively studied in the literature is the one in which Y contains substitution, insertion and deletion (SID) errors. Although ..."
Abstract
 Add to MetaCart
We study the problem of recognizing a string Y which is the noisy version of some unknown string X* chosen from a finite dictionary, H. The traditional case which has been extensively studied in the literature is the one in which Y contains substitution, insertion and deletion (SID) errors. Although some work has been done to extend the traditional set of edit operations to include the straightforward transposition of adjacent characters 2 [LW75] the problem is unsolved when the transposed characters are themselves subsequently substituted, as is typical in cursive and typewritten script, in molecular biology and in noisy chaincoded boundaries. In this paper we present the first reported solution to the analytic problem of editing one string X to another, Y using these four edit operations. A scheme for obtaining the optimal edit operations has also been given. Both these solutions are optimal for the infinite alphabet case. Using these algorithms we present a syntactic pattern reco...
Precise and Efficient Text Correction using Levenshtein Automata, Dynamic Web Dictionaries and Optimized Correction Models
"... Despite of the high quality of commercial tools for optical character recognition (OCR) the number of OCRerrors in scanned documents remains intolerable for many applications. We describe an approach to lexical postcorrection of OCRresults developed in our groups at the universities of Munich and ..."
Abstract
 Add to MetaCart
(Show Context)
Despite of the high quality of commercial tools for optical character recognition (OCR) the number of OCRerrors in scanned documents remains intolerable for many applications. We describe an approach to lexical postcorrection of OCRresults developed in our groups at the universities of Munich and Sofia in the framework of two research projects. Some characteristic features are the following: (1) On the dictionary side, very large dictionaries for languages such as German, Bulgarian, English, Russian etc. are enriched with special dictionaries for proper names, geographic names and acronyms. For postcorrection of texts in a specific thematic area we also compute “dynamic ” dictionaries via analysis of web pages that fit the given thematic area. (2) Given a joint background dictionary for postcorrection, we have developed very fast methods for selecting a suitable set of correction candidates for a garbled word of the OCR output text. (3) In a second step, correction candidates are ranked. Our ranking mechanism is based on a number of parameters that determine the influence of features of correction suggestions such as word frequency, editdistance and others. A complex tool has been developed for optimizing these parameters on the basis of ground truth data. Our evaluation results cover a variety of corpora and show that postcorrection improves the quality even for scanned texts with a very small number of OCRerrors. 1
A Scene Learning and Recognition Framework
, 2005
"... ii As multiagent systems grow in complexity and diversity, they become increasingly difficult to design. Agents are described in terms of their behaviour, typically trained by an expert who prepares knowledge representations or training data for supervised machine learning. To reduce development ti ..."
Abstract
 Add to MetaCart
ii As multiagent systems grow in complexity and diversity, they become increasingly difficult to design. Agents are described in terms of their behaviour, typically trained by an expert who prepares knowledge representations or training data for supervised machine learning. To reduce development time, agents could learn by observing the behaviour of other agents. This thesis describes an effort to train a RoboCup soccer agent by capturing data from existing players, generating a knowledge representation, and using a realtime scene recognition system. The trained agent later exhibits behaviour traits similar to the observed agent and can appear to completely imitate the behaviour of the original; the process requires little human intervention. Experiments are performed using three agents of varying complexity. The “scene” knowledge description format, and simple scene matching algorithm, are limited to imitation of stateless and deterministic agent behaviours. Future work includes improving