Results 1 
9 of
9
Fast String Correction with LevenshteinAutomata
 INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION
, 2002
"... The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levensht ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levenshteindistance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshteinautomaton of degree n for W in time linear in the length of W . Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshteinautomaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshteindistance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshteinautomata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshteindistance where further primitive edit operations (transpositions, merges and splits) may be used.
Semgram  integrating semantic graphs into association rule mining
 In AusDM, volume 70 of CRPIT
, 2007
"... To date, most association rule mining algorithms have assumed that the domains of items are either discrete or, in a limited number of cases, hierarchical, categorical or linear. This constrains the search for interesting rules to those that satisfy the specified quality metrics as independent value ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
To date, most association rule mining algorithms have assumed that the domains of items are either discrete or, in a limited number of cases, hierarchical, categorical or linear. This constrains the search for interesting rules to those that satisfy the specified quality metrics as independent values or as higher level concepts of those values. However, in many cases the determination of a single hierarchy is not practicable and, for many datasets, an item’s value may be taken from a domain that is more conveniently structured as a graph with weights indicating semantic (or conceptual) distance. Research in the development of algorithms that generate disjunctive association rules has allowed the production of rules such as Radios ∨ T V s → Cables. In many cases there is little semantic relationship between the disjunctive terms and arguably less readable rules such as Radios ∨ T uesday → Cables can result. This paper describes two association rule mining algorithms, SemGrAMG and SemGrAMP, that accommodate conceptual distance information contained in a semantic graph. The SemGrAM algorithms permit the discovery of rules that include an association between sets of cognate groups of item values. The paper discusses the algorithms, the design decisions made during their development and some experimental results.
Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions
 In ICSC
, 1994
"... . We consider a problem which can greatly enhance the areas of cursive script recognition and the recognition of printed character sequences. This problem involves recognizing words/strings by processing their noisy subsequences. Let X * be any unknown word from a finite dictionary H. Let U be a ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
. We consider a problem which can greatly enhance the areas of cursive script recognition and the recognition of printed character sequences. This problem involves recognizing words/strings by processing their noisy subsequences. Let X * be any unknown word from a finite dictionary H. Let U be any arbitrary subsequence of X * . We study the problem of estimating X * by processing Y, a noisy version of U. Y contains substitution, insertion, deletion and generalized transposition errors  the latter occurring when transposed characters are themselves subsequently substituted. We solve the noisy subsequence recognition problem by defining and using the constrained edit distance between X H and Y subject to any arbitrary edit constraint involving the number and type of edit operations to be performed. An algorithm to compute this constrained edit distance has been presented. Using these algorithms we present a syntactic Pattern Recognition (PR) scheme which corrects noisy tex...
Pattern Recognition of Strings Containing Traditional and Generalized Transposition Errors
"... We study the problem of recognizing a string Y which is the noisy version of some unknown string X* chosen from a finite dictionary, H. The traditional case which has been extensively studied in the literature is the one in which Y contains substitution, insertion and deletion (SID) errors. Although ..."
Abstract
 Add to MetaCart
We study the problem of recognizing a string Y which is the noisy version of some unknown string X* chosen from a finite dictionary, H. The traditional case which has been extensively studied in the literature is the one in which Y contains substitution, insertion and deletion (SID) errors. Although some work has been done to extend the traditional set of edit operations to include the straightforward transposition of adjacent characters 2 [LW75] the problem is unsolved when the transposed characters are themselves subsequently substituted, as is typical in cursive and typewritten script, in molecular biology and in noisy chaincoded boundaries. In this paper we present the first reported solution to the analytic problem of editing one string X to another, Y using these four edit operations. A scheme for obtaining the optimal edit operations has also been given. Both these solutions are optimal for the infinite alphabet case. Using these algorithms we present a syntactic pattern reco...
Thue Systems for Pattern Recognition
, 2003
"... This report presents a synoptic overview of Thue Systems. Thue Systems were introduced in the early 1900s by the Norwegian mathematician and logician Axel Thue. In this report the author suggests ways in which such systems can be used in pattern recognition. ..."
Abstract
 Add to MetaCart
This report presents a synoptic overview of Thue Systems. Thue Systems were introduced in the early 1900s by the Norwegian mathematician and logician Axel Thue. In this report the author suggests ways in which such systems can be used in pattern recognition.
This thesis is accepted
, 2002
"... But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep. –RobertFrost Stopping by the woods on a snowy evening ii To my wife, Rachel, to our children, Conrad and Martina, to our parents, and, last but not least, to Kaboose. The Evolving Transformations Systems (ET ..."
Abstract
 Add to MetaCart
But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep. –RobertFrost Stopping by the woods on a snowy evening ii To my wife, Rachel, to our children, Conrad and Martina, to our parents, and, last but not least, to Kaboose. The Evolving Transformations Systems (ETS) model is a new inductive learning model proposed by Goldfarb in 1990. The development of the ETS model was motivated by need for the unification of the two competing approaches that model learning – numeric (vector space) and symbolic. This model provides a new method for describing classes (or concepts) and also a framework for learning classes from
An Efficient Algorithm for Deduplication of Demographic Data
"... Abstract. This paper proposes an efficient algorithm to deduplicate based on demographic information which contains two name strings, viz. GivenName and Surname, of individuals. The algorithm consists of two stages enrolment and deduplication. In both stages, all name strings are reduced to gener ..."
Abstract
 Add to MetaCart
Abstract. This paper proposes an efficient algorithm to deduplicate based on demographic information which contains two name strings, viz. GivenName and Surname, of individuals. The algorithm consists of two stages enrolment and deduplication. In both stages, all name strings are reduced to generic name strings with the help of phonetic based reduction rules. Thus there may be several name strings having same generic name and also there may be many individuals having the same name. The generic name with all name strings and their Ids forms a bin. At the enrolment stage, a database with demographic information is efficiently created which is an array of bins and each bin is a singly linked list. At the deduplication stage, name strings are reduced and all neighbouring bins of the reduced name strings are used to determine the top k best matches. In order to see the performance of the proposed algorithm, we have considered a large demographic database of 4,85,136 individuals. It has been observed that the phonetic reduction rules could reduce both the name strings by more than 90%. Experimental results reveal that there is very high hit rate against a low penetration rate.
Improved Website Fingerprinting on Tor
"... In this paper, we propose new website fingerprinting techniques that achieve a higher classification accuracy on Tor than previous works. We describe our novel methodology for gathering data on Tor; this methodology is essential for accurate classifier comparison and analysis. We offer new ways to i ..."
Abstract
 Add to MetaCart
In this paper, we propose new website fingerprinting techniques that achieve a higher classification accuracy on Tor than previous works. We describe our novel methodology for gathering data on Tor; this methodology is essential for accurate classifier comparison and analysis. We offer new ways to interpret the data by using the more fundamental Tor cells as a unit of data rather than TCP/IP packets. We demonstrate an experimental method to remove Tor SENDMEs, which are control cells that provide no useful data, in order to improve accuracy. We also propose a new set of metrics to describe the similarity between two traffic instances; they are derived from observations on how a site is loaded. Using our new metrics we achieve a higher success rate than previous authors. We conduct a thorough analysis and comparison between our new algorithms and the previous best algorithm. To identify the potential power of website fingerprinting on Tor, we perform openworld experiments; we achieve a recall rate over 95 % and a false positive rate under 0.2 % for several potentially monitored sites, which far exceeds previous reported recall rates. In the closedworld experiments, our accuracy is 91%, as compared to 86–87 % from the best previous classifier on the same data.
A Character Recognition Approach using Freeman Chain Code and Approximate String Matching
"... This paper deals with a syntactic approach for character recognition using approximate string matching and chain coding of characters. Here we deal only with the classification of characters and not on other phase of the character recognition process in a Optical character Recognition. The character ..."
Abstract
 Add to MetaCart
This paper deals with a syntactic approach for character recognition using approximate string matching and chain coding of characters. Here we deal only with the classification of characters and not on other phase of the character recognition process in a Optical character Recognition. The character image is first normalized to a specified size then by boundary detection process we detect the boundary of the character image. The character now converted to boundary curve representation of the characters. Then the curve is encoded to a sequence of numbers using Freeman chain coding. The coding scheme gives a sequence of numbers ranges from 0 to 7. Now the characters are in form of strings. For training set we will get a set of strings which is stored in the trie. The extracted unclassified character is also converted to string and searched in the trie. As we are dealing with the character which can be of different orientation so the searching is done with approximate string matching to support noisy character that of different orientation. For approximate string matching we use Look Ahead Branch and Bound scheme to prune path and make the approximation accurate and efficient. As we are using trie data structure, so it take uniform time and don't dependent on the size of the input. When we performed our experimentation for noiseless character that is printed character it successfully recognize all characters. But when we tested with the different variation of the character then it detect most of the character except some noisy character.