## N-gram similarity and distance (2005)

Venue: | Proc. Twelfth Int’l Conf. on String Processing and Information Retrieval |

Citations: | 15 - 0 self |

### BibTeX

@INPROCEEDINGS{Kondrak05n-gramsimilarity,

author = {Grzegorz Kondrak},

title = {N-gram similarity and distance},

booktitle = {Proc. Twelfth Int’l Conf. on String Processing and Information Retrieval},

year = {2005},

pages = {115--126}

}

### OpenURL

### Abstract

Abstract. In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents. 1

### Citations

656 |
The string-to-string correction problem
- Wagner, Fischer
- 1974
(Show Context)
Citation Context ... s: s(X,Y )=s(Γk,l)=max(s(Γk−1,l),s(Γk,l−1),s(Γk−1,l−1)+s(xk,yl)) The alternative formulation directly yields the well-known efficient dynamicprogramming algorithm for computing the length of the LCS =-=[14]-=-. 2.5 Beyond Unigram Similarity The main weakness of the LCS length as a measure of string similarity is its insensitivity to context. The problem is illustrated in Figure 2. The two word pairs on the... |

382 |
Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison
- Sankoff, Kruskal
- 1983
(Show Context)
Citation Context ...ral. The problem of measuring string similarity occurs in a variety of fields, including bioinformatics, speech recognition, information retrieval, machine translation, lexicography, and dialectology =-=[9]-=-. A related issue of computing the similarity of texts as strings of words has also been studied. Numerous string similarity measures have been proposed. A particularly widely-used method is edit dist... |

153 |
Approximate string matching with q-grams and maximal matches
- Ukkonen
- 1992
(Show Context)
Citation Context ...evelop a notion of n-gram similarity and distance. 1 We show that edit distance and the length of the LCS are special cases of n-gram 1 This is a different concept from the q-gram similarity/distance =-=[12]-=-, which is simply the number of common/distinct q-grams (n-grams) between two strings. M. Consens and G. Navarro (Eds.): SPIRE 2005, LNCS 3772, pp. 115–126, 2005. c○ Springer-Verlag Berlin Heidelberg ... |

80 | Bitext maps and alignment via pattern recognition
- Melamed
- 1999
(Show Context)
Citation Context ...normalized variant of the LCS is usually preferred. The longest common subsequence ratio (LCSR) is computed by dividing the length of the longest common subsequence by the length of the longer string =-=[8]-=-. Edit distance is often normalized in a similar way, i.e. by the length of the longer string (e.g., [5]). However, Marzal and Vidal [6] propose instead to normalize by the length of the editing path ... |

65 |
Computation of normalized edit distance and applications
- Marzal, Vidal
- 1993
(Show Context)
Citation Context ...e longest common subsequence by the length of the longer string [8]. Edit distance is often normalized in a similar way, i.e. by the length of the longer string (e.g., [5]). However, Marzal and Vidal =-=[6]-=- propose instead to normalize by the length of the editing path between strings, which requires a somewhat more complex algorithm. We refer to these two variants of Normalized Edit Distance as NED1 an... |

52 |
Longest common subsequences of two random sequences
- Chvatal, Sankoff
- 1975
(Show Context)
Citation Context ...s our focus here, only the length of the LCS is important; the actual longest common subsequence is irrelevant. The length of the LCS as a function of two strings is an interesting function in itself =-=[2]-=-. 2.2 Recursive Definition We propose the following formal, recursive definition of the function s(X,Y ), which is equivalent to the length of the LCS. Let X = x1 ...xk and Y = y1 ...yl be strings of ... |

42 | Manual annotation of translational equivalence: The Blinker project
- Melamed
- 1998
(Show Context)
Citation Context ...gnate identification in bitext-related tasks (e.g., [8]). In the second experiment, we used Blinker, a word-aligned French-English bitext containing translations of 250 sentences taken from the Bible =-=[7]-=-. For the evaluation, we manually identified all cognate pairs in the bitext, using word alignment links as clues. The candidate set of pairs was generated by taking a Cartesian product of words in co... |

33 | Word-pair extraction for lexicography
- Brew, McKelvie
- 1996
(Show Context)
Citation Context ...s are a stronger indication of similarity than identity matches that are separated by unmatched symbols. A family of similarity measures that do take context into account is based on Dice coefficient =-=[1]-=-. The measures are defined as the ratio of the number of n-grams that are shared by two strings and the total number of n-grams in both strings: 2 ×|n-grams(X) ∩ n-grams(Y )| |n-grams(X)| + |n-grams(Y... |

28 |
Computing Patterns in Strings
- Smyth
- 2003
(Show Context)
Citation Context ... y1 ...yl be strings of length k and l, respectively, composed of symbols of a finite alphabet. In order to simplify the formulas, we introduce the following notational shorthand, borrowed from Smyth =-=[10]-=-. Let Γi,j =(x1 ...xi,y1 ...yj) beapairsN-Gram Similarity and Distance 117 of prefixes of X and Y ,andΓ ∗ i,j =(xi+1 ...xk,yj+1 ...yl) a pair of suffixes of X and Y . For strings of length one or less... |

8 |
A Computer-Generated Dictionary of Proto-Algonquian. Canadian Museum of Civilization
- Hewson
- 1993
(Show Context)
Citation Context ...ty an important clue for their identification. In the first experiment, we extracted all nouns from two machine-readable word lists that had been used to produce an Algonquian etymological dictionary =-=[4]-=-. The two sets contain 1628 Cree nouns and 1023 Ojibwa nouns, respectively. The set C of candidate pairs was created by generating all possible Cree-Ojibwa pairs (a Cartesian product). An electronic v... |

5 |
Similarity as a risk factor in drug-name confusion errors: The look-alike (orthographic) and sound-alike (phonetic) model. Med Care
- BL, Lin, et al.
- 1999
(Show Context)
Citation Context ...ted by dividing the length of the longest common subsequence by the length of the longer string [8]. Edit distance is often normalized in a similar way, i.e. by the length of the longer string (e.g., =-=[5]-=-). However, Marzal and Vidal [6] propose instead to normalize by the length of the editing path between strings, which requires a somewhat more complex algorithm. We refer to these two variants of Nor... |