## N-gram similarity and distance (2005)

Venue: | Proc. Twelfth Int’l Conf. on String Processing and Information Retrieval |

Citations: | 15 - 0 self |

### BibTeX

@INPROCEEDINGS{Kondrak05n-gramsimilarity,

author = {Grzegorz Kondrak},

title = {N-gram similarity and distance},

booktitle = {Proc. Twelfth Int’l Conf. on String Processing and Information Retrieval},

year = {2005},

pages = {115--126}

}

### Abstract

Abstract. In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents. 1

