Results 1 -
3 of
3
The LikeIt Intelligent String Comparison Facility
- NEC Research Institute
, 1997
"... A highly-efficient ANSI-C facility is described for intelligently comparing a query string with a series of database strings. The bipartite weighted matching approach taken tolerates ordering violations that are problematic for simple automaton or string edit distance methods---yet common in practic ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
A highly-efficient ANSI-C facility is described for intelligently comparing a query string with a series of database strings. The bipartite weighted matching approach taken tolerates ordering violations that are problematic for simple automaton or string edit distance methods---yet common in practice. The method is character and polygraph based and does not require that words are properly formed in a query. Database characters are processed at a rate of approximately 2.5 million per second using a 200MHz Pentium Pro processor. A subroutine-level API is described along with an simple executable utility supporting both command-line and Web interfaces. An optimized Web interface is also reported consisting of a daemon that preloads multiple databases, and a corresponding CGI stub. The daemon may be initiated manually or via inetd. Keywords: String Comparison/Similarity, Text/Database Search/Retrieval, Bipartite Matching/Assignment, Edit Distance. Both authors are with the NEC Research I...
Association-Based Similarity Testing and Its Applications
, 2003
"... This paper proposes a new similarity measure between basket datasets based on associations. The new measure is calculated from support counts using a formula inspired by information entropy. Experiments on both real and synthetic datasets show the effectiveness of the measure. This paper then invest ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper proposes a new similarity measure between basket datasets based on associations. The new measure is calculated from support counts using a formula inspired by information entropy. Experiments on both real and synthetic datasets show the effectiveness of the measure. This paper then investigates the applications of the similarity measure. It first studies the problem of finding a mapping between categorical database attribute sets using similarity measures. A generic approach for identifying such a mapping is proposed. The approach is implemented based on the similarity measure proposed in the paper and its performance has been evaluated and validated. Moreover, this paper also explores the applications of using the similarity measure to mine distributed datasets.
Testing the Model By Multivariate Analysis
, 1986
"... We describe a novel approach for clustering collections of sets, and its application to the analysis and mining of categorical data. By "categorical data," we mean tables with fields that cannot be naturally ordered by a metric --- e.g., the names of producers of automobiles, or the names of product ..."
Abstract
- Add to MetaCart
We describe a novel approach for clustering collections of sets, and its application to the analysis and mining of categorical data. By "categorical data," we mean tables with fields that cannot be naturally ordered by a metric --- e.g., the names of producers of automobiles, or the names of products offered by a manufacturer. Our approach is based on an iterative method for assigning and propagating weights on the categorical values in a table; this facilitates a type of similarity measure arising from the co-occurrence of values in the dataset. Our techniques can be studied analytically in terms of certain types of non-linear dynamical systems. We discuss experiments on a variety of tables of synthetic and real data; we find that our iterative methods converge quickly to prominently correlated values of various categorical fields. 1 Introduction Much of the data in data warehouses is categorical: fields in tables whose attributes cannot naturally be ordered as numerical values can....

