## Similarity of Attributes by External Probes (1997)

### Cached

### Download Links

- [www.cs.helsinki.fi]
- [ranger.uta.edu]
- [www.cs.helsinki.fi]
- DBLP

### Other Repositories/Bibliography

Venue: | In Knowledge Discovery and Data Mining |

Citations: | 37 - 7 self |

### BibTeX

@INPROCEEDINGS{Das97similarityof,

author = {Gautam Das and Heikki Mannila and Pirjo Ronkainen},

title = {Similarity of Attributes by External Probes},

booktitle = {In Knowledge Discovery and Data Mining},

year = {1997},

pages = {23--29},

publisher = {AAAI Press}

}

### OpenURL

### Abstract

In data mining, similarity or distance between attributes is one of the central notions. Such a notion can be used to build attribute hierarchies etc. Similarity metrics can be user-defined, but an important problem is defining similarity on the basis of data. Several methods based on statistical techniques exist. For defining the similarity between two attributes A and B they typically consider only the values of A and B, not the other attributes. We describe how a similarity notion between attributes can be defined by considering the values of other attributes. The basic idea is that in a 0/1 relation r, two attributes A and B are similar if the subrelations oe A=1 (r) and oe B=1 (r) are similar. Similarity between the two relations is defined by considering the marginal frequencies of a selected subset of other attributes. We show that the framework produces natural notions of similarity. Empirical results on the Reuters-21578 document dataset show, for example, how natural classif...

### Citations

2430 | Mining Association Rules between Sets of Items in Large Databases - Agrawal, Imielinski, et al. |

2143 |
Dubes R.C. Algorithms for Clustering Data
- Jain
- 1988
(Show Context)
Citation Context ... external distancessTo further illustrate the behavior of the external and internal distance functions, we clustered the 14 countries using a standard agglomerative hierarchical clustering algorithm (=-=Jain & Dubes 1988-=-; Kaufman & Rousseauw 1990). As a distance between clusters, we used the minimum distance between the elements of the clusters. Figure 4 shows two clusterings produced by using d fr ;P , as well as a ... |

1324 |
Finding Groups in Data: An Introduction to Cluster Analysis
- Kaufman, Rousseeuw
- 1990
(Show Context)
Citation Context ...sTo further illustrate the behavior of the external and internal distance functions, we clustered the 14 countries using a standard agglomerative hierarchical clustering algorithm (Jain & Dubes 1988; =-=Kaufman & Rousseauw 1990-=-). As a distance between clusters, we used the minimum distance between the elements of the clusters. Figure 4 shows two clusterings produced by using d fr ;P , as well as a clustering produced by usi... |

490 | Beyond market basket: generalizing association rules to correlations - Brin, Motwani, et al. |

472 |
Fast discovery of association rules
- Agrawal, Mannila, et al.
- 1996
(Show Context)
Citation Context ...to compute all the necessary counts. In fact, computing these counts is a special case of computing all the frequent sets that arises in association rule discovery (Agrawal, Imielinski, & Swami 1993; =-=Agrawal et al. 1996-=-). If we are not interested in probe attributes of small frequency, we can use variations of the Apriori (Agrawal et al. 1996) algorithm. This method is fast and scales nicely to very large data sets.... |

453 | Mining generalized association rules
- Agrawal, Srikant
- 1995
(Show Context)
Citation Context ...s or clusters of attributes. A hierarchy can itself give useful insight into the structure of the data, and hierarchies can also be used to produce more abstract rules etc. (Han, Cai, & Cercone 1992; =-=Srikant & Agrawal 1995-=-). Typically, one assumes that the hierarchy is given by a domain expert. This is indeed a good solution, but in several cases the data can be such that no domain expert is available, and hence there ... |

413 | Efficient similarity search in sequence databases - Agrawal, Faloutsos, et al. - 1993 |

372 |
Empirical Methods for Artificial Intelligence
- Cohen
- 1995
(Show Context)
Citation Context ... difference between f i (s A ) and f i (s B ) is at least as big as the difference between f i (r A ) and f i (r B ). Such randomization methodology is currently fairly widely used in statistics (see =-=[6]-=- for a very readable introduction). Here we choose the functions f i to measure the frequency of a certain probe variable. That is, we select a set P ` R of probe attributes, and measure for each D 2 ... |

300 | Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery - Mannila, Toivonen, et al. - 1997 |

286 |
Verkamo – “Fast discovery of association rules”, Advances in knowledge discovery and data mining
- Agrawal, Mannila, et al.
- 1996
(Show Context)
Citation Context ...database suffices to compute all the necessary counts. 10 In fact, computing these counts is a special case of the problem of computing all the frequent sets that arises in association rule discovery =-=[2, 4]-=-. If we are not interested in probe sets of small frequency, we can use variations of the Apriori [4] algorithm for the computations needed. This method is fast and scales nicely to very large data se... |

198 | Fast similarity search in the presence of noise, scaling, and translation in timeseries databases
- Agrawal, Lin, et al.
- 1995
(Show Context)
Citation Context ...tly, there has been considerable interest into defining intuitive and easily computable measures of similarity between complex objects and into using abstract similarity notions in querying databases =-=[1, 3, 8, 11, 14, 19, 21]-=-. Two typical examples of data sets are shown in Figure 1. Row ID Chips Mustard Sausage Pepsi Coca-Cola Miller Bud t 1 1 0 0 0 0 1 0 t 2 1 1 1 1 0 1 0 t 3 1 0 1 0 1 0 0 t 4 0 0 1 0 0 1 0 t 5 0 1 1 1 0... |

169 |
Measures of Association for Cross Classification
- Goodman, Kruskal
- 1954
(Show Context)
Citation Context ...ations, 1 we can express the sufficient statistics for any internal measure by the familiar 2-by-2 contingency table. We can measure the strength of association between A and B in numerous ways; see (=-=Goodman & Kruskal 1979-=-) for a compendium of methods. Possibilities include thes2 test statistic, which measures the deviation of the observed values from the expected values under the assumption of independence. There exis... |

151 | Knowledge discovery in databases: An attribute oriented approach - Han, Cai, et al. - 1992 |

137 | On sirnilarity-based queries for time series data
- Rafiei
- 1999
(Show Context)
Citation Context ...nto using abstract similarity notions in querying databases (Agrawal, Faloutsos, & Swami 1993; Agrawal et al. 1995; Goldin & Kanellakis 1995; Jagadish, Mendelzon, & Milo 1995; Knobbe & Adriaans 1996; =-=Rafiei & Mendelzon 1997-=-; White & Jain 1996). A typical data set is shown in Figure 1. In this example, market basket data, the data objects represent customers in the supermarket, and the columns represent different product... |

103 |
Distance measures for signal processing and pattern recognition
- Basseville
- 1989
(Show Context)
Citation Context ...frequency ofsx in the relation r A . One widely used distance notion between distributions is the Kullbach-Leibler distance (also known as relative entropy or cross entropy) (Kullbach & Leibler 1951; =-=Basseville 1989-=-): re(g A ; gB ) = Xsx g A (x) log gA (x) g B (x) ; or the symmetrized version of it: re(g A ; g B ) + re(g B ; gA ). The problem with this measure is that the sum has 2 jP j elements, so direct compu... |

99 | On similarity queries for time-series data: constraint specification and implementation
- Goldin, Kanellakis
- 1995
(Show Context)
Citation Context ...intuitive and easily computable measures of similarity between complex objects and into using abstract similarity notions in querying databases (Agrawal, Faloutsos, & Swami 1993; Agrawal et al. 1995; =-=Goldin & Kanellakis 1995-=-; Jagadish, Mendelzon, & Milo 1995; Knobbe & Adriaans 1996; Rafiei & Mendelzon 1997; White & Jain 1996). A typical data set is shown in Figure 1. In this example, market basket data, the data objects ... |

70 |
Algorithms and strategies for similarity retrieval
- White, Jain
- 1996
(Show Context)
Citation Context ...tly, there has been considerable interest into defining intuitive and easily computable measures of similarity between complex objects and into using abstract similarity notions in querying databases =-=[1, 3, 8, 11, 14, 19, 21]-=-. Two typical examples of data sets are shown in Figure 1. Row ID Chips Mustard Sausage Pepsi Coca-Cola Miller Bud t 1 1 0 0 0 0 1 0 t 2 1 1 1 1 0 1 0 t 3 1 0 1 0 1 0 0 t 4 0 0 1 0 0 1 0 t 5 0 1 1 1 0... |

66 | Similarity-Based Queries - Jagadish, Mendelzon, et al. - 1995 |

49 | Knowledge discovery from telecommunication network alarm databases
- Hätönen, Klemettinen, et al.
- 1996
(Show Context)
Citation Context ...6 Figure 5: Clustering of the courses produced by the minimum distance clustering criterion. alarm sequences is quite important. In this application, data mining methods have been shown to be useful (=-=Hatonen et al. 1996-=-). Here we describe how our external distance measures can be used to detect similarities between alarms. We analyzed a sequence of 58616 alarms with associated occurrence times from a telephone excha... |

20 | Relaxing the triangle inequality in pattern matching
- Fagin, Stockmeyer
- 1998
(Show Context)
Citation Context ...iangle inequality holds to within a constant multiplicative factor. There also are some interesting applications in which the normal triangle inequality does not hold, but the weakened form does hold =-=[7]-=-. 4 Let 0 ! e ! 1. Generate a new attribute B e , and for a row t 2 r assign a value to t(B e ) as follows. If t(A) = 1, then t(B e ) = 1 with probability e, and t(B e ) = 0 otherwise; if t(A) = 0, th... |

8 |
Linkage disequilibrium measures for fine-scale mapping: a comparison
- Guo
- 1997
(Show Context)
Citation Context ...nt thats2 , or some simple function of it, is an appropriate measure of degree of association." One well-known problem with thes2 measure is that it is very sensitive to cells with small counts; =-=see (Guo 1997-=-) for a discussion of the same problems in the context of medical genetics. An alternative is to use the term from the relative entropy formula: F (A; B; D) = fr (r A ; D) log(fr (r A ; D)=fr (r B ; D... |

8 |
heoretical Epidemiology
- Miettinen
- 1985
(Show Context)
Citation Context ... D). We can measure how different the frequency of D is in relations r A and r B . A simple test for this is to use thes2 test statistic for two proportions, as is widely done in, e.g., epidemiology (=-=Miettinen 1985-=-), and also in data mining (Brin, Motwani, & Silverstein 1997). Given a probe variable D, the value of the test statistic F (A; B; D) is after some simplifications (fr (r A ; D) \Gamma fr (r B ; D)) 2... |

7 |
Analysing binary associations
- Knobbe, Adriaans
- 1996
(Show Context)
Citation Context ...en complex objects and into using abstract similarity notions in querying databases (Agrawal, Faloutsos, & Swami 1993; Agrawal et al. 1995; Goldin & Kanellakis 1995; Jagadish, Mendelzon, & Milo 1995; =-=Knobbe & Adriaans 1996-=-; Rafiei & Mendelzon 1997; White & Jain 1996). A typical data set is shown in Figure 1. In this example, market basket data, the data objects represent customers in the supermarket, and the columns re... |

5 |
On information theory and sufficiency
- Kullbach, Leibler
- 1951
(Show Context)
Citation Context ...e gA (x) is the relative frequency ofsx in the relation r A . One widely used distance notion between distributions is the Kullbach-Leibler distance (also known as relative entropy or cross entropy) (=-=Kullbach & Leibler 1951-=-; Basseville 1989): re(g A ; gB ) = Xsx g A (x) log gA (x) g B (x) ; or the symmetrized version of it: re(g A ; g B ) + re(g B ; gA ). The problem with this measure is that the sum has 2 jP j elements... |

2 |
The reuters-21578, distribution 1.0
- Lewis
- 1997
(Show Context)
Citation Context ...n. Thus we can continue the clustering without having to look at the original data again. Experimental results We have used three data sets in our experiments: the so-called Reuters-21578 collection (=-=Lewis 1997-=-) of newswire articles, a database about students and courses at the Computer Science Department of the University of Helsinki, and telecommunication alarm sequence data. Documents and keywords The da... |