Results 1  10
of
74
Randomwalk computation of similarities between nodes of a graph, with application to collaborative recommendation
 IEEE Transactions on Knowledge and Data Engineering
, 2006
"... Abstract—This work presents a new perspective on characterizing the similarity between elements of a database or, more generally, nodes of a weighted and undirected graph. It is based on a Markovchain model of random walk through the database. More precisely, we compute quantities (the average comm ..."
Abstract

Cited by 116 (14 self)
 Add to MetaCart
Abstract—This work presents a new perspective on characterizing the similarity between elements of a database or, more generally, nodes of a weighted and undirected graph. It is based on a Markovchain model of random walk through the database. More precisely, we compute quantities (the average commute time, the pseudoinverse of the Laplacian matrix of the graph, etc.) that provide similarities between any pair of nodes, having the nice property of increasing when the number of paths connecting those elements increases and when the “length ” of paths decreases. It turns out that the square root of the average commute time is a Euclidean distance and that the pseudoinverse of the Laplacian matrix is a kernel matrix (its elements are inner products closely related to commute times). A principal component analysis (PCA) of the graph is introduced for computing the subspace projection of the node vectors in a manner that preserves as much variance as possible in terms of the Euclidean commutetime distance. This graph PCA provides a nice interpretation to the “Fiedler vector, ” widely used for graph partitioning. The model is evaluated on a collaborativerecommendation task where suggestions are made about which movies people should watch based upon what they watched in the past. Experimental results on the MovieLens database show that the Laplacianbased similarities perform well in comparison with other methods. The model, which nicely fits into the socalled “statistical relational learning ” framework, could also be used to compute document or word similarities, and, more generally, it could be applied to machinelearning and patternrecognition tasks involving a relational database. Index Terms—Graph analysis, graph and database mining, collaborative recommendation, graph kernels, spectral clustering, Fiedler vector, proximity measures, statistical relational learning. 1
Generalized Kernel Approach to Dissimilaritybased Classification
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2001
"... Usually, objects to be classified are represented by features. In this paper, we discuss an alternative object representation based on dissimilarity values. If such distances separate the classes well, the nearest neighbor method offers a good solution. However, dissimilarities used in practice are ..."
Abstract

Cited by 53 (2 self)
 Add to MetaCart
Usually, objects to be classified are represented by features. In this paper, we discuss an alternative object representation based on dissimilarity values. If such distances separate the classes well, the nearest neighbor method offers a good solution. However, dissimilarities used in practice are usually far from ideal and the performance of the nearest neighbor rule suffers from its sensitivity to noisy examples. We show that other, more global classification techniques are preferable to the nearest neighbor rule, in such cases. For classification purposes, two different ways of using generalized dissimilarity kernels are considered. In the first one, distances are isometrically embedded in a pseudoEuclidean space and the classification task is performed there. In the second approach, classifiers are built directly on distance kernels. Both approaches are described theoretically and then compared using experiments with different dissimilarity measures and datasets including degraded data simulating the problem of missing values.
Multidimensional Scaling
 Handbook of Statistics
, 2001
"... eflecting the importance or precision of dissimilarity # i j . 1. SOURCES OF DISTANCE DATA Dissimilarity information about a set of objects can arise in many different ways. We review some of the more important ones, organized by scientific discipline. 1.1. Geodesy. The most obvious application, ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
eflecting the importance or precision of dissimilarity # i j . 1. SOURCES OF DISTANCE DATA Dissimilarity information about a set of objects can arise in many different ways. We review some of the more important ones, organized by scientific discipline. 1.1. Geodesy. The most obvious application, perhaps, is in sciences in which distance is measured directly, although generally with error. This happens, for instance, in triangulation in geodesy. We have measurements which are approximately equal to distances, either Euclidean or spherical, depending on the scale of the experiment. In other examples, measured distances are less directly related to physical distances. For example, we could measure airplane or road or train travel distances between different cities. Physical distance is usually not the only factor determining these types of dissimilarities. 1 2 J. DE LEEUW<
Using the Triangle Inequality to Reduce the Number of Comparisons Required for SimilarityBased Retrieval
 Proc. of SPIE/IS&T Conf. on Storage and Retrieval for Image and Video Databases IV
, 1996
"... Dissimilarity measures, the basis of similaritybased retrieval, can be viewed as a distance and a similaritybased search as a nearest neighbor search. Though there has been extensive research on data structures and search methods to support nearestneighbor searching, these indexing and dimensionr ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
Dissimilarity measures, the basis of similaritybased retrieval, can be viewed as a distance and a similaritybased search as a nearest neighbor search. Though there has been extensive research on data structures and search methods to support nearestneighbor searching, these indexing and dimensionreduction methods are generally not applicable to noncoordinate data and nonEuclidean distance measures. In this paper we reexamine and extend previous work of other researchers on best match searching based on the triangle inequality. These methods can be used to organize both noncoordinate data and nonEuclidean metric similarity measures. The effectiveness of the indexes depends on the actual dimensionality of the feature set, data, and similarity metric used. We show that these methods provide significant performance improvements and may be of practical value in realworld databases. Keywords: image database indexing, similaritybased retrieval, best match searching, triangle inequali...
W.: A shared task involving multilabel classification of clinical free text
 In: Proceedings of ACL BioNLP
, 2007
"... This paper reports on a shared task involving the assignment of ICD9CM codes to radiology reports. Two features distinguished this task from previous shared tasks in the biomedical domain. One is that it resulted in the first freely distributable corpus of fully anonymized clinical text. This reso ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
This paper reports on a shared task involving the assignment of ICD9CM codes to radiology reports. Two features distinguished this task from previous shared tasks in the biomedical domain. One is that it resulted in the first freely distributable corpus of fully anonymized clinical text. This resource is permanently available and will (we hope) facilitate future research. The other key feature of the task is that it required categorization with respect to a large and commercially significant set of labels. The number of participants was larger than in any previous biomedical challenge task. We describe the data production process and the evaluation measures, and give a preliminary analysis of the results. Many systems performed at levels approaching the intercoder agreement, suggesting that humanlike performance on this task is within the reach of currently available technologies. 1
GMD@CSB.DB: the Golm Metabolome Database
 Bioinformatics
, 2005
"... Summary: Metabolomics, in particular gas chromatography–mass spectrometry (GC–MS) based metabolite profiling of biological extracts, is rapidly becoming one of the cornerstones of functional genomics and systems biology. Metabolite profiling has profound applications in discovering the mode of actio ..."
Abstract

Cited by 21 (7 self)
 Add to MetaCart
Summary: Metabolomics, in particular gas chromatography–mass spectrometry (GC–MS) based metabolite profiling of biological extracts, is rapidly becoming one of the cornerstones of functional genomics and systems biology. Metabolite profiling has profound applications in discovering the mode of action of drugs or herbicides, and in unravelling the effect of altered gene expression on metabolism and organism performance in biotechnological applications. As such the technology needs to be available to many laboratories. For this, an open exchange of information is required, like that already achieved for transcript and protein data. One of the keysteps in metabolite profiling is the unambiguous identification of metabolites in highly complex metabolite preparations from biological samples. Collections of mass spectra, which comprise frequently observed metabolites of either known or unknown exact chemical structure, represent the most
Learning Metrics via Discriminant Kernels and Multidimensional Scaling: Toward Expected . . .
 IN PROCEEDINGS OF THE TWENTIETH INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2003
"... Distancebased methods in machine learning and pattern recognition have to rely on a metric distance between points in the input space. Instead of specifying a metric a priori, we seek to learn the metric from data via kernel methods and multidimensional scaling (MDS) techniques. Under the classi ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
Distancebased methods in machine learning and pattern recognition have to rely on a metric distance between points in the input space. Instead of specifying a metric a priori, we seek to learn the metric from data via kernel methods and multidimensional scaling (MDS) techniques. Under the classification setting, we define discriminant kernels on the joint space of input and output spaces and present a specific family of discriminant kernels. This family of discriminant kernels is attractive because the induced metrics are Euclidean and Fisher separable, and MDS techniques can be used to find the lowdimensional Euclidean representations (also called feature vectors) of the induced metrics. Since the
Parametric Distance Metric Learning with Label Information
 In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence
, 2003
"... Distancebased methods in pattern recognition and machine learning have to rely on a similarity or dissimilarity measure between patterns in the input space. For many applications, Euclidean distance in the input space is not a good choice and hence more complicated distance metrics have to be used. ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
Distancebased methods in pattern recognition and machine learning have to rely on a similarity or dissimilarity measure between patterns in the input space. For many applications, Euclidean distance in the input space is not a good choice and hence more complicated distance metrics have to be used. In this paper, we propose a parametric method for distance metric learning based on class label information. We first define a dissimilarity measure that can be proved to be a metric. It has the favorable property that betweenclass dissimilarity is always larger than withinclass dissimilarity. We then perform parametric learning to find a regression mapping from the input space to an Euclidean feature space, such that the dissimilarity between patterns in the input space is approximated by the Euclidean distance between points in the feature space. Parametric learning is performed using the iterative majorization algorithm. Our method has been tested on some synthetic and realworld benchmark datasets. Experimental results show that this approach is promising.
Selectivity estimation for fuzzy string predicates in large data sets
 In VLDB
, 2005
"... Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as “name similar to smith ” and “telephone number similar to 4120964. ” Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as “name similar to smith ” and “telephone number similar to 4120964. ” Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of fuzzy string predicates. We develop a novel technique, called Sepia, to solve the problem. It groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance function. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of nonuniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of fuzzy string predicates.
The Proximity of an Individual to a Population With Applications in Discriminant Analysis
, 1995
"... : We develop a proximity function between an individual and a population from a distance between multivariate observations. We study some properties of this construction and apply it to a distancebased discrimination rule, which contains the classic linear discriminant function as a particular ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
: We develop a proximity function between an individual and a population from a distance between multivariate observations. We study some properties of this construction and apply it to a distancebased discrimination rule, which contains the classic linear discriminant function as a particular case. Additionally, this rule can be used advantageously for categorical or mixed variables, or in problems where a probabilistic model is not well determined. This approach is illustrated and compared with other classic procedures using four real data sets. Keywords: Categorical and mixed data; Distances between observations; Multidimensional scaling; Discrimination; Classification rules. AMS Subject Classification: 62H30 The authors thank M.Abrahamowicz, J. C. Gower and M. Greenacre for their helpful comments, and W. J. Krzanowski for providing us with a data set and his quadratic location model program. Work supported in part by CGYCIT grant PB930784. Authors' address: Departam...