Results 1 -
7 of
7
Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation
- IEEE Transactions on Knowledge and Data Engineering
, 2006
"... Abstract—This work presents a new perspective on characterizing the similarity between elements of a database or, more generally, nodes of a weighted and undirected graph. It is based on a Markov-chain model of random walk through the database. More precisely, we compute quantities (the average comm ..."
Abstract
-
Cited by 54 (12 self)
- Add to MetaCart
Abstract—This work presents a new perspective on characterizing the similarity between elements of a database or, more generally, nodes of a weighted and undirected graph. It is based on a Markov-chain model of random walk through the database. More precisely, we compute quantities (the average commute time, the pseudoinverse of the Laplacian matrix of the graph, etc.) that provide similarities between any pair of nodes, having the nice property of increasing when the number of paths connecting those elements increases and when the “length ” of paths decreases. It turns out that the square root of the average commute time is a Euclidean distance and that the pseudoinverse of the Laplacian matrix is a kernel matrix (its elements are inner products closely related to commute times). A principal component analysis (PCA) of the graph is introduced for computing the subspace projection of the node vectors in a manner that preserves as much variance as possible in terms of the Euclidean commute-time distance. This graph PCA provides a nice interpretation to the “Fiedler vector, ” widely used for graph partitioning. The model is evaluated on a collaborativerecommendation task where suggestions are made about which movies people should watch based upon what they watched in the past. Experimental results on the MovieLens database show that the Laplacian-based similarities perform well in comparison with other methods. The model, which nicely fits into the so-called “statistical relational learning ” framework, could also be used to compute document or word similarities, and, more generally, it could be applied to machine-learning and pattern-recognition tasks involving a relational database. Index Terms—Graph analysis, graph and database mining, collaborative recommendation, graph kernels, spectral clustering, Fiedler vector, proximity measures, statistical relational learning. 1
ADMIT: Anomaly-based Data Mining for Intrusions
"... Security of computer systems is essential to their acceptance and utility. Computer security analysts use intrusion detection systems to assist them in maintaining computer system security. This paper deals with the problem of differentiating between masqueraders and the true user of a computer term ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Security of computer systems is essential to their acceptance and utility. Computer security analysts use intrusion detection systems to assist them in maintaining computer system security. This paper deals with the problem of differentiating between masqueraders and the true user of a computer terminal. Prior efficient solutions are less suited to real time application, often requiring all training data to be labeled, and do not inherently provide an intuitive idea of what the data model means. Our system, called ADMIT, relaxes these constraints, by creating user profiles using semiincremental techniques. It is a real-time intrusion detection system with host-based data collection and processing. Our method also suggests ideas for dealing with concept drift and affords a detection rate as high as 80.3% and a false positive rate as low as 15.3%.
Spectral Imaging Target Development Based on Hierarchical Cluster Analysis
- in Proceedings of the 12 th Color Imaging Conference
, 2004
"... Agglomerative hierarchical cluster analysis was used to group similar spectra from a large database of samples. Based on angles between reflectance vectors of members of a cluster, a reflectance vector was selected as representative of that cluster. Representative samples were grouped together and s ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Agglomerative hierarchical cluster analysis was used to group similar spectra from a large database of samples. Based on angles between reflectance vectors of members of a cluster, a reflectance vector was selected as representative of that cluster. Representative samples were grouped together and stored as new calibration targets. Simulated wide-band imaging with glass filters was performed using these new calibration targets and a transformation matrix from digital signals to reflectance was derived. Different verification targets were reconstructed using the transformation matrix; the spectral and colorimetric accuracy of the reconstruction was evaluated. It was shown that beyond a threshold number of samples in the calibration target, the performance of reconstruction became independent of the number of samples used in the calculation. The average spectral RMS for a calibration target consisting of 24 samples selected based on clustering were found to be less than 3.2 % for
Preface Preface
, 2006
"... Effective prediction of highway travel time is essential to many advanced traveler information and transportation management system. This thesis proposes 3 different prediction schemes to predict highway travel time in certain stretch of Denmark, using a linear model where the coefficients vary as s ..."
Abstract
- Add to MetaCart
Effective prediction of highway travel time is essential to many advanced traveler information and transportation management system. This thesis proposes 3 different prediction schemes to predict highway travel time in certain stretch of Denmark, using a linear model where the coefficients vary as smooth functions of the departure time, also the principle components and partial least squares regression. The methods are straightforward to implement and applicable to different circumstances.
Testing Homogeneity in a Mixture Distribution via the L² Distance Between Competing Models
- Journal of the American Statistical Society
, 2004
"... Ascertaining the number of components in a mixture distribution is an interesting and challenging problem for statisticians. Chen, Chen, and Kalbeisch (2001) recently proposed a modified likelihood ratio test (MLRT), which is distribution-free and locally most powerful, asymptotically. In this paper ..."
Abstract
- Add to MetaCart
Ascertaining the number of components in a mixture distribution is an interesting and challenging problem for statisticians. Chen, Chen, and Kalbeisch (2001) recently proposed a modified likelihood ratio test (MLRT), which is distribution-free and locally most powerful, asymptotically. In this paper we present a new method for testing whether a finite mixture distribution is homogeneous. Our method, the D-test, is based on the L² distance between a fitted homogeneous model and a fitted heterogeneous model. For mixture components from standard distributions, our D-test statistic has closed-form expressions in terms of parameter estimates, whereas likelihood ratio-type test statistics do not. Thus, our test has potential for data mining applications. The convergence rate of the D-test statistic under a null hypothesis of homogeneity is established. The D-test is shown to be competitive with the MLRT when the mixture components are normal. The MLRT performs better for small sample sizes when the mixture components are exponential, but in this case there is little visual separation and, hence, little L² separation between the homogeneous and heterogeneous models. Thus, we propose that the measure underlying the L² be modified according to a suitable weight function, which is equivalent to transforming the data before applying the D-test. Such a modification produces a generalized D-test that is competitive in the aforementioned case. After applying our method to a data set in which the observations are measurements of firms' financial performances, we conclude with discussion and remarks.
Ranking and Selecting Clustering Algorithms Using a Meta-Learning Approach
"... Abstract — We present a novel framework that applies a metalearning approach to clustering algorithms. Given a dataset, our meta-learning approach provides a ranking for the candidate algorithms that could be used with that dataset. This ranking could, among other things, support non-expert users in ..."
Abstract
- Add to MetaCart
Abstract — We present a novel framework that applies a metalearning approach to clustering algorithms. Given a dataset, our meta-learning approach provides a ranking for the candidate algorithms that could be used with that dataset. This ranking could, among other things, support non-expert users in the algorithm selection task. In order to evaluate the framework proposed, we implement a prototype that employs regression support vector machines as the meta-learner. Our case study is developed in the context of cancer gene expression microarray datasets. I.
A Concurrent Object-Oriented Approach to the Eigenproblem Treatment in Shared Memory Multicore Environments
"... Abstract. This work presents an object-oriented approach to the concurrent computation of eigenvalues and eigenvectors in real symmetric and Hermitian matrices on present memory shared multicore systems. This can be considered the lower level step in a general framework for dealing with large size e ..."
Abstract
- Add to MetaCart
Abstract. This work presents an object-oriented approach to the concurrent computation of eigenvalues and eigenvectors in real symmetric and Hermitian matrices on present memory shared multicore systems. This can be considered the lower level step in a general framework for dealing with large size eigenproblems, where the matrices are factorized to a small enough size. The results show that the proposed parallelization achieves a good speedup in actual systems with up to four cores. Also, it is observed that the limiting performance factor is the number of threads rather than the size of the matrix. We also find that a reasonable upper limit for a “small ” dense matrix to be treated in actual processors is in the interval 10000-30000.

