Results 11  20
of
348
Graph mining: Laws, generators, and algorithms
 ACM COMPUTING SURVEYS
, 2006
"... How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M : N relation i ..."
Abstract

Cited by 70 (6 self)
 Add to MetaCart
How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M : N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: "How can we generate synthetic but realistic graphs?" To answer this, we must first understand what patterns are common in realworld graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.
XClust: Clustering XML Schemas for Effective Integration
, 2002
"... It is increasingly important to develop scalable integration techniques for the growing number of XML data sources. A practical starting point for the integration of large numbers of Document Type Definitions (DTDs) of XML sources would be to first find clusters of DTDs that are similar in structure ..."
Abstract

Cited by 58 (1 self)
 Add to MetaCart
It is increasingly important to develop scalable integration techniques for the growing number of XML data sources. A practical starting point for the integration of large numbers of Document Type Definitions (DTDs) of XML sources would be to first find clusters of DTDs that are similar in structure and semantics. Reconciling similar DTDs within such a cluster will be an easier task than reconciling DTDs that are different in structure and semantics as the latter would involve more restructuring. We introduce XClust, a novel integration strategy that involves the clustering of DTDs. A matching algorithm based on the semantics, immediate descendents and leafcontext similarity of DTD elements is developed. Our experiments to integrate real world DTDs demonstrate the effectiveness of the XClust approach.
Concept Formation in Structured Domains
, 1991
"... ions are made over the structural information (relations) ..."
Abstract

Cited by 52 (2 self)
 Add to MetaCart
ions are made over the structural information (relations)
Combining multiple clusterings using evidence accumulation
 IEEE Transaction on Pattern Analysis and Machine Intelligence
, 2005
"... We explore the idea of evidence accumulation (EAC) for combining the results of multiple clusterings. First, a clustering ensemble a set of object partitions, is produced. Given a data set (n objects or patterns in d dimensions), different ways of producing data partitions are: (1) applying differ ..."
Abstract

Cited by 51 (5 self)
 Add to MetaCart
We explore the idea of evidence accumulation (EAC) for combining the results of multiple clusterings. First, a clustering ensemble a set of object partitions, is produced. Given a data set (n objects or patterns in d dimensions), different ways of producing data partitions are: (1) applying different clustering algorithms, and (2) applying the same clustering algorithm with different values of parameters or initializations. Further, combinations of different data representations (feature spaces) and clustering algorithms can also provide a multitude of significantly different data partitionings. We propose a simple framework for extracting a consistent clustering, given the various partitions in a clustering ensemble. According to the EAC concept, each partition is viewed as an independent evidence of data organization, individual data partitions being combined, based on a voting mechanism, to generate a new n × n similarity matrix between the n patterns. The final data partition of the n patterns is obtained by applying a hierarchical agglomerative clustering algorithm on this matrix. We have developed a theoretical framework for the analysis of the proposed clustering combination strategy and its evaluation, based on the concept of mutual information between data partitions. Stability of the results is evaluated using bootstrapping techniques. A detailed discussion of an evidence accumulationbased clustering algorithm, using a split and merge strategy based on the Kmeans clustering algorithm, is presented. Experimental results of the proposed method on several synthetic and real data sets are compared with other combination strategies, and with individual clustering results produced by well known clustering algorithms.
Parametric and Nonparametric Unsupervised Cluster Analysis
 Pattern Recognition
, 1996
"... Much work has been published on methods for assessing the probable number of clusters or structures within unknown data sets. This paper aims to look in more detail at two methods, a broad parametric method, based around the assumption of Gaussian clusters and the other a nonparametric method which ..."
Abstract

Cited by 50 (6 self)
 Add to MetaCart
Much work has been published on methods for assessing the probable number of clusters or structures within unknown data sets. This paper aims to look in more detail at two methods, a broad parametric method, based around the assumption of Gaussian clusters and the other a nonparametric method which utilises methods of scalespace filtering to extract robust structures within a data set. It is shown that, whilst both methods are capable of determining cluster validity for data sets in which clusters tend towards a multivariate Gaussian distribution, the parametric method inevitably fails for clusters which have a nonGaussian structure whilst the scalespace method is more robust. Key words : Cluster analysis, maximum likelihood methods, scalespace filtering, probability density estimation. 1 Introduction Most scientific disciplines generate experimental data from an observed system about which we have may have little understanding of the data generating function. The notion that com...
Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis
 Journal of Machine Learning Research
, 2007
"... Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in highdimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a ..."
Abstract

Cited by 48 (11 self)
 Add to MetaCart
Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in highdimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. An unsupervised dimensionality reduction method called localitypreserving projection (LPP) can work well with multimodal data due to its locality preserving property. However, since LPP does not take the label information into account, it is not necessarily useful in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classification tasks through extensive simulation studies. We also show that LFDA can be extended to nonlinear dimensionality reduction scenarios by applying the kernel trick.
Cluster Validation Techniques for Genome Expression Data
 Signal Processing
, 2002
"... Several clustering algorithms have been suggested to analyse genome expression data, but fewer solutions have been implemented to guide the design of clusteringbased experiments and assess the quality of their outcomes. A cluster validity framework provides insights into the problem of predicting th ..."
Abstract

Cited by 46 (10 self)
 Add to MetaCart
Several clustering algorithms have been suggested to analyse genome expression data, but fewer solutions have been implemented to guide the design of clusteringbased experiments and assess the quality of their outcomes. A cluster validity framework provides insights into the problem of predicting the correct the number of clusters. This paper presents several validation techniques for gene expression data analysis. Normalisation and validity aggregation strategies are proposed to improve the prediction about the number of relevant clusters. The results obtained indicate that this systematic evaluation approach may significantly support genome expression analyses for knowledge discovery applications.
Hierarchical Latent Class Models for Cluster Analysis
 Journal of Machine Learning Research
, 2002
"... Latent class models are used for cluster analysis of categorical data. Underlying such a model is the assumption that the observed variables are mutually independent given the class variable. A serious problem with the use of latent class models, known as local dependence, is that this assumption is ..."
Abstract

Cited by 46 (12 self)
 Add to MetaCart
Latent class models are used for cluster analysis of categorical data. Underlying such a model is the assumption that the observed variables are mutually independent given the class variable. A serious problem with the use of latent class models, known as local dependence, is that this assumption is often untrue. In this paper we propose hierarchical latent class models as a framework where the local dependence problem can be addressed in a principled manner. We develop a searchbased algorithm for learning hierarchical latent class models from data. The algorithm is evaluated using both synthetic and realworld data.
Mining of concurrent text and time series
 In proceedings of the 6 th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining Workshop on Text Mining
, 2000
"... ..."