Results 1 -
7 of
7
clValid , an R package for cluster validation
, 2008
"... The R package clValid contains functions for validating the results of a clustering analysis. There are three main types of cluster validation measures available, “internal”, “stability”, and “biological”. The user can choose from nine clustering algorithms in existing R packages, including hierarch ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The R package clValid contains functions for validating the results of a clustering analysis. There are three main types of cluster validation measures available, “internal”, “stability”, and “biological”. The user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM),
and model based clustering. In addition, we provide a function to perform the self-organizing tree algorithm (SOTA) method of clustering. Any combination of validation measures and clustering methods can be requested in a single function call. This allows the user to simultaneously
evaluate several clustering algorithms while varying the
number of clusters, to help determine the most appropriate method and number of clusters for the dataset of interest. Additionally, the package can automatically make use of the biological information contained in the Gene Ontology (GO) database to calculate the biological validation measures, via the annotation packages available in Bioconductor. The function returns an object of S4 class clValid, which has
summary, plot, print, and additional methods which allow the user to display the optimal validation scores and extract clustering results.
Building on the arules infrastructure for analyzing transaction data with R
- Advances in Data Analysis, Proceedings of the 30th Annual Conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, March 8–10, 2006, Studies in Classification, Data Analysis, and Knowledge Organization
"... Abstract. The free and extensible statistical computing environment R with its enormous number of extension packages already provides many state-of-the-art techniques for data analysis. Support for association rule mining, a popular exploratory method which can be used, among other purposes, for unc ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. The free and extensible statistical computing environment R with its enormous number of extension packages already provides many state-of-the-art techniques for data analysis. Support for association rule mining, a popular exploratory method which can be used, among other purposes, for uncovering cross-selling opportunities in market baskets, has become available recently with the R extension package arules. After a brief introduction to transaction data and association rules, we present the formal framework implemented in arules and demonstrate how clustering and association rule mining can be applied together using a market basket data set from a typical retailer. This paper shows that implementing a basic infrastructure with formal classes in R provides an extensible basis which can very efficiently be employed for developing new applications (such as clustering transactions) in addition to association rule mining. 1
Harvard Medical School
"... This introduction to the R package clues is a (slightly) modified version of Chang et al. (2010), published in the Journal of Statistical Software. Determining the optimal number of clusters appears to be a persistent and controversial issue in cluster analysis. Most existing R packages targeting cl ..."
Abstract
- Add to MetaCart
This introduction to the R package clues is a (slightly) modified version of Chang et al. (2010), published in the Journal of Statistical Software. Determining the optimal number of clusters appears to be a persistent and controversial issue in cluster analysis. Most existing R packages targeting clustering require the user to specify the number of clusters in advance. However, if this subjectively chosen number is far from optimal, clustering may produce seriously misleading results. In order to address this vexing problem, we develop the R package clues to automate and evaluate the selection of an optimal number of clusters, which is widely applicable in the field of clustering analysis. Package clues uses two main procedures, shrinking and partitioning, to estimate an optimal number of clusters by maximizing an index function, either the CH index or the Silhouette index, rather than relying on guessing a pre-specified number. Five agreement indices (Rand index, Hubert and Arabie’s adjusted Rand index, Morey and Agresti’s adjusted Rand index, Fowlkes and Mallows index and Jaccard index), which measure the degree of agreement between any two partitions, are also provided in clues. In addition to numerical evidence, clues also supplies a deeper insight into the partitioning process with trajectory plots. Keywords: agreement index, cluster analysis, dissimilarity measure, K-nearest neighbor. 1.
A Generic Registry Infrastructure for R
, 2009
"... More and more, R packages are offering dynamic functionality, allowing users to extend a “repository”of initial features or data. For example, the proxy package (Meyer and Buchta, 2008) provides an enhanced dist() function for computing dissimilarity matrices, allowing to choose among several proxim ..."
Abstract
- Add to MetaCart
More and more, R packages are offering dynamic functionality, allowing users to extend a “repository”of initial features or data. For example, the proxy package (Meyer and Buchta, 2008) provides an enhanced dist() function for computing dissimilarity matrices, allowing to choose among several proximity measures stored in a registry. Each entry is composed of a small workhorse function
A COMBINATORICS-BASED DATA-MINING APPROACH TO TIME-SERIES MICROARRAY ALIGNMENT
- VESTNIK VOGIS (INFORMATION BULLETIN OF VAVILOV SOCIETY FOR GENETICISTS AND BREEDING SCIENTISTS)
, 2008
"... One of the biological issues aiming at understanding bovine embryo development implies the analysis of proliferation and differentiation processes. An easy way to do so is to use published data to collect information about genes interacting with a target gene of interest from which we can extract pi ..."
Abstract
- Add to MetaCart
One of the biological issues aiming at understanding bovine embryo development implies the analysis of proliferation and differentiation processes. An easy way to do so is to use published data to collect information about genes interacting with a target gene of interest from which we can extract pieces of information from the literature. Using published data from other species (mouse, human) we used a double-step classical clustering approach. First step runs a k-mean clustering for each chip individually. Second step runs a fuzzy consensus clustering to merge a few clusters (i.e. megaclusters) between microarrays. Hence we make temporal gene profiles (i.e matrix) based on gene expression of megaclusters using the symbolic time property of simultaneity and precedence. Finally with the help of a Jaccard coefficient between temporal gene profiles across species, we extract a list of genes revealing a similarity with a target gene of interest. Depending on the species or target gene, this list of genes differed in size and content, thus highlighting the interest of such cross- species comparisons to gain insights from different literature contexts.
Isotone Optimization in R: . . .
"... This introduction to the R package isotone is a (slightly) modified version of de Leeuw et al. (2009), published in the Journal of Statistical Software. In this paper we give a general framework for isotone optimization. First we discuss a generalized version of the pool-adjacent-violators algorithm ..."
Abstract
- Add to MetaCart
This introduction to the R package isotone is a (slightly) modified version of de Leeuw et al. (2009), published in the Journal of Statistical Software. In this paper we give a general framework for isotone optimization. First we discuss a generalized version of the pool-adjacent-violators algorithm (PAVA) to minimize a separable convex function with simple chain constraints. Besides of general convex functions we extend existing PAVA implementations in terms of observation weights, approaches for tie handling, and responses from repeated measurement designs. Since isotone optimization problems can be formulated as convex programming problems with linear constraints we then develop a primal active set method to solve such problem. This methodology is applied on specific loss functions relevant in statistics. Both approaches are implemented in the R package isotone.
Truecluster matching Truecluster
, 705
"... Cluster matching by permuting cluster labels is important in many clustering contexts such as cluster validation and cluster ensemble techniques. The classic approach is to minimize the euclidean distance between two cluster solutions which induces inappropriate stability in certain settings. Theref ..."
Abstract
- Add to MetaCart
Cluster matching by permuting cluster labels is important in many clustering contexts such as cluster validation and cluster ensemble techniques. The classic approach is to minimize the euclidean distance between two cluster solutions which induces inappropriate stability in certain settings. Therefore, we present the truematch algorithm that introduces two improvements best explained in the crisp case. First, instead of maximizing the trace of the cluster crosstable, we propose to maximize a χ 2-transformation of this crosstable. Thus, the trace will not be dominated by the cells with the largest counts but by the cells with the most non-random observations, taking into account the marginals. Second, we suggest a probabilistic component in order to break ties and to make the matching algorithm truly random on random data. The truematch algorithm is designed as a building block of the truecluster framework and scales in polynomial time. First simulation results confirm that the truematch algorithm gives more consistent truecluster results for unequal cluster sizes. Free R software is available.

