Results 1  10
of
23
Data Clustering: 50 Years Beyond KMeans
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract

Cited by 75 (3 self)
 Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, Kmeans, was first published in 1955. In spite of the fact that Kmeans was proposed over 50 years ago and thousands of clustering algorithms have been published since then, Kmeans is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semisupervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Stability criteria for switched and hybrid systems
 SIAM Review
, 2007
"... The study of the stability properties of switched and hybrid systems gives rise to a number of interesting and challenging mathematical problems. The objective of this paper is to outline some of these problems, to review progress made in solving these problems in a number of diverse communities, an ..."
Abstract

Cited by 34 (4 self)
 Add to MetaCart
The study of the stability properties of switched and hybrid systems gives rise to a number of interesting and challenging mathematical problems. The objective of this paper is to outline some of these problems, to review progress made in solving these problems in a number of diverse communities, and to review some problems that remain open. An important contribution of our work is to bring together material from several areas of research and to present results in a unified manner. We begin our review by relating the stability problem for switched linear systems and a class of linear differential inclusions. Closely related to the concept of stability are the notions of exponential growth rates and converse Lyapunov theorems, both of which are discussed in detail. In particular, results on common quadratic Lyapunov functions and piecewise linear Lyapunov functions are presented, as they represent constructive methods for proving stability, and also represent problems in which significant progress has been made. We also comment on the inherent difficulty of determining stability of switched systems in general which is exemplified by NPhardness and undecidability results. We then proceed by considering the stability of switched systems in which there are constraints on the switching rules, through both dwell time requirements and state dependent switching laws. Also in this case the theory of Lyapunov functions and the existence of converse theorems is reviewed. We briefly comment on the classical Lur’e problem and on the theory of stability radii, both of which contain many of the features of switched systems and are rich sources of practical results on the topic. Finally we present a list of questions and open problems which provide motivation for continued research in this area.
Rearrangement clustering: Pitfalls, remedies, and applications
 Journal of Machine Learning Research
, 2006
"... Given a matrix of values in which the rows correspond to objects and the columns correspond to features of the objects, rearrangement clustering is the problem of rearranging the rows of the matrix such that the sum of the similarities between adjacent rows is maximized. Referred to by various names ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Given a matrix of values in which the rows correspond to objects and the columns correspond to features of the objects, rearrangement clustering is the problem of rearranging the rows of the matrix such that the sum of the similarities between adjacent rows is maximized. Referred to by various names and reinvented several times, this clustering technique has been extensively used in many fields over the last three decades. In this paper, we point out two critical pitfalls that have been previously overlooked. The first pitfall is deleterious when rearrangement clustering is applied to objects that form natural clusters. The second concerns a similarity metric that is commonly used. We present an algorithm that overcomes these pitfalls. This algorithm is based on a variation of the Traveling Salesman Problem. It offers an extra benefit as it automatically determines cluster boundaries. Using this algorithm, we optimally solve four benchmark problems and a 2,467gene expression data clustering problem. As expected, our new algorithm identifies better clusters than those found by previous approaches in all five cases. Overall, our results demonstrate the benefits of rectifying the pitfalls and exemplify the usefulness of this clustering technique. Our code is available at our websites.
Automated abstraction methodology for genetic regulatory networks
 Online]. Available: http://www.async. ece.utah.edu/publications/TCSB06.pdf
, 2006
"... Abstract. In order to efficiently analyze the complicated regulatory systems often encountered in biological settings, abstraction is essential. This paper presents an automated abstraction methodology that systematically reduces the smallscale complexity found in genetic regulatory network models, ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Abstract. In order to efficiently analyze the complicated regulatory systems often encountered in biological settings, abstraction is essential. This paper presents an automated abstraction methodology that systematically reduces the smallscale complexity found in genetic regulatory network models, while broadly preserving the largescale system behavior. Our method first reduces the number of reactions by using rapid equilibrium and quasisteadystate approximations as well as a number of other stoichiometrysimplifying techniques, which together result in substantially shortened simulation time. To further reduce analysis time, our method can represent the molecular state of the system by a set of scaled Boolean (or nary) discrete levels. This results in a chemical master equation that is approximated by a Markov chain with a much smaller state space providing significant analysis time acceleration and computability gains. 1
Clustering, Dimensionality Reduction, and Side Information
, 2006
"... Recent advances in sensing and storage technology have created many highvolume, highdimensional data sets in pattern recognition, machine learning, and data mining. Unsupervised learning can provide generic tools for analyzing and summarizing these data sets when there is no welldefined notion of ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Recent advances in sensing and storage technology have created many highvolume, highdimensional data sets in pattern recognition, machine learning, and data mining. Unsupervised learning can provide generic tools for analyzing and summarizing these data sets when there is no welldefined notion of classes. The purpose of this thesis is to study some of the open problems in two main areas of unsupervised learning, namely clustering and (unsupervised) dimensionality reduction. Instancelevel constraint on objects, an example of sideinformation, is also considered to improve the clustering results. Our first contribution is a modification to the isometric feature mapping (ISOMAP) algorithm when the input data, instead of being all available simultaneously, arrive sequentially from a data stream. ISOMAP is representative of a class of nonlinear dimensionality reduction algorithms that are based on the notion of a manifold. Both the standard ISOMAP and the landmark version of ISOMAP are considered. Experimental results on synthetic data as well as real world images demonstrate that the modified algorithm can maintain an accurate lowdimensional representation of the data in an efficient manner. We study the problem of feature selection in modelbased clustering when the number of clusters
2005, Detection and normalization of biases present in spotted cDNA microarray data: a composite method addressing dye, intensitydependent, spatiallydependent, and printorder biases
 DNA Res
"... Microarrays are often used to identify target genes that trigger specific diseases, to elucidate the mechanisms of drug effects, and to check SNPs. However, data from microarray experiments are well known to contain biases resulting from the experimental protocols. Therefore, in order to elucidate b ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Microarrays are often used to identify target genes that trigger specific diseases, to elucidate the mechanisms of drug effects, and to check SNPs. However, data from microarray experiments are well known to contain biases resulting from the experimental protocols. Therefore, in order to elucidate biological knowledge from the data, systematic biases arising from their protocols must be removed prior to any data analysis. To remove these biases, many normalization methods are used by researchers. However, not all biases are eliminated from the microarray data because not all types of errors from experimental protocols are known. In this paper, we report an effective way of removing various types of biases by treating each microarray dataset independently to detect biases present in the dataset. After the biases contained in each dataset were identified, a combination of normalization methods specifically made for each dataset was applied to remove biases one at a time. Key words: cDNA microarray; normalization; printorder bias 1.
Genome identification and classification by short oligo arrays
 IN: PROCEEDINGS OF THE FOURTH ANNUAL WORKSHOP ON ALGORITHMS IN BIOINFORMATICS
, 2004
"... We explore the problem of designing oligonucleotides that help locate organisms along a known phylogenetic tree. We develop a suffixtree based algorithm to find such short sequences efficiently. Our algorithm requires O(Nm) time and O(N) space in the worst case where m is the number of the genomes ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We explore the problem of designing oligonucleotides that help locate organisms along a known phylogenetic tree. We develop a suffixtree based algorithm to find such short sequences efficiently. Our algorithm requires O(Nm) time and O(N) space in the worst case where m is the number of the genomes classified by the phylogeny and N is their total length. We implemented our algorithm and used it to find these discriminating sequences in both small and large phylogenies. We believe our algorithm will have wide applications including: highthroughput classification and identification, oligo array design optimally differentiating genes in gene families, and markers for closely related strains and populations. It will also have scientific significance as a new way to assess the confidence in a given classification.
Abstracted stochastic analysis of type 1 pili expression in E. coli
 In The 2006 International Conference on Bioinformatics and Computational Biology
, 2006
"... Abstract — With the aid of model abstractions, biochemical networks can be analyzed at different levels of resolution: from lowlevel quantitative models to highlevel qualitative ones. Furthermore, an ability to change the level of abstraction can be very useful when dealing with many biological sy ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract — With the aid of model abstractions, biochemical networks can be analyzed at different levels of resolution: from lowlevel quantitative models to highlevel qualitative ones. Furthermore, an ability to change the level of abstraction can be very useful when dealing with many biological systems, including gene regulatory networks. These systems typically have too many components and states to be practically studied using allinclusive lowlevel models, yet they often manifest enough dynamical and functional complexity, making an entirely highlevel qualitative representation similarly inadequate — thus necessitating a search for some intermediate level of abstraction. Finally, while most abstractions used in modeling of biochemical networks have traditionally been performed manually, doing so accurately in a large system is a tedious and timeconsuming process that is highly susceptible to errors during model transformation. To address these issues, we have developed a methodology and implemented an automated modeling and analysis tool with variable abstraction level capabilities. In this paper, we use it for the analysis of switching in Type 1 pili expression dynamics and, in particular, for the problem of estimating the effect of HNS and Lrp regulatory protein levels on phase variation rates in E. coli. Such behavior is notoriously difficult to study due to the size of the associated gene regulatory network and the characteristically stochastic dynamics involved, which result in very high analytical and computational demands. Here, we show how, by using our system, we are able to automatically abstract the switch network and accurately predict E. coli afimbriation rates, while, at the same time, accelerating the required computations by up to two orders of magnitude. I.
Extracting and explaining biological knowledge in microarray data
 In Pacific Asia Knowledge Discovery and Data Mining Conference (PAKDD2004
, 2004
"... matrix decomposition methods. Abstract. High throughput technologies produce large biological datasets that may lead to greater understanding of the biological mechanisms behind diseases such as cancer. However, progress has been slow in extracting meaningful information from these datasets. We desc ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
matrix decomposition methods. Abstract. High throughput technologies produce large biological datasets that may lead to greater understanding of the biological mechanisms behind diseases such as cancer. However, progress has been slow in extracting meaningful information from these datasets. We describe a method of clustering lists of genes mined from a microarray dataset using functional information from the Gene Ontology. The method uses relationships between terms in the ontology both to build clusters and to extract meaningful cluster descriptions. The approach is general and may be applied to assist explanation other datasets associated with ontologies. 1
Genetic Programming based DNA Microarray Analysis for Classification of Cancer
"... Abstract. In this study the advantages of statistical gene selection are combined with the power of Genetic Programming (GP) to build classifiers for assigning gene expression microarray data samples to categories characteristic of certain cell states. To that end we implemented different statistica ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. In this study the advantages of statistical gene selection are combined with the power of Genetic Programming (GP) to build classifiers for assigning gene expression microarray data samples to categories characteristic of certain cell states. To that end we implemented different statistical measures in a program called GENEACTIVATOR and tested their applicability to gene selection. Subsequently we used the general purpose GPsystem DISCIPULUS to train classifiers. We applied our approach to four different human cancer gene expression datasets publicly available, including multiclass sets. The results indicate that using gene selection and GP as implemented in DISCIPULUS is an appropriate method for gene expression data analysis. 1