## An Algorithm for Clustering cDNAs for Gene Expression Analysis (1999)

Venue: | In RECOMB99: Proceedings of the Third Annual International Conference on Computational Molecular Biology |

Citations: | 45 - 4 self |

### BibTeX

@INPROCEEDINGS{Hartuv99analgorithm,

author = {Erez Hartuv and Armin Schmitt and Jörg Lange and Sebastian Meier-Ewert and Hans Lehrach and Ron Shamir},

title = {An Algorithm for Clustering cDNAs for Gene Expression Analysis},

booktitle = {In RECOMB99: Proceedings of the Third Annual International Conference on Computational Molecular Biology},

year = {1999},

pages = {188--197},

publisher = {ACM Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. A similarity graph is defined and clusters in that graph correspond to highly connected subgraphs. A polynomial algorithm to compute them efficiently is presented. Our algorithm produces a clustering with some provably good properties. The application that motivated this study was gene expression analysis, where a collection of cDNAs must be clustered based on their oligonucleotide fingerprints. The algorithm has been tested intensively on simulated libraries and was shown to outperform extant methods. It demonstrated robustness to high noise levels. In a blind test on real cDNA fingerprint data the algorithm obtained very good results. Utilizing the results of the algorithm would have saved over 70% of the cDNA sequencing cost on that data set. 1 Introduction Cluster analysis seeks grouping of data elements into subsets, so that elements in the same subset are in some sense more cl...

### Citations

11201 |
Computers and Intractability: A Guide to the Theory of NP-Completeness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...iques. First, all maximal cliques are computed and later maximal cliques with sufficiently large overlap are merged into a single cluster. Computing all maximal cliques is computationally intractable =-=[6]-=-. Moreover, a high false negative rate may break large clusters into many maximal cliques with a complicated and hard-to-detect overlap structure. Milosavljevic et. al. [24], build clusters using a gr... |

1586 |
A k-means clustering algorithm
- Hartigan, Wong
- 1979
(Show Context)
Citation Context ...show simulation results that demonstrate the superiority of our algorithm over the greedy algorithm. 2 The Highly Connected Subgraphs Algorithm We start with some basic definitions on clustering (cf. =-=[18, 8, 7]-=-) and graph theory (cf. [2, 5, 17]). In the clustering problem, one has a sample L of n elements and an n \Theta p real 2 data matrix of measurements D, which contains p measurements (or characteristi... |

343 |
Graph Algorithms
- Even
- 1979
(Show Context)
Citation Context ...strate the superiority of our algorithm over the greedy algorithm. 2 The Highly Connected Subgraphs Algorithm We start with some basic definitions on clustering (cf. [18, 8, 7]) and graph theory (cf. =-=[2, 5, 17]-=-). In the clustering problem, one has a sample L of n elements and an n \Theta p real 2 data matrix of measurements D, which contains p measurements (or characteristics) on each of the n entities. Fro... |

280 | An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation
- Wu, Leahy
- 1993
(Show Context)
Citation Context ...ected graphs, cliques and maximal cliques. For a critique of these approaches see [16]. Two other approaches that are more similar to ours were proposed by Matula [18, 17, 16, 15] and by Wu and Leahy =-=[32]-=-. Both of these algorithms lack our important stopping criterion, with the ensuing provable results on the clustering quality. In particular these algorithms do not guarantee that clusters have diamet... |

181 |
Convexity in graphs
- Harary, Nieminen
- 1981
(Show Context)
Citation Context ...strate the superiority of our algorithm over the greedy algorithm. 2 The Highly Connected Subgraphs Algorithm We start with some basic definitions on clustering (cf. [18, 8, 7]) and graph theory (cf. =-=[2, 5, 17]-=-). In the clustering problem, one has a sample L of n elements and an n \Theta p real 2 data matrix of measurements D, which contains p measurements (or characteristics) on each of the n entities. Fro... |

140 |
Mathematical classification and clustering
- Mirkin
- 1996
(Show Context)
Citation Context ...ood results. Using the algorithm for clustering that data would have saved over 70% of the cDNA sequencing cost. Several graph theoretic approaches to cluster analysis have been suggested (see, e.g., =-=[25, 17, 7]-=-). Those include finding connected components, strongly connected components in directed graphs, cliques and maximal cliques. For a critique of these approaches see [16]. Two other approaches that are... |

131 |
A simple min-cut algorithm
- Stoer, Wagner
- 1997
(Show Context)
Citation Context ... high thresholds give a large number of false negative edges. Both of these situations hinder the clustering algorithm. The optimal threshold value is about 45, but good results are achieved when ` 2 =-=[30; 70]-=-, so finding an optimal ` is not crucial. The effective range of the number of probes (Figure 4.3) is 100-300. Increasing the false positive parameter fi up to 0.001-0.0015 has negligible effect on th... |

81 |
A model for taxonomy
- Jardine, Sibson
- 1968
(Show Context)
Citation Context ... , where C ij = 1 iff i and j belong the same cluster. Given matrix representations of the true clustering T and any clustering C of the same data set, the Minkowski measure for the quality of C (cf. =-=[29, 10]-=-) is the normalized distance between the two matrices, jjT \GammaCjj jjT jj , where jjT jj = q P i P j T 2 i;j . An example of the performance of the HCS clustering algorithm on simulated cDNA oligo f... |

76 |
Cluster analysis and mathematical programming
- Hansen, Jaumard
- 1997
(Show Context)
Citation Context ...ood results. Using the algorithm for clustering that data would have saved over 70% of the cDNA sequencing cost. Several graph theoretic approaches to cluster analysis have been suggested (see, e.g., =-=[25, 17, 7]-=-). Those include finding connected components, strongly connected components in directed graphs, cliques and maximal cliques. For a critique of these approaches see [16]. Two other approaches that are... |

74 | Minimum cuts in near-linear time
- Karger
- 1996
(Show Context)
Citation Context ...es. The results on the real dataset also demonstrate that the effect of this problem is not large. Additional improvements to the algorithm can be achieved by using faster minimum cut algorithm (e.g. =-=[11]-=-) and by attempting to find maximal highly connected subgraphs (e.g. using Matula's cohesiveness function [17]). Using a weighted minimum cut algorithm may also improve the results. Further theoretica... |

55 |
a platform for combinatorial and geometric computing
- LEDA
- 1995
(Show Context)
Citation Context ...iving high quality clustering results according to our experiments. The simulation algorithm was written in MATLAB, while the clustering algorithm was written in C++ within the LEDA 3.4.1 environment =-=[21]-=-. (The minimum cut algorithm implemented in LEDA has an O(nm + n 2 log n) time complexity [30].) Average elapsed times on a 194 MHZ SGI challenge L machine with 32Kb instruction cache and 1024 main me... |

29 |
K-components, clusters, and slicing in graphs
- Matula
- 1972
(Show Context)
Citation Context ...ood results. Using the algorithm for clustering that data would have saved over 70% of the cDNA sequencing cost. Several graph theoretic approaches to cluster analysis have been suggested (see, e.g., =-=[25, 17, 7]-=-). Those include finding connected components, strongly connected components in directed graphs, cliques and maximal cliques. For a critique of these approaches see [16]. Two other approaches that are... |

27 |
Computing edge connectivity in multigraphs and capacitated graphs
- Nagamochi, Ibaraki
- 1992
(Show Context)
Citation Context ...(n; m) is the time complexity of computing a minimum cut in a graph with n vertices and m edges, and N is the number of clusters. TypicallysN !! n. Currently, the best time bound for f(n; m) is O(nm) =-=[19, 26]-=-. Properties of HCS clustering: We now present several properties of the solutions produced by HCS algorithm. These properties will imply that appropriateness of the solutions for cluster analysis. Du... |

21 |
Graph theoretic techniques for cluster analysis algorithms
- Matula
- 1987
(Show Context)
Citation Context ... strongly connected components in directed graphs, cliques and maximal cliques. For a critique of these approaches see [16]. Two other approaches that are more similar to ours were proposed by Matula =-=[18, 17, 16, 15]-=- and by Wu and Leahy [32]. Both of these algorithms lack our important stopping criterion, with the ensuing provable results on the clustering quality. In particular these algorithms do not guarantee ... |

19 |
Gene-representing cDNA clusters defined by hybridization of 57,419 clones from infant brain libraries with short oligonucleotide probes. Genomics 37:29–40
- Drmanac, NA, et al.
- 1996
(Show Context)
Citation Context ...sentation of low-abundance genes must be 100,000 or more. Sequencing all 100,000 cDNAs in a sample is slow, wasteful and prohibitively expensive. An alternative method was proposed about a decade ago =-=[14, 3, 31, 4, 28, 22, 24]-=-. It is based on spotting the cDNAs on high density filters (about 31,000 different cDNAs can be spotted currently in duplicates on one filter [28]). Short synthetic DNA sequences, typically 7-12 base... |

16 |
Determining edge connectivity in O(nm
- Matula
- 1987
(Show Context)
Citation Context ...(n; m) is the time complexity of computing a minimum cut in a graph with n vertices and m edges, and N is the number of clusters. TypicallysN !! n. Currently, the best time bound for f(n; m) is O(nm) =-=[19, 26]-=-. Properties of HCS clustering: We now present several properties of the solutions produced by HCS algorithm. These properties will imply that appropriateness of the solutions for cluster analysis. Du... |

16 |
Molecular approach to genome analysis: a strategy for the construction of ordered overlapping clone libraries
- Michiels, G, et al.
- 1987
(Show Context)
Citation Context ...d deviation oe = C b \GammaC a 6 . The number of probes (i.e., oligos) is p. Probes are assumed to occur along a gene with Poisson distribution with rate . This assumption originally was suggested in =-=[23]-=- and was adopted by other researchers [1, 27, 20]. The probability that an oligo occurrence did not register (false negative probability) is ff. False positive hybridizations are assumed to have Poiss... |

13 |
Clustering and classification: Background and current directions
- Sokal
- 1977
(Show Context)
Citation Context ... , where C ij = 1 iff i and j belong the same cluster. Given matrix representations of the true clustering T and any clustering C of the same data set, the Minkowski measure for the quality of C (cf. =-=[29, 10]-=-) is the normalized distance between the two matrices, jjT \GammaCjj jjT jj , where jjT jj = q P i P j T 2 i;j . An example of the performance of the HCS clustering algorithm on simulated cDNA oligo f... |

11 |
Drmanac R. Processing of cDNA and genomic kilobase-size clones for massive screening, mapping and sequencing by hybridization. Biotechniques
- Drmanac
- 1994
(Show Context)
Citation Context ...sentation of low-abundance genes must be 100,000 or more. Sequencing all 100,000 cDNAs in a sample is slow, wasteful and prohibitively expensive. An alternative method was proposed about a decade ago =-=[14, 3, 31, 4, 28, 22, 24]-=-. It is based on spotting the cDNAs on high density filters (about 31,000 different cDNAs can be spotted currently in duplicates on one filter [28]). Short synthetic DNA sequences, typically 7-12 base... |

10 | Construction of physical maps from oligonucleotide fingerprints data
- Mayraz, Shamir
- 1999
(Show Context)
Citation Context ...ber of probes (i.e., oligos) is p. Probes are assumed to occur along a gene with Poisson distribution with rate . This assumption originally was suggested in [23] and was adopted by other researchers =-=[1, 27, 20]-=-. The probability that an oligo occurrence did not register (false negative probability) is ff. False positive hybridizations are assumed to have Poisson distribution with rate fi. All probe occurrenc... |

9 |
Cluster analysis via graph theoretic techniques
- Matula
- 1970
(Show Context)
Citation Context ...n suggested (see, e.g., [25, 17, 7]). Those include finding connected components, strongly connected components in directed graphs, cliques and maximal cliques. For a critique of these approaches see =-=[16]-=-. Two other approaches that are more similar to ours were proposed by Matula [18, 17, 16, 15] and by Wu and Leahy [32]. Both of these algorithms lack our important stopping criterion, with the ensuing... |

6 |
Clone clustering by hybridization
- Milosavljevic, Strezoska, et al.
- 1995
(Show Context)
Citation Context ...sentation of low-abundance genes must be 100,000 or more. Sequencing all 100,000 cDNAs in a sample is slow, wasteful and prohibitively expensive. An alternative method was proposed about a decade ago =-=[14, 3, 31, 4, 28, 22, 24]-=-. It is based on spotting the cDNAs on high density filters (about 31,000 different cDNAs can be spotted currently in duplicates on one filter [28]). Short synthetic DNA sequences, typically 7-12 base... |

5 |
Lehrach H: Hybridization analysis of arrayed cDNA libraries
- Lennon
- 1991
(Show Context)
Citation Context |

4 |
Physical mapping of chromosomes: A combinatorical problem in molecular biology
- Alizadeh, Karp, et al.
- 1995
(Show Context)
Citation Context ...ber of probes (i.e., oligos) is p. Probes are assumed to occur along a gene with Poisson distribution with rate . This assumption originally was suggested in [23] and was adopted by other researchers =-=[1, 27, 20]-=-. The probability that an oligo occurrence did not register (false negative probability) is ff. False positive hybridizations are assumed to have Poisson distribution with rate fi. All probe occurrenc... |

4 |
Partial sequencing by oligohybridization: Concept and applications in genome analysis
- Crkvenjakov, Lehrach
- 1991
(Show Context)
Citation Context |

4 |
Comparison of clone-ordering algorithms used in physical mapping
- Platt, Dix
- 1997
(Show Context)
Citation Context ...ber of probes (i.e., oligos) is p. Probes are assumed to occur along a gene with Poisson distribution with rate . This assumption originally was suggested in [23] and was adopted by other researchers =-=[1, 27, 20]-=-. The probability that an oligo occurrence did not register (false negative probability) is ff. False positive hybridizations are assumed to have Poisson distribution with rate fi. All probe occurrenc... |

3 |
Cluster Analysis by Highly Connected Subgraphs with Applications to cDNA Clustering
- Hartuv
- 1998
(Show Context)
Citation Context ...vable results on the clustering quality. In particular these algorithms do not guarantee that clusters have diameter two, a key property of our clustering. For other limitations of these methods, see =-=[9]-=-. For the specific problem of clustering cDNA fingerprints, several approaches were suggested previously: Drmanac et al. [28] build clusters around connected components in the similarity graph. In tha... |

3 |
Sequencing by hybridization: Towards an automated sequencing of one million M13 clones arrayed on membranes. Electrophoresis 13: 566–573
- Vicentic, Gemmell
- 1992
(Show Context)
Citation Context |

2 |
expression profiles in normal and cancer cells
- Gene
- 1997
(Show Context)
Citation Context ...cates the relative expression level of their genes. Out of about 100,000 different genes that are present in human DNA, the number of different genes active in a human cell at any time is over 10,000 =-=[12]-=-. The relative abundance of cDNAs of different genes may vary by a factor of 10,000. This clearly implies that the size of the sample of cDNAs that must be extracted from a cell in order to obtain ade... |

2 |
The cohesive strength of graphs. In “The Many Facets of Graph Theory
- Matula
- 1969
(Show Context)
Citation Context ... strongly connected components in directed graphs, cliques and maximal cliques. For a critique of these approaches see [16]. Two other approaches that are more similar to ours were proposed by Matula =-=[18, 17, 16, 15]-=- and by Wu and Leahy [32]. Both of these algorithms lack our important stopping criterion, with the ensuing provable results on the clustering quality. In particular these algorithms do not guarantee ... |

1 |
Comparative gene expression profiling by oligonucleotide fingerprinting
- Lehrach
- 1998
(Show Context)
Citation Context ...length, so even in the errorless case their fingerprint will differ. Luckily, some cDNA extraction techniques guarantee that all cDNAs from the same gene will end at a common endpoint in the sequence =-=[13]-=-, and therefore they all share a common sequence suffix . Alternative technologies such as DNA microchips have the advantage of being able to determine the expression levels of thousands of genes in p... |

1 |
Gene identification by oligonucleotide fingerprinting -- a pilot study
- Meier-Ewert, Mott, et al.
- 1995
(Show Context)
Citation Context |