Results 1 - 10
of
11
Clustering by compression
- IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Abstract
-
Cited by 120 (12 self)
- Add to MetaCart
Abstract—We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameter-free data mining, quartet tree method, universal dissimilarity distance. I.
A New Quartet Tree Heuristic for Hierarchical Clustering
- EUPASCAL Statistics and Optimization of Clustering Workshop, 5-6 Juli 2005
, 2006
"... We consider the problem of constructing an an optimal-weight tree from the 3 ` n 4 ´ weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal t ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We consider the problem of constructing an an optimal-weight tree from the 3 ` n 4 ´ weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal topologies). We present a heuristic for reconstructing the optimal-weight tree, and a canonical manner to derive the quartet-topology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values. We do not assume that there exists a true bifurcating supertree that embeds each quartet in the optimal topology, or represents the distance matrix faithfully—not even under the assumption that the weights or distances are corrupted by a measuring process. Our aim is to hierarchically cluster the input data as faithfully as possible, both phylogenetic data and data of completely different types. In our experiments with natural data, like genomic data, texts or music, the global optimum appears to be reached. Our method is capable of handling over 100 objects, possibly up to 1000 objects, while no existing quartet heuristic can computionally approximate the exact optimal solution of a quartet tree of more than about 20–30 objects without running for years. The method is implemented and available as public software. 1
Clustering
, 2009
"... The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We pr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The method has been extensively used for general hierarchical clustering of nontreelike (non-phylogeny) data in various domains and across domains with heterogenous data, and is implemented and available, as part of the CompLearn package. We compare performance and running time with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized.
Preclinical Evaluation of Two Real-Time, Reverse Transcription-PCR Assays for Detection of the Severe Acute Respiratory Syndrome Coronavirus
, 2003
"... We verified the analytical performance characteristics of a previously described real-time reverse transcription-PCR (RT-PCR) assay targeting the open reading frame (ORF) 1b region of the severe acute respiratory syndrome coronavirus (SARS-CoV) with RNA transcripts. We then compared it to a novel nu ..."
Abstract
- Add to MetaCart
We verified the analytical performance characteristics of a previously described real-time reverse transcription-PCR (RT-PCR) assay targeting the open reading frame (ORF) 1b region of the severe acute respiratory syndrome coronavirus (SARS-CoV) with RNA transcripts. We then compared it to a novel nucleocapsid gene real-time RT-PCR assay with genomic RNA. The assays differed only in the primer and probe sequences and final concentrations. A commercially available armored RNA (Ambion, Austin, Tex.) was evaluated as positive control for the ORF 1b assay. The analytical sensitivity, reproducibility, amplification efficiency, and dynamic range of the assays were similar. Both were specific for SARS-CoV as determined by testing against human CoV 229E and OC43, specimens from patients without SARS, and by BLAST searches of GenBank for primer and probe sequence homology. The armored RNA was found to be a suitable positive control for the ORF 1b assay that could be reliably recovered and amplified from a variety of clinical specimens. In November 2002, an outbreak of atypical pneumonia characterized by fever, respiratory compromise, and a high fatality rate emerged in Guangdong province, China. It subsequently was termed severe acute respiratory syndrome (SARS) and rapidly spread across five continents. A novel coronavirus
Journal of Theoretical Biology 232 (2005) 71–81 Network theory and SARS: predicting outbreak diversity
, 2004
"... ..."
International Journal of Scientometrics, Informetrics and Bibliometrics
"... LOTKA, a computer program for fitting a power law distribution such as Lotka's is presented. It basically follows Nicholl's methodology : using a maximum likelihood approach to estimate parameters, and a KolmogorovSmirnov test for goodness-of-fit. When input data are converted (from rankfrequency to ..."
Abstract
- Add to MetaCart
LOTKA, a computer program for fitting a power law distribution such as Lotka's is presented. It basically follows Nicholl's methodology : using a maximum likelihood approach to estimate parameters, and a KolmogorovSmirnov test for goodness-of-fit. When input data are converted (from rankfrequency to size-frequency) this program can also be used to test Zipf's law. It can be downloaded here. Permission to download or copy LOTKA is granted without fee, provided this is not done for profit or commercial advantage. Modifications may only be applied with written consent of the authors. Keywords computer program, power law distribution, Lotka's law, maximum likelihood estimation, Nicholls' methodology, Kolmogorov-Smirnov test, Zipf
Initial SARS Coronavirus Genome Sequence Analysis Using a Bioinformatics Platform
"... A dedicated anti-SARS bioinformatics web site was setup in April 2003 at the Centre of bioinformatics (CBI), Peking University (http://antisars.cbi.pku.edu.cn/). A special bioinformatics platform was constructed to analyse the sequence and structure data of SARS coronavirus and other viruses. A tota ..."
Abstract
- Add to MetaCart
A dedicated anti-SARS bioinformatics web site was setup in April 2003 at the Centre of bioinformatics (CBI), Peking University (http://antisars.cbi.pku.edu.cn/). A special bioinformatics platform was constructed to analyse the sequence and structure data of SARS coronavirus and other viruses. A total file of 32 SARS coronavirus genome sequences was retrieved from GenBank and mismatches in 30 sites were revealed from the result of multiple sequence alignment. The SARS coronavirus genome sequences can be divided into three groups based on the phylogenetic analysis using the data set constructed from the sequence mismatches.
Scholarly Paper Optimized Probe Selection for Pan-genomic DNA Microarrays
"... Motivation: Array comparative genomic hybridization is a quick and cheap method for detecting and genotyping unknown microbial isolates. However, there are a fixed number of probes per array, and therefore the number of loci that can be targeted by a single array is limited. For accurate strain geno ..."
Abstract
- Add to MetaCart
Motivation: Array comparative genomic hybridization is a quick and cheap method for detecting and genotyping unknown microbial isolates. However, there are a fixed number of probes per array, and therefore the number of loci that can be targeted by a single array is limited. For accurate strain genotyping, an array must query a fully representative set of genes from the species ʼ pan-genome. Prior genotyping arrays have only targeted a single strain or the conserved sequences of gene families. Results: This paper presents a new probe selection algorithm (PanArray) that can target multiple whole genomes in a minimal number of probes. Unlike arrays built on clustered gene families, PanArray guarantees that every subsequence of the genomes is independently targeted by a full complement of probes, increasing the flexibility and accuracy of the associated comparative analysis and genotyping. The viability of the algorithm is demonstrated by the design of a 385,000 probe array that fully tiles the genomes of 20 different Listeria monocytogenes strains at greater than two-fold coverage. Availability and Implementation: The PanArray design software is implemented in C++, and the PanArray source code and the L. monocytogenes array design are freely available upon request. Contact:
SARS outbreaks in Ontario, Hong Kong and Singapore: the role of diagnosis and isolation as a control mechanism
, 2003
"... In this article we use global and regional data from the SARS epidemic in conjunction with a model of susceptible, exposed, infective, diagnosed, and recovered classes of people (‘‘SEIJR’’) to extract average properties and rate constants for those populations. The model is fitted to data from the O ..."
Abstract
- Add to MetaCart
In this article we use global and regional data from the SARS epidemic in conjunction with a model of susceptible, exposed, infective, diagnosed, and recovered classes of people (‘‘SEIJR’’) to extract average properties and rate constants for those populations. The model is fitted to data from the Ontario (Toronto) in Canada, Hong Kong in China and Singapore outbreaks and predictions are made based on various assumptions and observations, including the current effect of isolating individuals diagnosed with SARS. The epidemic dynamics for Hong Kong and Singapore appear to be different from the dynamics in Toronto, Ontario. Toronto shows a very rapid increase in the number of cases between March 31st and April 6th, followed by a significant slowing in the number of new cases. We explain this as the result of an increase in the diagnostic rate and in the effectiveness of patient isolation after March 26th. Our best estimates are consistent with SARS eventually being contained in Toronto, although the time of containment is sensitive to the parameters in our model. It is shown that despite the empirically modeled heterogeneity in transmission, SARS ’ average reproductive number is 1.2, a value quite similar to that computed for some strains of influenza (J. Math. Biol. 27 (1989) 233). Although it would not be surprising to see levels of SARS infection higher than 10 % in some regions of the world (if unchecked), lack of data and the observed heterogeneity and sensitivity of parameters prevent us from predicting the long-term impact of SARS. The possibility that 10 or more percent of the world population at risk could eventually be infected with the virus in conjunction with a mortality rate of 3–7 % or more, and indications of significant improvement in Toronto support the
A Trimerizing GxxxG Motif Is Uniquely Inserted in the Severe Acute Respiratory Syndrome (SARS) Coronavirus Spike Protein Transmembrane Domain †
, 2006
"... ABSTRACT: In an attempt to understand what distinguishes severe acute respiratory syndrome (SARS) coronavirus (SCoV) from other members of the coronaviridae, we searched for elements that are unique to its proteins and not present in any other family member. We identified an insertion of two glycine ..."
Abstract
- Add to MetaCart
ABSTRACT: In an attempt to understand what distinguishes severe acute respiratory syndrome (SARS) coronavirus (SCoV) from other members of the coronaviridae, we searched for elements that are unique to its proteins and not present in any other family member. We identified an insertion of two glycine residues, forming the GxxxG motif, in the SCoV spike protein transmembrane domain (TMD), which is not found in any other coronavirus. This surprising finding raises an “oligomerization riddle”: the GxxxG motif is a known dimerization signal, while the SCoV spike protein is known to be trimeric. Using an in vivo assay, we found that the SCoV spike protein TMD is oligomeric and that this oligomerization is driven by the GxxxG motif. We also found that the GxxxG motif contributes toward the trimerization of the entire spike protein; in that, mutations in the GxxxG motif decrease trimerization of the full-length protein expressed in mammalian cells. Using molecular modeling, we show that the SCoV spike protein TMD adopts a distinct and unique structure as opposed to all other coronaviruses. In this unique structure, the glycine residues of the GxxxG motif are facing each other, enhancing helix-helix interactions by allowing for the close positioning of the helices. This unique orientation of the glycine residues also stabilizes the trimeric bundle during multi-nanosecond molecular dynamics simulation in a hydrated lipid bilayer. To the best of our knowledge, this is the first demonstration that the GxxxG motif can potentiate

