Results 1 
7 of
7
P.M.B.: A New Quartet Tree Heuristic for Hierarchical Clustering arXiv:cs/0606048
, 2006
"... We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal to ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a heuristic for reconstructing the optimalweight tree, and a canonical manner to derive the quartettopology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values. We do not assume that there exists a true bifurcating supertree that embeds each quartet in the optimal topology, or represents the distance matrix faithfully—not even under the assumption that the weights or distances are corrupted by a measuring process. Our aim is to hierarchically cluster the input data as faithfully as possible, both phylogenetic data and data of completely different types. In our experiments with natural data, like genomic data, texts or music, the global optimum appears to be reached. Our method is capable of handling over 100 objects, possibly up to 1000 objects, while no existing quartet heuristic can computionally approximate the exact optimal solution of a quartet tree of more than about 20–30 objects without running for years. The method is implemented and available as public software. 1
Automatic Extraction of Meaning from the Web
 IEEE International Symposium on Information Theory
, 2006
"... Abstract — We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have like “red ” or “christianity. ” For the first type we consider a family of computable distance measures c ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Abstract — We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have like “red ” or “christianity. ” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. I.
Similarity of objects and the meaning of words
 In Proc. 3rd Annual Conferene on Theory and Applications of Models of Computation (TAMC’06), volume 3959 of LNCS
, 2006
"... Abstract. We survey the emerging area of compressionbased, parameterfree, similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for tha ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We survey the emerging area of compressionbased, parameterfree, similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red ” or “christianity. ” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. 1
IP Covert Channel Detection
, 2008
"... A covert channel can occur when an attacker finds and exploits a shared resource that is not designed to be a communication mechanism. A network covert channel operates by altering the timing of otherwise legitimate network traffic so that the arrival times of packets encode confidential data that a ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
A covert channel can occur when an attacker finds and exploits a shared resource that is not designed to be a communication mechanism. A network covert channel operates by altering the timing of otherwise legitimate network traffic so that the arrival times of packets encode confidential data that an attacker wants to exfiltrate from a secure area from which she has no other means of communication. In this paper, we present the first public implementation of an IP covert channel, discuss the subtle issues that arose in its design, and present a discussion on its efficacy. We then show that an IP covert channel can be differentiated from legitimate channels and present new detection measures that provide detection rates over 95%. We next take the simple step an attacker would of adding noise to the channel to attempt to conceal the covert communication. For these noisy IP covert timing channels, we show that our online detection measures can fail to identify the covert channel for noise levels higher than 10%. We then provide effective offline search mechanisms that identify the noisy channels.
Universal similarity
 in Proc. IEEE ISOC ITW2005 on Coding and Complexity
, 2005
"... Abstract — We survey a new area of parameterfree similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes e ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract — We survey a new area of parameterfree similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red ” or “christianity.” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. I.
Optimizing Quartet Trees Through Monte Carlo Methods
"... The problem is to construct an optimal weight tree from the 3 � � n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We ..."
Abstract
 Add to MetaCart
(Show Context)
The problem is to construct an optimal weight tree from the 3 � � n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The method has been extensively used, and is implemented and available, as part of the CompLearn package. We compare performance and running time with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package. Index Terms — evolutionary tree, global optimization, Monte Carlo method, quartet method, randomized hillclimbing,