Results 1 
6 of
6
On Clusterings: Good, Bad and Spectral
, 2000
"... We motivate and develop a natural bicriteria measure for assessing the quality of a clustering which avoids the drawbacks of existing measures. A simple recursive heuristic has polylogarithmic worstcase guarantees under the new measure. The main result of the paper is the analysis of a popular spe ..."
Abstract

Cited by 254 (12 self)
 Add to MetaCart
We motivate and develop a natural bicriteria measure for assessing the quality of a clustering which avoids the drawbacks of existing measures. A simple recursive heuristic has polylogarithmic worstcase guarantees under the new measure. The main result of the paper is the analysis of a popular spectral algorithm. One variant of spectral clustering turns out to have effective worstcase guarantees
Sublinear Time Algorithms for Metric Space Problems
"... In this paper we give approximation algorithms for the following problems on metric spaces: Furthest Pair, k median, Minimum Routing Cost Spanning Tree, Multiple Sequence Alignment, Maximum Traveling Salesman Problem, Maximum Spanning Tree and Average Distance. The key property of our algorithms i ..."
Abstract

Cited by 80 (2 self)
 Add to MetaCart
In this paper we give approximation algorithms for the following problems on metric spaces: Furthest Pair, k median, Minimum Routing Cost Spanning Tree, Multiple Sequence Alignment, Maximum Traveling Salesman Problem, Maximum Spanning Tree and Average Distance. The key property of our algorithms is that their running time is linear in the number of metric space points. As the full specification o`f an npoint metric space is of size \Theta(n 2 ), the complexity of our algorithms is sublinear with respect to the input size. All previous algorithms (exact or approximate) for the problems we consider have running time\Omega\Gamma n 2 ). We believe that our techniques can be applied to get similar bounds for other problems. 1 Introduction In recent years there has been a dramatic growth of interest in algorithms operating on massive data sets. This poses new challenges for algorithm design, as algorithms quite efficient on small inputs (for example, having quadratic running time) ...
Hierarchical Reliable Multicast: performance analysis and placement of proxies
, 2000
"... The use of proxies for local error recovery and congestion control is a scalable technique used to overcome a number of wellknown problems in Reliable Multicast (RM). The idea is that the multicast delivery tree is partitioned into subgroups that form a hierarchy rooted at the source, hence the term ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
The use of proxies for local error recovery and congestion control is a scalable technique used to overcome a number of wellknown problems in Reliable Multicast (RM). The idea is that the multicast delivery tree is partitioned into subgroups that form a hierarchy rooted at the source, hence the term Hierarchical Reliable Multicast (HRM). For each subgroup, there is a designated node, the proxy, which is responsible for collecting the feedback from the subgroup and for locally retransmitting the lost packets. The performance of any RM protocol is affected by the underlying multicast routing tree and its loss characteristics. Furthermore, the performance of the HRM approach, in particular, strongly depends on the appropriate partitioning of the tree and the selection of proxies. In this paper, we first model the HRM problem, then define and compute appropriate performance metrics and finally give insights on the optimal location of proxies. Keywords Performance analysis, reliable mult...
Using Compression For Source Based Classification Of Text
, 2001
"... This thesis addresses the problem of source based text classification. In a nutshell, this problem involves classifying documents according to "where they came from" instead of the usual "what they contain". Viewed from a machine learning perspective, this can be looked upon as a learning problem an ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
This thesis addresses the problem of source based text classification. In a nutshell, this problem involves classifying documents according to "where they came from" instead of the usual "what they contain". Viewed from a machine learning perspective, this can be looked upon as a learning problem and can be classified into two categories: supervised and unsupervised learning. In the former case, the classifier is presented with known examples of documents and their sources during the training phase. In the testing phase, the classifier is given a document whose source is unknown, and the goal of the classifier is to find the most likely one from the category of known sources. In the latter case, the classifier is just presented with samples of text, and its goal is to detect regularities in the data set. One such goal could be a clustering of the documents based on common authorship. In order to perform these classification tasks, we intend to use compression as the underlying technique. Compression can be viewed as a predictencode process where the prediction of upcoming tokens is done by adaptively building a model from the text seen so far. This source modelling feature of compression algorithms allows for classification by purely statistical means.
Last updated
"... In this literature review, we survey graphbased clustering and its application in coreference resolution. We state that the methodology of graphbased clustering can be described by a fivepart story: (1) hypothesis which hypothesizes that a graph can be partitioned into densely connected subgraphs ..."
Abstract
 Add to MetaCart
In this literature review, we survey graphbased clustering and its application in coreference resolution. We state that the methodology of graphbased clustering can be described by a fivepart story: (1) hypothesis which hypothesizes that a graph can be partitioned into densely connected subgraphs that are sparsely connected to each other; (2) modeling which deals with the problem of transforming data into a graph; (3) measure which is an objective function that rates the quality of a clustering; (4) algorithm which aims to optimize the measure; (5) evaluation which evaluates the performance of a system clustering relative to a groundtruth clustering. We then survey coreference resolution which is further split into two problems, entity coreference resolution and event coreference resolution. We focus on discussing how the graphbased clustering methodology has been applied in solving these two problems. i Contents