Results 1  10
of
64
An Efficient kMeans Clustering Algorithm: Analysis and Implementation
, 2000
"... Kmeans clustering is a very popular clustering technique, which is used in numerous applications. Given a set of n data points in R d and an integer k, the problem is to determine a set of k points R d , called centers, so as to minimize the mean squared distance from each data point to its ..."
Abstract

Cited by 208 (3 self)
 Add to MetaCart
Kmeans clustering is a very popular clustering technique, which is used in numerous applications. Given a set of n data points in R d and an integer k, the problem is to determine a set of k points R d , called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for kmeans clustering is Lloyd's algorithm. In this paper we present a simple and efficient implementation of Lloyd's kmeans clustering algorithm, which we call the filtering algorithm. This algorithm is very easy to implement. It differs from most other approaches in that it precomputes a kdtree data structure for the data points rather than the center points. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a datasensitive analysis of the algorithm's running time. Second, we have implemented the algorithm and performed a number of empirical studies, both on synthetically generated data and on real...
A local search approximation algorithm for kmeans clustering
, 2004
"... In kmeans clustering we are given a set of n data points in ddimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomialtime algorithms are kno ..."
Abstract

Cited by 71 (1 self)
 Add to MetaCart
In kmeans clustering we are given a set of n data points in ddimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomialtime algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance. We consider the question of whether there exists a simple and practical approximation algorithm for kmeans clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9 + ε)approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9 − ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with
Robustness and fragility of boolean models for genetic regulatory networks
 J. Theoretical Biology
, 2005
"... Interactions between genes and gene products give rise to complex circuits that enable cells to process information and respond to external signals. Theoretical studies often describe these interactions using continuous, stochastic, or logical approaches. We propose a new modeling framework for gene ..."
Abstract

Cited by 47 (10 self)
 Add to MetaCart
Interactions between genes and gene products give rise to complex circuits that enable cells to process information and respond to external signals. Theoretical studies often describe these interactions using continuous, stochastic, or logical approaches. We propose a new modeling framework for gene regulatory networks, that combines the intuitive appeal of a qualitative description of gene states with a high flexibility in incorporating stochasticity in the duration of cellular processes. We apply our methods to the regulatory network of the segment polarity genes, thus gaining novel insights into the development of gene expression patterns. For example, we show that very short synthesis and decay times can perturb the wild type pattern. On the other hand, separation of timescales between pre and posttranslational processes and a minimal prepattern ensure convergence to the wild type expression pattern regardless of fluctuations.
Importance Sampling for Portfolio Credit Risk
 MANAGEMENT SCIENCE
, 2003
"... Monte Carlo simulation is widely used to measure the credit risk in portfolios of loans, corporate bonds, and other instruments subject to possible default. The accurate measurement of credit risk is often a rareevent simulation problem because default probabilities are low for highly rated obli ..."
Abstract

Cited by 45 (7 self)
 Add to MetaCart
Monte Carlo simulation is widely used to measure the credit risk in portfolios of loans, corporate bonds, and other instruments subject to possible default. The accurate measurement of credit risk is often a rareevent simulation problem because default probabilities are low for highly rated obligors and because risk management is particularly concerned with rare but significant losses resulting from a large number of defaults. This makes importance sampling (IS) potentially attractive. But the application of IS is complicated by the mechanisms used to model dependence between obligors; and capturing this dependence is essential to a portfolio view of credit risk. This paper provides an IS procedure for the widely used normal copula model of portfolio credit risk. The procedure has two parts: one applies IS conditional on a set of common factors affecting multiple obligors, the other applies IS to the factors themselves. The relative importance of the two parts of the procedure is determined by the strength of the dependence between obligors. We provide both theoretical and numerical support for the method.
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
 Journal of Parallel and Distributed Computing
, 2001
"... Performance prediction of checkpointing systems in the presence of failures is a wellstudied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In parti ..."
Abstract

Cited by 33 (0 self)
 Add to MetaCart
Performance prediction of checkpointing systems in the presence of failures is a wellstudied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we present a performance model for longrunning parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today's parallel computing environments and software, and present case studies of using the model to select runtime parameters. Keywords: Checkpointing, performance prediction, parameter selection, parallel computation, Markov chain, exponential failure and repair distributions. 1
Piecewiselinear diffusion processes
 Advances in Queueing
, 1995
"... Diffusion processes are often regarded as among the more abstruse stochastic processes, but diffusion processes are actually relatively elementary, and thus are natural first candidates to consider in queueing applications. To help demonstrate the advantages of diffusion processes, we show that ther ..."
Abstract

Cited by 27 (8 self)
 Add to MetaCart
Diffusion processes are often regarded as among the more abstruse stochastic processes, but diffusion processes are actually relatively elementary, and thus are natural first candidates to consider in queueing applications. To help demonstrate the advantages of diffusion processes, we show that there is a large class of onedimensional diffusion processes for which it is possible to give convenient explicit expressions for the steadystate distribution, without writing down any partial differential equations or performing any numerical integration. We call these tractable diffusion processes piecewise linear; the drift function is piecewise linear, while the diffusion coefficient is piecewise constant. The explicit expressions for steadystate distributions in turn yield explicit expressions for longrun average costs in optimization problems, which can be analyzed with the aid of symbolic mathematics packages. Since diffusion processes have continuous sample paths, approximation is required when they are used to model discretevalued processes. We also discuss strategies for performing this approximation, and we investigate when this approximation is good for the steadystate distribution of birthanddeath processes. We show that the diffusion approximation tends to be good when the differences between the birth and death rates are small compared to the death rates.
Modelbased diagnostics and probabilistic assumptionbased reasoning
 Artificial Intelligence
, 1998
"... The mathematical foundations of modelbased diagnostics or diagnosis from first principles have been laid by Reiter [31]. In this paper we extend Reiter’s ideas of modelbased diagnostics by introducing probabilities into Reiter’s framework. This is done in a mathematically sound and precise way whi ..."
Abstract

Cited by 22 (16 self)
 Add to MetaCart
The mathematical foundations of modelbased diagnostics or diagnosis from first principles have been laid by Reiter [31]. In this paper we extend Reiter’s ideas of modelbased diagnostics by introducing probabilities into Reiter’s framework. This is done in a mathematically sound and precise way which allows one to compute the posterior probability that a certain component is not working correctly given some observations of the system. A straightforward computation of these probabilities is not efficient and in this paper we propose a new method to solve this problem. Our method is logicbased and borrows ideas from assumptionbased reasoning and ATMS. We show how it is possible to determine arguments in favor of the hypothesis that a certain group of components is not working correctly. These arguments represent the symbolic or qualitative aspect of the diagnosis process. Then they are used to derive a quantitative or numerical aspect represented by the posterior probabilities. Using two new theorems about the relation between Reiter’s notion of conflict and our notion of argument, we prove that our socalled degree of support is nothing but the posterior probability that we are looking for. Furthermore, a model where each component may have more than two different operating modes is discussed and a new algorithm to compute posterior probabilities in this case is presented. Key words: Modelbased diagnostics; Assumptionbased reasoning; ATMS;
A Probabilistic Approach to Navigation in Hypertext
 Information Sciences
, 1999
"... One of the main unsolved problems confronting Hypertext is the navigation problem, namely the problem of having to know where you are in the database graph representing the structure of a Hypertext database, and knowing how to get to some other place you are searching for in the database graph. Prev ..."
Abstract

Cited by 21 (9 self)
 Add to MetaCart
One of the main unsolved problems confronting Hypertext is the navigation problem, namely the problem of having to know where you are in the database graph representing the structure of a Hypertext database, and knowing how to get to some other place you are searching for in the database graph. Previously we formalised a Hypertext database in terms of a directed graph whose nodes represent pages of information. The notion of a trail, which is a path in the database graph describing some logical association amongst the pages in the trail, is central to our model. We defined a Hypertext Query Language, HQL, over Hypertext databases and showed that in general the navigation problem, i.e. the problem of finding a trail that satisfies a HQL query (technically known as the model checking problem), is NPcomplete. Herein we present a preliminary investigation of using a probabilistic approach in order to enhance the efficiency of model checking. The flavour of our investigation is that if we h...
MAC vs. PC: Determinism and Randomness as Complementary Approaches to Robotic Exploration of Continuous Unknown Domains
, 2000
"... Three methods are described for exploring a continuous unknown planar region by a group of robots having limited sensors and no explicit communication. We formalize the problem, prove that its offline version is NPhard, and show a lower bound on the length of any solution. Then a deterministic mar ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Three methods are described for exploring a continuous unknown planar region by a group of robots having limited sensors and no explicit communication. We formalize the problem, prove that its offline version is NPhard, and show a lower bound on the length of any solution. Then a deterministic mark and cover (MAC) algorithm is described for the online problem using shortlived navigational markers as means of navigation and indirect communication. The convergence of the algorithm is proved, and its cover time is shown to be the asymptotically optimal O(A/a), where A is the total area and a is the area covered by the robot in a single step. TheMAC algorithm is tested against an alternative randomized probabilistic covering (PC) method, which does not rely on sensors but is still able to cover an unknown region in an expected time that depends polynomially on the dimensions of the region. Both algorithms enable cooperation of several robots to achieve faster coverage. Finally, we show...
2010), "Nonparametric estimation in random coefficients binary choice models," Unpublished manuscript
"... This paper considers random coefficients binary choice models. The main goal is to estimate the density of the random coefficients nonparametrically. This is an illposed inverse problem characterized by an integral transform. A new density estimator for the random coefficients is developed, utilizi ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
This paper considers random coefficients binary choice models. The main goal is to estimate the density of the random coefficients nonparametrically. This is an illposed inverse problem characterized by an integral transform. A new density estimator for the random coefficients is developed, utilizing Fourier–Laplace series on spheres. This approach offers a clear insight on the identification problem. More importantly, it leads to a closed form estimator formula that yields a simple plugin procedure requiring no numerical optimization. The new estimator, therefore, is easy to implement in empirical applications, while being flexible about the treatment of unobserved heterogeneity. Extensions including treatments of nonrandom coefficients and models with endogeneity are discussed.