Results 1  10
of
96
An Efficient kMeans Clustering Algorithm: Analysis and Implementation
, 2000
"... Kmeans clustering is a very popular clustering technique, which is used in numerous applications. Given a set of n data points in R d and an integer k, the problem is to determine a set of k points R d , called centers, so as to minimize the mean squared distance from each data point to its ..."
Abstract

Cited by 405 (4 self)
 Add to MetaCart
(Show Context)
Kmeans clustering is a very popular clustering technique, which is used in numerous applications. Given a set of n data points in R d and an integer k, the problem is to determine a set of k points R d , called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for kmeans clustering is Lloyd's algorithm. In this paper we present a simple and efficient implementation of Lloyd's kmeans clustering algorithm, which we call the filtering algorithm. This algorithm is very easy to implement. It differs from most other approaches in that it precomputes a kdtree data structure for the data points rather than the center points. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a datasensitive analysis of the algorithm's running time. Second, we have implemented the algorithm and performed a number of empirical studies, both on synthetically generated data and on real...
A local search approximation algorithm for kmeans clustering
, 2004
"... In kmeans clustering we are given a set of n data points in ddimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomialtime algorithms are kno ..."
Abstract

Cited by 105 (1 self)
 Add to MetaCart
In kmeans clustering we are given a set of n data points in ddimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomialtime algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance. We consider the question of whether there exists a simple and practical approximation algorithm for kmeans clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9 + ε)approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9 − ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with
RANDOM SAMPLING IN CUT, FLOW, AND NETWORK DESIGN PROBLEMS
, 1999
"... We use random sampling as a tool for solving undirected graph problems. We show that the sparse graph, or skeleton, that arises when we randomly sample a graph’s edges will accurately approximate the value of all cuts in the original graph with high probability. This makes sampling effective for pro ..."
Abstract

Cited by 102 (12 self)
 Add to MetaCart
(Show Context)
We use random sampling as a tool for solving undirected graph problems. We show that the sparse graph, or skeleton, that arises when we randomly sample a graph’s edges will accurately approximate the value of all cuts in the original graph with high probability. This makes sampling effective for problems involving cuts in graphs. We present fast randomized (Monte Carlo and Las Vegas) algorithms for approximating and exactly finding minimum cuts and maximum flows in unweighted, undirected graphs. Our cutapproximation algorithms extend unchanged to weighted graphs while our weightedgraph flow algorithms are somewhat slower. Our approach gives a general paradigm with potential applications to any packing problem. It has since been used in a nearlinear time algorithm for finding minimum cuts, as well as faster cut and flow algorithms. Our sampling theorems also yield faster algorithms for several other cutbased problems, including approximating the best balanced cut of a graph, finding a kconnected orientation of a 2kconnected graph, and finding integral multicommodity flows in graphs with a great deal of excess capacity. Our methods also improve the efficiency of some parallel cut and flow algorithms. Our methods also apply to the network design problem, where we wish to build a network satisfying certain connectivity requirements between vertices. We can purchase edges of various costs and wish to satisfy the requirements at minimum total cost. Since our sampling theorems apply even when the sampling probabilities are different for different edges, we can apply randomized rounding to solve network design problems. This gives approximation algorithms that guarantee much better approximations than previous algorithms whenever the minimum connectivity requirement is large. As a particular example, we improve the best approximation bound for the minimum kconnected subgraph problem from 1.85 to 1 � O(�log n)/k).
Robustness and fragility of boolean models for genetic regulatory networks
 J. Theoretical Biology
, 2005
"... Interactions between genes and gene products give rise to complex circuits that enable cells to process information and respond to external signals. Theoretical studies often describe these interactions using continuous, stochastic, or logical approaches. We propose a new modeling framework for gene ..."
Abstract

Cited by 70 (10 self)
 Add to MetaCart
Interactions between genes and gene products give rise to complex circuits that enable cells to process information and respond to external signals. Theoretical studies often describe these interactions using continuous, stochastic, or logical approaches. We propose a new modeling framework for gene regulatory networks, that combines the intuitive appeal of a qualitative description of gene states with a high flexibility in incorporating stochasticity in the duration of cellular processes. We apply our methods to the regulatory network of the segment polarity genes, thus gaining novel insights into the development of gene expression patterns. For example, we show that very short synthesis and decay times can perturb the wild type pattern. On the other hand, separation of timescales between pre and posttranslational processes and a minimal prepattern ensure convergence to the wild type expression pattern regardless of fluctuations.
Importance sampling for portfolio credit risk
 Management Science
"... Monte Carlo simulation is widely used to measure the credit risk in portfolios of loans, corporate bonds, and other instruments subject to possible default. The accurate measurement of credit risk is often a rareevent simulation problem because default probabilities are low for highly rated obligo ..."
Abstract

Cited by 67 (7 self)
 Add to MetaCart
(Show Context)
Monte Carlo simulation is widely used to measure the credit risk in portfolios of loans, corporate bonds, and other instruments subject to possible default. The accurate measurement of credit risk is often a rareevent simulation problem because default probabilities are low for highly rated obligors and because risk management is particularly concerned with rare but significant losses resulting from a large number of defaults. This makes importance sampling (IS) potentially attractive. But the application of IS is complicated by the mechanisms used to model dependence between obligors; and capturing this dependence is essential to a portfolio view of credit risk. This paper provides an IS procedure for the widely used normal copula model of portfolio credit risk. The procedure has two parts: one applies IS conditional on a set of common factors affecting multiple obligors, the other applies IS to the factors themselves. The relative importance of the two parts of the procedure is determined by the strength of the dependence between obligors. We provide both theoretical and numerical support for the method. 1
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
 Journal of Parallel and Distributed Computing
, 2001
"... Performance prediction of checkpointing systems in the presence of failures is a wellstudied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In parti ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
(Show Context)
Performance prediction of checkpointing systems in the presence of failures is a wellstudied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we present a performance model for longrunning parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today's parallel computing environments and software, and present case studies of using the model to select runtime parameters. Keywords: Checkpointing, performance prediction, parameter selection, parallel computation, Markov chain, exponential failure and repair distributions. 1
Piecewiselinear diffusion processes
 Advances in Queueing
, 1995
"... Diffusion processes are often regarded as among the more abstruse stochastic processes, but diffusion processes are actually relatively elementary, and thus are natural first candidates to consider in queueing applications. To help demonstrate the advantages of diffusion processes, we show that ther ..."
Abstract

Cited by 45 (10 self)
 Add to MetaCart
Diffusion processes are often regarded as among the more abstruse stochastic processes, but diffusion processes are actually relatively elementary, and thus are natural first candidates to consider in queueing applications. To help demonstrate the advantages of diffusion processes, we show that there is a large class of onedimensional diffusion processes for which it is possible to give convenient explicit expressions for the steadystate distribution, without writing down any partial differential equations or performing any numerical integration. We call these tractable diffusion processes piecewise linear; the drift function is piecewise linear, while the diffusion coefficient is piecewise constant. The explicit expressions for steadystate distributions in turn yield explicit expressions for longrun average costs in optimization problems, which can be analyzed with the aid of symbolic mathematics packages. Since diffusion processes have continuous sample paths, approximation is required when they are used to model discretevalued processes. We also discuss strategies for performing this approximation, and we investigate when this approximation is good for the steadystate distribution of birthanddeath processes. We show that the diffusion approximation tends to be good when the differences between the birth and death rates are small compared to the death rates.
2010), "Nonparametric estimation in random coefficients binary choice models," Unpublished manuscript
"... This paper considers random coefficients binary choice models. The main goal is to estimate the density of the random coefficients nonparametrically. This is an illposed inverse problem characterized by an integral transform. A new density estimator for the random coefficients is developed, utilizi ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
This paper considers random coefficients binary choice models. The main goal is to estimate the density of the random coefficients nonparametrically. This is an illposed inverse problem characterized by an integral transform. A new density estimator for the random coefficients is developed, utilizing Fourier–Laplace series on spheres. This approach offers a clear insight on the identification problem. More importantly, it leads to a closed form estimator formula that yields a simple plugin procedure requiring no numerical optimization. The new estimator, therefore, is easy to implement in empirical applications, while being flexible about the treatment of unobserved heterogeneity. Extensions including treatments of nonrandom coefficients and models with endogeneity are discussed.
Proactive algorithms for job shop scheduling with probabilistic durations
 Journal of Artificial Intelligence Research
"... Most classical scheduling formulations assume a fixed and known duration for each activity. In this paper, we weaken this assumption, requiring instead that each duration can be represented by an independent random variable with a known mean and variance. The best solutions are ones which have a hig ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
Most classical scheduling formulations assume a fixed and known duration for each activity. In this paper, we weaken this assumption, requiring instead that each duration can be represented by an independent random variable with a known mean and variance. The best solutions are ones which have a high probability of achieving a good makespan. We first create a theoretical framework, formally showing how Monte Carlo simulation can be combined with deterministic scheduling algorithms to solve this problem. We propose an associated deterministic scheduling problem whose solution is proved, under certain conditions, to be a lower bound for the probabilistic problem. We then propose and investigate a number of techniques for solving such problems based on combinations of Monte Carlo simulation, solutions to the associated deterministic problem, and either constraint programming or tabu search. Our empirical results demonstrate that a combination of the use of the associated deterministic problem and Monte Carlo simulation results in algorithms that scale best both in terms of problem size and uncertainty. Further experiments point to the correlation between the quality of the deterministic solution and the quality of the probabilistic solution as a major factor responsible for this success. 1.
Modelbased diagnostics and probabilistic assumptionbased reasoning
 Artificial Intelligence
, 1998
"... The mathematical foundations of modelbased diagnostics or diagnosis from first principles have been laid by Reiter [31]. In this paper we extend Reiter’s ideas of modelbased diagnostics by introducing probabilities into Reiter’s framework. This is done in a mathematically sound and precise way whi ..."
Abstract

Cited by 23 (17 self)
 Add to MetaCart
(Show Context)
The mathematical foundations of modelbased diagnostics or diagnosis from first principles have been laid by Reiter [31]. In this paper we extend Reiter’s ideas of modelbased diagnostics by introducing probabilities into Reiter’s framework. This is done in a mathematically sound and precise way which allows one to compute the posterior probability that a certain component is not working correctly given some observations of the system. A straightforward computation of these probabilities is not efficient and in this paper we propose a new method to solve this problem. Our method is logicbased and borrows ideas from assumptionbased reasoning and ATMS. We show how it is possible to determine arguments in favor of the hypothesis that a certain group of components is not working correctly. These arguments represent the symbolic or qualitative aspect of the diagnosis process. Then they are used to derive a quantitative or numerical aspect represented by the posterior probabilities. Using two new theorems about the relation between Reiter’s notion of conflict and our notion of argument, we prove that our socalled degree of support is nothing but the posterior probability that we are looking for. Furthermore, a model where each component may have more than two different operating modes is discussed and a new algorithm to compute posterior probabilities in this case is presented. Key words: Modelbased diagnostics; Assumptionbased reasoning; ATMS;