Results 1  10
of
32
Bayesian Statistics
 in WWW', Computing Science and Statistics
, 1989
"... ∗ Signatures are on file in the Graduate School. This dissertation presents two topics from opposite disciplines: one is from a parametric realm and the other is based on nonparametric methods. The first topic is a jackknife maximum likelihood approach to statistical model selection and the second o ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
(Show Context)
∗ Signatures are on file in the Graduate School. This dissertation presents two topics from opposite disciplines: one is from a parametric realm and the other is based on nonparametric methods. The first topic is a jackknife maximum likelihood approach to statistical model selection and the second one is a convex hull peeling depth approach to nonparametric massive multivariate data analysis. The second topic includes simulations and applications on massive astronomical data. First, we present a model selection criterion, minimizing the KullbackLeibler distance by using the jackknife method. Various model selection methods have been developed to choose a model of minimum KullbackLiebler distance to the true model, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), Minimum description length (MDL), and Bootstrap information criterion. Likewise, the jackknife method chooses a model of minimum KullbackLeibler distance through bias reduction. This bias, which is inevitable in model
An Overview of Heterogeneous High Performance and Grid Computing
 In Engineering the Grid
, 2006
"... Abstract. This paper is an overview the ongoing academic research, development, and uses of heterogeneous parallel and distributed computing. This work is placed in the context of scientific computing. The simulation of very large systems often requires computational capabilities which cannot be sat ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
(Show Context)
Abstract. This paper is an overview the ongoing academic research, development, and uses of heterogeneous parallel and distributed computing. This work is placed in the context of scientific computing. The simulation of very large systems often requires computational capabilities which cannot be satisfied by a single processing system. A possible way to solve this problem is to couple different computational resources, perhaps distributed geographically. Heterogeneous distributed computing is a means to overcome the limitations of single computing systems.
The largescale structure of the Universe
, 2006
"... Research over the past 25 years has led to the view that the rich tapestry of presentday cosmic structure arose during the first instants of creation, where weak ripples were imposed on the otherwise uniform and rapidly expanding primordial soup. Over 14 billion years of evolution, these ripples ha ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Research over the past 25 years has led to the view that the rich tapestry of presentday cosmic structure arose during the first instants of creation, where weak ripples were imposed on the otherwise uniform and rapidly expanding primordial soup. Over 14 billion years of evolution, these ripples have been amplified to enormous proportions by gravitational forces, producing evergrowing concentrations of dark matter in which ordinary gases cool, condense and fragment to make galaxies. This process can be faithfully mimicked in large computer simulations, and tested by observations that probe the history of the Universe starting from just 400,000 years after the Big Bang. The past two and a half decades have seen enormous advances in the study of cosmic structure, both in our knowledge of how it is manifest in the largescale matter distribution, and in our understanding of its origin. A new generation of galaxy surveys – the 2degree Field Galaxy Redshift Survey, or 2dFGRS1, and the Sloan Digital Sky Survey, or SDSS22 – have quantified the distribution of galaxies in the local Universe with a level of detail and on length scales that were unthinkable just a few years ago. Surveys of quasar absorption and of gravitational lensing have produced qualitatively new data on the distributions of diffuse
A distributed kernel summation framework for generaldimension machine learning
 In SIAM International Conference on Data Mining 2012
, 2012
"... Kernel summations are a ubiquitous key computational bottleneck in many data analysis methods. In this paper, we attempt to marry, for the first time, the best relevant techniques in parallel computing, where kernel summations are in low dimensions, with the best generaldimension algorithms from th ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
Kernel summations are a ubiquitous key computational bottleneck in many data analysis methods. In this paper, we attempt to marry, for the first time, the best relevant techniques in parallel computing, where kernel summations are in low dimensions, with the best generaldimension algorithms from the machine learning literature. We provide the first distributed implementation of kernel summation framework that can utilize: 1) various types of deterministic and probabilistic approximations that may be suitable for low and highdimensional problems with a large number of data points; 2) any multidimensional binary tree using both distributed memory and shared memory parallelism; 3) a dynamic load balancing scheme to adjust work imbalances during the computation. Our hybrid MPI/OpenMP codebase has wide applicability in providing a general framework to accelerate the computation of many popular machine learning methods. Our experiments show scalability results for kernel density estimation on a synthetic tendimensional dataset containing over one billion points and a subset of the Sloan Digital Sky Survey Data up to 6,144 cores. 1
Terascale turbulence computation on BG/L using the FLASH3 code
, 2006
"... Understanding the nature of turbulent flows remains one of the outstanding questions in classical physics. Significant progress has been made recently using computer simulation as an aid to our understanding of the rich physics of turbulence. Here we present both the computer science and scientific ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
(Show Context)
Understanding the nature of turbulent flows remains one of the outstanding questions in classical physics. Significant progress has been made recently using computer simulation as an aid to our understanding of the rich physics of turbulence. Here we present both the computer science and scientific features of a unique terascale simulation of a weaklycompressible turbulent flow, including tracer particles. The simulation was performed on the world’s fastest supercomputer as of March 2007, the Lawrence Livermore National Laboratory IBM BG/L, using version 3 of the FLASH code. FLASH3 is a modular, publicallyavailable code, designed primarily for astrophysical simulations, which scales well to massively parallel environments. We discuss issues related to the analysis and visualization of such a massive simulation, and present initial scientific results. We also discuss the opening of the dataset and challenges related to its public release. We suggest that widespread adoption of an open dataset model of computing is likely to result in significant gains for the scientific computing community in the near future, in much the same way that the widespread adoption of open source software has produced similar gains in the last decade.
Browsing Large Scale Cheminformatics Data with Dimension Reduction
"... Visualization of largescale high dimensional data tool is highly valuable for scientific discovery in many fields. We presentPubChemBrowse, acustomizedvisualizationtoolfor cheminformatics research. It provides a novel 3D data point browser that displays complex properties of massive data on commodi ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Visualization of largescale high dimensional data tool is highly valuable for scientific discovery in many fields. We presentPubChemBrowse, acustomizedvisualizationtoolfor cheminformatics research. It provides a novel 3D data point browser that displays complex properties of massive data on commodity clients. As in GIS browsers for Earth and Environment data, chemical compounds with similar properties are nearby in the browser. PubChemBrowse is built around inhousehighperformanceparallel MDS(MultiDimensional Scaling) and GTM (Generative Topographic Mapping) services andsupports fast interaction with anexternal property database. These properties can be overlaid on 3D mapped compound space or queried for individual points. We prototype use with Chem2Bio2RDF system usingSPARQLquery language to access over 20 publicly accessible bioinformatics databases. We describe our design and implementation of the integrated PubChemBrowse application and outline its use in drug discovery. The same core technologies can be used to develop similar high dimensional browsers in other scientific areas.
Delineating the Citation Impact of Scientific Discoveries
"... Identifying the significance of specific concepts in the diffusion of scientific knowledge is a challenging issue concerning many theoretical and practical areas. We introduce an innovative visual analytic approach to integrate microscopic and macroscopic perspectives of a rapidly growing scientific ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Identifying the significance of specific concepts in the diffusion of scientific knowledge is a challenging issue concerning many theoretical and practical areas. We introduce an innovative visual analytic approach to integrate microscopic and macroscopic perspectives of a rapidly growing scientific knowledge domain. Specifically, our approach focuses on statistically unexpected phrases extracted from unstructured text of titles and abstracts at the microscopic level in association with the magnitude and timeliness of their citation impact at the macroscopic level. The Hindex, originally defined to measure individual scientists’ productivity in terms of their citation profiles, is extended in two ways: 1) to papers and terms as a means of dividing these items into two groups so as to replace the less optimal thresholdbased divisions, and 2) to take into account the timeliness of the impact
In STATISTICS
"... A recent flood of astronomical data has created much demand for sophisticated statistical and machine learning tools that can rapidly draw accurate inferences from large databases of highdimensional data. In this Ph.D. thesis, methods for statistical inference in such databases will be proposed, stu ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
A recent flood of astronomical data has created much demand for sophisticated statistical and machine learning tools that can rapidly draw accurate inferences from large databases of highdimensional data. In this Ph.D. thesis, methods for statistical inference in such databases will be proposed, studied, and applied to real data. I use methods for lowdimensional parametrization of complex, highdimensional data that are based on the notion of preserving the connectivity of data points in the context of a Markov random walk over the data set. I show how this simple parameterization of data can be exploited to: define appropriate prototypes for use in complex mixture models, determine datadriven eigenfunctions for accurate nonparametric regression, and find a set of suitable features to use in a statistical classifier. In this thesis, methods for each of these tasks are built up from simple principles, compared to existing methods in the literature, and applied to data from astronomical allsky surveys. I examine several important problems in astrophysics, such as estimation of star formation history parameters for galaxies, prediction of redshifts of galaxies using photometric data, and classification of different types of supernovae based on their photometric light curves. Fast methods for highdimensional data analysis are crucial in each of these problems because they all involve the analysis of complicated highdimensional data in large, allsky surveys.
Robust Visual Mining of Data with Error Information
"... Abstract. Recent results on robust densitybased clustering have indicated that the uncertainty associated with the actual measurements can be exploited to locate objects that are atypical for a reason unrelated to measurement errors. In this paper, we develop a constrained robust mixture model, whi ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Recent results on robust densitybased clustering have indicated that the uncertainty associated with the actual measurements can be exploited to locate objects that are atypical for a reason unrelated to measurement errors. In this paper, we develop a constrained robust mixture model, which, in addition, is able to nonlinearly map such data for visual exploration. Our robust visual mining approach aims to combine statistically sound densitybased analysis with visual presentation of the density structure, and to provide visual support for the identification and exploration of ‘genuine ’ peculiar objects of interest that are not due to the measurement errors. In this model, an exact inference is not possible despite the latent space being discretised, and we resort to employing a structured variational EM. We present results on synthetic data as well as a real application, for visualising peculiar quasars from an astrophysical survey, given photometric measurements with errors. 1
A Fast Algorithm for Robust Mixtures in the Presence of Measurement Errors
"... Abstract—In experimental and observational sciences, detecting atypical, peculiar data from large sets of measurements has the potential of highlighting candidates of interesting new types of objects that deserve more detailed domainspecific followup study. However, measurement data is nearly never ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—In experimental and observational sciences, detecting atypical, peculiar data from large sets of measurements has the potential of highlighting candidates of interesting new types of objects that deserve more detailed domainspecific followup study. However, measurement data is nearly never free of measurement errors. These errors can generate false outliers that are not truly interesting. Although many approaches exist for finding outliers, they have no means to tell to what extent the peculiarity is not simply due to measurement errors. To address this issue, we have developed a modelbased approach to infer genuine outliers from multivariate data sets when measurement error information is available. This is based on a probabilistic mixture of hierarchical density models, in which parameter estimation is made feasible by a treestructured variational expectationmaximization algorithm. Here, we further develop an algorithmic enhancement to address the scalability of this approach, in order to make it applicable to large data sets, via a Kdimensionaltree based partitioning of the variational posterior assignments. This creates a nontrivial tradeoff between a more detailed noise model to enhance the detection accuracy, and the coarsened posterior representation to obtain computational speedup. Hence, we conduct extensive experimental validation to study the accuracy/speed tradeoffs achievable in a variety of data conditions. We find that, at lowtomoderate error levels, a speedup factor that is at least linear in the number of data points can be achieved without significantly sacrificing the detection accuracy. The benefits of including measurement error information into the modeling is evident in all situations, and the gain roughly recovers the loss incurred by the speedup procedure in large error conditions. We analyze and discuss in detail the characteristics of our algorithm based on results obtained on appropriately designed synthetic data experiments, and we also demonstrate its working in a real application example. Index Terms—Kdimensional (KD)tree, measurement errors, outlier detection, robust mixture modeling, variational expectationmaximization (EM) algorithm. I.