Results 1 
7 of
7
A FAST MULTISCALE FRAMEWORK FOR DATA IN HIGHDIMENSIONS: MEASURE ESTIMATION, ANOMALY DETECTION, AND COMPRESSIVE MEASUREMENTS*
"... Data sets are often modeled as samples from some probability distribution lying in a very high dimensional space. In practice, they tend to exhibit low intrinsic dimensionality, which enables both fast construction of efficient data representations and solving statistical tasks such as regression of ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Data sets are often modeled as samples from some probability distribution lying in a very high dimensional space. In practice, they tend to exhibit low intrinsic dimensionality, which enables both fast construction of efficient data representations and solving statistical tasks such as regression of functions on the data, or even estimation of the probability distribution from which the data is generated. In this paper we introduce a novel multiscale density estimator for high dimensional data and apply it to the problem of detecting changes in the distribution of dynamic data, or in a time series of data sets. We also show that our data representations, which are not standard sparse linear expansions, are amenable to compressed measurements. Finally, we test our algorithms on both synthetic data and a real data set consisting of a times series of hyperspectral images, and demonstrate their high accuracy in the detection of anomalies.
Nonasymptotic analysis of tangent space perturbation R
, 2014
"... Constructing an efficient parameterization of a large, noisy data set of points lying close to a smooth manifold in high dimension remains a fundamental problem. One approach consists in recovering a local parameterization using the local tangent plane. Principal component analysis (PCA) is often th ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Constructing an efficient parameterization of a large, noisy data set of points lying close to a smooth manifold in high dimension remains a fundamental problem. One approach consists in recovering a local parameterization using the local tangent plane. Principal component analysis (PCA) is often the tool of choice, as it returns an optimal basis in the case of noisefree samples from a linear subspace. To process noisy data samples from a nonlinear manifold, PCA must be applied locally, at a scale small enough such that the manifold is approximately linear, but at a scale large enough such that structure may be discerned from noise. Using eigenspace perturbation theory and nonasymptotic random matrix theory, we study the stability of the subspace estimated by PCA as a function of scale, and bound (with high probability) the angle it forms with the true tangent space. By adaptively selecting the scale that minimizes this bound, our analysis reveals an appropriate scale for local tangent plane recovery. We also introduce a geometric uncertainty principle quantifying the limits of noise–curvature perturbation for stable recovery. With the purpose of providing perturbation bounds that can be used in practice, we propose plugin estimates that make it possible to directly apply the theoretical results to real data sets.
Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework
"... When dealing with datasets comprising highdimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually calle ..."
Abstract
 Add to MetaCart
(Show Context)
When dealing with datasets comprising highdimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced stateoftheart methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant stateoftheart estimators.
Office: 293 Physics Bldg.
"... Synopsis of course content We will cover some basic materials in random matrix theory (with applications to compressed sensing and signal processing), nonparametric statistical estimation and machine learning, and problems about the geometry of high dimensional data sets. • Random matrices. Basic th ..."
Abstract
 Add to MetaCart
(Show Context)
Synopsis of course content We will cover some basic materials in random matrix theory (with applications to compressed sensing and signal processing), nonparametric statistical estimation and machine learning, and problems about the geometry of high dimensional data sets. • Random matrices. Basic theory of random matrices, following [8]: basic concentration inequalities, subgaussian random variables, singular values of random matrices. Applications: to compressed sensing theory; to numerical linear algebra (a.k.a. how to compute quickly highly accurate lowrank approximate Singular Value Decompositions, with high probability [9]). • Nonparametric estimation: Basic results in nonparametric density estimation and nonparametric regression (e.g. following the first chapter [7]) in low dimensions. Obstructions in the highdimensional setting, curse of dimensionality. Applications: denoising of signals (the classic DonohoJohnstone paper [4] and the compressed sensing results). • Approximation theory. A primer in nonlinear approximation of functions [3], especially for wavelets and other multiscale approximations. Multiscale approximation of functions in high dimensions [2]. Attacking the curse of dimensionality. • Multiscale Analysis in High dimensions. Multiscale geometric constructionsin metric spaces, associated algorithms and applications. Multiscale SVD and Geometric Multiresolution analyses, and their applications to dictionary learning, regression, manifold learning, compressive sensing [6, 1, 5]. • Optimal transport. A primer in optimal transport theory and Wasserstein metrics between distributions. Current research: multiscale approximationtheory in the space of probability measures with respect to Wasserstein metrics.
Landmarking Manifolds with Gaussian Processes
"... We present an algorithm for finding landmarks along a manifold. These landmarks provide a small set of locations spaced out along the manifold such that they capture the lowdimensional nonlinear structure of the data embedded in the highdimensional space. The approach does not select points direc ..."
Abstract
 Add to MetaCart
(Show Context)
We present an algorithm for finding landmarks along a manifold. These landmarks provide a small set of locations spaced out along the manifold such that they capture the lowdimensional nonlinear structure of the data embedded in the highdimensional space. The approach does not select points directly from the dataset, but instead we optimize each landmark by moving along the continuous manifold space (as approximated by the data) according to the gradient of an objective function. We borrow ideas from active learning with Gaussian processes to define the objective, which has the property that a new landmark is “repelled ” by those currently selected, allowing for exploration of the manifold. We derive a stochastic algorithm for learning with large datasets and show results on several datasets, including the Million Song Dataset and articles from the New York Times. 1.
Dictionary Learning and NonAsymptotic Bounds for the Geometric MultiResolution Analysis
"... Abstract: Highdimensional data sets arising in a wide variety of applications often exhibit inherently lowdimensional structure. Detecting, measuring, and exploiting such low intrinsic dimensionality has been the focus of much research in the past decade, with implications and applications in many ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract: Highdimensional data sets arising in a wide variety of applications often exhibit inherently lowdimensional structure. Detecting, measuring, and exploiting such low intrinsic dimensionality has been the focus of much research in the past decade, with implications and applications in many fields including highdimensional statistics, machine learning, and signal processing. In this vein, active and compelling research in machine learning explores the topic of manifold learning, where the lowdimensional sets manifest as an unknown manifold structure that must be learned from the sampled data. Manifold learning seems quite distinct from the comparably popular subject of dictionary learning, where the lowdimensional structure is the set of sparse (or compressible) linear combinations of vectors from a finite linear dictionary. However, Geometric MultiResolution Analysis (GMRA) [2] was introduced as a method for producing, in a robust multiscale fashion, an approximation to a lowdimensional manifold structure (should it exist), while simultaneously providing a dictionary for sparse representation of the data, thereby creating a connection between these two problems. In this work, we prove nonasymptotic probabilistic bounds for GMRA approximation error under certain assumptions on the geometry of the underlying distribution. In particular, our results imply that if the data is supported near a lowdimensional manifold, the proposed sparse representations result in an error primarily dependent upon the intrinsic dimension of the manifold, and independent of the ambient dimension. 1.
Approved:
, 2014
"... When simulating multiscale stochastic differential equations (SDEs) in highdimensions, separation of timescales and highdimensionality can make simulations expensive. The computational cost is dictated by microscale properties and interactions of many variables, while interesting behavior often oc ..."
Abstract
 Add to MetaCart
When simulating multiscale stochastic differential equations (SDEs) in highdimensions, separation of timescales and highdimensionality can make simulations expensive. The computational cost is dictated by microscale properties and interactions of many variables, while interesting behavior often occurs on the macroscale with few important degrees of freedom. For many problems bridging the gap between the microscale and macroscale by direct simulation is computationally infeasible, and one would like to learn a fast macroscale simulator. In this paper we present an unsupervised learning algorithm that uses short parallelizable microscale simulations to learn provably accurate macroscale SDE models. The learning algorithm takes as input: the microscale simulator, a local distance function, and a homogenization scale. The learned macroscale model can then be used for fast computation and storage of long simulations. I will discuss various examples, both low and highdimensional, as well as results about the accuracy of the fast simulators we construct, and its dependency on the number of short paths requested from the microscale simulator.