Results 1  10
of
59
Estimating the Support of a HighDimensional Distribution
, 1999
"... Suppose you are given some dataset drawn from an underlying probability distribution P and you want to estimate a "simple" subset S of input space such that the probability that a test point drawn from P lies outside of S is bounded by some a priori specified between 0 and 1. We propose a metho ..."
Abstract

Cited by 510 (32 self)
 Add to MetaCart
Suppose you are given some dataset drawn from an underlying probability distribution P and you want to estimate a "simple" subset S of input space such that the probability that a test point drawn from P lies outside of S is bounded by some a priori specified between 0 and 1. We propose a method to approach this problem by trying to estimate a function f which is positive on S and negative on the complement. The functional form of f is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. The expansion coefficients are found by solving a quadratic programming problem, which we do by carrying out sequential optimization over pairs of input patterns. We also provide a preliminary theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabelled d...
Smooth Discrimination Analysis
 Ann. Statist
, 1998
"... Discriminant analysis for two data sets in IR d with probability densities f and g can be based on the estimation of the set G = fx : f(x) g(x)g. We consider applications where it is appropriate to assume that the region G has a smooth boundary. In particular, this assumption makes sense if di ..."
Abstract

Cited by 103 (3 self)
 Add to MetaCart
Discriminant analysis for two data sets in IR d with probability densities f and g can be based on the estimation of the set G = fx : f(x) g(x)g. We consider applications where it is appropriate to assume that the region G has a smooth boundary. In particular, this assumption makes sense if discriminant analysis is used as a data analytic tool. We discuss optimal rates for estimation of G. 1991 AMS: primary 62G05 , secondary 62G20 Keywords and phrases: discrimination analysis, minimax rates, Bayes risk Short title: Smooth discrimination analysis This research was supported by the Deutsche Forschungsgemeinschaft, Sonderforschungsbereich 373 "Quantifikation und Simulation okonomischer Prozesse", HumboldtUniversitat zu Berlin 1 Introduction Assume that one observes two independent samples X = (X 1 ; : : : ; X n ) and Y = (Y 1 ; : : : ; Ym ) of IR d valued i.i.d. observations with densities f or g, respectively. The densities f and g are unknown. An additional random variabl...
A classification framework for anomaly detection
 J. Machine Learning Research
, 2005
"... One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of finding level sets for the data generating density. We interpret this learning problem as a binary classification problem and compare the corresponding classification risk with the standard p ..."
Abstract

Cited by 47 (6 self)
 Add to MetaCart
One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of finding level sets for the data generating density. We interpret this learning problem as a binary classification problem and compare the corresponding classification risk with the standard performance measure for the density level problem. In particular it turns out that the empirical classification risk can serve as an empirical performance measure for the anomaly detection problem. This allows us to compare different anomaly detection algorithms empirically, i.e. with the help of a test set. Based on the above interpretation we then propose a support vector machine (SVM) for anomaly detection. Finally, we establish universal consistency for this SVM and report some experiments which compare our SVM to other commonly used methods including the standard oneclass SVM. 1
Learning minimum volume sets
 J. Machine Learning Res
, 2006
"... Given a probability measure P and a reference measure µ, one is often interested in the minimum µmeasure set with Pmeasure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence region ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
Given a probability measure P and a reference measure µ, one is often interested in the minimum µmeasure set with Pmeasure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1
Fast learning rates in statistical inference through aggregation
 SUBMITTED TO THE ANNALS OF STATISTICS
, 2008
"... We develop minimax optimal risk bounds for the general learning task consisting in predicting as well as the best function in a reference set G up to the smallest possible additive term, called the convergence rate. When the reference set is finite and when n denotes the size of the training data, w ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
We develop minimax optimal risk bounds for the general learning task consisting in predicting as well as the best function in a reference set G up to the smallest possible additive term, called the convergence rate. When the reference set is finite and when n denotes the size of the training data, we provide minimax convergence rates of the form C () log G  v with tight evaluation of the positive constant C and with n exact 0 < v ≤ 1, the latter value depending on the convexity of the loss function and on the level of noise in the output distribution. The risk upper bounds are based on a sequential randomized algorithm, which at each step concentrates on functions having both low risk and low variance with respect to the previous step prediction function. Our analysis puts forward the links between the probabilistic and worstcase viewpoints, and allows to obtain risk bounds unachievable with the standard statistical learning approach. One of the key idea of this work is to use probabilistic inequalities with respect to appropriate (Gibbs) distributions on the prediction function space instead of using them with respect to the distribution generating the data. The risk lower bounds are based on refinements of the Assouad lemma taking particularly into account the properties of the loss function. Our key example to illustrate the upper and lower bounds is to consider the Lqregression setting for which an exhaustive analysis of the convergence rates is given while q ranges in [1; +∞[.
How to compare different loss functions and their risks
, 2006
"... Many learning problems are described by a risk functional which in turn is defined by a loss function, and a straightforward and widelyknown approach to learn such problems is to minimize a (modified) empirical version of this risk functional. However, in many cases this approach suffers from subst ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
Many learning problems are described by a risk functional which in turn is defined by a loss function, and a straightforward and widelyknown approach to learn such problems is to minimize a (modified) empirical version of this risk functional. However, in many cases this approach suffers from substantial problems such as computational requirements in classification or robustness concerns in regression. In order to resolve these issues many successful learning algorithms try to minimize a (modified) empirical risk of a surrogate loss function, instead. Of course, such a surrogate loss must be “reasonably related ” to the original loss function since otherwise this approach cannot work well. For classification good surrogate loss functions have been recently identified, and the relationship between the excess classification risk and the excess risk of these surrogate loss functions has been exactly described. However, beyond the classification problem little is known on good surrogate loss functions up to now. In this work we establish a general theory that provides powerful tools for comparing excess risks of different loss functions. We then apply this theory to several learning problems including (costsensitive) classification, regression, density estimation, and density level detection.
Generalization error bounds in semisupervised classification under the cluster assumption
, 2007
"... ..."
How to Divide a Territory? A New Simple Differential Formalism for Optimization of Set Functions
 International Journal of Intelligent Systems
, 1999
"... In many practical problems, we must optimize a set function, i.e., find a set A for which f(A) ! max, where f is a function defined on the class of sets. Such problems appear in design, in image processing, in game theory, etc. Most optimization problems can be solved (or at least simplified) by usi ..."
Abstract

Cited by 12 (8 self)
 Add to MetaCart
In many practical problems, we must optimize a set function, i.e., find a set A for which f(A) ! max, where f is a function defined on the class of sets. Such problems appear in design, in image processing, in game theory, etc. Most optimization problems can be solved (or at least simplified) by using the fact that small deviations from an optimal solution can only decrease the value of the objective function; as a result, some derivative must be equal to 0. This approach has been successfully used, e.g., for set functions in which the desired set A is a shape, i.e., a smooth (or piecewise smooth) surface. In some reallife problems, in particular, in the territorial division problem, the existing methods are not directly applicable. For such problems, we design a new simple differential formalism for optimizing set functions. 1 Introduction: Optimization of Set Functions is a Practically Important but Difficult Problem Optimization is important. In most application problems, we h...
Estimating the Number of Clusters
, 2000
"... Hartigan (1975) defines the number q of clusters in a dvariate statistical population as the number of connected components of the set {f>c}, where f denotes the underlying density function on R^d and c is a given constant. Some usual cluster algorithms treat q as an input which must be given in ad ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Hartigan (1975) defines the number q of clusters in a dvariate statistical population as the number of connected components of the set {f>c}, where f denotes the underlying density function on R^d and c is a given constant. Some usual cluster algorithms treat q as an input which must be given in advance. The authors propose a method for estimating this parameter which is based on the computation of the number of connected components of an estimate of {f>c}. This set estimator is constructed as a union of balls with centres at an appropriate subsample which is selected via a nonparametric density estimator of f. The asymptotic behaviour of the proposed method is analyzed. A simulation study and an example with real data are also included.