Results 1  10
of
39
Online Feature Selection with Streaming Features
, 2012
"... We propose a new online feature selection framework for applications with streaming features where the knowledge of the full feature space is unknown in advance. We define streaming features as features that flow in one by one over time whereas the number of training examples remains fixed. This is ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
We propose a new online feature selection framework for applications with streaming features where the knowledge of the full feature space is unknown in advance. We define streaming features as features that flow in one by one over time whereas the number of training examples remains fixed. This is in contrast with traditional online learning methods that only deal with sequentially added observations, with little attention being paid to streaming features. The critical challenges for online streaming feature selection include (1) the continuous growth of feature volumes over time; (2) a large feature space, possibly of unknown or infinite size; and (3) the unavailability of the entire feature set before learning starts. In the paper, we present a novel Online Streaming Feature Selection (OSFS) method to select strongly relevant and nonredundant features on the fly. An efficient FastOSFS algorithm is proposed to improve feature selection performance. The proposed algorithms are evaluated extensively on highdimensional datasets and also with a realworld case study on impact crater detection. Experimental results demonstrate that the algorithms achieve better compactness and higher prediction accuracy than existing streaming feature selection algorithms.
Beyond Fano’s Inequality: Bounds on the Optimal FScore, BER, and CostSensitive Risk and Their Implications
"... Fano’s inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more tha ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Fano’s inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is costsensitive. Two other popular criteria are balanced errorrate (BER) and Fscore. In this work, we focus on the twoclass problem and use a general definition of conditional entropy (including Shannon’s as a special case) to derive upper/lower bounds on the optimal Fscore, BER and costsensitive risk, extending Fano’s result. As a consequence, we show that Infomax is not suitable for optimizing Fscore or costsensitive risk, in that it can potentially lead to low Fscore and high risk. For costsensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate—we derive similar optimal thresholds for Fscore and BER.
A bootstrap based neyman– pearson test for identifying variable importance
 IEEE Transactions on Neural Networks and Learning Systems
, 2014
"... Abstract—Selection of most informative features that leads to a small loss on future data is arguably one of the most important steps in classification, data analysis and model selection. Several feature selection algorithms are available; however, due to noise present in any data set, feature selec ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Selection of most informative features that leads to a small loss on future data is arguably one of the most important steps in classification, data analysis and model selection. Several feature selection algorithms are available; however, due to noise present in any data set, feature selection algorithms are typically accompanied by an appropriate cross validation scheme. In this work, we propose a statistical hypothesis test derived from the NeymanPearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any feature selection algorithm, regardless of the feature selection criteria used by that algorithm, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point. We provide freely available software implementations of the proposed methodology. Index Terms—Neyman–Pearson, feature selection. I.
Informative Priors for Markov Blanket Discovery
"... We present a novel interpretation of information theoretic feature selection as optimization of a discriminative model. We show that this formulation coincides with a group of mutual information based filter heuristics in the literature, and show how our probabilistic framework gives a wellfounded ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
We present a novel interpretation of information theoretic feature selection as optimization of a discriminative model. We show that this formulation coincides with a group of mutual information based filter heuristics in the literature, and show how our probabilistic framework gives a wellfounded extension for informative priors. We then derive a particular sparsity prior that recovers the wellknown IAMB algorithm (Tsamardinos & Aliferis, 2003) and extend it to create a novel algorithm, IAMBIP, that includes domain knowledge priors. In empirical evaluations, we find the new algorithm to improve Markov Blanket recovery even when a misspecified prior was used, in which half the prior knowledge was incorrect. 1
Statistical Hypothesis Testing in Positive Unlabelled Data
"... Abstract. We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semisupervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a gen ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semisupervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions: (1) a proof that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities; (2) a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power; and finally, (3) a new capability, supervision determination, which can determine apriori the number of labelled examples the user must collect before being able to observe a desired statistical effect. Beyond general hypothesis testing, we suggest the tools will additionally be useful for information theoretic feature selection, and Bayesian Network structure learning. 1
Information Theoretic Feature Selection for High Dimensional Metagenomic Data
"... Abstract—Extremely high dimensional data sets are common in genomic classification scenarios, but they are particularly prevalent in metagenomic studies that represent samples as abundances of taxonomic units. Furthermore, the data dimensionality is typically much larger than the number of observa ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Extremely high dimensional data sets are common in genomic classification scenarios, but they are particularly prevalent in metagenomic studies that represent samples as abundances of taxonomic units. Furthermore, the data dimensionality is typically much larger than the number of observations collected for each instance, a phenomenon known as curse of dimensionality, a particularly challenging problem for most machine learning algorithms. The biologists collecting and analyzing data need efficient methods to determine relationships between classes in a data set and the variables that are capable of differentiating between multiple groups in a study. The most common methods of metagenomic data analysis are those characterized by α– and β–diversity tests; however, neither of these tests allow scientists to identify the organisms that are most responsible for differentiating between different categories in a study. In this paper, we present an analysis of information theoretic feature selection methods for improving the classification accuracy with metagenomic data. I.
A semidefinite programming based search strategy for feature selection with mutual information measure
 IEEE Transactions on Pattern Analysis and Machine Intelligence
"... Abstract—Feature subset selection, as a special case of the general subset selection problem, has been the topic of a considerable number of studies due to the growing importance of datamining applications. In the feature subset selection problem there are two main issues that need to be addressed: ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Feature subset selection, as a special case of the general subset selection problem, has been the topic of a considerable number of studies due to the growing importance of datamining applications. In the feature subset selection problem there are two main issues that need to be addressed: (i) Finding an appropriate measure function than can be fairly fast and robustly computed for highdimensional data. (ii) A search strategy to optimize the measure over the subset space in a reasonable amount of time. In this article mutual information between features and class labels is considered to be the measure function. Two series expansions for mutual information are proposed, and it is shown that most heuristic criteria suggested in the literature are truncated approximations of these expansions. It is wellknown that searching the whole subset space is an NPhard problem. Here, instead of the conventional sequential search algorithms, we suggest a parallel search strategy based on semidefinite programming (SDP) that can search through the subset space in polynomial time. By exploiting the similarities between the proposed algorithm and an instance of the maximumcut problem in graph theory, the approximation ratio of this algorithm is derived and is compared with the approximation ratio of the backward elimination method. The experiments show that it can be misleading to judge the quality of a measure solely based on the classification accuracy, without taking the effect of the nonoptimum search strategy into account. Index Terms—Feature selection, mutual information, convex objective, approximation ratio Ç 1
Recognition of Complex Events in OpenSource WebScale Videos
"... Recognition of complex events in unconstrained Internet videos is a challenging research problem. In this symposium proposal, we present a systematic decomposition of complex events into hierarchical components and make an indepth analysis of how existing research are being used to cater to various ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Recognition of complex events in unconstrained Internet videos is a challenging research problem. In this symposium proposal, we present a systematic decomposition of complex events into hierarchical components and make an indepth analysis of how existing research are being used to cater to various levels of this hierarchy. We also identify three key stages where we make novel contributions which are necessary to not only improve the overall recognition performance, but also develop richer understanding of these events. At the lowest level, our contributions include (a) compact covariance descriptors of appearance and motion features used in sparse coding framework to recognize realistic actions and gestures, and (b) a Liealgebra based representation of dominant camera motion present in video shots which can be used as a complementary feature for video analysis. In the next level, we propose an (c) efficient maximum likelihood estimate based representation from lowlevel features computed from videos which demonstrates state of the art performance in large scale visual concept detection, and finally, we propose to (d) model temporal interactions between concepts detected in video shots through two new discriminative feature spaces derived from Linear dynamical systems which eventually boosts event recognition performance. In all cases, we conduct thorough experiments to demonstrate promising performance gains over some of the prominent approaches.
Can HighOrder Dependencies Improve Mutual Information based Feature Selection?
, 2015
"... Mutual information (MI) based approaches are a popular paradigm for feature selection. Most previous methods have made use of lowdimensional MI quantities that are only effective at detecting loworder dependencies between variables. Several works have considered the use of higher dimensional mutu ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Mutual information (MI) based approaches are a popular paradigm for feature selection. Most previous methods have made use of lowdimensional MI quantities that are only effective at detecting loworder dependencies between variables. Several works have considered the use of higher dimensional mutual information, but the theoretical underpinning of these approaches is not yet comprehensive. To fill this gap, in this paper, we systematically investigate the issues of employing highorder dependencies for mutual information based feature selection. We first identify a set of assumptions under which the original highdimensional mutual information based criterion can be decomposed into a set of lowdimensional MI quantities. By relaxing these assumptions, we arrive at a principled approach for constructing higher dimensional MI based feature selection methods that takes into account higher order feature interactions. Our extensive experimental evaluation on real data sets provides concrete evidence that methodological inclusion of highorder dependencies improve MI based feature selection.
Parallel Feature Selection inspired by Group Testing
"... This paper presents a parallel feature selection method for classification that scales up to very high dimensions and large data sizes. Our original method is inspired by group testing theory, under which the feature selection procedure consists of a collection of randomized tests to be performed in ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This paper presents a parallel feature selection method for classification that scales up to very high dimensions and large data sizes. Our original method is inspired by group testing theory, under which the feature selection procedure consists of a collection of randomized tests to be performed in parallel. Each test corresponds to a subset of features, for which a scoring function may be applied to measure the relevance of the features in a classification task. We develop a general theory providing sufficient conditions under which true features are guaranteed to be correctly identified. Superior performance of our method is demonstrated on a challenging relation extraction task from a very large data set that have both redundant features and sample size in the order of millions. We present comprehensive comparisons with stateoftheart feature selection methods on a range of data sets, for which our method exhibits competitive performance in terms of running time and accuracy. Moreover, it also yields substantial speedup when used as a preprocessing step for most other existing methods. 1