Results 1  10
of
666
KernelBased Object Tracking
, 2003
"... A new approach toward target representation and localization, the central component in visual tracking of nonrigid objects, is proposed. The feature histogram based target representations are regularized by spatial masking with an isotropic kernel. The masking induces spatiallysmooth similarity fu ..."
Abstract

Cited by 900 (4 self)
 Add to MetaCart
(Show Context)
A new approach toward target representation and localization, the central component in visual tracking of nonrigid objects, is proposed. The feature histogram based target representations are regularized by spatial masking with an isotropic kernel. The masking induces spatiallysmooth similarity functions suitable for gradientbased optimization, hence, the target localization problem can be formulated using the basin of attraction of the local maxima. We employ a metric derived from the Bhattacharyya coefficient as similarity measure, and use the mean shift procedure to perform the optimization. In the presented tracking examples the new method successfully coped with camera motion, partial occlusions, clutter, and target scale variations. Integration with motion filters and data association techniques is also discussed. We describe only few of the potential applications: exploitation of background information, Kalman tracking using motion models, and face tracking. Keywords: nonrigid object tracking; target localization and representation; spatiallysmooth similarity function; Bhattacharyya coefficient; face tracking. 1
Toward Optimal Active Learning through Sampling Estimation of Error Reduction
 In Proc. 18th International Conf. on Machine Learning
, 2001
"... This paper presents an active learning method that directly optimizes expected future error. This is in contrast to many other popular techniques that instead aim to reduce version space size. These other methods are popular because for many learning models, closed form calculation of the expec ..."
Abstract

Cited by 353 (2 self)
 Add to MetaCart
(Show Context)
This paper presents an active learning method that directly optimizes expected future error. This is in contrast to many other popular techniques that instead aim to reduce version space size. These other methods are popular because for many learning models, closed form calculation of the expected future error is intractable. Our approach is made feasible by taking a sampling approach to estimating the expected reduction in error due to the labeling of a query. In experimental results on two realworld data sets we reach high accuracy very quickly, sometimes with four times fewer labeled examples than competing methods. 1.
Measures of Distributional Similarity
 In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics
, 1999
"... We study distributional similarity measures for the purpose of improving probability estimation for unseen cooccurrences. Our contributions are threefold: an empirical comparison of a broad range of measures; a classification of similarity functions based on the information that they incorporate; a ..."
Abstract

Cited by 297 (2 self)
 Add to MetaCart
(Show Context)
We study distributional similarity measures for the purpose of improving probability estimation for unseen cooccurrences. Our contributions are threefold: an empirical comparison of a broad range of measures; a classification of similarity functions based on the information that they incorporate; and the introduction of a novel function that is superior at evaluating potential proxy distributions.
An Information Statistics Approach to Data Stream and Communication Complexity
, 2003
"... We present a new method for proving strong lower bounds in communication complexity. ..."
Abstract

Cited by 228 (8 self)
 Add to MetaCart
We present a new method for proving strong lower bounds in communication complexity.
Weak pairwise correlations imply strongly correlated network states in a neural population.
, 2006
"... Biological networks have so many possible states that exhaustive sampling is impossible. Successful analysis thus depends on simplifying hypotheses, but experiments on many systems hint that complicated, higherorder interactions among large groups of elements have an important role. Here we show, ..."
Abstract

Cited by 191 (4 self)
 Add to MetaCart
Biological networks have so many possible states that exhaustive sampling is impossible. Successful analysis thus depends on simplifying hypotheses, but experiments on many systems hint that complicated, higherorder interactions among large groups of elements have an important role. Here we show, in the vertebrate retina, that weak correlations between pairs of neurons coexist with strongly collective behaviour in the responses of ten or more neurons. We find that this collective behaviour is described quantitatively by models that capture the observed pairwise correlations but assume no higherorder interactions. These maximum entropy models are equivalent to Ising models, and predict that larger networks are completely dominated by correlation effects. This suggests that the neural code has associative or errorcorrecting properties, and we provide preliminary evidence for such behaviour. As a first test for the generality of these ideas, we show that similar results are obtained from networks of cultured cortical neurons. Much of what we know about biological networks has been learned by studying one element at a timerecording the electrical activity of single neurons, the expression levels of single genes or the concentrations of individual metabolites. On the other hand, important aspects of biological function must be shared among many elements Here we address these questions in the context of the vertebrate retina, where it is possible to make long, stable recordings from many neurons simultaneously as the system responds to complex, naturalistic inputs The scale of correlations Throughout the nervous system, individual elements communicate by generating discrete pulses termed action potentials or spikes The small values of the correlation coefficients suggest an approximation in which the cells are completely independent. For most pairs, this is true with a precision of a few per cent, but if we extrapolate this approximation to the whole population of 40 cells, it fails disastrously. In
Document Clustering using Word Clusters via the Information Bottleneck Method
 In ACM SIGIR 2000
, 2000
"... We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x; y), we first cluster the words, Y , so that the obtained word clusters, Y_hat , maximally preserve the in ..."
Abstract

Cited by 181 (17 self)
 Add to MetaCart
(Show Context)
We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x; y), we first cluster the words, Y , so that the obtained word clusters, Y_hat , maximally preserve the information on the documents. The resulting joint distribution, p(X; Y_hat ), contains most of the original information about the documents, I(X; Y_hat ) ~= I(X;Y ), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X , so that the information about the wordclusters is preserved. Thus, we first find wordclusters that capture most of the mutual information about the set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.
Within the Twilight Zone: A Sensitive ProfileProfile Comparison Tool Based on Information Theory
 J. Mol. Biol
, 2002
"... This paper presents a novel approach to proleprole comparison. The method compares two input proles (like those that are generated by PSIBLAST) and assigns a similarity score to assess their statistical similarity. Our proleprole comparison tool, which allows for gaps, can be used to detect weak ..."
Abstract

Cited by 147 (4 self)
 Add to MetaCart
This paper presents a novel approach to proleprole comparison. The method compares two input proles (like those that are generated by PSIBLAST) and assigns a similarity score to assess their statistical similarity. Our proleprole comparison tool, which allows for gaps, can be used to detect weak similarities between protein families. It has also been optimized to produce alignments that are in very good agreement with structural alignments. Tests show that the proleprole alignments are indeed highly correlated with similarities between secondary structure elements and tertiary structure. Exhaustive evaluations show that our method is signicantly more sensitive in detecting distant homologies than the popular prolebased search programs PSIBLAST and IMPALA. The relative improvement is the same order of magnitude as the improvement of PSIBLAST relative to BLAST. Our new tool often detects similarities that fall within the twilight zone of sequence similarity
A divisive informationtheoretic feature clustering algorithm for text classification
 Journal of Machine Learning Research
, 2003
"... High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new informationth ..."
Abstract

Cited by 138 (15 self)
 Add to MetaCart
High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new informationtheoretic divisive algorithm for feature/word clustering and apply it to text classification. Existing techniques for such “distributional clustering ” of words are agglomerative in nature and result in (i) suboptimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value. We show that our algorithm minimizes the “withincluster JensenShannon divergence ” while simultaneously maximizing the “betweencluster JensenShannon divergence”. In comparison to the previously proposed agglomerative strategies our divisive algorithm is much faster and achieves comparable or higher classification accuracies. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20Newsgroups data set and a 3level hierarchy of HTML documents collected from the Open Directory project (www.dmoz.org).
Similaritybased models of word cooccurrence probabilities
 Machine Learning
, 1999
"... Abstract. In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach ” and “eat a beach ” is more likely. Statistical NLP met ..."
Abstract

Cited by 114 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach ” and “eat a beach ” is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on “most similar ” words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudoword disambiguation. In the language modeling task, a similaritybased model is used to improve probability estimates for unseen bigrams in a backoff language model. The similaritybased method yields a 20 % perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speechrecognition error. We also compare four similaritybased estimation methods against backoff and maximumlikelihood estimation methods on a pseudoword sense disambiguation task in which we controlled for both unigram and bigram frequency to avoid giving too much weight to easytodisambiguate highfrequency configurations. The similaritybased methods perform up to 40 % better on this particular task.
An algorithm for datadriven bandwidth selection
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2003
"... Abstract—The analysis of a feature space that exhibits multiscale patterns often requires kernel estimation techniques with locally adaptive bandwidths, such as the variablebandwidth mean shift. Proper selection of the kernel bandwidth is, however, a critical step for superior space analysis and pa ..."
Abstract

Cited by 106 (7 self)
 Add to MetaCart
Abstract—The analysis of a feature space that exhibits multiscale patterns often requires kernel estimation techniques with locally adaptive bandwidths, such as the variablebandwidth mean shift. Proper selection of the kernel bandwidth is, however, a critical step for superior space analysis and partitioning. This paper presents a mean shiftbased approach for local bandwidth selection in the multimodal, multivariate case. Our method is based on a fundamental property of normal distributions regarding the bias of the normalized density gradient. We demonstrate that, within the large sample approximation, the local covariance is estimated by the matrix that maximizes the magnitude of the normalized mean shift vector. Using this property, we develop a reliable algorithm which takes into account the stability of local bandwidth estimates across scales. The validity of our theoretical results is proven in various space partitioning experiments involving the variablebandwidth mean shift. Index Terms—Variablebandwidth mean shift, bandwidth selection, multiscale analysis, JensenShannon divergence, feature space. 1