Results 1  10
of
21
Diffusion Kernels on Statistical Manifolds
, 2004
"... A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian ker ..."
Abstract

Cited by 87 (6 self)
 Add to MetaCart
A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernelbased learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification.
Evaluation of simple performance measures for tuning svm hyperparameters
 Neurocomputing
, 2003
"... www.elsevier.com/locate/neucom ..."
Composite Kernels for Hypertext Categorisation
 In Proceedings of the International Conference on Machine Learning (ICML
, 2001
"... Kernels are problemspecific functions that act as an interface between the learning system and the data. While it is wellknown when the combination of two kernels is again a valid kernel, it is an open question if the resulting kernel will perform well. In particular, in which situations can a com ..."
Abstract

Cited by 49 (0 self)
 Add to MetaCart
Kernels are problemspecific functions that act as an interface between the learning system and the data. While it is wellknown when the combination of two kernels is again a valid kernel, it is an open question if the resulting kernel will perform well. In particular, in which situations can a combination of kernel be expected to perform better than its components considered separately? We investigate this problem by looking at the task of designing kernels for hypertext classification, where both words and links information can be exploited. We provide sufficient conditions that indicate when an improvement can be expected, highlighting and formalising the notion of "independent kernels". Experimental results confirm the predictions of the theory in the hypertext domain.
A scalability analysis of classifiers in text categorization
 In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
, 2003
"... Realworld applications of text categorization often require a system to deal with tens of thousands of categories defined over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, knearest neighbor ..."
Abstract

Cited by 38 (3 self)
 Add to MetaCart
Realworld applications of text categorization often require a system to deal with tens of thousands of categories defined over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, knearest neighbor, ridge regression, linear least square fit and logistic regression. By providing a formal analysis of the computational complexity of each classification method, followed by an investigation on the usage of different classifiers in a hierarchical setting of categorization, we show how the scalability of a method depends on the topology of the hierarchy and the category distributions. In addition, we are able to obtain tight bounds for the complexities by using the power law to approximate category distributions over a hierarchy. Experiments with kNN and SVM classifiers on the OHSUMED corpus are reported on, as concrete examples.
Email Classification with CoTraining
, 2002
"... The main problems in text classification are lack of labeled data, as well as the cost of labeling the unlabeled data. We address these problems by exploring cotraining  an algorithm that uses unlabeled data along with a few labeled examples to boost the performance of a classifier. We experiment ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
The main problems in text classification are lack of labeled data, as well as the cost of labeling the unlabeled data. We address these problems by exploring cotraining  an algorithm that uses unlabeled data along with a few labeled examples to boost the performance of a classifier. We experiment with cotraining on the email domain. Our results show that the performance of cotraining depends on the learning algorithm it uses. In particular, Support Vector Machines significantly outperforms Naive Bayes on email classification.
Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data
 IN PROC. OF THE WEBDB WORKSHOP
, 2003
"... This paper investigates how to automatically classify schemaless XML data into a userdefined topic directory. The main focus is on constructing appropriate feature spaces on which a classifier operates. In addition to the usual textbased term frequency vectors, we study XML twigs and tag paths as ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
This paper investigates how to automatically classify schemaless XML data into a userdefined topic directory. The main focus is on constructing appropriate feature spaces on which a classifier operates. In addition to the usual textbased term frequency vectors, we study XML twigs and tag paths as extended features that can be combined with text term occurrences in XML elements. Moreover, we show how to leverage ontological background information, more specifically, the WordNet thesaurus, for the construction of more expressive feature spaces. For efficiency our implementation computes features incrementally and caches ontology entries. Our experiments demonstrate the improved accuracy of automatic classification based on the enhanced feature spaces.
Metric learning for text documents
 IEEE Transactions on Pattern Analysis and Machine Intelligence
"... Abstract—Many algorithms in machine learning rely on being given a good distance metric over the input space. Rather than using a default metric such as the Euclidean metric, it is desirable to obtain a metric based on the provided data. We consider the problem of learning a Riemannian metric associ ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
Abstract—Many algorithms in machine learning rely on being given a good distance metric over the input space. Rather than using a default metric such as the Euclidean metric, it is desirable to obtain a metric based on the provided data. We consider the problem of learning a Riemannian metric associated with a given differentiable manifold and a set of points. Our approach to the problem involves choosing a metric from a parametric family that is based on maximizing the inverse volume of a given data set of points. From a statistical perspective, it is related to maximum likelihood under a model that assigns probabilities inversely proportional to the Riemannian volume element. We discuss in detail learning a metric on the multinomial simplex where the metric candidates are pullback metrics of the Fisher information under a Lie group of transformations. When applied to text document classification the resulting geodesic distance resemble, but outperform, the tfidf cosine similarity measure. Index Terms—Distance learning, text analysis, machine learning. æ 1
A Statistical Learning Model of Text Classification for Support Vector Machines
 In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 2001
"... This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of textclassification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifi ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of textclassification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifiers, which rely primarily on empirical evidence, this model explains why and when SVMs perform well for text classification. In particular, it addresses the following questions: Why can support vector machines handle the large feature spaces in text classification effectively? How is this related to the statistical properties of text? What are sufficient conditions for applying SVMs to textclassification problems successfully?
Concept Drift and the Importance of Examples
 Text Mining – Theoretical Aspects and Applications
, 2002
"... For many learning tasks where data is collected over an extended period of time, its underlying distribution is likely to change. A typical example is information ltering, i.e. the adaptive classi cation of documents with respect to a particular user interest. Both the interest of the user and the ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
For many learning tasks where data is collected over an extended period of time, its underlying distribution is likely to change. A typical example is information ltering, i.e. the adaptive classi cation of documents with respect to a particular user interest. Both the interest of the user and the document content change over time. A ltering system should be able to adapt to such concept changes.
The Locally Weighted Bag of Words Framework for Document Representation
"... The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words represen ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its ngram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or ngrams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks.