Results 1 - 10
of
18
Diffusion Kernels on Statistical Manifolds
, 2004
"... A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian ker ..."
Abstract
-
Cited by 63 (5 self)
- Add to MetaCart
A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification.
Composite Kernels for Hypertext Categorisation
- In Proceedings of the International Conference on Machine Learning (ICML
, 2001
"... Kernels are problem-specific functions that act as an interface between the learning system and the data. While it is well-known when the combination of two kernels is again a valid kernel, it is an open question if the resulting kernel will perform well. In particular, in which situations can a com ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
Kernels are problem-specific functions that act as an interface between the learning system and the data. While it is well-known when the combination of two kernels is again a valid kernel, it is an open question if the resulting kernel will perform well. In particular, in which situations can a combination of kernel be expected to perform better than its components considered separately? We investigate this problem by looking at the task of designing kernels for hypertext classification, where both words and links information can be exploited. We provide sufficient conditions that indicate when an improvement can be expected, highlighting and formalising the notion of "independent kernels". Experimental results confirm the predictions of the theory in the hypertext domain.
Evaluation of simple performance measures for tuning svm hyperparameters
- Neurocomputing
, 2003
"... www.elsevier.com/locate/neucom ..."
A Scalability Analysis of Classifiers in Text Categorization
- in Proceedings of SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval, ACM
, 2003
"... Real-world applications of text categorization often require a system to deal with tens of thousands of categories de- ned over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, k-nearest neighbor ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
Real-world applications of text categorization often require a system to deal with tens of thousands of categories de- ned over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, k-nearest neighbor, ridge regression, linear least square t and logistic regression. By providing a formal analysis of the computational complexity of each classi cation method, followed by an investigation on the usage of dierent classi ers in a hierarchical setting of categorization, we show how the scalability of a method depends on the topology of the hierarchy and the category distributions. In addition, we are able to obtain tight bounds for the complexities by using the power law to approximate category distributions over a hierarchy. Experiments with kNN and SVM classi ers on the OHSUMED corpus are reported on, as concrete examples.
Email Classification with Co-Training
, 2002
"... The main problems in text classification are lack of labeled data, as well as the cost of labeling the unlabeled data. We address these problems by exploring co-training - an algorithm that uses unlabeled data along with a few labeled examples to boost the performance of a classifier. We experiment ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
The main problems in text classification are lack of labeled data, as well as the cost of labeling the unlabeled data. We address these problems by exploring co-training - an algorithm that uses unlabeled data along with a few labeled examples to boost the performance of a classifier. We experiment with co-training on the email domain. Our results show that the performance of co-training depends on the learning algorithm it uses. In particular, Support Vector Machines significantly outperforms Naive Bayes on email classification.
Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data
- IN PROC. OF THE WEBDB WORKSHOP
, 2003
"... This paper investigates how to automatically classify schemaless XML data into a user-defined topic directory. The main focus is on constructing appropriate feature spaces on which a classifier operates. In addition to the usual text-based term frequency vectors, we study XML twigs and tag paths as ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
This paper investigates how to automatically classify schemaless XML data into a user-defined topic directory. The main focus is on constructing appropriate feature spaces on which a classifier operates. In addition to the usual text-based term frequency vectors, we study XML twigs and tag paths as extended features that can be combined with text term occurrences in XML elements. Moreover, we show how to leverage ontological background information, more specifically, the WordNet thesaurus, for the construction of more expressive feature spaces. For efficiency our implementation computes features incrementally and caches ontology entries. Our experiments demonstrate the improved accuracy of automatic classification based on the enhanced feature spaces.
A Statistical Learning Model of Text Classification for Support Vector Machines
- In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 2001
"... This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of text-classification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifi ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of text-classification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifiers, which rely primarily on empirical evidence, this model explains why and when SVMs perform well for text classification. In particular, it addresses the following questions: Why can support vector machines handle the large feature spaces in text classification effectively? How is this related to the statistical properties of text? What are sufficient conditions for applying SVMs to text-classification problems successfully?
Concept Drift and the Importance of Examples
- Text Mining – Theoretical Aspects and Applications
, 2002
"... For many learning tasks where data is collected over an extended period of time, its underlying distribution is likely to change. A typical example is information ltering, i.e. the adaptive classi cation of documents with respect to a particular user interest. Both the interest of the user and the ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
For many learning tasks where data is collected over an extended period of time, its underlying distribution is likely to change. A typical example is information ltering, i.e. the adaptive classi cation of documents with respect to a particular user interest. Both the interest of the user and the document content change over time. A ltering system should be able to adapt to such concept changes.
The Locally Weighted Bag of Words Framework for Document Representation
"... The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words represen ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks.
Hyperplane Margin Classifiers on the Multinomial Manifold
- In Proc. of the 21st International Conference on Machine Learning
, 2004
"... The assumptions behind linear classifiers for categorical data are examined and reformulated in the context of the multinomial manifold, the simplex of multinomial models furnished with the Riemannian structure induced by the Fisher information. ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
The assumptions behind linear classifiers for categorical data are examined and reformulated in the context of the multinomial manifold, the simplex of multinomial models furnished with the Riemannian structure induced by the Fisher information.

