Results 1  10
of
33
A modified finite newton method for fast solution of large scale linear svms
 Journal of Machine Learning Research
, 2005
"... This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text classification. This is done by modifying the finite Newton method of Mangasarian in several ways. Experiments indicate that the method is much faster than de ..."
Abstract

Cited by 109 (8 self)
 Add to MetaCart
This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text classification. This is done by modifying the finite Newton method of Mangasarian in several ways. Experiments indicate that the method is much faster than decomposition methods such as SVM light, SMO and BSVM (e.g., 4100 fold), especially when the number of examples is large. The paper also suggests ways of extending the method to other loss functions such as the modified Huber’s loss function and the L1 loss function, and also for solving ordinal regression.
Feature selection using linear classifier weights: interaction with classification models
 In Proceedings of the 27th Annual International ACM SIGIR Conference (SIGIR2004
, 2004
"... This paper explores feature scoring and selection based on weights from linear classification models. It investigates how these methods combine with various learning models. Our comparative analysis includes three learning algorithms: Naïve Bayes, Perceptron, and Support Vector Machines (SVM) in com ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
(Show Context)
This paper explores feature scoring and selection based on weights from linear classification models. It investigates how these methods combine with various learning models. Our comparative analysis includes three learning algorithms: Naïve Bayes, Perceptron, and Support Vector Machines (SVM) in combination with three feature weighting methods: Odds Ratio, Information Gain, and weights from linear models, the linear SVM and Perceptron. Experiments show that feature selection using weights from linear SVMs yields better classification performance than other feature weighting methods when combined with the three explored learning algorithms. The results support the conjecture that it is the sophistication of the feature weighting method rather than its apparent compatibility with the learning algorithm that improves classification performance.
A twostage linear discriminant analysis via QRdecomposition
 IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2005
"... Linear Discriminant Analysis (LDA) is a wellknown method for feature extraction and dimension reduction. It has been used widely in many applications involving highdimensional data, such as image and text classification. An intrinsic limitation of classical LDA is the socalled singularity proble ..."
Abstract

Cited by 46 (0 self)
 Add to MetaCart
Linear Discriminant Analysis (LDA) is a wellknown method for feature extraction and dimension reduction. It has been used widely in many applications involving highdimensional data, such as image and text classification. An intrinsic limitation of classical LDA is the socalled singularity problems; that is, it fails when all scatter matrices are singular. Many LDA extensions were proposed in the past to overcome the singularity problems. Among these extensions, PCA+LDA, a twostage method, received relatively more attention. In PCA+LDA, the LDA stage is preceded by an intermediate dimension reduction stage using Principal Component Analysis (PCA). Most previous LDA extensions are computationally expensive, and not scalable, due to the use of Singular Value Decomposition or Generalized Singular Value Decomposition. In this paper, we propose a twostage LDA method, namely LDA/QR, which aims to overcome the singularity problems of classical LDA, while achieving efficiency and scalability simultaneously. The key difference between LDA/QR and PCA+LDA lies in the first stage, where LDA/QR applies QR decomposition to a small matrix involving the class centroids, while PCA+LDA applies PCA to the total scatter matrix involving all training data points. We further justify the proposed algorithm by showing the relationship among LDA/QR and previous LDA methods. Extensive experiments on face images and text documents are presented to show the effectiveness of the proposed algorithm.
IDR/QR: An Incremental Dimension Reduction Algorithm Via Qr Decomposition
 IEEE Trans. on Knowledge and Data Engineering
, 2004
"... Dimension reduction is critical for many database and data mining applications, such as e#cient storage and retrieval of highdimensional data. In the literature, a wellknown dimension reduction scheme is Linear Discriminant Analysis (LDA). The common aspect of previously proposed LDA based algorit ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Dimension reduction is critical for many database and data mining applications, such as e#cient storage and retrieval of highdimensional data. In the literature, a wellknown dimension reduction scheme is Linear Discriminant Analysis (LDA). The common aspect of previously proposed LDA based algorithms is the use of Singular Value Decomposition (SVD). Due to the di#culty of designing an incremental solution for the eigenvalue problem on the product of scatter matrices in LDA, there is little work on designing incremental LDA algorithms. In this paper, we propose an LDA based incremental dimension reduction algorithm, called IDR/QR, which applies QR Decomposition rather than SVD. Unlike other LDA based algorithms, this algorithm does not require the whole data matrix in main memory. This is desirable for large data sets. More importantly, with the insertion of new data items, the IDR/QR algorithm can constrain the computational cost by applying e#cient QRupdating techniques. Finally, we evaluate the e#ectiveness of the IDR/QR algorithm in terms of classification accuracy on the reduced dimensional space. Our experiments on several realworld data sets reveal that the accuracy achieved by the IDR/QR algorithm is very close to the best possible accuracy achieved by other LDA based algorithms. However, the IDR/QR algorithm has much less computational cost, especially when new data items are dynamically inserted.
A Review of Machine Learning Algorithms for TextDocuments Classification
 Journal of Advances In Information Technology, VOL
, 2010
"... Abstract — With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and knowledge discovery. Proper classification of edocuments, online news, blogs, e ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
Abstract — With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and knowledge discovery. Proper classification of edocuments, online news, blogs, emails and digital libraries need text mining, machine learning and natural language processing techniques to get meaningful knowledge. The aim of this paper is to highlight the important techniques and methodologies that are employed in text documents classification, while at the same time making awareness of some of the interesting challenges that remain to be solved, focused mainly on text representation and machine learning techniques. This paper provides a review of the theory and methods of document classification and text mining, focusing on the existing literature.
Training text classifiers with SVM on very few positive examples
 Microsoft Research
, 2003
"... Text categorization is the problem of automatically assigning text documents into one or more categories. Typically, an amount of labelled data, positive and negative examples for a category, is available for training automatic classifiers. We are particularly concerned with text classification when ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
Text categorization is the problem of automatically assigning text documents into one or more categories. Typically, an amount of labelled data, positive and negative examples for a category, is available for training automatic classifiers. We are particularly concerned with text classification when the training data is highly imbalanced, i.e., the number of positive examples is very small. We show that the linear support vector machine (SVM) learning algorithm is adversely affected by imbalance in the training data. While the resulting hyperplane has a reasonable orientation, the proposed score threshold (parameter b) is too conservative. In our experiments we demonstrate that the SVMspecific costlearning approach is not effective in dealing with imbalanced classes. We obtained better results with methods that directly modify the score threshold. We propose a method based on the conditional class distributions for SVM scores that works well when very few training examples is available to the learner.
Efficient multiway text categorization via generalized discriminant analysis
 In Proceedings of 12th international conference on information and knowledge management (CIKM 2003
, 2003
"... Text categorization is an important research area and has been receiving much attention due to the growth of the online information and of Internet. Automated text categorization is generally cast as a multiclass classification problem. Much of previous work focused on binary document classificati ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
(Show Context)
Text categorization is an important research area and has been receiving much attention due to the growth of the online information and of Internet. Automated text categorization is generally cast as a multiclass classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind largemargin hyperplane cannot be easily extended to multiclass text classification. In addition, the training time and scaling are also important concerns. On the other hand, other techniques naturally extensible to handle multiclass classification are generally not as accurate as SVM. This paper presents a simple and efficient solution to multiclass text categorization. Classification problems are first formulated as optimization via discriminant analysis. Text categorization is then cast as the problem of finding coordinate transformations that reflects the inherent similarity from the data. While most of the previous approaches decompose a multiclass classification problem into multiple independent binary classification tasks, the proposed approach enables direct multiclass classification. By using Generalized Singular Value Decomposition (GSVD), a coordinate transformation that reflects the inherent class structure indicated by the generalized singular values is identified. Extensive experiments demonstrate the efficiency and effectiveness of the proposed approach.
Developing Practical Automatic Metadata Assignment and Evaluation Tools for Internet Resources
 Proceedings of the Fifth ACM/IEEE Joint Conference on Digital Libraries
, 2005
"... This paper describes the development of practical automatic metadata assignment tools to support automatic record creation for virtual libraries, metadata repositories and digital libraries, with particular reference to librarystandard metadata. The development process is incremental in nature, and ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
This paper describes the development of practical automatic metadata assignment tools to support automatic record creation for virtual libraries, metadata repositories and digital libraries, with particular reference to librarystandard metadata. The development process is incremental in nature, and depends upon an automatic metadata evaluation tool to objectively measure its progress. The evaluation tool is based on and informed by the metadata created and maintained by librarian experts at the INFOMINE Project, and uses different metrics to evaluate different metadata fields. In this paper, we describe the form and function of common metadata fields, and identify appropriate performance measures for these fields. The automatic metadata assignment tools in the iVia virtual library software are described, and their performance is measured. Finally, we discuss the limitations of automatic metadata evaluation, and cases where we choose to ignore its evidence in favor of human judgment.
A SURVEY OF TEXT CLASSIFICATION ALGORITHMS
"... The problem of classification has been widely studied in the data mining, machine learning, database, and information retrieval communities with applications in a number of diverse domains, such as target marketing, medical diagnosis, news group filtering, and document organization. In this paper we ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
The problem of classification has been widely studied in the data mining, machine learning, database, and information retrieval communities with applications in a number of diverse domains, such as target marketing, medical diagnosis, news group filtering, and document organization. In this paper we will provide a survey of a wide variety of text classification algorithms.