## Large Margin Winnow Methods for Text Categorization (2000)

Citations: | 5 - 0 self |

### BibTeX

@MISC{Zhang00largemargin,

author = {Tong Zhang},

title = {Large Margin Winnow Methods for Text Categorization},

year = {2000}

}

### OpenURL

### Abstract

The SNoW (Sparse Network of Winnows) architecture has recently been successful applied to a number of natural language processing (NLP) problems. In this paper, we propose large margin versions of the Winnow algorithms, which we argue can potentially enhance the performance of basic Winnows (and hence the SNoW architecture). We demonstrate that the resulting methods achieve performance comparable with support vector machines for text categorization applications. We also explain why both large margin Winnows and SVM can be suitable for NLP tasks. 1. INTRODUCTION Recently there have been considerable interests in applying machine learning techniques to problems in natural language processing. One method that has great success in many applications is the SNoW architecture [5, 12]. This architecture is based on the Winnow algorithm [15], which in theory is suitable for problems with many irrelevant attributes. The success of SNoW is then attributed to the argument that typical NLP tasks ...

### Citations

3909 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...lect 10000 features, where we replace the entropy scoring in the IG criterion by the Gini-index scoring. Note that both entropy and Gini-index are useful forsnding relevant attributes as described in =-=[2]-=-. We use the Gini-index instead of entropy mainly because we do not use stop word removal. Our experience seems to suggest that the Gini-index is more capable to pick up non stop-words, although the d... |

2171 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...n to the following quadratic programming problem: min w 1 2 w 2 s.t. w T x i y i 1 for i = 1; : : : ; n: In reality, not every problem is linearly separable. For such problems, as being proposed in [3], one can introduce a slack variable i for each data point (x i ; y i ) (i = 1; : : : ; n), and compute a weight vector w(C) that solves min w; 1 2 w T w +C X i i s.t. w T x i y i 1 i ; i ... |

1697 | Text Categorization with Support Vector Machines: Learning with Many Relevant Features
- Joachims
- 1998
(Show Context)
Citation Context ...features are irrelevant. On the other hand, recently there have been many developments on large margin Perceptron algorithms (SVMs), leading to the state of the art performance on text categorization =-=[11, 6]-=-. In [11], Joachims argued that the success of SVM is due to many relevant features rather than irrelevant features for text categorization problems. Such relevant features can be picked up by an SVM.... |

1552 |
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambrige
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ... is the probability that the resulting classier will have a small classication error. This style of analysis is often called PAC analysis. For an SVM, many such results can be found in chapter 4 of [4] and references therein. We list a variant of Theorem 4.19 in [4]: Theorem 3.1. If the data is 2-norm bounded as kxk2 b, then consider the family of hyperplanes w such that kwk2 a. Denote by err(w... |

672 | Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm - Littlestone - 1988 |

504 | Inductive learning algorithms and representations for text categorization
- Dumais, Platt, et al.
- 1998
(Show Context)
Citation Context ...features are irrelevant. On the other hand, recently there have been many developments on large margin Perceptron algorithms (SVMs), leading to the state of the art performance on text categorization =-=[11, 6]-=-. In [11], Joachims argued that the success of SVM is due to many relevant features rather than irrelevant features for text categorization problems. Such relevant features can be picked up by an SVM.... |

193 | A discriminative framework for detecting remote protein homologies
- Jaakkola, Diekhans
- 2000
(Show Context)
Citation Context ...ith = 0 can be more amenable to theoretical analysis. For an SVM, asxed threshold also allows a simple Perceptron like numerical algorithm as described in chapter 12 of [18], as well as in [16] and [=-=-=-9]. Note that although more complex, a non-xed does not introduce any fundamental diculty. The paper is organized as follows. In Section 2, we review the Perceptron and the Winnows. Based on the deri... |

134 |
Additive versus exponentiated gradient updates for linear prediction
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...eter, and the initial weight vector can be taken as w j = j > 0. The Winnow algorithm belongs to a general family of algorithms called exponentiated gradient descent with unnormalized weights (EGU) [=-=14-=-]. There can be a number of variants. One modication is to normalize the one-norm of the weight w so that P j w j = W , which leads to the normalized Winnow. Another variant is called balanced Winnow,... |

122 | Maximum entropy discrimination
- Jaakkola, Meila, et al.
- 1999
(Show Context)
Citation Context ...natural to all exponentiated gradient methods [14], as can be observed from the theoretical results in [14]. The regularized normalized Winnow is closely related to the maximum entropy discrimination =-=[10]-=- (the two methods are almost identical for linearly separable problems). However, in the framework of maximum entropy discrimination, the Winnow connection is non-obvious. Note also that the SNoW arch... |

98 | Mistakedriven learning in text categorization
- Dagan, Karov, et al.
- 1997
(Show Context)
Citation Context ...re have been considerable interests in applying machine learning techniques to problems in natural language processing. One method that has great success in many applications is the SNoW architecture =-=[5, 12]-=-. This architecture is based on the Winnow algorithm [15], which in theory is suitable for problems with many irrelevant attributes. The success of SNoW is then attributed to the argument that typical... |

81 | General convergence results for linear discriminant updates
- Grove, Littlestone, et al.
- 2001
(Show Context)
Citation Context ...j ln w j kk1 j kwk1 ) max i kx i k 2 1 = 2 ; where 0s min i w T x i y i , W kwk1 and the learning rate is = =(W max i kx i k 2 1 ). The technique for deriving the above bound was developed in [7] (also see [15] for earlier results). The detailed proof of this specic bound can be found in [24] which employed techniques in [7]. Note that unlike the Perceptron mistake bound, the above bound is ... |

66 | D.R.: Successive overrelaxation for support vector machines
- Mangasarian, Musicant
- 1999
(Show Context)
Citation Context ...ulation with = 0 can be more amenable to theoretical analysis. For an SVM, asxed threshold also allows a simple Perceptron like numerical algorithm as described in chapter 12 of [18], as well as in [=-=1-=-6] and [9]. Note that although more complex, a non-xed does not introduce any fundamental diculty. The paper is organized as follows. In Section 2, we review the Perceptron and the Winnows. Based on ... |

54 |
The adatron: an adaptive perceptron algorithm
- Anlauf, Biehl
- 1989
(Show Context)
Citation Context ...M = kwk 2 2 max i kx i k 2 2 =(min i w T x i ) 2 : The weight vector w that minimizes the right hand side of the bound is called the optimal hyperplane in [20] or the optimal stability hyperplane in [=-=1, 13, 17-=-]. This optimal hyperplane is the solution to the following quadratic programming problem: min w 1 2 w 2 s.t. w T x i y i 1 for i = 1; : : : ; n: In reality, not every problem is linearly separable. ... |

28 | Relational learning for NLP using linear threshold elements - Khardon, Roth, et al. - 1999 |

21 | Linear concepts and hidden variables
- Grove, Roth
- 2001
(Show Context)
Citation Context ...um entropy discrimination, the Winnow connection is non-obvious. Note also that the SNoW architecture for NLP problems employs a heuristics for a margin version of unnormalized Winnow as described in =-=[5, 8-=-]. However, the algorithm was purely mistake driven without dual variables i (therefore the algorithm does not automatically compute an optimal stability hyperplane for the Winnow mistake bound). In ... |

11 |
Statistical Mechanics of the Perceptron with Maximal Stability
- Kinzel
(Show Context)
Citation Context ...M = kwk 2 2 max i kx i k 2 2 =(min i w T x i ) 2 : The weight vector w that minimizes the right hand side of the bound is called the optimal hyperplane in [20] or the optimal stability hyperplane in [=-=1, 13, 17-=-]. This optimal hyperplane is the solution to the following quadratic programming problem: min w 1 2 w 2 s.t. w T x i y i 1 for i = 1; : : : ; n: In reality, not every problem is linearly separable. ... |