## Confidence-weighted linear classification (2008)

### Cached

### Download Links

- [icml2008.cs.helsinki.fi]
- [www.dredze.com]
- [www.cs.jhu.edu]
- [webee.technion.ac.il]
- [www.cs.jhu.edu]
- [static.googleusercontent.com]
- [cs.jhu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In ICML ’08: Proceedings of the 25th international conference on Machine learning |

Citations: | 56 - 10 self |

### BibTeX

@INPROCEEDINGS{Dredze08confidence-weightedlinear,

author = {Mark Dredze and Koby Crammer},

title = {Confidence-weighted linear classification},

booktitle = {In ICML ’08: Proceedings of the 25th international conference on Machine learning},

year = {2008},

pages = {264--271},

publisher = {ACM}

}

### OpenURL

### Abstract

We introduce confidence-weighted linear classifiers, which add parameter confidence information to linear classifiers. Online learners in this setting update both classifier parameters and the estimate of their confidence. The particular online algorithms we study here maintain a Gaussian distribution over parameter vectors and update the mean and covariance of the distribution with each instance. Empirical evaluation on a range of NLP tasks show that our algorithm improves over other state of the art online and batch methods, learns faster in the online setting, and lends itself to better classifier combination after parallel training. 1.

### Citations

3666 | Convex Optimization - Boyd, Vandenberghe - 2004 |

3436 | LIBSVM: A Library for Support Vector Machines, 2001. Software available at www.csie.ntu.edu.tw/˜cjlin/libsvm
- Chang, Lin
(Show Context)
Citation Context ... our new online algorithm (Variance) against two standard batch algorithms: maxent classification (default configuration of the maxent learner in McCallum (2002)) and support vector machines (LibSVM (=-=Chang & Lin, 2001-=-)). We also include stochastic gradient descent (SGD) (Blitzer et al., 2007), which performs well for NLP tasks. Classifier parameters (Gaussian prior for maxent, C for SVM and the learning rate for S... |

788 |
The perceptron: a probabilistic model for information storage and organization in the brain
- Rosenblatt
- 1958
(Show Context)
Citation Context ...e tasks. This type of feature distribution can have a detrimental effect on learning. With typical linear classifier training algorithms, such as the perceptron or passive-aggressive (PA) algorithms (=-=Rosenblatt, 1958-=-; Crammer et al., 2006), the parameters of binary features are only updated when the features occur. Therefore, frequent features typically receive more updates. Similarly, features that occur early i... |

436 | Rcv1: A new benchmark collection for text categorization research
- Lewis, Yang, et al.
(Show Context)
Citation Context ...as represented as a binary bagof-words. For each problem we selected 1800 instances. Reuters The Reuters Corpus Volume 1 (RCV1v2/LYRL2004) contains over 800,000 manually categorized newswire stories (=-=Lewis et al., 2004-=-). Each article 2 http://people.csail.mit.edu/jrennie/20Newsgroups/ Figure 1. Accuracy on test data after each iteration on the “talk” dataset. contains one or more labels describing its general topic... |

323 | Kachites (2002) MALLET: A Machine Learning for Language Toolkit - McCallum |

292 | Online passive-aggressive algorithms
- Crammer, Dekel, et al.
(Show Context)
Citation Context ... of feature distribution can have a detrimental effect on learning. With typical linear classifier training algorithms, such as the perceptron or passive-aggressive (PA) algorithms (Rosenblatt, 1958; =-=Crammer et al., 2006-=-), the parameters of binary features are only updated when the features occur. Therefore, frequent features typically receive more updates. Similarly, features that occur early in the data stream take... |

281 | Gaussian processes in machine learning
- Rasmussen
- 2004
(Show Context)
Citation Context ...PC) maintains a Gaussian distribution over weight vectors (primal) or over regressor values (dual). Our algorithm uses a different update criterion than the the standard Bayesian updates used in GPC (=-=Rasmussen & Williams, 2006-=-, Ch. 3), avoiding the challenging issues in approximating posteriors in GPC. Bayes point machines (Herbrich et al., 2001) maintain a collection of weight vectors consistent with the training data, an... |

136 | Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification
- Blitzer, Dredze, et al.
- 2007
(Show Context)
Citation Context ...ms: maxent classification (default configuration of the maxent learner in McCallum (2002)) and support vector machines (LibSVM (Chang & Lin, 2001)). We also include stochastic gradient descent (SGD) (=-=Blitzer et al., 2007-=-), which performs well for NLP tasks. Classifier parameters (Gaussian prior for maxent, C for SVM and the learning rate for SGD) were tuned as for the online methods. Results for batch learning are sh... |

68 | Bayes point machines
- Herbrich, Graepel, et al.
- 2001
(Show Context)
Citation Context ...he training data, and use the single linear classifier which best represents the collection. Conceptually, the collection is a non-parametric distribution over the weight vectors. Its online version (=-=Harrington et al., 2003-=-) maintains a finite number of weight-vectors updated simultaneously. Finally, with the growth of available data there is an increasing need for algorithms that process training data very efficiently.... |

61 | The Matrix Cookbook
- Petersen, Pedersen
- 2008
(Show Context)
Citation Context ... we must also have ∂ L = −1 ∂Σ 2 Σ−1 + 1 2 Σ−1 i + φαxix ⊤ i = 0 . Solving for Σ −1 we obtain Σ −1 i+1 = Σ−1 i + 2αφxix ⊤ i . (12) Finally, we compute the inverse of (12) using the Woodbury identity (=-=Petersen & Pedersen, 2007-=-, Eq. 135) and get, Σi+1 = ( Σ −1 i = Σi − Σixi = Σi − Σixi + 2αφxix ⊤)−1 i ( 1 2αφ + x⊤i Σixi 2αφ ) −1 x ⊤ i Σi 1 + 2αφx⊤ i Σixi x ⊤ i Σi . (13) The KKT conditions for the optimization imply that the... |

57 | A second-order perceptron algorithm
- Cesa-Bianchi, Conconi, et al.
- 2005
(Show Context)
Citation Context ...learning (Sutton, 1992), although we do not know of a previous model that specifically models confidence in a way that takes into account the frequency of features. The second-order perceptron (SOP) (=-=Cesa-Bianchi et al., 2005-=-) is perhaps the closest to our CW algorithm. Both are online algorithms that maintain a weight vector and some statistics about previous examples. While the SOP models certainty with feature counts,... |

19 |
Single-pass online learning: performance, voting schemes and online feature selection
- Carvalho, Cohen
- 2006
(Show Context)
Citation Context ...ity and ability to operate on extremely large datasets. In the batch setting, these algorithms are run several times over the training data, which yields slower performance than single pass learning (=-=Carvalho & Cohen, 2006-=-). Our algorithm improves on both accuracy and learning speed by requiring fewer iterations over the training data. Such behavior can be seen on the “talk” dataset in Figure 1, which shows accuracy on... |

13 | The huller: a simple and efficient online svm
- Bordes, Bottou
- 2005
(Show Context)
Citation Context ...y. Finally, with the growth of available data there is an increasing need for algorithms that process training data very efficiently. A similar approach to ours is to train classifiers incrementally (=-=Bordes & Bottou, 2005-=-). The extreme case is to use each example once, without repetitions, as in the multiplicative update method of Carvalho and Cohen (2006). Conclusion: We have presented confidenceweighted linear class... |

12 | Ellipsoidal kernel machines - Shivaswamy, Jebara - 2007 |

1 | C.Campbell (2001). Bayes point machinesonline passive-aggressive algorithms - Herbrich, Graepel |