## A framework for learning predictive structures from multiple tasks and unlabeled data (2005)

### Cached

### Download Links

Venue: | Journal of Machine Learning Research |

Citations: | 319 - 3 self |

### BibTeX

@ARTICLE{Ando05aframework,

author = {Rie Kubota Ando and Tong Zhang and Peter Bartlett},

title = {A framework for learning predictive structures from multiple tasks and unlabeled data},

journal = {Journal of Machine Learning Research},

year = {2005},

volume = {6},

pages = {1817--1853}

}

### Years of Citing Articles

### OpenURL

### Abstract

One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting. 1.

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...nd unlabeled data to train a classifier. Although a number of methods have been proposed, their effectiveness is not always clear. For example, Vapnik introduced the notion of transductive inference (=-=Vapnik, 1998-=-), which may be regarded as an approach to semi-supervised learning. Although some success has been reported (e.g., see Joachims, 1999), there has also been criticism pointing out that this method may... |

2046 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...However, the framework developed in this paper is under the frequentist setting, and the most relevant statistical studies are shrinkage methods in multiple-output linear models (see Section 3.4.6 of =-=Hastie et al., 2001-=-). In particular, the algorithm proposed in Section 3 has a form similar to a shrinkage method proposed by Breiman and Friedman (1997). However, the framework presented here (as well as the specific a... |

1245 | Combining Labeled and Unlabeled Data with Co-training
- Blum, Mitchell
- 1998
(Show Context)
Citation Context ...1999), there has also been criticism pointing out that this method may not behave well under some circumstances (Zhang and Oles, 2000). Another popular semi-supervised learning method is co-training (=-=Blum and Mitchell, 1998-=-), which is related to the bootstrap method used in some NLP applications (Yarowsky, 1995) and to EM (Nigam et al., 2000). The basic idea is to label part of unlabeled data using a high precision clas... |

803 | Text classification from labeled and unlabeled documents using
- Nigam, McCallum, et al.
- 2000
(Show Context)
Citation Context ...es, 2000). Another popular semi-supervised learning method is co-training (Blum and Mitchell, 1998), which is related to the bootstrap method used in some NLP applications (Yarowsky, 1995) and to EM (=-=Nigam et al., 2000-=-). The basic idea is to label part of unlabeled data using a high precision classifier, and then put the “automaticallyc○2005 Rie Kubota Ando and Tong Zhang.sAndo and Zhang labeled” data into the trai... |

492 | Unsupervised word sense disambiguation rivaling supervised methods
- Yarowsky
- 1995
(Show Context)
Citation Context ...circumstances (Zhang and Oles, 2000). Another popular semi-supervised learning method is co-training (Blum and Mitchell, 1998), which is related to the bootstrap method used in some NLP applications (=-=Yarowsky, 1995-=-) and to EM (Nigam et al., 2000). The basic idea is to label part of unlabeled data using a high precision classifier, and then put the “automaticallyc○2005 Rie Kubota Ando and Tong Zhang.sAndo and Zh... |

467 | Multitask learning
- Caruana
- 1997
(Show Context)
Citation Context ...al than the earlier statistical studies. In the machine learning literature, related work is sometime referred to as multi-task learning, for example, see (Baxter, 2000; Ben-David and Schuller, 2003; =-=Caruana, 1997-=-; Evegniou and Pontil, 2004; Micchelli and Ponti, 2005) and references therein. We shall call our procedure structural learning since it is a more accurate description of what our method does in the s... |

434 | Learning with local and global consistency
- Zhou, Bousquet, et al.
(Show Context)
Citation Context ...arning procedure. An example of this approach is to use unlabeled data to create a data-manifold (graph structure), on which proper smooth function classes can be defined (Szummer and Jaakkola, 2002; =-=Zhou et al., 2004-=-; Zhu et al., 2003). If such smooth functions can characterize the underlying classifier very well, then one is able to improve the classification performance. It is worth pointing out that smooth fun... |

389 |
On the method of bounded differences
- McDiarmid
- 1989
(Show Context)
Citation Context ...easy to verify that Q(S) − Q( ¯ S) ≤ sup θ sup f∈Hθ 1 mn |L(f(X ¯ ℓ ī ), Y ¯ ℓ ī ) − L(f( ¯ X ¯ ℓ ī ), ¯ Y ¯ ℓ ī )| ≤ M mn . The lemma is a direct consequence of McDiarmid’s concentration inequality (=-=McDiarmid, 1989-=-). We are now ready to prove the main theorem. Consider a sequence of binary random = ±1 is independent with probability 1/2. The under empirical sample S, is given by variables σ = {σℓ i } such that ... |

341 | Weak Convergence and Empirical Processes - Vaart, Wellner - 1996 |

278 |
Probability in Banach Spaces
- Ledoux, Talagrand
- 1991
(Show Context)
Citation Context ... 2 follows from a simple estimate of the Rademacher complexity of the sub function class in HΘ corresponding to w T Φ(x) as A/ √ n, and a straight-forward application of Sudak’s minoration (e.g., see =-=Ledoux and Talagrand, 1991-=-, Chapter 12). The two ln(1+B/ɛ) terms can be obtained by explicit discretization of the corresponding finite dimensional parameter spaces — one for the h-dimensional sub function class in HΘ correspo... |

156 | Semi-supervised learning on Riemannian manifolds
- Belkin, Niyogi
- 2004
(Show Context)
Citation Context ...number of iterations. 1837sAndo and Zhang # of labeled BN04 best ASO-semi examples (manifold) 100 39.8 54.1 200 – 61.6 500 59.9 68.5 1000 64.0 72.3 Figure 6: Comparison with similar settings in BN04 (=-=Belkin and Niyogi, 2004-=-) on 20 newsgroup. outperforms the best co-training performance in all of the four settings by up to 8.4%. In Figure 5, we plot the co-training performance versus co-training iterations in typical run... |

153 | Kernels for Multitask Learning
- Micchelli, Pontil
- 2005
(Show Context)
Citation Context ...n the machine learning literature, related work is sometime referred to as multi-task learning, for example, see (Baxter, 2000; Ben-David and Schuller, 2003; Caruana, 1997; Evegniou and Pontil, 2004; =-=Micchelli and Ponti, 2005-=-) and references therein. We shall call our procedure structural learning since it is a more accurate description of what our method does in the semi-supervised learning setting. That is, we transfer ... |

144 | A Model of Inductive Bias Learning
- Baxter
(Show Context)
Citation Context ...ecific algorithm in Section 3) is more general than the earlier statistical studies. In the machine learning literature, related work is sometime referred to as multi-task learning, for example, see (=-=Baxter, 2000-=-; Ben-David and Schuller, 2003; Caruana, 1997; Evegniou and Pontil, 2004; Micchelli and Ponti, 2005) and references therein. We shall call our procedure structural learning since it is a more accurate... |

89 | Exploiting Task Relatedness for Multiple Task Learning
- Ben-David, Schuller
(Show Context)
Citation Context ...hm in Section 3) is more general than the earlier statistical studies. In the machine learning literature, related work is sometime referred to as multi-task learning, for example, see (Baxter, 2000; =-=Ben-David and Schuller, 2003-=-; Caruana, 1997; Evegniou and Pontil, 2004; Micchelli and Ponti, 2005) and references therein. We shall call our procedure structural learning since it is a more accurate description of what our metho... |

89 |
A probability analysis on the value of unlabeled data for classification problems
- Zhang, Oles
- 2000
(Show Context)
Citation Context ...semi-supervised learning. Although some success has been reported (e.g., see Joachims, 1999), there has also been criticism pointing out that this method may not behave well under some circumstances (=-=Zhang and Oles, 2000-=-). Another popular semi-supervised learning method is co-training (Blum and Mitchell, 1998), which is related to the bootstrap method used in some NLP applications (Yarowsky, 1995) and to EM (Nigam et... |

76 | Limitations of co-training for natural language learning from large datasets - Pierce, Cardie - 2001 |

66 |
A high-performance semi-supervised learning method for text chunking
- Ando, Zhang
- 2005
(Show Context)
Citation Context ...ructure to semi-supervised learning, and demonstrate the effectiveness of the proposed method in this context. A short version of this paper, mainly reporting some empirical results, appeared in ACL (=-=Ando and Zhang, 2005-=-). This version includes a more complete derivation of the proposed algorithm, with theoretical analysis and several additional experimental results. In Section 2, we formally introduce the structural... |

61 | Weak convergence and empirical processes. Springer Series in Statistics - Vaart, Wellner - 1996 |

56 | Solving large scale linear prediction problems using stochastic gradient descent algorithms
- Zhang
- 2004
(Show Context)
Citation Context ...mploy stochastic gradient descent (SGD), widely used in the neural networks literature. It was recently argued that this simple method can also work well for large scale convex learning formulations (=-=Zhang, 2004-=-). In the following, we consider a special case of (4) which has a simple iterative SVD solution. Let Φ(x) = Ψ(x) = x ∈ Rp with square regularization of weight vectors. Then we have � � [{ ˆwℓ, ˆvℓ}, ... |

23 | A robust risk minimization based named entity recognition system
- Zhang, Johnson
- 2003
(Show Context)
Citation Context ...al Text Corpus – as the training/development/test sets but disjoint from them. 5.3.1 Feature Representation Our feature representation is a slight modification of a simpler configuration reported in (=-=Zhang and Johnson, 2003-=-), which uses: token strings, parts-of-speech, character types, several characters at the beginning and the ending of the tokens, in a 5-token window around the current position; token strings in a 3-... |

10 | Chieu and Hwee Tou Ng. Named entity recognition: A maximum entropy approach using global information - Leong - 2002 |

8 | With discussion - B - 1987 |

8 |
Regularized multi–task learning
- Evegniou, Pontil
- 1988
(Show Context)
Citation Context ...lier statistical studies. In the machine learning literature, related work is sometime referred to as multi-task learning, for example, see (Baxter, 2000; Ben-David and Schuller, 2003; Caruana, 1997; =-=Evegniou and Pontil, 2004-=-; Micchelli and Ponti, 2005) and references therein. We shall call our procedure structural learning since it is a more accurate description of what our method does in the semi-supervised learning set... |

3 |
Named entity recogintion through classifier combination
- Florian, Ittycheriah, et al.
- 2003
(Show Context)
Citation Context ...ng studies on NLP tasks (e.g., Pierce and Cardie, 2001). 5.3.4 Comparison with Previous Best Results English test set System F-measure Additional resources semi:word+top-2 89.31 unlabeled data FIJZ03(=-=Florian et al., 2003-=-) 88.76 gazetteers; 1.7M-word labeled data CN03(Chieu and Ng, 2003) 88.31 gazetteers (also very elaborated features) KSNM03(Klein et al., 2003) 86.31 rule-based post processing German test set Systems... |

1 | ɛ0 Learning Predictive Structures - Klein, Smarr, et al. - 1999 |

1 | the eleventh annual conference on Computational learning theory - Breiman, Friedman - 1998 |

1 | ǫ0 Ando and Zhang - Klein, Smarr, et al. - 1999 |