## Regret bounds for hierarchical classification with linear-threshold functions (2004)

Venue: | Proceedings of the 17th Annual Conference on Learning Theory |

Citations: | 8 - 3 self |

### BibTeX

@INPROCEEDINGS{Cesa-bianchi04regretbounds,

author = {Nicolò Cesa-bianchi and Alex Conconi and Claudio Gentile},

title = {Regret bounds for hierarchical classification with linear-threshold functions},

booktitle = {Proceedings of the 17th Annual Conference on Learning Theory},

year = {2004},

pages = {93--108},

publisher = {Springer}

}

### OpenURL

### Abstract

Abstract. We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce an incremental algorithm using a linear-threshold classifier at each node of the taxonomy. These classifiers are trained and evaluated in a hierarchical top-down fashion. We then define a hierachical and parametric data model and prove a bound on the probability that our algorithm guesses the wrong multilabel for a random instance compared to the same probability when the true model parameters are known. Our bound decreases exponentially with the number of training examples and depends in a detailed way on the interaction between the process parameters and the taxonomy structure. Preliminary experiments on real-world data provide support to our theoretical results. 1

### Citations

4660 |
Matrix analysis
- Horn, Johnson
- 1986
(Show Context)
Citation Context ...2 i,t (2+ˆ λi,n)/8 Pi,t ≤ 2 {Ai,t} e −γ2 i λi n/16 . Thus, integrating out the conditioning, we get that (8) is upper bounded by 2 P(Ai,t) �t−1 n=M e −γ2 i λi n/16 ≤ 2 P(Ai,t) t e −γ2 i λi M/16 . (8) =-=(9)-=-sSince the process at each node i is i.i.d., we can bound (9) through the concentration result contained in Lemma 1. Choosing M ≥ 96d/λ2 i , we get � Ai,t, ˆ � � � λi,n < λin/2 ˆλi,n < λin/2 Pi,t = {A... |

1486 | Probability inequalities for sums of bounded random variables
- Hoeffding
- 1963
(Show Context)
Citation Context ...r(i) and Pi,t denote P � � · | (X1, Vj,1), . . . , (Xt−1, Vj,t−1), Xt . Notice that Vi,i1, . . . , Vi,i are independent w.r.t. Pi,t. We bound (8) by combining N(i,t−1) Chernoff-Hoeffding inequalities =-=[8]-=- with Lemma 3: � | � ∆i,t,n + Bi,n − ∆i,t| ≥ |∆i,t|/2, Ai,t, ˆ � λi,n ≥ λin/2 � � � = {Ai,t} × ˆλi,n ≥ λin/2 × Pi,t | � � ∆i,t,n + Bi,n − ∆i,t| ≥ |∆i,t|/2 � � ≤ {Ai,t} × ˆλi,n ≥ λin/2 × 2e −∆2 i,t (2+... |

1165 | Factor graphs and the sumproduct algorithm
- Kschischang, Frey, et al.
- 2001
(Show Context)
Citation Context ..., the Bayes optimal predictor would use the maximum likelihood multilabel assignment given G and u1, . . . , uc (this assignment is easily computable using a special case of the sum-product algorithm =-=[10]-=-). Finding a good algorithm to approximate the maximum-likelihood assignment has proven to be a difficult task. Proof (of Theorem 1). We first observe that {∃i : ˆyi,t �= Vi,t} ≤ {∃i : yi,t �= Vi,t} +... |

990 |
A Probabilistic Theory of Pattern Recognition
- Devroye, Gyorfi, et al.
- 1996
(Show Context)
Citation Context ...olumns are X1, . . . , Xs, C = S S ⊤ is the associated empirical correlation matrix, and ˆλs ≥ 0 is the smallest eigenvalue of C, then P � ˆ λs s < λ/2� ≤ 2(s + 1) e −s λ2 /304 provided s ≥ 96d/λ 2 . =-=(4)-=- We now state our main result. Theorem 1. Consider a taxonomy G with c nodes of depths h1, . . . , hc and fix an arbitrary choice of parameters u1, . . . , uc ∈ Rd , such that ||ui|| = 1, i = 1, . . .... |

782 |
The perceptron: A probabilistic model for information storage and organization in the brain
- Rosenblatt
- 1958
(Show Context)
Citation Context ... error of 46.6% (recall that an instance is considered mistaken if at least one out of 102 labels is guessed wrong). For comparison, if we replace our estimator with the standard Perceptron algorithm =-=[16, 14]-=- (without touching the rest of the algorithm) the test error goes up to 65.8%, and this performance does not change significantly if we train the Perceptron algorithm at each node with all the example... |

420 | Hierarchically classifying documents using very few words
- Koller, Sahami
- 1997
(Show Context)
Citation Context ...1 P (Ai,t, N < M) = P(Ai,t)P (N < M) ≤ P(Ai,t) e −(t−1)µi/10 ≤ P(Ai,t) exp � − − 1 . (t − 1)P(Ai,t) 10 · 2 hi where we used Bernstein’s inequality (see, e.g., [4, Ch. 8]) and our choice of M to prove =-=(11)-=-. � , (11)sPiecing together, overapproximating, and using in the bounds for (8) and (9) the conditions on t, along with M ≥ (t − 1)P(Ai,t)/2 hi+1 − 1 results in P(∃i : ˆyi,t �= Vi,t) − P(∃i : yi,t �= ... |

266 | H.Chen, “Hierarchical classification of Web content
- Dumais
(Show Context)
Citation Context ...or instance, if we just focus on the probability values P(Ai,t), we see that the regret bound is essentially the sum over all nodes i in the taxonomy of terms of the form P(Ai,t) exp(−ki P(Ai,t) t) , =-=(5)-=- where the ki’s are positive constants. Clearly, P(Ai,t) decreases as we descend along a path. Hence, if node i is a root then P(Ai,t) tends to be relatively large, whereas if i is a leaf node then P(... |

236 | Improving text classification by shrinkage in a hierarchy of classes - McCallum, Rosenfeld, et al. - 1998 |

115 | Relative loss bounds for on-line density estimation with the exponential family of distributions
- Azoury, Warmuth
- 2001
(Show Context)
Citation Context ... labels observed by node i. This estimator is a slight variant of regularized least squares for classification [2, 15] where we include the current instance xt in the computation of W i,t (see, e.g., =-=[1, 20]-=- for analyses of similar algorithms in different contexts). Efficient incremental computations of the inverse matrix and dual variable formulations of the algorithm are extensively discussed in [2, 15... |

100 | Hierarchical text classification and evaluation - Sun, Lim - 2001 |

72 | P.: Hierarchical text categorization using neural networks. Information Retrieval 5 - Ruiz, Srinivasan - 2001 |

63 | Competitive on-line statistics
- Vovk
(Show Context)
Citation Context ... labels observed by node i. This estimator is a slight variant of regularized least squares for classification [2, 15] where we include the current instance xt in the computation of W i,t (see, e.g., =-=[1, 20]-=- for analyses of similar algorithms in different contexts). Efficient incremental computations of the inverse matrix and dual variable formulations of the algorithm are extensively discussed in [2, 15... |

60 | T.: Regularized Least Squares Classification
- Rifkin, Yeo, et al.
- 2003
(Show Context)
Citation Context ...,i1, . . . , Vi,i N(i,t−1) ) ⊤ is the N(i, t − 1)-dimensional vector of the corresponding labels observed by node i. This estimator is a slight variant of regularized least squares for classification =-=[2, 15]-=- where we include the current instance xt in the computation of W i,t (see, e.g., [1, 20] for analyses of similar algorithms in different contexts). Efficient incremental computations of the inverse m... |

57 | A second-order perceptron algorithm
- Cesa-Bianchi, Conconi, et al.
- 2005
(Show Context)
Citation Context ...,i1, . . . , Vi,i N(i,t−1) ) ⊤ is the N(i, t − 1)-dimensional vector of the corresponding labels observed by node i. This estimator is a slight variant of regularized least squares for classification =-=[2, 15]-=- where we include the current instance xt in the computation of W i,t (see, e.g., [1, 20] for analyses of similar algorithms in different contexts). Efficient incremental computations of the inverse m... |

49 | Turning Yahoo into an Automatic Web-Page Classifier - Mladenic - 1998 |

25 | Learning probabilistic linear-threshold classifiers via selective sampling
- Cesa-Bianchi, Conconi, et al.
- 2003
(Show Context)
Citation Context ...dard approach in statistical learning theory, we assume that data are generated by a parametric and hierarchical stochastic process associated with the given taxonomy. Building on the techniques from =-=[3]-=-, we design and analyze an algorithm for estimating the parameters of the process. Our algorithm is based on a hierarchy of regularized least-squares estimators which are incrementally updated as more... |

23 | Learning with taxonomies: Classifying documents and words
- Hofmann, Cai, et al.
- 2003
(Show Context)
Citation Context ... � � ∆i,t,N − ∆i,t| ≥ |∆i,t|, Ai,t � ≤ | � � ∆i,t,N + Bi,N − ∆i,t| ≥ |∆i,t| − |Bi,N|, Ai,t � ≤ | � � ∆i,t,N + Bi,N − ∆i,t| ≥ |∆i,t|/2, Ai,t + {|Bi,N| ≥ |∆i,t|/2, Ai,t} . We can bound the two terms of =-=(7)-=- separately. Let M < t be an integer constant to be specified later. For the first term we obtain � | � � ∆i,t,N + Bi,N − ∆i,t| ≥ |∆i,t|/2, Ai,t � ≤ | � ∆i,t,N + Bi,N − ∆i,t| ≥ |∆i,t|/2, Ai,t, N ≥ M, ... |

14 | On the eigenspectrum of the gram matrix and its relationship to the operator eigenspectrum
- Shawe-Taylor, Williams, et al.
- 2002
(Show Context)
Citation Context ...ly depend on the convergence of the smallest empirical eigenvalue of the process at each node i, and the next result is the key to keeping this convergence under control. Lemma 1 (Shawe-Taylor et al. =-=[18]-=-). Let X = (X1, . . . , Xd) ∈ R d be a random vector such that �X� = 1 with probability 1, and let λ ≥ 0 be thessmallest eigenvalue of the correlation matrix {E[Xi Xj]} d i,j=1 . If X1, . . . , Xs are... |

12 |
On convergence proofs on Perceptrons
- Novikov
- 1962
(Show Context)
Citation Context ... error of 46.6% (recall that an instance is considered mistaken if at least one out of 102 labels is guessed wrong). For comparison, if we replace our estimator with the standard Perceptron algorithm =-=[16, 14]-=- (without touching the rest of the algorithm) the test error goes up to 65.8%, and this performance does not change significantly if we train the Perceptron algorithm at each node with all the example... |

10 |
Hierarchical text classification using methods from machine learning
- Granitzer
- 2003
(Show Context)
Citation Context ...lt task. Proof (of Theorem 1). We first observe that {∃i : ˆyi,t �= Vi,t} ≤ {∃i : yi,t �= Vi,t} + {∃i : ˆyi,t �= yi,t} = {∃i : yi,t �= Vi,t} c� + {ˆyi,t �= yi,t, ˆyj,t = yj,t, j = 1, . . . , i − 1} . =-=(6)-=- i=1 Without loss of generality we can assume that the nodes in the taxonomy are assigned numbers such that if node i is a child of node j then i > j. The regret (6) can then be upper bounded assc� {ˆ... |