## Treelets — An Adaptive Multi-Scale Basis for Sparse Unordered Data

### Cached

### Download Links

Citations: | 9 - 2 self |

### BibTeX

@MISC{Lee_treelets—,

author = {Ann B Lee and Boaz Nadler and Larry Wasserman},

title = {Treelets — An Adaptive Multi-Scale Basis for Sparse Unordered Data},

year = {}

}

### OpenURL

### Abstract

In many modern applications, including analysis of gene expression and text documents, the data are noisy, high-dimensional, and unordered — with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper, we present treelets — a novel construction of multi-scale bases that extends wavelets to non-smooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonor-mal basis which both reflect the internal structure of the data. Treelets are especially well-suited as a dimensionality reduction and feature selection tool prior to regression and classification, in situ-ations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables. 1

### Citations

2292 |
The Elements of Statistical Learning
- Hastie
- 2001
(Show Context)
Citation Context ...st kind of sparsity, two key tools are graphical models (Whittaker, 2001) that assume statistical dependence between specific variables, and regularization methods that penalize non-sparse solutions (=-=Hastie et al., 2001-=-). Examples of such regularization methods are the lasso (Tibshirani, 1996), regularized covariance estimation methods (Bickel and Levina, 2007; Levina and Zhu, 2007), and variable selection in high-d... |

2287 |
A wavelet tour of signal processing
- Mallat
- 1999
(Show Context)
Citation Context ...y clusters or groupings of variables, but also functions on the data. More specifically, we construct a multi-scale orthonormal basis on a hierarchical tree. As in standard multi-resolution analysis (=-=Mallat, 1998-=-), the treelet algorithm provides a set of “scaling functions” defined on nested subspaces V0 ⊃ V1 ⊃ . . . ⊃ VL, and a set 7sof orthogonal “detail functions” defined on residual spaces {Wℓ} L ℓ=1 wher... |

2266 |
Principal Component Analysis
- Jolliffe
- 1986
(Show Context)
Citation Context ...n, 2006). For the second type of sparsity, where the goal is to find a new set of coordinates or features of the data, two standard “variable transformation” methods are principal component analysis (=-=Jolliffe, 2002-=-) and wavelets (Ogden, 1997). Each of these two methods has its own strengths and weaknesses which we briefly discuss here. PCA has gained much popularity due to its simplicity and the unique property... |

2096 |
Cluster analysis and display of genome-wide expression patterns
- Eisen, Spellman, et al.
- 1998
(Show Context)
Citation Context ...l when sample sizes are small. Other methods also relate to treelets. In recent years, hierarchical clustering methods have been widely used for identifying diseases and groups of co-expressed genes (=-=Eisen et al., 1998-=-; Tibshirani et al., 1999). Many researchers are also developing algorithms that combine gene selection and gene grouping; see e.g. Hastie et al. (2001); Dettling and Bühlmann (2004); Zou and Hastie (... |

2041 | Regression shrinkage and selection via the Lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ... various quantities related either to the learning problem at hand or to the representation of the data in the original given variables. Examples include a sparse regression or classification vector (=-=Tibshirani, 1996-=-), and a sparse structure to the covariance or inverse covariance matrix of the given variables (Bickel and Levina, 2007). The second notion is sparsity of the data themselves. Here we are referring t... |

1405 | Data clustering: a review
- JAIN, MURTY, et al.
- 1999
(Show Context)
Citation Context ...adler, 2007). 2 The Treelet Transform In many modern data sets the data are not only high-dimensional but also redundant with many variables related to each other. Hierarchical clustering algorithms (=-=Jain et al., 1999-=-; Xu and Wunsch, 2005) are often used for the organization and grouping of the variables of such data sets. These methods offer an easily interpretable description of the data structure in terms of a ... |

1339 | Molecular classification of cancer: class discovery and class prediction by gene expression monitoring - GOLUB, SLONIM, et al. - 1999 |

747 | Adapting to Unknown Smoothness via Wavelet Shrinkage
- Donoho, Johnstone
- 1995
(Show Context)
Citation Context ...r mixture error-in-variables model. In contrast to PCA, wavelet methods describe the data in terms of localized basis functions. The representations are multi-scale, and for smooth data, also sparse (=-=Donoho and Johnstone, 1995-=-). 4sWavelets are used in many non-parametric statistics tasks, including regression and density estimation. In recent years, wavelet expansions have also been combined with regularization methods to ... |

741 | Gene selection for cancer classification using support vector machines
- GUYON, WESTON, et al.
- 2002
(Show Context)
Citation Context ...ation error (Fig. 9, left). 36sTable 2: Leukemia misclassification rates; courtesy of Zou and Hastie (2005). Method Ten-fold CV error Test error Golub et al. (1999) 3/38 4/34 Support vector machines (=-=Guyon et al., 2002-=-) 2/38 1/34 Nearest shrunken centroids (Tibshirani et al., 2002) 2/38 2/34 Penalized logistic regression (Zhu and Hastie, 2004) 2/38 1/34 Elastic nets (Zou and Hastie, 2005) 3/38 0/34 LDA on treelet f... |

531 | Entropy-based algorithms for best-basis selection
- Coifman, Wickerhauser
- 1992
(Show Context)
Citation Context ...reshold value for the similarity measure, or use hypothesis testing for homogeneity of clusters, etc.). In this work, we propose a rather different method that is inspired by the best basis paradigm (=-=Coifman and Wickerhauser, 1992-=-; Saito and Coifman, 1995) in wavelet signal processing. This approach directly addresses the question of how well one can capture information in the data. 11sConsider IID data x 1 , . . .,x n , where... |

448 | The Dantzig selector: Statistical estimation when p is much larger than n
- Candés, Tao
- 2007
(Show Context)
Citation Context ...ression and density estimation. In recent years, wavelet expansions have also been combined with regularization methods to find regression vectors which are sparse in an a priori known wavelet basis (=-=Candès and Tao, 2005-=-; Donoho and Elad, 2003). The main limitation of wavelets is the implicit assumption of smoothness of the (noiseless) data as a function of its variables. In other words, standard wavelets are not sui... |

408 | Regularization and variable selection via the elastic net
- Zou, Hastie
(Show Context)
Citation Context ...4/34 Support vector machines (Guyon et al., 2002) 2/38 1/34 Nearest shrunken centroids (Tibshirani et al., 2002) 2/38 2/34 Penalized logistic regression (Zhu and Hastie, 2004) 2/38 1/34 Elastic nets (=-=Zou and Hastie, 2005-=-) 3/38 0/34 LDA on treelet features 2/38 3/34 Two-way treelet decomposition 0/38 1/34 In the second case, we classify the data using a novel two-way treelet decomposition scheme: we first compute tree... |

407 | High-dimensional graphs and variable selection with the
- Meinshausen, Bühlmann
- 2006
(Show Context)
Citation Context ... regularization methods are the lasso (Tibshirani, 1996), regularized covariance estimation methods (Bickel and Levina, 2007; Levina and Zhu, 2007), and variable selection in high-dimensional graphs (=-=Meinshausen and Bühlmann, 2006-=-). For the second type of sparsity, where the goal is to find a new set of coordinates or features of the data, two standard “variable transformation” methods are principal component analysis (Jolliff... |

396 |
Diagnosis of multiple cancer types by shrunken centroids of gene expression
- Tibshirani, Hastie, et al.
- 2002
(Show Context)
Citation Context ...fication rates; courtesy of Zou and Hastie (2005). Method Ten-fold CV error Test error Golub et al. (1999) 3/38 4/34 Support vector machines (Guyon et al., 2002) 2/38 1/34 Nearest shrunken centroids (=-=Tibshirani et al., 2002-=-) 2/38 2/34 Penalized logistic regression (Zhu and Hastie, 2004) 2/38 1/34 Elastic nets (Zou and Hastie, 2005) 3/38 0/34 LDA on treelet features 2/38 3/34 Two-way treelet decomposition 0/38 1/34 In th... |

313 | Model-based clustering, discriminant analysis, and density estimation - Fraley, Raftery - 2002 |

265 | Survey of clustering algorithms
- Xu, Wunsch, et al.
(Show Context)
Citation Context ... Treelet Transform In many modern data sets the data are not only high-dimensional but also redundant with many variables related to each other. Hierarchical clustering algorithms (Jain et al., 1999; =-=Xu and Wunsch, 2005-=-) are often used for the organization and grouping of the variables of such data sets. These methods offer an easily interpretable description of the data structure in terms of a dendrogram, and only ... |

212 | On the distribution of the largest eigenvalue in principal components analysis
- Johnstone
- 2001
(Show Context)
Citation Context ...three blocks Ckk = σ21pk×pk k (k = 1, 2, 3), and the last diagonal block having all entries equal to zero. Assume that σk ≫ σ for k = 1, 2, 3. This is a specific example of a spiked covariance model (=-=Johnstone, 2001-=-); the three components corresponding to distinct large eigenvalues or “spikes” of a model with background noise. As n → ∞ with p fixed, PCA recovers the hidden vectors v1, v2, and v3, since these thr... |

184 |
The Grand Tour: A Tool for Viewing Multidimensional Data
- Asimov
- 1985
(Show Context)
Citation Context ...ilar Variables The treelet algorithm is inspired by the classical Jacobi method for computing eigenvalues of a matrix (Golub and van Loan, 1996). There are also some similarities with the Grand Tour (=-=Asimov, 1985-=-), a visualization tool for viewing multidimensional data through a sequence of orthogonal projections. The main difference from Jacobi’s method — and the reason why the treelet transform, in general,... |

161 | Semi-supervised learning on Riemannian manifolds
- Belkin, Niyogi
- 2004
(Show Context)
Citation Context ...rlying structures that can be represented by only a few features. Examples include data where many variables are approximately collinear or highly related, and data that lie on a non-linear manifold (=-=Belkin and Niyogi, 2005-=-; Coifman et al., 2005) 1 . While the two notions of sparsity are different, they are clearly related. In fact, a low intrinsic dimensionality of the data typically implies, for example, sparse regres... |

158 | Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps
- Coifman, Lafon, et al.
(Show Context)
Citation Context ...n be represented by only a few features. Examples include data where many variables are approximately collinear or highly related, and data that lie on a non-linear manifold (Belkin and Niyogi, 2005; =-=Coifman et al., 2005-=-) 1 . While the two notions of sparsity are different, they are clearly related. In fact, a low intrinsic dimensionality of the data typically implies, for example, sparse regression or classification... |

99 |
Essential Wavelets for Statistical Applications and Data Analysis
- Ogden
- 1997
(Show Context)
Citation Context ...of sparsity, where the goal is to find a new set of coordinates or features of the data, two standard “variable transformation” methods are principal component analysis (Jolliffe, 2002) and wavelets (=-=Ogden, 1997-=-). Each of these two methods has its own strengths and weaknesses which we briefly discuss here. PCA has gained much popularity due to its simplicity and the unique property of providing a sequence of... |

61 | Prediction by Supervised Principal Components
- Bair, Hastie, et al.
- 2006
(Show Context)
Citation Context ...= σ 2 1 1p1×p2 and C23 = σ 2 2 1p2×p3. Consider a numerical example with n = 100 observations, p = 500 variables, and noise level σ = 0.5. We choose the same form for the components u1, u2, u3 as in (=-=Bair et al., 2006-=-), but associate the first two components with overlapping loading vectors v1 and v2. Specifically, the components are given by u1 = ±0.5 with equal probability, u2 = I(U2 < 0.4), and u3 = I(U3 < 0.3)... |

51 | Learning to remove internet advertisements
- Kushmerick
- 1999
(Show Context)
Citation Context ...mselves contain information about structure in data.) 5.2 A Classification Example with an Internet Advertisement Data Set Here we study an internet advertisement data set from the UCI ML repository (=-=Kushmerick, 1999-=-). This is an example of an unordered data set of high dimension where many variables are collinear. After removal of the first three continuous variables, this set contains 1555 binary variables and ... |

50 | Estimating high-dimensional directed acyclic graphs with the PC-algorithm - Kalisch, Bühlmann - 2007 |

48 |
Geometric Representation of High Dimension, Low Sample Size Data
- Hall, Marron, et al.
- 2005
(Show Context)
Citation Context ...metimes becomes evident only in a different basis representation of the data. 1 A referee pointed out that another issue with sparsity is that very high-dimensional spaces have very simple structure (=-=Hall et al., 2005-=-; Murtagh, 2004; Ahn and Marron, 2004). 3sIn either case, to take advantage of sparsity, one constrains the set of possible parameters of the problem. For the first kind of sparsity, two key tools are... |

47 |
Classification of gene microarrays by penalized logistic regression
- ZHU, HASTIE
- 2004
(Show Context)
Citation Context ... CV error Test error Golub et al. (1999) 3/38 4/34 Support vector machines (Guyon et al., 2002) 2/38 1/34 Nearest shrunken centroids (Tibshirani et al., 2002) 2/38 2/34 Penalized logistic regression (=-=Zhu and Hastie, 2004-=-) 2/38 1/34 Elastic nets (Zou and Hastie, 2005) 3/38 0/34 LDA on treelet features 2/38 3/34 Two-way treelet decomposition 0/38 1/34 In the second case, we classify the data using a novel two-way treel... |

41 | Clustering methods for the analysis of DNA microarray data
- Tibshirani, Hastie, et al.
- 1999
(Show Context)
Citation Context ...are small. Other methods also relate to treelets. In recent years, hierarchical clustering methods have been widely used for identifying diseases and groups of co-expressed genes (Eisen et al., 1998; =-=Tibshirani et al., 1999-=-). Many researchers are also developing algorithms that combine gene selection and gene grouping; see e.g. Hastie et al. (2001); Dettling and Bühlmann (2004); Zou and Hastie (2005) among others, and s... |

38 | Supervised harvesting of expression trees
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...st kind of sparsity, two key tools are graphical models (Whittaker, 2001) that assume statistical dependence between specific variables, and regularization methods that penalize non-sparse solutions (=-=Hastie et al., 2001-=-). Examples of such regularization methods are the lasso (Tibshirani, 1996), regularized covariance estimation methods (Bickel and Levina, 2007; Levina and Zhu, 2007), and variable selection in high-d... |

37 | Sparse principal component analysis
- Johnstone, Lu
- 2004
(Show Context)
Citation Context ...ariables p is much larger than the number of observations n, the true underlying principal factors may be masked by the noise, yielding an inconsistent estimator in the joint limit p, n → ∞, p/n → c (=-=Johnstone and Lu, 2004-=-). Even for a finite sample size n, this property of PCA and other global methods (such as partial least squares and ridge regression) can lead to large prediction errors in regression and classificat... |

31 |
Maximal sparsity representation via l1 minimization
- Donoho, Elad
- 2003
(Show Context)
Citation Context ...timation. In recent years, wavelet expansions have also been combined with regularization methods to find regression vectors which are sparse in an a priori known wavelet basis (Candès and Tao, 2005; =-=Donoho and Elad, 2003-=-). The main limitation of wavelets is the implicit assumption of smoothness of the (noiseless) data as a function of its variables. In other words, standard wavelets are not suited for the analysis of... |

30 | Improved Linear Discrimination Using Time-frequency Dictionaries
- Buckheit, Donoho
- 1995
(Show Context)
Citation Context ...en for a finite sample size n, this property of PCA and other global methods (such as partial least squares and ridge regression) can lead to large prediction errors in regression and classification (=-=Buckheit and Donoho, 1995-=-; Nadler and Coifman, 2005b). Eq. 25 in our paper, for example, gives an estimate of the finite-n regression error for a linear mixture error-in-variables model. In contrast to PCA, wavelet methods de... |

30 | Finding predictive gene groups from microarray data - Dettling, Bühlmann - 2004 |

30 |
Constructions of local orthonormal bases for classification and regression
- Coifman, Saito
- 1994
(Show Context)
Citation Context ... measure, or use hypothesis testing for homogeneity of clusters, etc.). In this work, we propose a rather different method that is inspired by the best basis paradigm (Coifman and Wickerhauser, 1992; =-=Saito and Coifman, 1995-=-) in wavelet signal processing. This approach directly addresses the question of how well one can capture information in the data. 11sConsider IID data x 1 , . . .,x n , where x i ∈ R p is a p-dimensi... |

29 | Principal Component Analysis - T - 2002 |

26 | Finite sample approximation results for principal component analysis: A matrix perturbation approach
- Nadler
- 2008
(Show Context)
Citation Context ..., in Sec. 5, we apply our method to classification of hyperspectral data, internet advertisements, and gene expression arrays. A preliminary version of this paper was presented at AISTATS-07 (Lee and =-=Nadler, 2007-=-). 2 The Treelet Transform In many modern data sets the data are not only high-dimensional but also redundant with many variables related to each other. Hierarchical clustering algorithms (Jain et al.... |

25 |
Discriminant feature extraction using empirical probability density and a local basis library
- Saito, Coifman, et al.
- 2002
(Show Context)
Citation Context ... 7 shows the average CV error rate as a function of the number of local discriminant features. (As a comparison, we show similar results for Haar-Walsh wavelet packets and a local discriminant basis (=-=Saito et al., 2002-=-) which use the same discriminant score to search through a library of orthonormal wavelet bases.) The straight line represents the error rate if we apply a Gaussian classifier directly to the 28 comp... |

24 | Bootstrap tests and confidence regions for functions of a covariance matrix - BERAN, SRIVASTAVA - 1985 |

22 |
On ultrametricity, data coding, and computation
- Murtagh
- 2004
(Show Context)
Citation Context ...dent only in a different basis representation of the data. 1 A referee pointed out that another issue with sparsity is that very high-dimensional spaces have very simple structure (Hall et al., 2005; =-=Murtagh, 2004-=-; Ahn and Marron, 2004). 3sIn either case, to take advantage of sparsity, one constrains the set of possible parameters of the problem. For the first kind of sparsity, two key tools are graphical mode... |

17 |
Regularized estimation of large covariance matrices. The Annals of Statistics
- Bickel, Levina
- 2008
(Show Context)
Citation Context ...iginal given variables. Examples include a sparse regression or classification vector (Tibshirani, 1996), and a sparse structure to the covariance or inverse covariance matrix of the given variables (=-=Bickel and Levina, 2007-=-). The second notion is sparsity of the data themselves. Here we are referring to a situation where the data, despite their apparent high dimensionality, are highly redundant with underlying structure... |

16 | The Haar wavelet transform of a dendrogram
- Murtagh
- 2007
(Show Context)
Citation Context ...ds, standard wavelets are not suited for the analysis of unordered data. Thus, some work suggests first sorting the data, and then applying fixed wavelets to the reordered data (Murtagh et al., 2000; =-=Murtagh, 2007-=-). In this paper, we propose an adaptive method for multi-scale representation and eigenanalysis of data where the variables can occur in any given order. We call the construction treelets, as the met... |

16 |
H.: Searching for Interacting Features
- Zhao, Liu
- 2007
(Show Context)
Citation Context ...s shown in the right column of Table 1, treelets can further increase the predictive performance on this data set, yielding results competitive with other feature selection methods in the literature (=-=Zhao and Liu, 2007-=-). All results in Table 1 are averaged over 25 different simulations. As in Sec. 4.2, the results are achieved at a level L < p−1, by projecting the data onto the treelet scaling functions, i.e. by on... |

14 | The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration - Nadler, Coifman |

10 | Overcoming the curse of dimensionality in clustering by means of the wavelet transform
- Murtagh, Berry
- 2000
(Show Context)
Citation Context ...ariables. In other words, standard wavelets are not suited for the analysis of unordered data. Thus, some work suggests first sorting the data, and then applying fixed wavelets to the reordered data (=-=Murtagh et al., 2000-=-; Murtagh, 2007). In this paper, we propose an adaptive method for multi-scale representation and eigenanalysis of data where the variables can occur in any given order. We call the construction treel... |

9 |
Net analyte signal calculation in multivariate calibration
- Lorber, Faber, et al.
(Show Context)
Citation Context ...cted. For 27sexample, in the case of only two components, we have that vy = v1 − v1 · v2 v2 (23) �v2�2 The vector vy plays a central role in chemometrics, where it is known as the net analyte signal (=-=Lorber et al., 1997-=-; Nadler and Coifman, 2005a). Using this vector for regression yields a mean squared error of prediction E{(�y − y) 2 } = σ2 �vy� 2 We remark that similar to shrinkage in point estimation, there exist... |

9 | Regression shrinkage and selection via the lasso - unknown authors - 1996 |

7 |
Improving Efficiency by Shrinkage: The James-Stein and Ridge Regression Estimators
- Gruber
- 1998
(Show Context)
Citation Context ...his vector for regression yields a mean squared error of prediction E{(�y − y) 2 } = σ2 �vy� 2 We remark that similar to shrinkage in point estimation, there exist biased estimators with smaller MSE (=-=Gruber, 1998-=-; Nadler and Coifman, 2005b), but for large signal to noise ratios (σ/�vy� ≪ 1), such shrinkage is negligible. Many regression methods (including multivariate least squares, partial least squares (PLS... |

7 | Partial least squares, beer’s law and the net analyte signal: statistical modeling and analysis, J.Chemom. 19 (2005) 45–54. 26 version of the article published - Nadler |

4 | Detection of malignancy in cytology specimens using spectral-spatial analysis - Angeletti, Harvey, et al. - 2005 |

4 | The elements of statistical Learning - T, Friedman - 2001 |

3 |
The local Karhunen–Loève basis
- Coifman, Saito
- 1996
(Show Context)
Citation Context ... fully adaptive nature of the treelet algorithm — a property that sets treelets apart from methods that use fixed wavelets on a dendrogram (Murtagh, 2007), or adaptive basis functions on fixed trees (=-=Coifman and Saito, 1996-=-); 15 � .ssee Remark 2 for a concrete example. Lemma 1 Assume that x = (x1, x2, . . .,xp) T is a random vector with distribution F , mean 0 and covariance matrix Σ = σ 2 11p×p, where 1p×p denotes a p ... |