## Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition (2003)

Venue: | SIAM Journal on Matrix Analysis and Applications |

Citations: | 43 - 19 self |

### BibTeX

@ARTICLE{Howland03structurepreserving,

author = {Peg Howland and Moongu Jeon and Haesun Park},

title = {Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition},

journal = {SIAM Journal on Matrix Analysis and Applications},

year = {2003},

volume = {25},

pages = {165--179}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. In today’s vector space information retrieval systems, dimension reduction is imperative for efficiently manipulating the massive quantity of data. To be useful, this lower-dimensional representation must be a good approximation of the full document set. To that end, we adapt and extend the discriminant analysis projection used in pattern recognition. This projection preserves cluster structure by maximizing the scatter between clusters while minimizing the scatter within clusters. A common limitation of trace optimization in discriminant analysis is that one of the scatter matrices must be nonsingular, which restricts its application to document sets in which the number of terms does not exceed the number of documents. We show that by using the generalized singular value decomposition (GSVD), we can achieve the same goal regardless of the relative dimensions of the term-document matrix. In addition, applying the GSVD allows us to avoid the explicit formation of the scatter matrices in favor of working directly with the data matrix, thus improving the numerical properties of the approach. Finally, we present experimental results that confirm the effectiveness of our approach.

### Citations

3423 |
Introduction to Modern Information Retrieval
- Salton, McGill
- 1983
(Show Context)
Citation Context ...n, text classification AMS subject classifications. 15A09, 68T10, 62H30, 65F15, 15A18 PII. S0895479801393666 1. Introduction. The vector space–based information retrieval system, originated by Salton =-=[13, 14]-=-, represents documents as vectors in a vector space. The document set comprises an m × n term-document matrix A =(aij), in which each column represents a document and each entry aij represents the wei... |

2928 |
Introduction to Statistical Pattern Recognition, Electrical Science Series
- Fukunaga
- 1972
(Show Context)
Citation Context ...ace that best approximates the document collection in the full space [8, 12]. The specific method we present in this paper is based on the discriminant analysis projection used in pattern recognition =-=[4, 15]-=-. Its goal is to find the mapping that transforms each column of A into a column in the lower-dimensional space, while preserving the cluster structure of the full data matrix. This is accomplished by... |

2334 |
Algorithms for Clustering Data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ...ix Sb is defined as Sb = = k� � i=1 j∈Ni k� i=1 (c (i) − c)(c (i) − c) T ni(c (i) − c)(c (i) − c) T . Finally, the mixture scatter matrix is defined as n� Sm = (aj − c)(aj − c) T . It is easy to show =-=[7]-=- that the scatter matrices have the relationship (4) j=1 Sm = Sw + Sb. Writing aj − c = aj − c (i) + c (i) − c for j ∈ Ni, wehave (5) (6) (7) Sm = = + k� � i=1 j∈Ni k� � i=1 j∈Ni k� � i=1 j∈Ni (aj − c... |

666 |
Numerical Methods for Least Squares Problems
- Björck
- 1996
(Show Context)
Citation Context ...rical properties of the approach by working with the data matrixdirectly rather than forming the scatter matrices explicitly. Our algorithm follows the generalized singular value decomposition (GSVD) =-=[2, 5, 16]-=- as formulated by Paige and Saunders [11]. For a data matrixwith k clusters, we can limit our computation to the generalized right singular vectors that correspond to the k −1 largest generalized sing... |

579 |
Solving least squares problems
- Lawson, Hanson
- 1974
(Show Context)
Citation Context ... >αr+1 ≥···≥αr+s > 0, 0 <βr+1 ≤···≤βr+s < 1, α 2 i + β 2 i =1 for i = r +1,...,r+ s. Paige and Saunders gave a constructive proof of Theorem 2, which starts with the complete orthogonal decomposition =-=[5, 2, 10]-=- of K, or P T � � R 0 (21) KQ = , 0 0 where P and Q are orthogonal and R is nonsingular with the same rank as K. The construction proceeds by exploiting the SVDs of submatrices of P . Partitioning P a... |

574 | Using linear algebra for intelligent information retrieval
- Berry, Dumais, et al.
- 1996
(Show Context)
Citation Context ...ts a document and each entry aij represents the weighted frequency of term i in document j. A major benefit of this representation is that the algebraic structure of the vector space can be exploited =-=[1]-=-. Modern document sets are huge [3], so we need to find a lower-dimensional representation of the data. To achieve higher efficiency in manipulating the data, it is often necessary to reduce the dimen... |

512 |
Information Retrieval: Data Structures and Algorithms
- Frakes, Baeza-Yates
- 1992
(Show Context)
Citation Context ...presents the weighted frequency of term i in document j. A major benefit of this representation is that the algebraic structure of the vector space can be exploited [1]. Modern document sets are huge =-=[3]-=-, so we need to find a lower-dimensional representation of the data. To achieve higher efficiency in manipulating the data, it is often necessary to reduce the dimension severely. Since this may resul... |

503 |
Pattern Recognition
- Theodoridis, Koutroumbas
- 1999
(Show Context)
Citation Context ...ace that best approximates the document collection in the full space [8, 12]. The specific method we present in this paper is based on the discriminant analysis projection used in pattern recognition =-=[4, 15]-=-. Its goal is to find the mapping that transforms each column of A into a column in the lower-dimensional space, while preserving the cluster structure of the full data matrix. This is accomplished by... |

176 |
The Smart Retrieval System
- Salton
- 1971
(Show Context)
Citation Context ...n, text classification AMS subject classifications. 15A09, 68T10, 62H30, 65F15, 15A18 PII. S0895479801393666 1. Introduction. The vector space–based information retrieval system, originated by Salton =-=[13, 14]-=-, represents documents as vectors in a vector space. The document set comprises an m × n term-document matrix A =(aij), in which each column represents a document and each entry aij represents the wei... |

130 |
Information Retrieval Systems - Theory and Implementation
- Kowalski
- 1997
(Show Context)
Citation Context ...use five categories of abstracts from the MEDLINE 1 database. Each category has 40 documents. The total number of terms is 7519 (see Table 2) after preprocessing with stopping and stemming algorithms =-=[9]-=-. For this 7519 × 200 term-document matrix, the original discriminant analysis breaks down, since Sw is singular. However, our improved LDA/GSVD method circumvents this singularity problem. 1 http://w... |

95 |
Towards a generalized singular value decomposition
- Paige, Saunders
- 1981
(Show Context)
Citation Context ...h the data matrixdirectly rather than forming the scatter matrices explicitly. Our algorithm follows the generalized singular value decomposition (GSVD) [2, 5, 16] as formulated by Paige and Saunders =-=[11]-=-. For a data matrixwith k clusters, we can limit our computation to the generalized right singular vectors that correspond to the k −1 largest generalized singular values. In this way, our algorithm r... |

58 |
Generalizing the singular value decomposition
- Loan
- 1976
(Show Context)
Citation Context ...rical properties of the approach by working with the data matrixdirectly rather than forming the scatter matrices explicitly. Our algorithm follows the generalized singular value decomposition (GSVD) =-=[2, 5, 16]-=- as formulated by Paige and Saunders [11]. For a data matrixwith k clusters, we can limit our computation to the generalized right singular vectors that correspond to the k −1 largest generalized sing... |

39 | Lower dimensional representation of text data based on centroids and least squares
- Park, Jeon, et al.
- 2003
(Show Context)
Citation Context ...o reduce the dimension severely. Since this may result in loss of information, we seek a representation in the lower-dimensional space that best approximates the document collection in the full space =-=[8, 12]-=-. The specific method we present in this paper is based on the discriminant analysis projection used in pattern recognition [4, 15]. Its goal is to find the mapping that transforms each column of A in... |

9 |
Kernel discriminant analysis based on the generalized singular value decomposition, in preparation
- Park, Park
(Show Context)
Citation Context ...at both of the above criteria require Sw to be nonsingular or, equivalently, Hw to have full rank. For more measures of cluster quality, their relationships, and their extension to document data, see =-=[6]-=-. In the lower-dimensional space obtained from the linear transformation GT , the within-cluster, between-cluster, and mixture scatter matrices become k� � S Y w = S Y b = S Y m = i=1 j∈Ni k� � i=1 j∈... |

8 |
Dimension reduction based on centroids and least squares for ecient processing of text data
- Jeon, Park, et al.
- 2001
(Show Context)
Citation Context ...o reduce the dimension severely. Since this may result in loss of information, we seek a representation in the lower-dimensional space that best approximates the document collection in the full space =-=[8, 12]-=-. The specific method we present in this paper is based on the discriminant analysis projection used in pattern recognition [4, 15]. Its goal is to find the mapping that transforms each column of A in... |