## Principal Direction Divisive Partitioning (1997)

Venue: | Data Mining and Knowledge Discovery |

Citations: | 104 - 22 self |

### BibTeX

@ARTICLE{Boley97principaldirection,

author = {Daniel Boley},

title = {Principal Direction Divisive Partitioning},

journal = {Data Mining and Knowledge Discovery},

year = {1997},

volume = {2},

pages = {325--344}

}

### Years of Citing Articles

### OpenURL

### Abstract

We propose a new algorithm capable of partitioning a set of documents or other samples based on an embedding in a high dimensional Euclidean space (i.e. in which every document is a vector of real numbers). The method is unusual in that it is divisive, as opposed to agglomerative, and operates by repeatedly splitting clusters into smaller clusters. The splits are not based on any distance or similarity measure. The documents are assembled in to a matrix which is very sparse. It is this sparsity that permits the algorithm to be very efficient. The performance of the method is illustrated with a set of text documents obtained from the World Wide Web. Some possible extensions are proposed for further investigation.

### Citations

3921 | Pattern Classification and Scene Analysis - Duda, Hart - 1973 |

3124 |
Introduction to Modern Information Retrieval
- Salton, McGill
- 1983
(Show Context)
Citation Context ... = TF i q P j (TF j ) 2 ; (3) where TF i is the number of occurences of word i in the particular document d. We refer to the scaling (3) as "norm scaling." An alternative scaling is the tfid=-=f scaling [13]-=-, defined as follows: let ~ d i = 1 + TF i max j (TF j ) log ` 1 DF i ' , then d i = ~ d i q P j ( ~ d j ) 2 ; (3:1) where DF i is the number of different documents in which the i word appears, among ... |

2152 |
Algorithms for Clustering Data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ...hes to document clustering are generally based on either probabilistic methods, or distance and similarity measures (see [6]). Distance-based methods such as k-means analysis, hierarchical clustering =-=[9]-=- and nearest-neighbor clustering [10] use a selected set of words (features) appearing in different documents as the dimensions. Each such feature vector, representing a document, can be viewed as a p... |

710 |
Cluster Analysis for Applications
- Anderberg
- 1973
(Show Context)
Citation Context ...every stage the clusters are disjoint and their union equals the entire set of documents. The word "Divisive" comes from the following taxonomy of clustering algorithms in [12, p298] (origin=-=ally from [1]-=-): 1. hierarchical agglomerative clustering 2. hierarchical divisive clustering 3. iterative partitioning 4. density search clustering 2 5. factor analytic clustering 6. clumping 7. graph-theoretic cl... |

625 |
Statistical Analysis of Finite Mixture Distributions
- Titterington, Smith, et al.
- 1985
(Show Context)
Citation Context ...h such feature vector, representing a document, can be viewed as a point in this multi-dimensional space. AutoClass [3] is a method using Bayesian analysis based on the probabilistic mixture modeling =-=[14]-=-. Given a data set it finds maximum parameter values for a specific probability distribution functions of the clusters. There are a number of problems with clustering in a multi-dimensional space usin... |

622 | Scatter/gather:a cluster-based approach to browsing large document collections
- Cutting, Karger, et al.
- 1992
(Show Context)
Citation Context ...cluster and proceeds to divide up the initial cluster into progressively smaller clusters. As a hierarchical divisive algorithm, the PDDP algorithm would be appropriate for the scatter/gather task of =-=[4]. The scatter/gather task cons-=-ists of a "scatter" of a large collection of documents into a "small" number of clusters, and a "gather" task in which certain of those resulting clusters are combined. T... |

533 | Using linear algebra for intelligent information retrieval
- Berry, Dumais, et al.
- 1995
(Show Context)
Citation Context ...ures collected using these methods still tends to be very large, and determining which features can be discarded without affecting the quality of clusters is difficult. Latent Semantic Indexing (LSI) =-=[2]-=- was proposed as a method for query-based document retrieval in which the noise present in data sets of very high dimensionality is reduced by orthogonal projection. A low rank (say, rank k !! minfm; ... |

485 |
Information retrieval: Data structures and algorithms. Eaglewood Cliffs
- Frakes, Baeza-Yatex
- 1992
(Show Context)
Citation Context ...rithms typically have O(m 2 ) running time [4]. 4 Related Work Existing approaches to document clustering are generally based on either probabilistic methods, or distance and similarity measures (see =-=[6]-=-). Distance-based methods such as k-means analysis, hierarchical clustering [9] and nearest-neighbor clustering [10] use a selected set of words (features) appearing in different documents as the dime... |

480 |
Bayesian classification (AutoClass): Theory and results
- Cheeseman, Stutz
- 1995
(Show Context)
Citation Context ...cted set of words (features) appearing in different documents as the dimensions. Each such feature vector, representing a document, can be viewed as a point in this multi-dimensional space. AutoClass =-=[3]-=- is a method using Bayesian analysis based on the probabilistic mixture modeling [14]. Given a data set it finds maximum parameter values for a specific probability distribution functions of the clust... |

75 | Webace: a web agent for document categorization and exploration
- Han, Boley, et al.
- 1998
(Show Context)
Citation Context ...responding to the 185 documents, and 10536 rows each corresponding to a word. Then further heuristics were applied to obtain the matrices for data sets J2 through J11. These are summarized in Table 3 =-=[8]-=-. We applied two algorithms to this data set, the divisive PDDP algorithm (Table 2), and an agglomerative algorithm [5]. The agglomerative algorithm is shown only for comparative purposes, and is brie... |

46 |
A sentence-to-sentence clustering procedure for pattern analysis
- Lu, Fu
- 1978
(Show Context)
Citation Context ...ally based on either probabilistic methods, or distance and similarity measures (see [6]). Distance-based methods such as k-means analysis, hierarchical clustering [9] and nearest-neighbor clustering =-=[10]-=- use a selected set of words (features) appearing in different documents as the dimensions. Each such feature vector, representing a document, can be viewed as a point in this multi-dimensional space.... |

43 |
Pattern Recognition Engineering
- M, Smith
- 1993
(Show Context)
Citation Context ... in one cluster, so that at every stage the clusters are disjoint and their union equals the entire set of documents. The word "Divisive" comes from the following taxonomy of clustering algo=-=rithms in [12, p298]-=- (originally from [1]): 1. hierarchical agglomerative clustering 2. hierarchical divisive clustering 3. iterative partitioning 4. density search clustering 2 5. factor analytic clustering 6. clumping ... |

26 | Web page categorization and feature selection using association rule and principal component clustering
- Moore, Han, et al.
- 1997
(Show Context)
Citation Context ...her investigation. 1 Introduction Unsupervised clustering of documents is a critical component for the exploration of large unstructured document sets. As part of a larger project, the WebACE Project =-=[11, 8]-=-, we have developed a novel algorithm for unsupervised partitioning of a large data set which exhibits several useful features. These features include scalability to large data sets, competitive This ... |