Results 1  10
of
42
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 247 (0 self)
 Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
ADE4: a multivariate analysis and graphical display software
 Stat. Comput
, 1997
"... e searching, zooming, selection of points, and display of data values on factor maps. The user interface is simple and homogeneous among all the programs; this contributes to making the use of ADE4 very easy for nonspecialists in statistics, data analysis or computer science. Keywords: Multivar ..."
Abstract

Cited by 45 (8 self)
 Add to MetaCart
e searching, zooming, selection of points, and display of data values on factor maps. The user interface is simple and homogeneous among all the programs; this contributes to making the use of ADE4 very easy for nonspecialists in statistics, data analysis or computer science. Keywords: Multivariate analysis, principal component analysis, correspondence analysis, instrumental variables, canonical correspondence analysis, partial least squares regression, coinertia analysis, graphics, multivariate graphics, interactive graphics, Macintosh, HyperCard, Windows 95 1. Introduction ADE4 is a multivariate analysis and graphical display software for Apple Macintosh and Windows 95 microcomputers. It is made up of several standalone applications, called modules, that feature a wide range of multivariate analysis methods, from simple onetable analysis to threeway table analysis and twotable coupling methods. It also provides many possibilitie
Document clustering via adaptive subspace iteration
 In SIGIR
, 2004
"... Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1, which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification vi ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1, which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification via an iterative alternating optimization procedure. Motivated from the optimization procedure, we then provide a novel method to determine the number of clusters. We also discuss the connections of ASI with various existential clustering approaches. Finally, extensive experimental results on real data sets show the effectiveness of ASI algorithm.
The analysis of vegetationenvironment relationships by canonical correspondence analysis
, 1987
"... Canonical correspondence analysis (CCA) is introduced as a multivariate extension of weighted averaging ordination, which is a simple method for arranging species along environmental variables. CCA constructs those linear combinations of environmental variables, along which the distributions of the ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
Canonical correspondence analysis (CCA) is introduced as a multivariate extension of weighted averaging ordination, which is a simple method for arranging species along environmental variables. CCA constructs those linear combinations of environmental variables, along which the distributions of the species are maximally separated. The eigenvalues produced by CCA measure this separation. As its name suggests, CCA is also a correspondence analysis technique, but one in which the ordination axes are constrained to be linear combinations of environmental variables. The ordination diagram generated by CCA visualizes not only a pattern of community variation (as in standard ordination) but also the main features of the distributions of species along the environmental variables. Applications demonstrate that CCA can be used both for detecting speciesenvironment relations, and for investigating specific questions about the response of species to environmental variables. Questions in community ecology that have typically been studied by 'indirect ' gradient analysis (i.e. ordination followed by external interpretation of the axes) can now be answered more directly by CCA.
The Gifi System Of Descriptive Multivariate Analysis
 STATISTICAL SCIENCE
, 1998
"... The Gifi system of analyzing categorical data through nonlinear varieties of classical multivariate analysis techniques is reviewed. The system is characterized by the optimal scaling of categorical variables which is implemented through alternating least squares algorithms. The main technique of h ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
The Gifi system of analyzing categorical data through nonlinear varieties of classical multivariate analysis techniques is reviewed. The system is characterized by the optimal scaling of categorical variables which is implemented through alternating least squares algorithms. The main technique of homogeneity analysis is presented, along with its extensions and generalizations leading to nonmetric principal components analysis and canonical correlation analysis. A brief account of stability issues and areas of applications of the techniques is also given.
Partitioning Networks by Eigenvectors
, 1995
"... A survey of published methods for partitioning sparse arrays is presented. These include early attempts to describe the partitioning properties of eigenvectors of the adjacency matrix. More direct methods of partitioning are developed by introducing the Laplacian of the adjacency matrix via the dire ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
A survey of published methods for partitioning sparse arrays is presented. These include early attempts to describe the partitioning properties of eigenvectors of the adjacency matrix. More direct methods of partitioning are developed by introducing the Laplacian of the adjacency matrix via the directed (signed) edgevertex incidence matrix. It is shown that the Laplacian solves the minimization of total length of connections between adjacent nodes, which induces clustering of connected nodes by partitioning the underlying graph. Another matrix derived from the adjacency matrix is also introduced via the unsigned edgevertex matrix. This (the Normal) matrix is not symmetric, and it also is shown to solve the minimization of total length in its own nonEuclidean metric. In this case partitions are induced by clustering the connected nodes. The Normal matrix is closely related to Correspondence Analysis.
Algorithms for Clustering High Dimensional and Distributed Data
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. The clustering problem has been widely studied in machine learning, databases, and statistics. This paper studies the problem o ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. The clustering problem has been widely studied in machine learning, databases, and statistics. This paper studies the problem of clustering high dimensional data. The paper proposes an algorithm called the CoFD algorithm, which is a nondistance based clustering algorithm for high dimensional spaces. Based on the
Continuous Extensions of Matrix Formulations in Correspondence Analysis, with Applications to the FGM Family of Distributions
 Advances in Econometrics, Kluwer
, 1998
"... Correspondence analysis is a multivariate technique used to visualize categorical data, usually data in a twoway contingency table. Some extensions of correspondence analysis to a continuous bivariate distribution are presented, firstly from a canonical correlation analysis perspective and then fro ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Correspondence analysis is a multivariate technique used to visualize categorical data, usually data in a twoway contingency table. Some extensions of correspondence analysis to a continuous bivariate distribution are presented, firstly from a canonical correlation analysis perspective and then from a continuous scaling perspective. These extensions are applied to the FarlieGumbelMorgenstern (FGM) family of bivariate distributions with given marginals, and also to a generalization of this family. 1 1 Introduction Correspondence analysis (CA) is a method designed to give a graphical representation of a contingency table N and thus to interpret the association between rows and columns. To be specific, correspondence analysis visualizes the socalled correspondence matrix P, which is the discrete bivariate density obtained by dividing N by its grand total n: P = (1=n)N. A continuous extension of CA can be obtained by replacing P with a bivariate probability density h(x; y). The marg...