Results 1  10
of
44
Similaritybased approaches to natural language processing
, 1997
"... Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and naturallanguagebased user interfaces. However, even huge ..."
Abstract

Cited by 40 (3 self)
 Add to MetaCart
Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and naturallanguagebased user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similaritybased approaches, where, in general, we measure similarity by the KullbackLeibler divergence, an informationtheoretic quantity. Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing,
Lowcomplexity fuzzy relational clustering algorithms for web mining
 IEEE TRANSACTIONS ON FUZZY SYSTEMS
, 2001
"... This paper presents new algorithms—fuzzy cmedoids (FCMdd) and robust fuzzy cmedoids (RFCMdd)—for fuzzy clustering of relational data. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total fuzzy dissimilarity within each clus ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
This paper presents new algorithms—fuzzy cmedoids (FCMdd) and robust fuzzy cmedoids (RFCMdd)—for fuzzy clustering of relational data. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total fuzzy dissimilarity within each cluster is minimized. A comparison of FCMdd with the wellknown relational fuzzy cmeans algorithm (RFCM) shows that FCMdd is more efficient. We present several applications of these algorithms to Web mining, including Web document clustering, snippet clustering, and Web access log analysis.
An interior point algorithm for minimum sum of squares clustering
 SIAM J. Sci. Comput
, 1997
"... Abstract. An exact algorithm is proposed for minimum sumofsquares nonhierarchical clustering, i.e., for partitioning a given set of points from a Euclidean mspace into a given number of clusters in order to minimize the sum of squared distances from all points to the centroid of the cluster to wh ..."
Abstract

Cited by 21 (8 self)
 Add to MetaCart
Abstract. An exact algorithm is proposed for minimum sumofsquares nonhierarchical clustering, i.e., for partitioning a given set of points from a Euclidean mspace into a given number of clusters in order to minimize the sum of squared distances from all points to the centroid of the cluster to which they belong. This problem is expressed as a constrained hyperbolic program in 01 variables. The resolution method combines an interior point algorithm, i.e., a weighted analytic center column generation method, with branchandbound. The auxiliary problem of determining the entering column (i.e., the oracle) is an unconstrained hyperbolic program in 01 variables with a quadratic numerator and linear denominator. It is solved through a sequence of unconstrained quadratic programs in 01 variables. To accelerate resolution, variable neighborhood search heuristics are used both to get a good initial solution and to solve quickly the auxiliary problem as long as global optimality is not reached. Estimated bounds for the dual variables are deduced from the heuristic solution and used in the resolution process as a trust region. Proved minimum sumofsquares partitions are determined for the first time for several fairly large data sets from the literature, including Fisher’s 150 iris. Key words. classification and discrimination, cluster analysis, interiorpoint methods, combinatorial optimization
A Fuzzy Relative of the kMedoids Algorithm with Application to Web Document and Snippet Clustering
 Snippet Clustering, in Proc. IEEE Intl. Conf. Fuzzy Systems  FUZZIEEE99, Korea
, 1999
"... This paper presents new algorithms (Fuzzy cMedoids FCMdd and Fuzzy c Trimmed Medoids or FCTMdd) for fuzzy clustering of relational data. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total dissimilarity within each cluster ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
This paper presents new algorithms (Fuzzy cMedoids FCMdd and Fuzzy c Trimmed Medoids or FCTMdd) for fuzzy clustering of relational data. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total dissimilarity within each cluster is minimized. A comparison of FCMdd with the Relational Fuzzy cMeans algorithm (RFCM) shows that FCMdd is much faster. We present examples of applications of these algorithms to Web document and snippet clustering. 1.Introduction Object data refers to the the situation where the objects to be clustered are represented by vectors x i 2 ! p . Relational data refers to the situation where we have only numerical values representing the degrees to which pairs of objects in the data set are related. Algorithms that generate partitions of relational data are usually referred to as relational (or sometimes pairwise) clustering algorithms. Relational clustering is more general in the sense tha...
Automatic Web User Profiling and Personalization Using Robust Fuzzy Relational Clustering
, 2002
"... The proliferation of information on the world wide Web has made the personalization of this information space a necessity. Personalization of content returned from a Web site is a desired feature that can enhance server performance improve system design, and lead to wise marketing decisions in elect ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
The proliferation of information on the world wide Web has made the personalization of this information space a necessity. Personalization of content returned from a Web site is a desired feature that can enhance server performance improve system design, and lead to wise marketing decisions in electronic commerce. Mining typical user profiles from the vast amount of historical data stored in access logs is an important component of Web personalization. In the absence of a priori knowledge, unsupervised or clustering methods seem to be ideally suited to categorize the usage behavior of Web surfers. In this chapter, we present a framework for mining typical user profiles from server acces logs based on robust fuzzy relational clustering. As a byproduct of the clustering process that generates robust profiles, associations between different URL addresses on a given site can easily be inferred. In general, the URLs that are present in the same profile tend to be visited together in the same session or form a large itemset. Finally, we present a personalization system that uses previously mined profiles to automatically generate a Web page containing URLs the user might be interested in. Our personalization approach is based on profiles computed from the prior traversal patterns of the users on the website and do not involve providing any declarative private information or the user to log in.
A monothetic clustering method
 Pattern Recognition Letters
, 1998
"... Abstract: The proposed divisive clustering method performs simultaneously a hierarchy of a set of objects and a monothetic characterization of each cluster of the hierarchy. A division is performed according to the withincluster inertia criterion which is minimized among the bipartitions induced by ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
Abstract: The proposed divisive clustering method performs simultaneously a hierarchy of a set of objects and a monothetic characterization of each cluster of the hierarchy. A division is performed according to the withincluster inertia criterion which is minimized among the bipartitions induced by a set of binary questions. In order to improve the clustering, the algorithm revises at each step the division which has induced the cluster chosen for division.
Clustering in an objectoriented environment
 Journal of Statistical Software
, 1996
"... This paper describes the incorporation of seven standalone clustering programs into SPLUS, where they can now be used in a much more flexible way. The original Fortran programs carried out new cluster analysis algorithms introduced in the book of Kaufman and Rousseeuw (1990). These clustering meth ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
This paper describes the incorporation of seven standalone clustering programs into SPLUS, where they can now be used in a much more flexible way. The original Fortran programs carried out new cluster analysis algorithms introduced in the book of Kaufman and Rousseeuw (1990). These clustering methods were designed to be robust and to accept dissimilarity data as well as objectsbyvariables data. Moreover, they each provide a graphical display and a quality index reflecting the strength of the clustering. The powerful graphics of SPLUS made it possible to improve these graphical representations considerably. The integration of the clustering algorithms was performed according to the objectoriented principle supported by SPLUS. The new functions have a uniform interface, and are compatible with existing SPLUS functions. We will describe the basic idea and the use of each clustering method, together with its graphical features. Each function is briefly illustrated with an example.
Quantitative evaluation of clustering results using computatinal negative controls
 In Proc. 2004 SIAM Int. Conf. Data Mining
, 2004
"... Most partitionbased cluster analysis methods (e.g., kmeans) will partition any dataset D into k subsets, regardless of the inherent appropriateness of such a partitioning. This paper presents a family of permutationbased procedures to determine both the number of clusters k best supported by the a ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
Most partitionbased cluster analysis methods (e.g., kmeans) will partition any dataset D into k subsets, regardless of the inherent appropriateness of such a partitioning. This paper presents a family of permutationbased procedures to determine both the number of clusters k best supported by the available data and the weight of evidence in support of this clustering. These procedures use one of 37 cluster quality measures to assess the influence of structuredestroying random permutations applied to the original dataset. Results are presented for a collection of simulated datasets for which the correct cluster structure is known unambiguously. 1
Output convergence and international trade: Timeseries and fuzzy clustering evidence for New Zealand and her trading partners, 19501992
 DEPARTMENT OF ECONOMICS, UNIVERSITY OF VICTORIA
, 2001
"... Using historical timeseries data, we test for convergence and common trends in real per capita output for New Zealand and her four major trading partners. Both bivariate and multivariate timeseries methods are used, and we also implement the fuzzy cmeans clustering algorithm as an alternative bas ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Using historical timeseries data, we test for convergence and common trends in real per capita output for New Zealand and her four major trading partners. Both bivariate and multivariate timeseries methods are used, and we also implement the fuzzy cmeans clustering algorithm as an alternative basis for detecting convergence. The results of our timeseries analysis accord with earlier studies we find limited evidence of (only bivariate) convergence, but ample evidence of a small number of common trends. In contrast, our fuzzy clustering analysis reveals very strong evidence of a particular form of output convergence when the five trading countries are considered as a group.
Clustering With Genetic Algorithms
, 1998
"... Clustering is the search for those partitions that reflect the structure of an object set. Traditional clustering algorithms search only a small subset of all possible clusterings (the solution space) and consequently, there is no guarantee that the solution found will be optimal. We report here on ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Clustering is the search for those partitions that reflect the structure of an object set. Traditional clustering algorithms search only a small subset of all possible clusterings (the solution space) and consequently, there is no guarantee that the solution found will be optimal. We report here on the application of Genetic Algorithms (GAs)  stochastic search algorithms touted as effective search methods for large and complex spaces  to the problem of clustering. GAs which have been made applicable to the problem of clustering (by adapting the representation, fitness function, and developing suitable evolutionary operators) are known as Genetic Clustering Algorithms (GCAs). There are two parts to our investigation of GCAs: first we look at clustering into a given number of clusters. The performance of GCAs on three generated data sets, analysed using 4320 differing combinations of adaptions, establishes their efficacy. Choice of adaptions and parameter settings is data set depe...