Results 1 - 10
of
33
Similarity-based approaches to natural language processing
, 1997
"... Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similarity-based approaches, where, in general, we measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity. Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing,
Low-complexity fuzzy relational clustering algorithms for web mining
- IEEE TRANSACTIONS ON FUZZY SYSTEMS
, 2001
"... This paper presents new algorithms—fuzzy cmedoids (FCMdd) and robust fuzzy c-medoids (RFCMdd)—for fuzzy clustering of relational data. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total fuzzy dissimilarity within each clus ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
This paper presents new algorithms—fuzzy cmedoids (FCMdd) and robust fuzzy c-medoids (RFCMdd)—for fuzzy clustering of relational data. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total fuzzy dissimilarity within each cluster is minimized. A comparison of FCMdd with the wellknown relational fuzzy c-means algorithm (RFCM) shows that FCMdd is more efficient. We present several applications of these algorithms to Web mining, including Web document clustering, snippet clustering, and Web access log analysis.
An interior point algorithm for minimum sum of squares clustering
- SIAM J. Sci. Comput
, 1997
"... Abstract. An exact algorithm is proposed for minimum sum-of-squares nonhierarchical clustering, i.e., for partitioning a given set of points from a Euclidean m-space into a given number of clusters in order to minimize the sum of squared distances from all points to the centroid of the cluster to wh ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
Abstract. An exact algorithm is proposed for minimum sum-of-squares nonhierarchical clustering, i.e., for partitioning a given set of points from a Euclidean m-space into a given number of clusters in order to minimize the sum of squared distances from all points to the centroid of the cluster to which they belong. This problem is expressed as a constrained hyperbolic program in 0-1 variables. The resolution method combines an interior point algorithm, i.e., a weighted analytic center column generation method, with branch-and-bound. The auxiliary problem of determining the entering column (i.e., the oracle) is an unconstrained hyperbolic program in 0-1 variables with a quadratic numerator and linear denominator. It is solved through a sequence of unconstrained quadratic programs in 0-1 variables. To accelerate resolution, variable neighborhood search heuristics are used both to get a good initial solution and to solve quickly the auxiliary problem as long as global optimality is not reached. Estimated bounds for the dual variables are deduced from the heuristic solution and used in the resolution process as a trust region. Proved minimum sum-of-squares partitions are determined for the first time for several fairly large data sets from the literature, including Fisher’s 150 iris. Key words. classification and discrimination, cluster analysis, interior-point methods, combinatorial optimization
A Fuzzy Relative of the k-Medoids Algorithm with Application to Web Document and Snippet Clustering
- Snippet Clustering, in Proc. IEEE Intl. Conf. Fuzzy Systems - FUZZIEEE99, Korea
, 1999
"... This paper presents new algorithms (Fuzzy c-Medoids FCMdd and Fuzzy c Trimmed Medoids or FCTMdd) for fuzzy clustering of relational data. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total dissimilarity within each cluster ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
This paper presents new algorithms (Fuzzy c-Medoids FCMdd and Fuzzy c Trimmed Medoids or FCTMdd) for fuzzy clustering of relational data. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total dissimilarity within each cluster is minimized. A comparison of FCMdd with the Relational Fuzzy c-Means algorithm (RFCM) shows that FCMdd is much faster. We present examples of applications of these algorithms to Web document and snippet clustering. 1.Introduction Object data refers to the the situation where the objects to be clustered are represented by vectors x i 2 ! p . Relational data refers to the situation where we have only numerical values representing the degrees to which pairs of objects in the data set are related. Algorithms that generate partitions of relational data are usually referred to as relational (or sometimes pair-wise) clustering algorithms. Relational clustering is more general in the sense tha...
Automatic Web User Profiling and Personalization Using Robust Fuzzy Relational Clustering
, 2002
"... The proliferation of information on the world wide Web has made the personalization of this information space a necessity. Personalization of content returned from a Web site is a desired feature that can enhance server performance improve system design, and lead to wise marketing decisions in elect ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
The proliferation of information on the world wide Web has made the personalization of this information space a necessity. Personalization of content returned from a Web site is a desired feature that can enhance server performance improve system design, and lead to wise marketing decisions in electronic commerce. Mining typical user profiles from the vast amount of historical data stored in access logs is an important component of Web personalization. In the absence of a priori knowledge, unsupervised or clustering methods seem to be ideally suited to categorize the usage behavior of Web surfers. In this chapter, we present a framework for mining typical user profiles from server acces logs based on robust fuzzy relational clustering. As a by-product of the clustering process that generates robust profiles, associations between different URL addresses on a given site can easily be inferred. In general, the URLs that are present in the same profile tend to be visited together in the same session or form a large itemset. Finally, we present a personalization system that uses previously mined profiles to automatically generate a Web page containing URLs the user might be interested in. Our personalization approach is based on profiles computed from the prior traversal patterns of the users on the website and do not involve providing any declarative private information or the user to log in.
A monothetic clustering method
- Pattern Recognition Letters
, 1998
"... Abstract: The proposed divisive clustering method performs simultaneously a hierarchy of a set of objects and a monothetic characterization of each cluster of the hierarchy. A division is performed according to the within-cluster inertia criterion which is minimized among the bipartitions induced by ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Abstract: The proposed divisive clustering method performs simultaneously a hierarchy of a set of objects and a monothetic characterization of each cluster of the hierarchy. A division is performed according to the within-cluster inertia criterion which is minimized among the bipartitions induced by a set of binary questions. In order to improve the clustering, the algorithm revises at each step the division which has induced the cluster chosen for division.
Output convergence and international trade: Time-series and fuzzy clustering evidence for New Zealand and her trading partners, 1950-1992”, Econometrics Working Paper EWP0102
- Department of Economics, University of Victoria
, 2001
"... Using historical time-series data, we test for convergence and common trends in real per capita output for New Zealand and her four major trading partners. Both bivariate and multivariate time-series methods are used, and we also implement the fuzzy c-means clustering algorithm as an alternative bas ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Using historical time-series data, we test for convergence and common trends in real per capita output for New Zealand and her four major trading partners. Both bivariate and multivariate time-series methods are used, and we also implement the fuzzy c-means clustering algorithm as an alternative basis for detecting convergence. The results of our time-series analysis accord with earlier studies- we find limited evidence of (only bivariate) convergence, but ample evidence of a small number of common trends. In contrast, our fuzzy clustering analysis reveals very strong evidence of a particular form of output convergence when the five trading countries are considered as a group.
Quantitative evaluation of clustering results using computatinal negative controls
- In Proc. 2004 SIAM Int. Conf. Data Mining
, 2004
"... Most partition-based cluster analysis methods (e.g., kmeans) will partition any dataset D into k subsets, regardless of the inherent appropriateness of such a partitioning. This paper presents a family of permutation-based procedures to determine both the number of clusters k best supported by the a ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Most partition-based cluster analysis methods (e.g., kmeans) will partition any dataset D into k subsets, regardless of the inherent appropriateness of such a partitioning. This paper presents a family of permutation-based procedures to determine both the number of clusters k best supported by the available data and the weight of evidence in support of this clustering. These procedures use one of 37 cluster quality measures to assess the influence of structure-destroying random permutations applied to the original dataset. Results are presented for a collection of simulated datasets for which the correct cluster structure is known unambiguously. 1
A new theoretical framework for K-means-type clustering
- FOUNDATIONS AND ADVANCES IN DATA MINING
, 2005
"... One of the fundamental clustering problems is to assign n points into k clusters based on the minimal sum-of-squares(MSSC), which is known to be NP-hard. In this paper, by using matrix arguments, we first model MSSC as a so-called 0-1 semidefinite programming (SDP). The classical K-means algorithm c ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
One of the fundamental clustering problems is to assign n points into k clusters based on the minimal sum-of-squares(MSSC), which is known to be NP-hard. In this paper, by using matrix arguments, we first model MSSC as a so-called 0-1 semidefinite programming (SDP). The classical K-means algorithm can be interpreted as a special heuristics for the underlying 0-1 SDP. Moreover, the 0-1 SDP model can be further approximated by the relaxed and polynomially solvable linear and semidefinite programming. This opens new avenues for solving MSSC. The 0-1 SDP model can be applied not only to MSSC, but also to other scenarios of clustering as well. In particular, we show that the recently proposed normalized k-cut and spectral clustering can also be embedded into the 0-1 SDP model in various kernel spaces.
Approximating k-means-type clustering via semidefinite programming
- SIAM Journal on Optimization
"... One of the fundamental clustering problems is to assign n points into k clusters based on the minimal sum-of-squares(MSSC), which is known to be NP-hard. In this paper, by using matrix arguments, we first model MSSC as a so-called 0-1 semidefinite programming (SDP). We show that our 0-1 SDP model pr ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
One of the fundamental clustering problems is to assign n points into k clusters based on the minimal sum-of-squares(MSSC), which is known to be NP-hard. In this paper, by using matrix arguments, we first model MSSC as a so-called 0-1 semidefinite programming (SDP). We show that our 0-1 SDP model provides an unified framework for several clustering approaches such as normalized k-cut and spectral clustering. Moreover, the 0-1 SDP model allows us to solve the underlying problem approximately via the relaxed linear and semidefinite programming. Secondly, we consider the issue of how to extract a feasible solution of the original MSSC model from the approximate solution of the relaxed SDP problem. By using principal component analysis, we develop a rounding procedure to construct a feasible partitioning from a solution of the relaxed problem. In our rounding procedure, we need to solve a k-means clustering problem in ℜ k−1, which can be solved in O(n k2 (k−1) ) time. In case of bi-clustering, the running time of our rounding procedure can be reduced to O(n log n). We show that our algorithm can provide a 2-approximate solution to the original problem. Promising numerical results based on our new method are reported. Key words. K-means clustering, Principal component analysis, Semidefinite programming, Approximation.

