Results 11 - 20
of
220
Correlation Clustering with Partial Information
, 2003
"... We consider the following general correlation-clustering problem [1]: given a graph with real edge weights (both positive and negative), partition the vertices into clusters to minimize the total absolute weight of cut positive edges and uncut negative edges. Thus, large positive weights (represent ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
We consider the following general correlation-clustering problem [1]: given a graph with real edge weights (both positive and negative), partition the vertices into clusters to minimize the total absolute weight of cut positive edges and uncut negative edges. Thus, large positive weights (representing strong correlations between endpoints) encourage those endpoints to belong to a common cluster; large negative weights encourage the endpoints to belong to different clusters; and weights with small absolute value represent little information. In contrast to most clustering problems, correlation clustering specifies neither the desired number of clusters nor a distance threshold for clustering; both of these parameters are effectively chosen to be the best possible by the problem definition. Correlation clustering was introduced by Bansal, Blum, and Chawla [1], motivated by both document clustering and agnostic learning. They proved NP-hardness and gave constant-factor approximation algorithms for the special case in which the graph is complete (full information) and every edge has weight +1 or-1. We give an O(log n)-approximation algorithm for the general case based on a linear-programming rounding and the "region-growing " technique. We also prove that this linear program has a gap of Ω(log n), and therefore our approximation is tight under this approach. We also give an O(r³)-approximation algorithm for Kr,r-minor-free graphs. On the other hand, we show that the problem is APX-hard, and any o(log n)-approximation would require improving the best approximation algorithms known for minimum multicut.
TwitterStand: News in Tweets ∗
"... Twitter is an electronic medium that allows a large user populace to communicate with each other simultaneously. Inherent to Twitter is an asymmetrical relationship between friends and followers that provides an interesting social networklike structure among the users of Twitter. Twitter messages, c ..."
Abstract
-
Cited by 37 (9 self)
- Add to MetaCart
Twitter is an electronic medium that allows a large user populace to communicate with each other simultaneously. Inherent to Twitter is an asymmetrical relationship between friends and followers that provides an interesting social networklike structure among the users of Twitter. Twitter messages, called tweets, are restricted to 140 characters and thus are usually very focused. We investigate the use of Twitter to build a news processing system, called TwitterStand, from Twitter tweets. The idea is to capture tweets that correspond to late breaking news. The result is analogous to a distributed news wire service. The difference is that the identities of the contributors/reporters are not known in advance and there may be many of them. Furthermore, tweets are not sent according to a schedule: they occur as news is happening, and tend to be noisy while usually arriving at a high throughput rate. Some of the issues addressed include removing the noise, determining tweet clusters of interest bearing in mind that the methods must be online, and determining the relevant locations associated with the tweets.
Ontologies Improve Text Document Clustering
, 2003
"... Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large sets of documents into a small number of meaningful clusters. The bag of words representation used for these clustering methods is often unsatisfactory as it ignores relatio ..."
Abstract
-
Cited by 34 (13 self)
- Add to MetaCart
Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large sets of documents into a small number of meaningful clusters. The bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms that do not cooccur literally. In order to deal with the problem, we integrate core ontologies as background knowledge into the process of clustering text documents. Our experimental evaluations compare clustering techniques based on precategorizations of texts from Reuters newsfeeds and on a smaller domain of an eLearning course about Java. In the experiments, improvements of results by background knowledge compared to a baseline without background knowledge can be shown in many interesting combinations.
Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data
- in Proceedings of Second SIAM International Conference on Data Mining
, 2003
"... ..."
A Comparative Study of Generative Models for Document Clustering
- In SIAM Int. Conf. Data Mining Workshop on Clustering High Dimensional Data and Its Applications
, 2003
"... Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mi ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mixture of von Mises-Fisher (vMF) distributions. This paper compares these three probabilistic models for text clustering, both theoretically and empirically, using a general model-based clustering framework. For each model, we investigate three strategies for assigning documents to models: maximum likelihood (k-means) assignment, stochastic assignment, and soft assignment. Our experimental results over a large number of datasets show that, in terms of clustering quality, (a) The Bernoulli model is the worst for text clustering; (b) The vMF model produces better clustering results than both Bernoulli and multinomial models; (c) Soft assignment leads to comparable or slightly better results than hard assignment. We also use deterministic annealing (DA) to improve the vMF-based soft clustering and compare all the model-based algorithms with the state-of-the-art discriminative approach to document clustering based on graph partitioning (CLUTO) and a spectral co-clustering method. Overall, CLUTO and DA perform the best but are also the most computationally expensive; the spectral coclustering algorithm fares worse than the vMF-based methods.
Explaining Text Clustering Results Using Semantic Structures
- In Principles of Data Mining and Knowledge Discovery, 7th European Conference, PKDD 2003
, 2003
"... Common text clustering techniques offer rather poor capabilities for explaining to their users why a particular result has been achieved. They have the disadvantage that they do not relate semantically nearby terms and that they cannot explain how resulting clusters are related to each other. In ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
Common text clustering techniques offer rather poor capabilities for explaining to their users why a particular result has been achieved. They have the disadvantage that they do not relate semantically nearby terms and that they cannot explain how resulting clusters are related to each other. In this paper, we discuss a way of integrating a large thesaurus and the computation of lattices of resulting clusters into common text clustering in order to overcome these two problems. As its major result, our approach achieves an explanation using an appropriate level of granularity at the concept level as well as an appropriate size and complexity of the explaining lattice of resulting clusters.
NewsStand: A New View on News
, 2008
"... News articles contain a wealth of implicit geographic content that if exposed to readers improves understanding of today’s news. However, most articles are not explicitly geotagged with their geographic content, and few news aggregation systems expose this content to users. A new system named NewsSt ..."
Abstract
-
Cited by 26 (14 self)
- Add to MetaCart
News articles contain a wealth of implicit geographic content that if exposed to readers improves understanding of today’s news. However, most articles are not explicitly geotagged with their geographic content, and few news aggregation systems expose this content to users. A new system named NewsStand is presented that collects, analyzes, and displays news stories in a map interface, thus leveraging on their implicit geographic content. NewsStand monitors RSS feeds from thousands of online news sources and retrieves articles within minutes of publication. It then extracts geographic content from articles using a custom-built geotagger, and groups articles into story clusters using a fast online clustering algorithm. By panning and zooming in NewsStand’s map interface, users can retrieve stories based on both topical significance and geographic region, and see substantially different stories depending on position and zoom level.
Recommender Systems for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering
, 2002
"... Recommender syPx4fl apply knowledge discovery techniques to the problem of making personalized product recommendations during a live customer interaction. These sye ems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success in E-commerce nowada ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
Recommender syPx4fl apply knowledge discovery techniques to the problem of making personalized product recommendations during a live customer interaction. These sye ems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success in E-commerce nowaday s. The tremendous growth of customers and products in recenty ears poses some key challenges for recommender sycomme These are:producing high quality recommendations and performing many recommendations per second for millions of customers and products. New recommender syPxfl technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. We address the performance issues by scaling up the neighborhood formation process through the use of clustering techniques.
Comparing Conceptual, Divisive and Agglomerative Clustering for Learning Taxonomies from Text
, 2004
"... The application of clustering methods for automatic taxonomy construction from text requires knowledge about the tradeoff between, (i), their effectiveness (quality of result), (ii), efficiency (run-time behaviour), and, (iii), traceability of the taxonomy construction by the ontology engineer. In t ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
The application of clustering methods for automatic taxonomy construction from text requires knowledge about the tradeoff between, (i), their effectiveness (quality of result), (ii), efficiency (run-time behaviour), and, (iii), traceability of the taxonomy construction by the ontology engineer. In this line, we present an original conceptual clustering method based on Formal Concept Analysis for automatic taxonomy construction and compare it with hierarchical agglomerative clustering and hierarchical divisive clustering.
Evaluating contents-link coupled web page clustering for web search results
- In Proc. 11th Intl. Conference on Information and Knowledge Management
, 2002
"... ..."

