• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The challenges of clustering high dimensional data”, [online] available : http://www.users.cs.umn.edu/~kumar/papers/high_dim_clust ering_19.pdf (0)

by M Steinbach, L Ertöz, V Kumar
Add To MetaCart

Tools

Sorted by:
Results 1 - 8 of 8

A brief survey of text mining

by Andreas Hotho, Andreas Nürnberger, Gerhard Paaß - LDV Forum - GLDV Journal for Computational Linguistics and Language Technology , 2005
"... The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific (pre-)processing methods and algorithms are required in order to extract useful pattern ..."
Abstract - Cited by 17 (0 self) - Add to MetaCart
The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific (pre-)processing methods and algorithms are required in order to extract useful patterns. Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. In this article, we discuss text mining as a young and interdisciplinary field in the intersection of the related areas information retrieval, machine learning, statistics, computational linguistics and especially data mining. We describe the main analysis tasks preprocessing, classification, clustering, information extraction and visualization. In addition, we briefly discuss a number of successful applications of text mining. 1

On the Effects of Dimensionality on Data Analysis With Neural Networks

by M. Verleysen, D. Francois, G. Simon, V. Wertz , 2003
"... Modern data analysis often faces high-dimensional data. ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
Modern data analysis often faces high-dimensional data.

Modeling Word Senses with Fuzzy Clustering

by Erik Velldal , 2003
"... This thesis describes a clustering approach to automatically inferring soft semantic classes and characterizing senses of a set of Norwegian nouns. The words are represented by way of their distribution in text, identified as local contexts in the form of lexical-syntactic relations. Through a shall ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
This thesis describes a clustering approach to automatically inferring soft semantic classes and characterizing senses of a set of Norwegian nouns. The words are represented by way of their distribution in text, identified as local contexts in the form of lexical-syntactic relations. Through a shallow processing step the context features are extracted for lemmatized word forms in syntactically tagged corpora. The corresponding frequency counts of noun--context co-occurrences are weighted with a statistical association measure, and the distributional profile of a given word is represented in the form of a feature vector in a semantic space model. A hybrid approach is taken when clustering the word vectors; a bottom-up hierarchical method is used to initialize various types of fuzzy partitional clusterings. With the purpose of capturing the notion of typicality the clusters are construed as fuzzy sets, and the words are assigned varying degrees of membership with respect to the various classes. Words are assigned graded memberships in clusters on the basis of their resemblance towards a class prototype. The goal is to automatically uncover semantic classes, where the various memberships of a given word in these fuzzy clusters can be used to characterize its various senses.

Information filtering. General Terms

by unknown authors
"... There are currently hundreds of millions of people contributing content to the Web. They do so by rating items, sharing links, photos, music and video, creating their own webpage or writing them for friends, family, or employer, socializing in social networking sites, and blogging their daily life a ..."
Abstract - Add to MetaCart
There are currently hundreds of millions of people contributing content to the Web. They do so by rating items, sharing links, photos, music and video, creating their own webpage or writing them for friends, family, or employer, socializing in social networking sites, and blogging their daily life and thoughts. Of those who author Web content there is a group of people who contribute to more than a single Web entity, be it on a different host, on a different application or under a different username. We name this group Serial Sharers. In this paper we analyze patterns in the contributions of Serial Sharers. We examine the overlap between their individual contributions and propose a method for detecting their pages in large and diverse collections of pages.

High-Dimensional Clustering with Sparse Gaussian Mixture Models

by Akshay Krishnamurthy
"... We consider the problem of clustering high-dimensional data using Gaussian Mixture Models (GMMs) with unknown covariances. In this context, the Expectation-Maximization algorithm (EM), which is typically used to learn GMMs, fails to cluster the data accurately due to the large number of free paramet ..."
Abstract - Add to MetaCart
We consider the problem of clustering high-dimensional data using Gaussian Mixture Models (GMMs) with unknown covariances. In this context, the Expectation-Maximization algorithm (EM), which is typically used to learn GMMs, fails to cluster the data accurately due to the large number of free parameters in the covariance matrices. We address this weakness by assuming that the mixture model consists of sparse gaussian distributions and leveraging this assumption in a novel algorithm for learning GMMs. Our approach incorporates the graphical lasso procedure for sparse covariance estimation into the EM algorithm for learning GMMs, and by encouraging sparsity, it avoids the problems faced by traditional GMMs. We guarantee convergence of our algorithm and show through experimentation that this procedure outperforms the traditional Expectation Maximization algorithm and other clustering algorithms in the high-dimensional clustering setting. 1

CLINCH: Clustering Incomplete High-Dimensional Data for Data Mining Application ⋆

by Zunping Cheng, Ding Zhou, Chen Wang, Jiankui Guo, Wei Wang, Baokang Ding, Baile Shi
"... Abstract. Clustering is a common technique in data mining to discover hidden patterns from massive datasets. With the development of privacy-maintaining data mining application, clustering incomplete highdimensional data has becoming more and more useful. Motivated by these limits, we develop a nove ..."
Abstract - Add to MetaCart
Abstract. Clustering is a common technique in data mining to discover hidden patterns from massive datasets. With the development of privacy-maintaining data mining application, clustering incomplete highdimensional data has becoming more and more useful. Motivated by these limits, we develop a novel algorithm CLINCH, which could produce fine clusters on incomplete high-dimensional data space. To handle missing attributes, CLINCH employs a prediction method that can be more precise than traditional techniques. On the other hand, we also introduce an efficient way in which dimensions are processed one by one to attack the ”curse of dimensionality”. Experiments show that our algorithm not only outperforms many existing high-dimensional clustering algorithms in scalability and efficiency, but also produces precise results. Keywords: Clustering, Incomplete Data, High-Dimensional Data. 1

A Multi-stage, Semi-automated Procedure for Analyzing the Morphology of Nanoparticles

by Chiwoo Park, Jianhua Z. Huang, David Huitink, Subrata Kundu, Bani K. Mallick, Hong Liang, Yu Ding , 2011
"... This paper presents a multi-stage, semi-automated procedure that can expedite the morphology analysis of nanoparticles. Material scientists have long conjectured that morphology of nanoparticles has a profound impact on the properties of the hosting material but a bottleneck is the lack of a reliabl ..."
Abstract - Add to MetaCart
This paper presents a multi-stage, semi-automated procedure that can expedite the morphology analysis of nanoparticles. Material scientists have long conjectured that morphology of nanoparticles has a profound impact on the properties of the hosting material but a bottleneck is the lack of a reliable and automated morphology analysis of the particles based on their image measurements. This paper attempts to ll in this critical void. One particular challenge in nano morphology analysis is how to analyze the overlapped nanoparticles, a problem not well addressed by the existing methods but e ectively tackled by our proposed method. Our proposed method entails multiple stages of operations, executed sequentially, and is considered semi-automated due to the inclusion of a semi-supervised clustering step. Our method was applied to several images of gold nanoparticles, producing the needed statistical characterization of their morphology. 1

Clustering for High Dimensional Data: Density based Subspace Clustering Algorithms

by Sunita Jahirabadkar, Parag Kulkarni
"... Finding clusters in high dimensional data is a challenging task as the high dimensional data comprises hundreds of attributes. Subspace clustering is an evolving methodology which, instead of finding clusters in the entire feature space, it aims at finding clusters in various overlapping or non-over ..."
Abstract - Add to MetaCart
Finding clusters in high dimensional data is a challenging task as the high dimensional data comprises hundreds of attributes. Subspace clustering is an evolving methodology which, instead of finding clusters in the entire feature space, it aims at finding clusters in various overlapping or non-overlapping subspaces of the high dimensional dataset. Density based subspace clustering algorithms treat clusters as the dense regions compared to noise or border regions. Many momentous density based subspace clustering algorithms exist in the literature. Each of them is characterized by different characteristics caused by different assumptions, input parameters or by the use of different techniques etc. Hence it is quite unfeasible for the future developers to compare all these algorithms using one common scale. In this paper, we presented a review of various density based subspace clustering algorithms together with a comparative chart focusing on their distinguishing characteristics such as overlapping / non-overlapping, axis parallel / arbitrarily oriented and so on.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University