Results 11 - 20
of
122
Mapping the backbone of science
- Scientometrics
, 2005
"... This paper presents a new map representing the structure of all of science, based on journal articles, including both the natural and social sciences. Similar to cartographic maps of our world, the map of science provides a bird’s eye view of today’s scientific landscape. It can be used to visually ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
This paper presents a new map representing the structure of all of science, based on journal articles, including both the natural and social sciences. Similar to cartographic maps of our world, the map of science provides a bird’s eye view of today’s scientific landscape. It can be used to visually identify major areas of science, their size, similarity, and interconnectedness. In order to be useful, the map needs to be accurate on a local and on a global scale. While our recent work has focused on the former aspect, 1 this paper summarizes results on how to achieve structural accuracy. Eight alternative measures of journal similarity were applied to a data set of 7,121 journals covering over 1 million documents in the combined Science Citation and Social Science Citation Indexes. For each journal similarity measure we generated two-dimensional spatial layouts using the force-directed graph layout tool, VxOrd. Next, mutual information values were calculated for each graph at different clustering levels to give a measure of structural accuracy for each map. The best co-citation and inter-citation maps according to local and structural accuracy were selected and are presented and characterized. These two maps are compared to establish robustness. The inter-citation map is then used to examine linkages between disciplines. Biochemistry appears as the most interdisciplinary discipline in science.
A Comparative Study of Generative Models for Document Clustering
- In SIAM Int. Conf. Data Mining Workshop on Clustering High Dimensional Data and Its Applications
, 2003
"... Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mi ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mixture of von Mises-Fisher (vMF) distributions. This paper compares these three probabilistic models for text clustering, both theoretically and empirically, using a general model-based clustering framework. For each model, we investigate three strategies for assigning documents to models: maximum likelihood (k-means) assignment, stochastic assignment, and soft assignment. Our experimental results over a large number of datasets show that, in terms of clustering quality, (a) The Bernoulli model is the worst for text clustering; (b) The vMF model produces better clustering results than both Bernoulli and multinomial models; (c) Soft assignment leads to comparable or slightly better results than hard assignment. We also use deterministic annealing (DA) to improve the vMF-based soft clustering and compare all the model-based algorithms with the state-of-the-art discriminative approach to document clustering based on graph partitioning (CLUTO) and a spectral co-clustering method. Overall, CLUTO and DA perform the best but are also the most computationally expensive; the spectral coclustering algorithm fares worse than the vMF-based methods.
A Hybrid Layout Algorithm for Sub-Quadratic Multidimensional Scaling
- In Proc. IEEE Symposium on Information Visualization
, 2002
"... Many clustering and layout techniques have been used for structuring and visualising complex data. This paper is inspired by a number of such contemporary techniques and presents a novel hybrid approach based upon stochastic sampling, clustering and 2D layout. We use Chalmers' 1996 O(N^2) spring mod ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
Many clustering and layout techniques have been used for structuring and visualising complex data. This paper is inspired by a number of such contemporary techniques and presents a novel hybrid approach based upon stochastic sampling, clustering and 2D layout. We use Chalmers' 1996 O(N^2) spring model as a benchmark when evaluating our technique, comparing layout quality and run times using data sets of synthetic and real data. Our model runs in O(N*sqrt(N)) and requires around a third of the execution time of Chalmers' algorithm, whilst producing superior quality representations in its low dimensional layout. Our results indicate that it is a solid foundation for interactive and visual exploration of data.
Self-Organizing Maps, Vector Quantization, and Mixture Modeling
- IEEE Transactions on Neural Networks
, 2001
"... |Self-organizing maps are popular algorithms for unsupervised learning and data visualization. Exploiting the link between vector quantization and mixture modeling, we derive EM algorithms for self-organizing maps with and without missing values. We compare self-organizing maps with the elastic-net ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
|Self-organizing maps are popular algorithms for unsupervised learning and data visualization. Exploiting the link between vector quantization and mixture modeling, we derive EM algorithms for self-organizing maps with and without missing values. We compare self-organizing maps with the elastic-net approach and explain why the former is better suited for the visualization of high-dimensional data. Several extensions and improvements are discussed. As an illustration we apply a self-organizing map based on a multinomial distribution to market basket analysis. I. Introduction Self-organizing maps are popular tools for clustering and visualization of high-dimensional data [1], [2]. The wellknown Kohonen learning algorithm can be interpreted as a variant of vector quantization with additional lateral interactions [3], [4]. The addition of lateral interaction between units introduces a sense of topology, such that neighboring units represent inputs that are close together in input space [...
Clustering Documents in a Web Directory
, 2003
"... growing interest due to the widespread proliferation of topic hierarchies for text documents. The worst problem of hierarchical supervised classifiers is their high demand in terms of labeled examples, whose amount is related to the number of topics in the taxonomy. Hence, bootstrapping a huge hiera ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
growing interest due to the widespread proliferation of topic hierarchies for text documents. The worst problem of hierarchical supervised classifiers is their high demand in terms of labeled examples, whose amount is related to the number of topics in the taxonomy. Hence, bootstrapping a huge hierarchy with a proper set of labeled examples is a critical issue. In this paper, we propose some solutions for the bootstrapping problem, implicitly or explicitly using a taxonomy definition: a baseline approach where documents are classified according to class labels, and two clustering approaches, where training is constrained by the a-priori knowledge of the taxonomy structure, both at terminological and topological level. In particular, we propose the TaxSOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Experimental evaluation was performed on a set of taxonomies taken from the Google Web directory.
Design and evaluation of a multi-agent collaborative Web Mining System
, 2003
"... Most existing Web search tools work only with individual users and do not help a user benefit from previous search experiences of others. In this paper, we present the Collaborative Spider, a multi-agent system designed to provide post-retrieval analysis and enable across-user collaboration in Web s ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
Most existing Web search tools work only with individual users and do not help a user benefit from previous search experiences of others. In this paper, we present the Collaborative Spider, a multi-agent system designed to provide post-retrieval analysis and enable across-user collaboration in Web search and mining. This system allows the user to annotate search sessions and share them with other users. We also report a user study designed to evaluate the effectiveness of this system. Our experimental findings show that subjects' search performance was degraded, compared to individual search scenarios in which users had no access to previous searches, when they had access to a limited number (e.g., 1 or 2) of earlier search sessions done by other users. However, search performance improved significantly when subjects had access to more search sessions. This indicates that gain from collaboration through collaborative Web searching and analysis does not outweigh the overhead of browsing and comprehending other users' past searches until a certain number of shared sessions have been reached. In this paper, we also catalog and analyze several different types of user collaboration behavior observed in the context of Web mining. D 2002 Elsevier Science B.V. All rights reserved.
A brief survey of text mining
- LDV Forum - GLDV Journal for Computational Linguistics and Language Technology
, 2005
"... The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific (pre-)processing methods and algorithms are required in order to extract useful pattern ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific (pre-)processing methods and algorithms are required in order to extract useful patterns. Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. In this article, we discuss text mining as a young and interdisciplinary field in the intersection of the related areas information retrieval, machine learning, statistics, computational linguistics and especially data mining. We describe the main analysis tasks preprocessing, classification, clustering, information extraction and visualization. In addition, we briefly discuss a number of successful applications of text mining. 1
Generative model-based document clustering: a comparative study
- Knowledge and Information Systems
, 2005
"... Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervis ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial modelbased semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.
Hidden semantic concept discovery in region based image retrieval
- In Proc. CVPR
, 2004
"... This paper addresses Content Based Image Retrieval (CBIR), focusing on developing a hidden semantic concept discovery methodology to address effective semanticsintensive image retrieval. In our approach, each image in the database is segmented to regions associated with homogenous color, texture, an ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
This paper addresses Content Based Image Retrieval (CBIR), focusing on developing a hidden semantic concept discovery methodology to address effective semanticsintensive image retrieval. In our approach, each image in the database is segmented to regions associated with homogenous color, texture, and shape features. By exploiting regional statistical information in each image and employing a vector quantization method, a uniform and sparse region-based representation is achieved. With this representation a probabilistic model based on statistical-hiddenclass assumptions of the image database is obtained, to which Expectation-Maximization (EM) technique is applied to analyze semantic concepts hidden in the database. An elaborated retrieval algorithm is designed to support the probabilistic model. The semantic similarity is measured through integrating the posterior probabilities of the transformed query image, as well as a constructed negative example, to the discovered semantic concepts. The proposed approach has a solid statistical foundation and the experimental evaluations on a database of 10,000 generalpurposed images demonstrate its promise of the effectiveness. 1.
Analysis of the query logs of a web site search engine
- Journal of the American Society for Information Science and Technology
, 2005
"... A large number of studies have investigated the transaction log of general-purpose search engines such as Excite and AltaVista, but few studies have reported on the analysis of search logs for search engines that are limited to particular Web sites, namely, Web site search engines. In this article, ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
A large number of studies have investigated the transaction log of general-purpose search engines such as Excite and AltaVista, but few studies have reported on the analysis of search logs for search engines that are limited to particular Web sites, namely, Web site search engines. In this article, we report our research on analyzing the search logs of the search engine of the Utah state government Web site. Our results show that some statistics, such as the number of search terms per query, of Web users are the same for general-purpose search engines and Web site search engines, but others, such as the search topics and the terms used, are considerably different. Possible reasons for the differences include the focused domain of Web site search engines and users ’ different information needs. The findings are useful for Web site developers to improve the performance of their services provided on the Web and for researchers to conduct further research in this area. The analysis also can be applied in e-government research by investigating how information should be delivered to users in government Web sites.

