Results 1  10
of
59
Self Organization of a Massive Document Collection
 IEEE Transactions on Neural Networks
"... This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the SelfOrganizing Map (SOM) algorithm. As the feature vectors for the documents we use statistical representations of their vocabularies. The m ..."
Abstract

Cited by 204 (14 self)
 Add to MetaCart
This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the SelfOrganizing Map (SOM) algorithm. As the feature vectors for the documents we use statistical representations of their vocabularies. The main goal in our work has been to scale up the SOM algorithm to be able to deal with large amounts of highdimensional data. In a practical experiment we mapped 6,840,568 patent abstracts onto a 1,002,240node SOM. As the feature vectors we used 500dimensional vectors of stochastic figures obtained as random projections of weighted word histograms. Keywords Data mining, exploratory data analysis, knowledge discovery, large databases, parallel implementation, random projection, SelfOrganizing Map (SOM), textual documents. I. Introduction A. From simple searches to browsing of selforganized data collections Locating documents on the basis of keywords and simple search expressions is a c...
Random projection in dimensionality reduction: Applications to image and text data
 in Knowledge Discovery and Data Mining
, 2001
"... Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction t ..."
Abstract

Cited by 137 (0 self)
 Add to MetaCart
Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction tool in a number of cases, where the high dimensionality of the data would otherwise lead to burdensome computations. Our application areas are the processing of both noisy and noiseless images, and information retrieval in text documents. We show that projecting the data onto a random lowerdimensional subspace yields results comparable to conventional dimensionality reduction methods such as principal component analysis: the similarity of data vectors is preserved well under random projection. However, using random projections is computationally signicantly less expensive than using, e.g., principal component analysis. We also show experimentally that using a sparse random matrix gives additional computational savings in random projection.
A Survey of Dimension Reduction Techniques
, 2002
"... this paper, we assume that we have n observations, each being a realization of the p dimensional random variable x = (x 1 , . . . , x p ) with mean E(x) = = ( 1 , . . . , p ) and covariance matrix E{(x )(x = # pp . We denote such an observation matrix by X = i,j : 1 p, 1 ..."
Abstract

Cited by 87 (0 self)
 Add to MetaCart
this paper, we assume that we have n observations, each being a realization of the p dimensional random variable x = (x 1 , . . . , x p ) with mean E(x) = = ( 1 , . . . , p ) and covariance matrix E{(x )(x = # pp . We denote such an observation matrix by X = i,j : 1 p, 1 n}. If i and # i = # (i,i) denote the mean and the standard deviation of the ith random variable, respectively, then we will often standardize the observations x i,j by (x i,j i )/ # i , where i = x i = 1/n j=1 x i,j , and # i = 1/n j=1 (x i,j x i )
SOMBased Data Visualization Methods
 Intelligent Data Analysis
, 1999
"... The SelfOrganizing Map (SOM) is an efficient tool for visualization of multidimensional numerical data. In this paper, an overview and categorization of both old and new methods for the visualization of SOM is presented. The purpose is to give an idea of what kind of information can be acquired fro ..."
Abstract

Cited by 79 (4 self)
 Add to MetaCart
The SelfOrganizing Map (SOM) is an efficient tool for visualization of multidimensional numerical data. In this paper, an overview and categorization of both old and new methods for the visualization of SOM is presented. The purpose is to give an idea of what kind of information can be acquired from different presentations and how the SOM can best be utilized in exploratory data visualization. Most of the presented methods can also be applied in the more general case of first making a vector quantization (e.g. kmeans) and then a vector projection (e.g. Sammon's mapping).
SelfOrganization of Very Large Document Collections: State of the Art
 Proceedings of ICANN98, the 8th International Conference on Artificial Neural Networks
, 1998
"... The SelfOrganizing Map (SOM) forms a nonlinear projection from a highdimensional data manifold onto a lowdimensional grid. A representative model of some subset of data is associated with each grid point. The SOM algorithm computes an optimal collection of models that approximates the data in the ..."
Abstract

Cited by 55 (2 self)
 Add to MetaCart
The SelfOrganizing Map (SOM) forms a nonlinear projection from a highdimensional data manifold onto a lowdimensional grid. A representative model of some subset of data is associated with each grid point. The SOM algorithm computes an optimal collection of models that approximates the data in the sense of some error criterion and also takes into account the similarity relations of the models. The models then become ordered on the grid according to their similarity. When the SOM is used for the exploration of statistical data, the data vectors can be approximated by models of the same dimensionality. When mapping documents, one can represent them statistically by their word frequency histograms or some reduced representations of the histograms that can be regarded as data vectors. We have made SOMs of collections of over one million documents. Each document is mapped onto some grid point, with a link from this point to the document database. The documents are ordered on the grid acco...
Binary Analysis and OptimizationBased Normalization of Gene Expression Data
, 2002
"... Motivation: Most approaches to gene expression analysis use realvalued expression data, produced by highthroughput screening technologies, such as microarrays. Often, some measure of similarity must be computed in order to extract meaningful information from the observed data. The choice of this si ..."
Abstract

Cited by 53 (6 self)
 Add to MetaCart
Motivation: Most approaches to gene expression analysis use realvalued expression data, produced by highthroughput screening technologies, such as microarrays. Often, some measure of similarity must be computed in order to extract meaningful information from the observed data. The choice of this similarity measure frequently has a profound effect on the results of the analysis, yet no standards exist to guide the researcher.
SelfOrganizing Maps In Natural Language Processing
, 1997
"... Kohonen's SelfOrganizing Map (SOM) is one of the most popular artificial neural network algorithms. Word category maps are SOMs that have been organized according to word similarities, measured by the similarity of the short contexts of the words. Conceptually interrelated words tend to fall into t ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
Kohonen's SelfOrganizing Map (SOM) is one of the most popular artificial neural network algorithms. Word category maps are SOMs that have been organized according to word similarities, measured by the similarity of the short contexts of the words. Conceptually interrelated words tend to fall into the same or neighboring map nodes. Nodes may thus be viewed as word categories. Although no a priori information about classes is given, during the selforganizing process a model of the word classes emerges. The central topic of the thesis is the use of the SOM in natural language processing. The approach based on the word category maps is compared with the methods that are widely used in artificial intelligence research. Modeling gradience, conceptual change, and subjectivity of natural language interpretation are considered. The main application area is information retrieval and textual data mining for which a specific SOMbased method called the WEBSOM has been developed. The WEBSOM metho...
A visual approach for monitoring logs
 In The Proceedings of the 12th Systems Administration Conference (LISA ’98
, 1998
"... Analyzing and monitoring logs that portray system, user, and network activity is essential to meet the requirements of high security and optimal resource availability. While most systems now possess satisfactory logging facilities, the tools to monitor and interpret such event logs are still in thei ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
Analyzing and monitoring logs that portray system, user, and network activity is essential to meet the requirements of high security and optimal resource availability. While most systems now possess satisfactory logging facilities, the tools to monitor and interpret such event logs are still in their infancy. This paper describes an approach to relieve system and network administrators from manually scanning sequences of log entries. An experimental system based on unsupervised neural networks and spring layouts to automatically classify events contained in logs is explained, and the use of complementary information visualization techniques to visually present and interactively analyze the results is then discussed. The system we present can be used to analyze past activity as well as to monitor realtime events. We illustrate the system’s use for event logs generated by a firewall, however it can be easily coupled to any source of sequential and structured event logs.
Using the SOM and local models in timeseries prediction
 Helsinki University of Technology
, 1997
"... In this paper we test the SelfOrganizing Map (SOM) on the problem of predicting chaotic timeseries (speci cally MackeyGlass series) with local linear models de ned separately for each of the prototype vectors of the SOM. We see that the method achieves good results. This together with the capabil ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
In this paper we test the SelfOrganizing Map (SOM) on the problem of predicting chaotic timeseries (speci cally MackeyGlass series) with local linear models de ned separately for each of the prototype vectors of the SOM. We see that the method achieves good results. This together with the capabilities of the SOM make itavaluable tool in exploratory data mining. 1
Novelty detection using SelfOrganizing Maps
 In Proc. of ICONIP'97
, 1997
"... Failure detection in process monitoring involves a classification mainly on the basis of data from normal operation. When a SelfOrganizing Map is used for the description of normal system behaviour, a compatibility measure is needed for declaring a map and a dataset as matching. We propose a novel ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
Failure detection in process monitoring involves a classification mainly on the basis of data from normal operation. When a SelfOrganizing Map is used for the description of normal system behaviour, a compatibility measure is needed for declaring a map and a dataset as matching. We propose a novel variant of one such measure and investigate usefulness of consisting and novel measures both with synthetic data and in two real world applications. 1 Introduction A typical aspect of fault diagnosis in a process or system is the limited availability of measurement data concerning faulty situations: often it is hard to acquire a dataset representative of the whole "failure space", whereas the normal operation space can be characterized very accurately. It may be valuable to make an accurate representation of the normal (admissible or healthy) behaviour, and detect faults as significant deviations w.r.t. this admissible domain. When there are some measurements from the abnormal situation, g...