This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the Self-Organizing Map (SOM) algorithm. As the feature vectors for the documents we use statistical representations of their vocabularies. The main goal in our work has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data. In a practical experiment we mapped 6,840,568 patent abstracts onto a 1,002,240-node SOM. As the feature vectors we used 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms. Keywords Data mining, exploratory data analysis, knowledge discovery, large databases, parallel implementation, random projection, Self-Organizing Map (SOM), textual documents. I. Introduction A. From simple searches to browsing of self-organized data collections Locating documents on the basis of keywords and simple search expressions is a c...
|
2329
|
Introduction to modern information retrieval
– Salton
- 1983
|
|
2274
|
Self-Organizing Maps
– Kohonen
- 1995
|
|
1636
|
Indexing by latent semantic analysis
– Deerwester, Dumais, et al.
- 1990
|
|
805
|
Self-Organized Formation of Topologically Correct Feature Maps. Biol Cybern 43:59–69
– Kohonen
- 1982
|
|
424
|
Dithered Quantizers
– Gray, Stockham
- 1993
|
|
312
|
Two-Level Morphology: A General Computational Model for WordForm Recognition and Production
– Koskenniemi
- 1983
|
|
259
|
Toward optimal feature selection
– Koller, Sahami
- 1996
|
|
158
|
Latent semantic indexing: A probabilistic analysis
– Papadimitriou, Tamaki, et al.
- 1998
|
|
150
|
Asymptotically optimal block quantization
– Gersho
- 1979
|
|
149
|
A self-organizing semantic map for information retrieval
– Lin, Soergel, et al.
- 1991
|
|
140
|
Multidimensional Scaling
– Kruskal, Wish
- 1978
|
|
117
|
Self-organizing semantic maps
– Ritter, Kohonen
- 1989
|
|
116
|
Multidimensional scaling: I. theory and method
– Torgerson
- 1952
|
|
104
|
SOM PAK, The selfOrganizing Map Program Package. Version 3.1
– Kohonen, Hynninen, et al.
- 1995
|
|
99
|
WEBSOMself-organizing maps of document collections
– Honkela, Kaski, et al.
- 1997
|
|
93
|
Vector quantization in speech coding
– Makhoul, Roucos
- 1985
|
|
77
|
Data exploration using self-organizing maps
– Kaski
- 1997
|
|
75
|
Dimensionality reduction by random mapping: Fast similarity computation for clustering
– Kaski
- 1998
|
|
67
|
Clustering in large graphs and matrices
– Drineas, Frieze, et al.
- 1999
|
|
62
|
Map Displays of Information Retrieval
– Lin
- 1997
|
|
59
|
Internet Categorization and Search: A Self-organizing Approach
– Chen, Schufels, et al.
- 1996
|
|
56
|
Newsgroup Exploration with the WEBSOM Method and Browsing Interface
– Honkela, Kaski, et al.
- 1996
|
|
53
|
SelfOrganizing Maps of Document Collections: A New Approach to Interactive Exploration
– Lagus, Honkela, et al.
- 1996
|
|
49
|
Self-organization of very large document collections: State of the art
– Kohonen
- 1998
|
|
47
|
AS: Discussion of a set of points in terms of their mutual distances. Pyschometrika
– Young, Householder
- 1938
|
|
42
|
Keyword selection method for characterizing text document maps
– Lagus, Kaski
- 1999
|
|
41
|
Very large two-level SOM for the browsing of newsgroups
– Kohonen, Kaski, et al.
- 1996
|
|
35
|
Exploration of Very Large Databases by Self Organising Maps
– Kohonen
- 1997
|
|
34
|
with the tree-structured selforganizing map
– Koikkalainen, “Progress
- 1994
|
|
32
|
Creating an order in digital libraries with self-organizing maps
– Kaski, Honkela, et al.
|
|
27
|
Theory of multidimensional scaling
– Leeuw, Heiser
- 1982
|
|
24
|
Self-Organization of a Massive Text Document Collection
– Kohonen, Kaski, et al.
- 1999
|
|
20
|
Tukey, Exploratory Data Analysis
– W
- 1977
|
|
20
|
Clustering, taxonomy and topological maps of patterns
– Kohonen
- 1982
|
|
20
|
Text classification with self-organizing maps: Some lessons learned
– Merkl
- 1998
|
|
20
|
A scalable self-organizing map algorithm for textual classification: A neural network approach to thesaurus generation
– Roussinov, Chen
- 1998
|
|
20
|
Fast deterministic self-organizing maps
– Koikkalainen
- 1995
|
|
19
|
Things You Haven't Heard about the Self-Organizing Map
– Kohonen
|
|
16
|
Unsupervised learning and the information retrieval problem
– Scholtes
- 1991
|
|
16
|
WEBSOM for textual data mining
– Lagus, Honkela, et al.
- 1999
|
|
12
|
Comparison of SOM point densities based on different criteria
– Kohonen
- 1999
|
|
12
|
Information visualization for collaborative computing
– Chen, Nunamaker, et al.
- 1998
|
|
11
|
New developments of Learning vector Quantization and the Self-Organizing map
– Kohonen
- 1992
|
|
8
|
Multidimensional scaling and its applications
– Wish, Carroll
- 1982
|
|
7
|
Improving the learning speed in topological maps of patterns
– Rodriques, Almeida
- 1990
|
|
5
|
The representation of semantic similarity between documents by using maps: Application of an artificial neural network to organize software libraries
– Merkl, Tjoa
- 1994
|
|
5
|
Convergence and ordering of Kohonen's batch map
– Cheng
- 1997
|
|
5
|
Neural networks and information extraction in astronomical information retrieval
– Lesteven, Ponçot, et al.
- 1996
|
|
3
|
Sammon Jr., "A nonlinear mapping for data structure analysis
– W
- 1969
|
|
1
|
Multidimensional scaling," in Encyclopedia of Statistical Sciences
– Young
- 1985
|