Results 1 - 10
of
10
Euclidean embedding of co-occurrence data
- Advances in Neural Information Processing Systems 17
, 2005
"... Abstract Embedding algorithms search for low dimensional structure in complexdata, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for em-bedding objects of different types, such as images and text, into a single comm ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
Abstract Embedding algorithms search for low dimensional structure in complexdata, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for em-bedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. Thejoint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to con-vex optimization over positive semidefinite matrices. The local structure of our embedding corresponds to the statistical correlations via ran-dom walks in the Euclidean space. We quantify the performance of our method on two text datasets, and show that it consistently and signifi-cantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling and correspondence analysis. 1 Introduction Embeddings of objects in a low-dimensional space are an important tool in unsupervisedlearning and in preprocessing data for supervised learning algorithms. They are especially valuable for exploratory data analysis and visualization by providing easily interpretablerepresentations of the relationships among objects. Most current embedding techniques build low dimensional mappings that preserve certain relationships among objects and dif-fer in the relationships they choose to preserve, which range from pairwise distances in multidimensional scaling (MDS) [4] to neighborhood structure in locally linear embedding[12]. All these methods operate on objects of a single type endowed with a measure of similarity or dissimilarity. However, real-world data often involve objects of several very different types without anatural measure of similarity. For example, typical web pages or scientific papers contain
Graph Layout Techniques and Multidimensional Data Analysis
, 2000
"... In this paper we explore the relationship between multivariate data analysis and techniques for graph drawing or graph layout. Although both classes of techniques were created for quite different purposes, we find many common principles and implementations. We start with a discussion of the data an ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In this paper we explore the relationship between multivariate data analysis and techniques for graph drawing or graph layout. Although both classes of techniques were created for quite different purposes, we find many common principles and implementations. We start with a discussion of the data analysis techniques, in particular multiple correspondence analysis, multidimensional scaling, parallel coordinate plotting, and seriation. We then discuss parallels in the graph layout literature.
Metric Scaling Graphical Representation of Categorical Data
- Penn State University
, 1995
"... : Metric Scaling is a well--known method to represent a finite set with respect to a given Euclidean distance matrix. Several methods to represent rows and columns of a two--way contingency table are available: Correspondence Analysis, Dual Scaling, Canonical Coordinates, etc. We show that metric s ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
: Metric Scaling is a well--known method to represent a finite set with respect to a given Euclidean distance matrix. Several methods to represent rows and columns of a two--way contingency table are available: Correspondence Analysis, Dual Scaling, Canonical Coordinates, etc. We show that metric scaling provides a similar representation by using Hellinger or Rao distances together with Gower's add--a--point formula and discuss its relationship with the other approaches. The present approach suggests an alternative to Multiple Correspondence Analysis for multivariate categorical data. Keywords: Categorical data; Correspondence Analysis; Distances between observations; Multidimensional scaling; Biplot. AMS Subject Classification: 62H25, 62H20, 62-09. 1 Introduction The statistical methodology dealing with categorical data currently has an increasing interest. Under the name Correspondence Analysis (CA), the data analyst recognizes a method of graphical representation of categorical ...
Feature Extraction and Dimension Reduction with Applications to Classification and the Analysis of Co-occurrence Data
, 2001
"... ii ..."
Relativity and Resolution for High Dimensional Information Visualization with Generalized Association Plots (GAP)
, 2002
"... Generalized association plots (GAP) (Chen, 1996; 1999; 2002) is an information visualization environment for high dimensional data structure without dimension reduction. There is no limit for sample size and variable number. Three matrix maps for raw data matrix, object proximity matrix, and variabl ..."
Abstract
- Add to MetaCart
Generalized association plots (GAP) (Chen, 1996; 1999; 2002) is an information visualization environment for high dimensional data structure without dimension reduction. There is no limit for sample size and variable number. Three matrix maps for raw data matrix, object proximity matrix, and variable proximity matrix are created for visually extracting grouping structures for objects and variables and the interaction information between object-clusters and variablegroups. Seriation algorithms are developed to permute objects and variables such that rows and columns with similar profiles are arranged at closer positions. Categorical generalized association plots (cGAP) (Chen, 1999; Chen et al., 2002) is an extension of GAP adapted for visualizing high dimensional categorical data structure. Optimal scaling (multiple correspondence analysis) is applied to compute the proximity matrices for objects as well as for variables and to obtain colors for coding all categories in the raw data matrix. Relativity and resolution are two related critical issues in conducting efficient GAP and cGAP analyses. This article discusses possible solutions when standard procedures fail in generating satisfactory relativity and resolution for GAP and cGAP.
Review Indigo: a World-Wide-Web review of genomes and gene functions
"... The present article describes a genome database reviewing gene-related knowledge of two model bacteria, Bacillus subtilis and Escherichia coli. The database, Indigo, is open through the World-Wide Web ..."
Abstract
- Add to MetaCart
The present article describes a genome database reviewing gene-related knowledge of two model bacteria, Bacillus subtilis and Escherichia coli. The database, Indigo, is open through the World-Wide Web
Mutagenic Activity of Disinfection By-Products
"... Data on raw water quality, disinfection treatment practices, and the resulting mutagenic properties of the treated water were compiled from pilot- and full-scale treatment experiments to evaluate that parameter which might produce variability in the results of a mutagenic study. Analysis of the data ..."
Abstract
- Add to MetaCart
Data on raw water quality, disinfection treatment practices, and the resulting mutagenic properties of the treated water were compiled from pilot- and full-scale treatment experiments to evaluate that parameter which might produce variability in the results of a mutagenic study. Analysis of the data and comparison of treatment practices indicated that the measured mutagenic activity is strongly related to the characteristics of the organic matter in the raw water, the methodology used to sample and detect mutagens, the scale of the study both in terms of treatment flow and period of study, and the point at which and the conditions under which oxidants are added during treatment. Conclusions regarding disinfection systems in full-scale water treatment plants include the following: When raw water is pretreated and high concentrations of organics are present in the raw water, both ozonation and chlorination increased mutagenic activity. However, no significant difference in mutagenicity was found between the two oxidants. Both in the case of a nitrified groundwater and a clarified surface water, the mutagenic activity of the water after ozonation was related to its mutagenic activity before ozonation. With ozonation, mutagenic activity decreased after granular activated carbon (GAC) filtration. Thus, when GAC filtration follows ozone disinfection, early addition of oxidants may not be deleterious to the finished water quality. When chlorine or chlorine dioxide is added after GAC filtration, chlorine dioxide was found to produce a less mutagenic water than chlorine. Although these conclusions suggest means of controlling mutagenic activity during treatment, it must be stressed that the measurement of mutagenicity is a presumptive index of contamination level.

