### Citations

908 |
Probabilistic Graphical Models Principles and Techniques
- Koller, Friedman
- 2009
(Show Context)
Citation Context ... parameterization makes them transferable; a DCNN learned on one graph can be applied to another. A related branch of work that has focused on extending convolutional neural networks to domains where the structure of the graph itself is of direct interest [13, 14, 15]. For example, [15] construct a deep convolutional model that learns real-valued fingerprint representation of chemical compounds. Probabilistic Relational Models DCNNs also share strong ties to probabilistic relational models (PRMs), a family of graphical models that are capable of representing distributions over relational data [16]. In contrast to PRMs, DCNNs are deterministic, which allows them to avoid the exponential blowup in learning and inference that hampers PRMs. Our results suggest that DCNNs outperform partially-observed conditional random fields, the stateof-the-art model probabilistic relational model for semi-supervised learning. Furthermore, DCNNs offer this performance at considerably lower computational cost. Learning the parameters of both DCNNs and partially-observed CRFs involves numerically minimizing a nonconvex objective – the backpropagated error in the case of DCNNs and the negative marginal log-... |

308 | Adaptive subgradient methods for online learning and stochastic optimization
- Duchi, Hazan, et al.
- 2011
(Show Context)
Citation Context ...tion power series and propagating the input forward to predict the output, then setting the weights by gradient ascent on the back-propagated error. We also make use of windowed early stopping; training is ceased if the validation error of a given epoch is greater than the average of the last few epochs. 3 Experiments In this section we present several experiments to investigate how well DCNNs perform at node and graph classification tasks. In each case we compare DCNNs to other well-known and effective approaches to the task. In each of the following experiments, we use the AdaGrad algorithm [2] for gradient ascent with a learning rate of 0.05. All weights are initialized by sampling from a normal distribution with mean zero and variance 0.01. We choose the hyperbolic tangent for the nonlinear differentiable function f and use the multiclass hinge loss between the model predictions and ground truth as the training objective. The model was implemented in Python using Lasagne and Theano [3]. 3.1 Node classification We ran several experiments to investigate how well DCNNs classify nodes within a single graph. The graphs were constructed from the Cora and Pubmed datasets, which each cons... |

176 | Collective classification in network data
- Sen, Namata, et al.
(Show Context)
Citation Context ...enote the exponential diffusion and Laplacian exponential diffusion kernels-on-graphs, respectively, which have previously been shown to perform well on the Cora dataset [1]. These kernel models take the graph structure as input (e.g. node features are not used) and the validation set is used to determine the kernel hyperparameters. ‘CRF-LBP’ indicates a partially-observed conditional random field that uses loopy belief propagation for inference. Results for this model are quoted from prior work [4] that uses the same dataset and experimental protocol. Node Classification Data The Cora corpus [5] consists of 2,708 machine learning papers and the 5,429 citation edges that they share. Each paper is assigned a label drawn from seven possible machine learning subjects, and each paper is represented by a bit vector where each feature corresponds to the presence or absence of a term drawn from a dictionary with 1,433 unique entries. We treat the citation network as an undirected graph. The Pubmed corpus [5] consists of 19,717 scientific papers from the Pubmed database on the subject of diabetes. Each paper is assigned to one of three classes. The citation network that joins the papers consi... |

175 |
Theano: a CPU and GPU math expression compiler
- Bergstra, Breuleux, et al.
- 2010
(Show Context)
Citation Context ...orm at node and graph classification tasks. In each case we compare DCNNs to other well-known and effective approaches to the task. In each of the following experiments, we use the AdaGrad algorithm [2] for gradient ascent with a learning rate of 0.05. All weights are initialized by sampling from a normal distribution with mean zero and variance 0.01. We choose the hyperbolic tangent for the nonlinear differentiable function f and use the multiclass hinge loss between the model predictions and ground truth as the training objective. The model was implemented in Python using Lasagne and Theano [3]. 3.1 Node classification We ran several experiments to investigate how well DCNNs classify nodes within a single graph. The graphs were constructed from the Cora and Pubmed datasets, which each consist of scientific papers (nodes), citations between papers (edges), and subjects (labels). Protocol In each experiment, the set G consists of a single graph G. During each trial, the input graph’s nodes are randomly partitioned into training, validation, and test sets, with each set having Cora Pubmed Model Accuracy F (micro) F (macro) Accuracy F (micro) F (macro) l1logistic 0.7087 0.7087 0.6829 0.... |

104 | Protein function prediction via graph kernels
- Borgwardt, Ong, et al.
- 2005
(Show Context)
Citation Context ...ists of NCI1, NCI109, MUTAG, PCI, and ENZYMES. The NCI1 and NCI109 [7] datasets consist of 4100 and 4127 graphs that represent chemical compounds. Each graph is labeled with whether it is has the ability to suppress or inhibit the growth of a panel of human tumor cell lines, and each node is assigned one of 37 (for NCI1) or 38 (for NCI109) possible labels. MUTAG [8] contains 188 nitro compounds that are labeled as either aromatic or heteroaromatic with seven node features. PTC [9] contains 344 compounds labeled with whether they are carcinogenic in rats with 19 node features. Finally, ENZYMES [10] is a balanced dataset containing 600 proteins with three node features. Results Discussion In contrast with the node classification experiments, there is no clear best model choice across the datasets or evaluation measures. In fact, according to Table 2, the only clear choice is the ‘deepwl’ graph kernel model on the ENZYMES dataset, which significantly outperforms the other methods in terms of accuracy and micro– and macro–averaged F measure. Furthermore, as shown in Figure 3, there is no clear benefit to broadening the search breadth H . These results suggest that, while diffusion processe... |

100 | Graph kernels
- Vishwanathan, Schraudolph, et al.
(Show Context)
Citation Context ...fect of search breadth (3c) performance of each model as a function of the proportion of the remaining graphs that are made available for training. Baseline Methods As a simple baseline, we apply linear classifiers to the average feature vector of each graph; ‘l1logistic’ and ‘l2logistic’ indicate `1 and `2-regularized logistic regression applied as described. ‘deepwl’ indicates the Weisfeiler-Lehman (WL) subtree deep graph kernel. Deep graph kernels decompose a graph into substructures, treat those substructures as words in a sentence, and fit a word-embedding model to obtain a vectorization [6]. Graph Classification Data We apply DCNNs to a standard set of graph classification datasets that consists of NCI1, NCI109, MUTAG, PCI, and ENZYMES. The NCI1 and NCI109 [7] datasets consist of 4100 and 4127 graphs that represent chemical compounds. Each graph is labeled with whether it is has the ability to suppress or inhibit the growth of a panel of human tumor cell lines, and each node is assigned one of 37 (for NCI1) or 38 (for NCI109) possible labels. MUTAG [8] contains 188 nitro compounds that are labeled as either aromatic or heteroaromatic with seven node features. PTC [9] contains 34... |

68 |
Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity
- Debnath, Compandre, et al.
- 1991
(Show Context)
Citation Context ...h into substructures, treat those substructures as words in a sentence, and fit a word-embedding model to obtain a vectorization [6]. Graph Classification Data We apply DCNNs to a standard set of graph classification datasets that consists of NCI1, NCI109, MUTAG, PCI, and ENZYMES. The NCI1 and NCI109 [7] datasets consist of 4100 and 4127 graphs that represent chemical compounds. Each graph is labeled with whether it is has the ability to suppress or inhibit the growth of a panel of human tumor cell lines, and each node is assigned one of 37 (for NCI1) or 38 (for NCI109) possible labels. MUTAG [8] contains 188 nitro compounds that are labeled as either aromatic or heteroaromatic with seven node features. PTC [9] contains 344 compounds labeled with whether they are carcinogenic in rats with 19 node features. Finally, ENZYMES [10] is a balanced dataset containing 600 proteins with three node features. Results Discussion In contrast with the node classification experiments, there is no clear best model choice across the datasets or evaluation measures. In fact, according to Table 2, the only clear choice is the ‘deepwl’ graph kernel model on the ENZYMES dataset, which significantly outper... |

60 |
Scene segmentation with crfs learned from partially labeled images.
- Verbeek, Triggs
- 2007
(Show Context)
Citation Context ...c relational model for semi-supervised learning. Furthermore, DCNNs offer this performance at considerably lower computational cost. Learning the parameters of both DCNNs and partially-observed CRFs involves numerically minimizing a nonconvex objective – the backpropagated error in the case of DCNNs and the negative marginal log-likelihood for CRFs. In practice, the marginal log-likelihood of a partially-observed CRF is computed using a contrastof-partition-functions approach that requires running loopy belief propagation twice; once on the entire graph and once with the observed labels fixed [17]. This algorithm, and thus each step in the numerical optimization, has exponential time complexity O(EtNCtt ) where Ct is the size of the maximal clique in Gt [18]. In contrast, the learning subroutine for an DCNN requires only one forward and backward pass for each instance in the training data. The complexity is dominated by the matrix multiplication between the graph definition matrix A and the design matrix V , giving an overall polynomial complexity of O(N2t F ). Kernel Methods Kernel methods define similarity measures either between nodes (so-called kernels on graphs) [1] or between gra... |

42 | Comparison of descriptor spaces for chemical compound retrieval and classification.
- Wale, Watson, et al.
- 2007
(Show Context)
Citation Context ...ple baseline, we apply linear classifiers to the average feature vector of each graph; ‘l1logistic’ and ‘l2logistic’ indicate `1 and `2-regularized logistic regression applied as described. ‘deepwl’ indicates the Weisfeiler-Lehman (WL) subtree deep graph kernel. Deep graph kernels decompose a graph into substructures, treat those substructures as words in a sentence, and fit a word-embedding model to obtain a vectorization [6]. Graph Classification Data We apply DCNNs to a standard set of graph classification datasets that consists of NCI1, NCI109, MUTAG, PCI, and ENZYMES. The NCI1 and NCI109 [7] datasets consist of 4100 and 4127 graphs that represent chemical compounds. Each graph is labeled with whether it is has the ability to suppress or inhibit the growth of a panel of human tumor cell lines, and each node is assigned one of 37 (for NCI1) or 38 (for NCI109) possible labels. MUTAG [8] contains 188 nitro compounds that are labeled as either aromatic or heteroaromatic with seven node features. PTC [9] contains 344 compounds labeled with whether they are carcinogenic in rats with 19 node features. Finally, ENZYMES [10] is a balanced dataset containing 600 proteins with three node fea... |

31 |
Link-based classification
- Sen, Getoor
- 2007
(Show Context)
Citation Context ... 0.8229 0.8229 0.8117 0.8228 0.8228 0.8086 CRF-LBP 0.8449 – 0.8248 – – – 2-hop DCNN 0.8677 0.8677 0.8584 0.8976 0.8976 0.8943 Table 1: A comparison of the performance between baseline `1 and `2-regularized logistic regression models, exponential diffusion and Laplacian exponential diffusion kernel models, loopy belief propagation (LBP) on a partially-observed conditional random field (CRF), and a two-hop DCNN on the Cora and Pubmed datasets. The DCNN offers the best performance according to each measure, and the gain is statistically significant in each case. The CRF-LBP result is quoted from [4], which follows the same experimental protocol. 4 the same number of nodes. During training, all node features X , all edges E, and the labels Y of the training and validation sets are visible to the model. We report classification accuracy as well as micro– and macro–averaged F1; each measure is reported as a mean and confidence interval computed from several trials. We also provide learning curves for the CORA and Pubmed datasets. In this experiment, the validation and test set each contain 10% of the nodes, and the amount of training data is varied between 10% and 100% of the remaining node... |

20 | Statistical evaluation of the predictive toxicology challenge 2000–2001.
- Toivonen, Srinivasan, et al.
- 2003
(Show Context)
Citation Context ...ctorization [6]. Graph Classification Data We apply DCNNs to a standard set of graph classification datasets that consists of NCI1, NCI109, MUTAG, PCI, and ENZYMES. The NCI1 and NCI109 [7] datasets consist of 4100 and 4127 graphs that represent chemical compounds. Each graph is labeled with whether it is has the ability to suppress or inhibit the growth of a panel of human tumor cell lines, and each node is assigned one of 37 (for NCI1) or 38 (for NCI109) possible labels. MUTAG [8] contains 188 nitro compounds that are labeled as either aromatic or heteroaromatic with seven node features. PTC [9] contains 344 compounds labeled with whether they are carcinogenic in rats with 19 node features. Finally, ENZYMES [10] is a balanced dataset containing 600 proteins with three node features. Results Discussion In contrast with the node classification experiments, there is no clear best model choice across the datasets or evaluation measures. In fact, according to Table 2, the only clear choice is the ‘deepwl’ graph kernel model on the ENZYMES dataset, which significantly outperforms the other methods in terms of accuracy and micro– and macro–averaged F measure. Furthermore, as shown in Figure... |

14 | An experimental investigation of kernels on graphs for collaborative recommendation and semisupervised classification.
- Fouss, Francoisse, et al.
- 2012
(Show Context)
Citation Context ...that computes the activations. So, for node classification tasks, the diffusion-convolutional representation of graph t, Zt, will be a Nt ×H × F tensor, as illustrated in Figure 1a; for graph classification tasks, Zt will be a H × F matrix, as illustrated in Figures 1b. The model is built on the idea of a diffusion kernel, which can be thought of as a measure of the level of connectivity between any two nodes in a graph when considering all paths between them, with longer paths being discounted more than shorter paths. Diffusion kernels provide an effective basis for node classification tasks [1]. The term ‘diffusion-convolution’ is meant to evoke the ideas of feature learning, parameter tying, and invariance that are characteristic of convolutional neural networks. The core operation of a DCNN is a mapping from nodes and their features to the results of a diffusion process that begins at that node. In contrast with standard CNNs, DCNN parameters are tied according diffusion search depth rather than their position in a grid. The diffusion-convolutional representation is invariant with respect to node index rather than position; in other words, the diffusion-convolututional activations... |

12 | The Graph Neural Network model
- Scarselli, Gori, et al.
- 2009
(Show Context)
Citation Context ...cal partitioning of the node set. In the same paper, the authors propose a spectral method that extends the notion of convolution to graph spectra. Later, [12] applied these techniques to data where a graph is not immediately present but must be inferred. DCNNs, which fall within the spatial category, are distinct from this work because their parameterization makes them transferable; a DCNN learned on one graph can be applied to another. A related branch of work that has focused on extending convolutional neural networks to domains where the structure of the graph itself is of direct interest [13, 14, 15]. For example, [15] construct a deep convolutional model that learns real-valued fingerprint representation of chemical compounds. Probabilistic Relational Models DCNNs also share strong ties to probabilistic relational models (PRMs), a family of graphical models that are capable of representing distributions over relational data [16]. In contrast to PRMs, DCNNs are deterministic, which allows them to avoid the exponential blowup in learning and inference that hampers PRMs. Our results suggest that DCNNs outperform partially-observed conditional random fields, the stateof-the-art model probabi... |

8 |
Neural network for graphs: a contextual constructive approach
- Micheli
- 2009
(Show Context)
Citation Context ...cal partitioning of the node set. In the same paper, the authors propose a spectral method that extends the notion of convolution to graph spectra. Later, [12] applied these techniques to data where a graph is not immediately present but must be inferred. DCNNs, which fall within the spatial category, are distinct from this work because their parameterization makes them transferable; a DCNN learned on one graph can be applied to another. A related branch of work that has focused on extending convolutional neural networks to domains where the structure of the graph itself is of direct interest [13, 14, 15]. For example, [15] construct a deep convolutional model that learns real-valued fingerprint representation of chemical compounds. Probabilistic Relational Models DCNNs also share strong ties to probabilistic relational models (PRMs), a family of graphical models that are capable of representing distributions over relational data [16]. In contrast to PRMs, DCNNs are deterministic, which allows them to avoid the exponential blowup in learning and inference that hampers PRMs. Our results suggest that DCNNs outperform partially-observed conditional random fields, the stateof-the-art model probabi... |

8 | Efficient Inference in Large Conditional Random Fields.
- Cohn
- 2006
(Show Context)
Citation Context ... DCNNs and partially-observed CRFs involves numerically minimizing a nonconvex objective – the backpropagated error in the case of DCNNs and the negative marginal log-likelihood for CRFs. In practice, the marginal log-likelihood of a partially-observed CRF is computed using a contrastof-partition-functions approach that requires running loopy belief propagation twice; once on the entire graph and once with the observed labels fixed [17]. This algorithm, and thus each step in the numerical optimization, has exponential time complexity O(EtNCtt ) where Ct is the size of the maximal clique in Gt [18]. In contrast, the learning subroutine for an DCNN requires only one forward and backward pass for each instance in the training data. The complexity is dominated by the matrix multiplication between the graph definition matrix A and the design matrix V , giving an overall polynomial complexity of O(N2t F ). Kernel Methods Kernel methods define similarity measures either between nodes (so-called kernels on graphs) [1] or between graphs (graph kernels) and these similarities can serve as a basis for prediction via the kernel trick. The performance of graph kernels can be improved by decomposing... |

6 |
Spectral networks and locally connected networks on graphs. arXiv.org,
- Bruna, Zaremba, et al.
- 2014
(Show Context)
Citation Context ...ctured data. As a consequence of constructing the latent representation from diffusion processes that begin at each node, we may fail to encode useful long-range spatial dependencies between individual nodes or other non-local graph behavior. 5 Related Work In this section we describe existing approaches to the problems of semi-supervised learning, graph classification, and edge classification, and discuss their relationship to DCNNs. Other Graph-Based Neural Network Models Other researchers have investigated how CNNs can be extended from grid-structured to more general graph-structured data. [11] propose a spatial method with ties to hierarchical clustering, where the layers of the network are defined via a hierarchical partitioning of the node set. In the same paper, the authors propose a spectral method that extends the notion of convolution to graph spectra. Later, [12] applied these techniques to data where a graph is not immediately present but must be inferred. DCNNs, which fall within the spatial category, are distinct from this work because their parameterization makes them transferable; a DCNN learned on one graph can be applied to another. A related branch of work that has f... |

4 |
Convolutional Networks on Graphs for Learning Molecular Fingerprints.
- Duvenaud, Maclaurin, et al.
- 2015
(Show Context)
Citation Context ...cal partitioning of the node set. In the same paper, the authors propose a spectral method that extends the notion of convolution to graph spectra. Later, [12] applied these techniques to data where a graph is not immediately present but must be inferred. DCNNs, which fall within the spatial category, are distinct from this work because their parameterization makes them transferable; a DCNN learned on one graph can be applied to another. A related branch of work that has focused on extending convolutional neural networks to domains where the structure of the graph itself is of direct interest [13, 14, 15]. For example, [15] construct a deep convolutional model that learns real-valued fingerprint representation of chemical compounds. Probabilistic Relational Models DCNNs also share strong ties to probabilistic relational models (PRMs), a family of graphical models that are capable of representing distributions over relational data [16]. In contrast to PRMs, DCNNs are deterministic, which allows them to avoid the exponential blowup in learning and inference that hampers PRMs. Our results suggest that DCNNs outperform partially-observed conditional random fields, the stateof-the-art model probabi... |