DP-Graph Convolutional Networks

Graph Convolutional Networks (GCNs) is one type of neural networks designed to work directly on graphs and leverage their structural information. In this post, I will introduce how information is propagated through the hidden layers of a GCN.

Definition

A graph convolutional network is a network that operates on graph. Given a graph $G=(V,E)$, a GCN takes as input:

  • an input feature matrix $X$ with size $N \times F^0$, where $N$ is the number of nodes and $F^0$ is the number of input features for each node, and
  • an $N \times N$ matrix representation of the graph structuure such as the adjacency matrix $A$.

A hidden layer in the GCN can thus be written as $H^i = f(H^{i-1},A)$, where $H^0=X$ and $f$ is a propagation. Each layer $H^i$ corresponds to an $N \times F^i$ feature matrix where each row is a feature representation of a node. At each layer, these features are aggregated to form the next layer’s features using the propagation rule $f$. In this way, features become increasingly more abstract at each consecutive layer. In this framework, variants of GCN differ only in the choice of propagation rule $f$.

A simple propagation rule

One of the simplest possible propagation rule is:

where $W^i$ is the weight matrix for layer $i$ and $\sigma$ is a non-linear activation function such as the ReLU function. Despite its simplicity this model is already quite powerful.

But first, let us address two limitations of this simple model: multiplication with $A$ means that, for every node, we sum up all the feature vectors of all neighboring nodes but not the node itself (unless there are self-loops in the graph). We can “fix” this by enforcing self-loops in the graph: we simply add the identity matrix to $A$.

The second major limitation is that $A$ is typically not normalized and therefore the multiplication with $A$ will completely change the scale of the feature vectors (we can understand that by looking at the eigenvalues of $A$). Normalizing $A$ such that all rows sum to one, i.e. $ D^{−1}A $ , where $D$ is the diagonal node degree matrix, gets rid of this problem. Multiplying with $D^{−1}A$ now corresponds to taking the average of neighboring node features. In practice, dynamics get more interesting when we use a symmetric normalization, i.e. $D^{−1/2}AD^{−1/2}$ (as this no longer amounts to mere averaging of neighboring nodes). Combining these two tricks, we essentially arrive at the propagation rule:

with $\hat A=A+I$, where $I$ is the identity matrix and $\hat D$ is the diagonal node degree matrix of $\hat A$.

Spatial-Based Convolution

Spectral-Based Convolution

A recent paper by Kipf proposes fast approximate spectral graph convolutions using a spectral propagation rule:

Compared to the sum and mean rules discussed above, the spectral rule differs only in the choice of aggregate function. Although it is somewhat similar to the mean rule in that it normalizes the aggregate using the degree matrix $D$ raised to a negative power, the normalization is asymmetric.

Aggregation as a Weighted Sum

The sum rule

To see how the aggregate feature representation of the $i$th node is computed using the sum rule, we see how the ith row in the aggregate is computed.

The contribution of the $j$th node in the aggregate is determined by the value of the $j$th column of the $i$th row of $A$. Since $A$ is an adjacency matrix, this value is 1 if the $j$th node is a neighbour of the $i$th node, and is otherwise 0. Thus, the sum rule corresponds to summing up the feature representations of the neighbors of the $i$th node.

The mean rule

To see how the mean rule aggregates node representations, we again see how the $i$th row in the aggregate is computed, now using the mean rule. For simplicity, we only consider the mean rule on the “raw“ adjacency matrix without addition between $A$ and the identity matrix $I$ which simply corresponds to adding self-loops to the graph.

We now again sum over each of the $N$ rows in the adjacency matrix $A$. As mentioned during the discussion of the sum rule, this corresponds to summing over each the $i$th node’s neighbors. However, the weights in the weighted sum are now guaranteed to sum to 1 by with the degree of the $i$th node. Thus, Equation above corresponds to a mean over the feature representations of the neighbors of the $i$th node.

The spectral rule

As with the mean rule, we transform the adjacency matrix $A$ using the degree matrix $D$. However, as shown in Equation, we raise the degree matrix to the power of -0.5 and multiply it on each side of A. Recall again, that degree matrices (and powers thereof) are diagonal.

Equation shows something quite interesting. When computing the aggregate feature representation of the $i$th node, we not only take into consideration the degree of the $i$th node, but also the degree of the $j$th node.

Similar to the mean rule, the spectral rule normalizes the aggregate s.t. the aggregate feature representation remains roughly on the same scale as the input features. However, the spectral rule weighs neighbor in the weighted sum higher if they have a low-degree and lower if they have a high-degree. This may be useful when low-degree neighbors provide more useful information than high-degree neighbors.

Reference

How to do Deep Learning on Graphs with Graph Convolutional Networks

李宏毅GNN