Recent advances in neural networks have driven study of data mining and pattern recognition. End-to-end deep learning models including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and autoencoders have breathed new life into traditional machine learning tasks such as object detection, machine translation, and speech recognition.

Deep Learning is effective at uncovering hidden patterns among Euclidean data (images, text, videos). But what about applications that rely on data originating from non-Euclidean domains, which is typically represented as a graph with intricate interdependencies and relationships among objects?

Thus, Graph ML comes into play. In this blog, I attempt to explain why graph convolutional networks were developed and their underlying mathematics.

What are networks/graphs?

Networks are a language for expressing large systems comprised of interacting components. Graphs can be used to display a wide variety of data kinds.

Social Networks, Communication Networks, Network of Neurons, Molecules, and Event Graphs are examples of networks.

A graph is a two-component data structure in computer science, consisting of nodes (vertices) and edges. G = (V, E), where V is the collection of nodes and E are the edges between them, defines a graph.

Edges are directed if there are directional dependencies between nodes. Otherwise, edges are undirected.

Graphs are frequently represented by the adjacency matrix A. If a graph has n nodes, A has (n x n) dimensions. Occasionally, nodes possess a set of features. If the node has f features, then the dimension of the node feature matrix X is (n x f).

Why is graph analysis difficult?

The complexity of graph data has presented numerous hurdles for conventional machine learning systems.

Conventional Machine Learning and Deep Learning tools focus in simple data types. As images with the same size and structure, which can be compared to fixed-size grid graphs. Text and voice are sequences, hence they can be compared to line graphs.

There are, however, more sophisticated networks without a fixed structure, with variable-sized, unordered nodes that can have varying numbers of neighbours. Existing machine learning methods have the fundamental assumption that instances are independent of one another. This is untrue for graph data, as each node is connected to others by various forms of connections.

Introduction to graph representation learning

Embedding nodes is a basic notion in graph theory. This is accomplished by assigning a weight to each node in the graph and then mapping them to a d-dimensional embedding space (a low-dimensional space rather than the real dimension of the graph) so that nodes with similar properties are embedded close to one another.

text

Embedding nodes in the network

  • Define an encoder
  • Define a node similarity function
  • Optimize the parameters of the encoder so that:

Encoder: Maps a node to a low-dimensional vector

$$Enc(v) = Zv $$

Similarity function: Specifies how the relationships in vector space map to the relationships in the original network

$$similarity(u,v) \approx Z_u ^ \top Z_v $$

Shallow encoders vs Deep graph encoders

Graphs can be represented in using 2 techniques. Shallow encoders and Deep encoders.

“Shallow” Encoding

“Shallow” encoding is the simplest way to encode. It means that the encoder is just an embedding-lookup, and it could be written as:

$$Enc(v) = Zv $$

Each column in Z represents an embedding of a node, and the number of rows in Z is equal to the size of the embeddings. v is the indicator vector that points to node v. It has all zeros except for a single one in column v. In “shallow” encoding, we can see that each node is given a unique embedding vector.

text

“Shallow Encoders” Have Their Limits

Since each node has its own embedding, Shallow Encoders don’t scale. Shallow Encoders are transductive by nature. It can only make embeddings for a single graph that stays the same. Also the features of nodes are not taken into account.

Different types of shallow encoders:

  • Random walk
  • Node2vec
  • TransE

“Deep” Encoding

The current state of the art performance is achieved using Deep graph encoders.

$$Enc(v) = Multiple\space layers\space of\space non\space linear \space transforms $$

Graph Convolutional Networks (GCN)

GCNs are a type of deep graph encoders. The majority of neural networks operate on graphs of fixed size, such as images and texts. However, in the actual world, graph sizes are not fixed. Let’s have a look at the computational graphs of GCNs.

Given a graph G = (V,A,X) such that:

  • V is the vertex set
  • A is the adjacency matrix
  • X ∈ R^(m×|V|) is the node feature matrix

text

The computational graph must preserve the structure of graph G and incorporate the neighboring characteristics of the nodes. For instance, the embedding vector of node A should consist of characteristics of its neighbors B, C, and D without regard to their order. One technique to accomplish this is by averaging the characteristics of B,C,D.

With two layers, the computational graph of G will appear as follows:

text

The computational graph for node A:

text

Analogy between image convolution and graph convolution

CNNs and GCNs have a number of operational commonalities.

text

In a CNN Conv layer, a window moves across a matrix and aggregates the local representations before proceeding to the next stride. Now, when considering the Graph Conv layer, the representations of nodes are created by aggregating the representations of adjacent nodes. Comparable to the adjacent pixels in a CNN.

Graph convolutional layer and propagation rule

Every neural network layer can then be written as a nonlinear function

$$H^{(l+1)} = f(H^{(l)},A) $$

The propagation rule:

$$f(H^{(l)},A) = \sigma (\hat{D}^{-{1 \over 2}}\hat{A}\hat{D}^{-{1 \over 2}}H^{(l)}W^{(l)})$$

Let’s try to understand this equation using code

1. Imports

import numpy as np
import networkx as nx

2. The adjacency matrix

A = np.matrix([
    [0, 1, 0, 0],
    [0, 0, 1, 1], 
    [0, 1, 0, 0],
    [1, 0, 0, 0]],
    dtype=float
)
 
G = nx.from_numpy_matrix(A) 

This is how the graph looks like

text

3. The feature matrix

a = [1,0]
b = [0,1]

X = np.matrix([a,b,a,b]) #features 
print(X)

#matrix([[1, 0],
#        [0, 1],
#        [1, 0],
#        [0, 1]])

4. Propagation rule

Multiplying Adjacency and Feature matrix

print(A @ X)

#matrix([[0., 1.],
#        [1., 1.],
#        [0., 1.],
#        [1., 0.]])

#  a b
#a 0 1   a is connected to 0 a and 1 b (Node 0) 
#b 1 1   b is connected to 1 a and 1 b (Node 1)
#a 0 1   a is connected to 0 a and 1 b (Node 2)
#b 1 0   b is connected to 1 a and 0 b (Node 3)

#The aggregated representation of a node does not include its own features!

Creating an Identity matrix

I = np.matrix(np.eye(A.shape[0]))
print(I)

# matrix([[1., 0., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 1., 0.],
#         [0., 0., 0., 1.]])

Getting the projection matrix

A_hat = A + I
print(A_hat)

# matrix([[1., 1., 0., 0.],
#         [0., 1., 1., 1.],
#         [0., 1., 1., 0.],
#         [1., 0., 0., 1.]])

Multiplying the projection matrix with feature matrix

print(A_hat @ X)

# matrix([[1., 1.],
#         [1., 2.],
#         [1., 1.],
#         [1., 1.]])

#  a b
#a 1 1   a is connected to 1 a and 1 b (Node 0) 
#b 1 2   b is connected to 1 a and 2 b (Node 1)
#a 1 1   a is connected to 1 a and 1 b (Node 2)
#b 1 1   b is connected to 1 a and 1 b (Node 3)

#The aggregated representation of a node does include its own features!

Yayy now the representation of nodes include themselves!!!

Now normalizing the representations

D_hat = np.array(np.sum(A_hat, axis=0))[0]
D_hat = np.matrix(np.diag(D_hat))
print(D_hat)

# matrix([[2., 0., 0., 0.],
#         [0., 3., 0., 0.],
#         [0., 0., 2., 0.],
#         [0., 0., 0., 2.]])

print(D_hat**-1 @ A_hat @ X)

# matrix([[0.5       , 0.5       ],
#         [0.33333333, 0.66666667],
#         [0.5       , 0.5       ],
#         [0.5       , 0.5       ]])

# Result,

#matrix([[1., 1.],
        # [1., 2.],
        # [1., 1.],
        # [1., 1.]])

# is normalized to

#matrix([[0.5       , 0.5       ],
        # [0.33333333, 0.66666667],
        # [0.5       , 0.5       ],
        # [0.5       , 0.5       ]])

A better way to normalize

D_hat_inv = np.sqrt(D_hat**-1)
H = D_hat_inv @ A_hat @ D_hat_inv @ X
print(H)

# matrix([[0.5       , 0.40824829],
#         [0.40824829, 0.74158162],
#         [0.5       , 0.40824829],
#         [0.5       , 0.5       ]])

Assigning weights

#Weights
np.random.seed(0)
W = 2*np.random.rand(2,5) - 1
W = np.matrix(W)
print(W)

# matrix([[ 0.09762701,  0.43037873,  0.20552675,  0.08976637, -0.1526904 ],
#         [ 0.29178823, -0.12482558,  0.783546  ,  0.92732552, -0.23311696]])

Finding the outputs of the hidden layer

Z = H @ W
print(Z)

# matrix([[ 0.16793555,  0.16422954,  0.42264469,  0.42346224, -0.1715148 ],
#         [ 0.25624085,  0.08313303,  0.66496926,  0.72433453, -0.23521085],
#         [ 0.16793555,  0.16422954,  0.42264469,  0.42346224, -0.1715148 ],
#         [ 0.19470762,  0.15277658,  0.49453638,  0.50854594, -0.19290368]])

That is the propagation rule!

4. Building the Graph Conv layer

def GraphConv_layer(A_hat, D_hat, X, W):
    D_hat_inv = np.sqrt(D_hat**-1)
    return (D_hat_inv @ A_hat @ D_hat_inv @ X @ W)

Which is $$f(H^{(l)},A) = \sigma (\hat{D}^{-{1 \over 2}}\hat{A}\hat{D}^{-{1 \over 2}}H^{(l)}W^{(l)})$$

References

  1. CS224W: Machine Learning with Graphs: https://web.stanford.edu/class/cs224w/
  2. Semi-Supervised Classification with Graph Convolutional Networks: https://tkipf.github.io/graph-convolutional-networks/
  3. Embedding molecules using GCNs