Introduction to GNN Note 1

Basic Knowledge and Recap

Posted by Rasin on August 18, 2020

Introduction

Motivations

Convolutional Neural Networks

Firstly, GNNs are motivated by convolutional neural networks. We find the keys of CNNs:

  • local connection
  • shared weights
  • use of multi-layer

These are also of great importance in solving problems of graph domain, because

  • graphs are the most typical locally connected structure
  • shared weights reduce the computational cost compared with traditional spectral graph theory
  • multi-layer structure is the key to deal with hierarchical patterns, which captures the features of various sizes

It is straightforward to think of finding the generalization of CNNs to graphs.

In Figure 1.1, it is hard to define localized convolutional filters and pooling operators, which hinders the transformation of CNN from Euclidean to non-Euclidean domain.

Network Embedding

The graph embedding, which learns to represent graph nodes, edges, or subgraphs in low-dimensional vectors.

Following the idea of representation learning and the success of word embedding, Deep Walk, which is regarded as the first graph embedding method based on representation learning, applies SkipGram model on the generated random walks.

Similar approaches such as node2vec, LINE, and TADW also achieved breakthroughs.

However, these methods suffer from two severe drawbacks:

  • no parameters are shared between nodes in the encoder, which leads to computational inefficiency
  • the direct embedding methods lack the ability of generalization, which means they cannot deal with dynamic graphs or be generalized to new graphs

Basics of Math and Graph

Linear Algebra

Vectors

The norm of a vector measures its length. The LpLp is defined as follows:

||x||p=(ni=1|xi|p)1p||x||p=(ni=1|xi|p)1p

The L1L1 norm can be simplified as

||x||1=ni=1|xi|||x||1=ni=1|xi|

In Euclidean space, the L2L2 norm is used to measure the length of vectors, where

||x||2=ni=1x2i||x||2= ni=1x2i

The LinftyLinfty norm is also called the max norm, as

||x||=maxi|xi|||x||=maxi|xi|

A set of vectors x1,x2,,xmx1,x2,,xm are linearly independent if and only if there does not exist a set of scalars λ1,λ2,,λmλ1,λ2,,λm, which are not all 0, such that

λ1x1+λ2x2+λmxm=0λ1x1+λ2x2+λmxm=0

Matrix

For each n×nn×n square matrix AA, its determinant is defined as

det(A)=k1k2kn(1)τ(k1k2kn)a1k1a2k2ankndet(A)=k1k2kn(1)τ(k1k2kn)a1k1a2k2ankn

where k1k2knk1k2kn is a permutation of 1,2,,n1,2,,n and τ(k1k2kn)τ(k1k2kn) is the inversion number of the permutation k1k2knk1k2kn, which is the number of inverted sequence in the permutation.

There is a frequently used product between matrices called Hadamard product. The Hadamard product of two matrices A,BA,B is a matrix \(C), where

Cij=AijBijCij=AijBij

Tensor

An array with arbitrary dimension. Most matrix operations can also be applied to tensors.

Eigen Decomposition

Let AA be a matrix. A nonzero vector \(v) is called an eigenvector of AA if there exists such scalar λλ that

Av=λvAv=λv

Here scalar λλ is an eigenvalue of AA corresponding to the eigenvector vv. If matrix AA has n eigenvectors v1,v2,,vnv1,v2,,vn that are linearly independent, corresponding to the eigenvalue λ1,λ2,,λnλ1,λ2,,λn, then it can be deduced that

A[v1v2vn]=[v1v2vn][λ1λ2λn]

Let

V = \begin{bmatrix}v_{1} & v_{2} & \dots  & v_{n}\end bmatrix}

then it is clear that V is an invertible matrix. We have the eigen decomposition of A (also called diagonalization)

A=VdiagλV1

It can also be written in the following form:

A=ni=1λivivi

However, not all square matrices can be diagonalized in such form because a matrix may not have as many as n linear independent eigenvectors. It can be proved that every real symmetric matrix has an eigendecomposition.

The matrix A and its transpose A share the same eigen values.

Singular Value Decomposition

As eigendecomposition can only be applied to certain matrices, we introduce the singular value decomposotion, which is a generalization to all matrices.

Singular Value: Let r denote the rank of AA, then there exist r positive scalars σ1σ2σr0 such that for 1ir, vi is an eigenvector of AA with corresponding eigenvalue σ2i. The r positive scalars σ1,σ2,,σr are called singular values of A. Then we have the singular value decomposition:

A=UΣV

where U and V are orthogonal matrices and Σ is an m×n matrix defined as follows:

Σij={σiif i = j  r0otherwise f AA, and the eigenvectors of AA are made up of the column vectors of V.

Probability Theory

Basic Concepts and Formulas

In probability theory, a random variable is a variable that has a random value. If we denote a random value by X, which has two possible values x1 and x2, then the probability of X equals to x1 is P(X=x1). The following equation remains true:

P(X=x1)+P(X=x2)=1

Suppose there is another random variable Y that has y1 as a possible value. The probability that X=x1 and Y=y1 is written as P(X=x1,Y=y1), which is called the joint probability of X=x1 and Y=y1.

The probability of X=x1 on the condition that Y=y1, which can be written as P(X=x1Y=y1). We call this the conditional probability of X=x1 given Y=y1. With the concepts above, we can write the following two fundamental rules of probability theory:

P(X=x)=yP(X=x,Y=y)P(X=x,Y=y)=P(Y=y|X=x)P(X=x)

The former is the sum rule while the latter is the product rule. The modified product rule can be written as:

P(Y=y|X=x)=P(X=x,Y=y)P(X=x)=P(X=x|Y=y)P(Y=y)P(X=x)

which is the famous Bayes formula. It also holds for more than two variables:

P(Xi=xi|Yi=yi)=P(Y=y|Xi=xi)P(Xi=xi)nj=1P(Y=y|Xj=xj)P(Xj=xj)

Using product rule, we can deduce the chain rule:

P(X1=x1,,Xn=xn)=P(X1=x1)ni=2P(Xi=xi|X1=x1,,Xi1=xi1)

where X1,,Xn are n random variables.

The average value of some function f(x) under a probability distribution P(x) is called the expectation of f(x). It can be written as:

E[f(x)]=xP(x)f(x)

Usually, E[x] stands for the expectation of x.

To measure the dispersion level of f(x) around its mean value E[f(x)], we introduce the variance of f(x):

Var(f(x))=E[(f(x)E[f(x)])2]=E[f(x)]2E[f(x)]2

Covariance expresses the degree to which two variables vary together:

Cov(f(x),g(y))=E[(f(x)E[f(x)])](g(y)E[g(y)])

Probability Distribution

Probability distributions describe the probability of a random variable or several random variables on every state.

Gaussian Distribution: it is also known as normal distribution:

N(x|μ,σ2)=12πσ2exp(12σ2(xμ)2)

where μ is the mean of variable x and σ2 is the variance.

Bernoulli distribution: random variable X can either be 0 or 1, with a probability P(X=1)=p. Then the distribution function is:

P(X=x)=px(1p)1x,x{0,1}

It is quite obvious that E(X)=p and Var(X)=p(1p).

Binomial distribution: repeat the Bernoulli experiment for N times and the times that X equals to 1 is denoted by Y, then

P(Y=k)=(Nk)pk(1p)Nk

is the Binomial distribution satisfying that E(Y)=np and Var(Y)=np(1p).

Laplace distribution: is described as

P(x|μ,b)=12bexp(|xμ|b)

Graph Theory

Basic Concepts

A graph is often denoted by G=(V,E), where V is the set of vertices and E is the set of edges. An edge e=u,v has two endpoints u and v, which are said to be joined by e. In this case, u is called a neighbor of v, these two vertices are adjacent. A graph is called a directed graph if all edges are directed. The degree of vertices v, denoted by d(v), is the number of edges connected with v.

Algebra Representations of Graphs

Adjacency matrix: for a simple graph G=(V,E) with n-vertices, it can be described by an adjacency matrix ARn×n, where

Aij={1if {vi, vj}E and ij0otherwise

It is obvious that such matrix is a symmetric matrix when G is an undirected graph.

Degree matrix: for a graph G=(V,E) with n-vertices, its degree matrix DRn×n is a diagonal matrix, where Dii=d(vi).

Laplacian matrix: for a simple graph G=(V,E) with n-vertices, if we consider all edges in G to be undirected, then its Laplacian matrix LRn×n can be defined as

L=DA

Thus, we have the elements:

Lij={d(vi)if i = j1if {vi, vj}  E and i j0otherwise

Symmetric normalized Laplacian: the symmetric normalized Laplacian is defined as:

Lsym=D12LD12=ID12AD12

The elements are given by:

Lsymij={1if i = j and d(vi)01d(vi)d(vj)if {vi, vj}E and ij0otherwise

Random walk normalized Laplacian: it is defined as:

Lrw=D1L=ID1A The elements can be computed by:

Lrwij={1if i = j and d(vi)01d(vi)if {vi, vj}E and ij0otherwise

Incidence matrix: For a directed graph G=(V,E) with n-vertices and m-edges, the corresponding incidence matrix is MRn×n, where

Mij={1if k s.t. ej={vi, vk}1if k s.t. ej={vk, vj}0otherwise

For a undirected graph, the corresponding incidence matrix satisfies that:

Mij={1if k s.t. ej={vi, vk}0otherwise

Basics of Neural Network

A neural network learns in the following way: initiated with random weights or values, the connections between neurons updates its weights or values by the back propagation algorithm repeatedly till the model performs rather precisely. In the end, the knowledge that a neural network learned is stored in the connections in a digital manner.

Neuron

The basic units of neural networks are neurons, which can receive a series of inputs and return the corresponding output. Where the neuron receives n inputs x1,,xn with corresponding weights w1,,wn and an offset b.

Several activation functions are shown as follows:

  • Sigmoid: σ(x)=11+ex
  • Tanh: tanh(x)=exexex+ex
  • ReLU:
ReLU={xx>00x0

During the training of a neural network, the choice of activation function is usually essential to the outcome.

Back Propagation

During the training of a neural network, the back propagation algorithm is most commonly used. It is an algorithm based on gradient descend to optimize the parameters in a model.

In summary, the process of he back propagation consists of the following two steps:

  • Forward calculation: given a set of parameters and an input, the neural network computes the values at each neuron in a forward order.
  • Backward propagation: compute the error at each variable to be optimized, and update the parameters with their corresponding partial derivatives in a backward order.

Neural Networks

Feedforward neural network

The FNN usually contains an input layer, several hidden layers, and an output layer. The feedforward neural network has a clear hierarchical structure, which always consists of multiple layers of neurons, and each layer is only connected to its neighbor layers. There are no loops in this network.

Convolutional neural network

FNNs are usually fully connected networks while CNNs preserve the local connectivity. The CNN architecture usually contains convolutional layers, pooling layers, and several fully connected layers. CNNs are widely used in the area of computer vision and proven to be effective in man other research fields.

Recurrent neural network

The neurons in RNNs receive not only signals and inputs from other neurons, but also its own historical information. The memory mechanism in RNN help the model to process series daa effectively. However, the RNN usually suffers from te problem of long-term dependencies. The RNN is widely used in the area of speech and natural language processing.

Graph neural network

The GNN is designed specifically to handle graph-structured data, such as social networks, molecular structures, knowledge graphs.


0 comments
Anonymous
Markdown is supported

Be the first person to leave a comment!