Introduction
Motivations
Convolutional Neural Networks
Firstly, GNNs are motivated by convolutional neural networks. We find the keys of CNNs:
- local connection
- shared weights
- use of multi-layer
These are also of great importance in solving problems of graph domain, because
- graphs are the most typical locally connected structure
- shared weights reduce the computational cost compared with traditional spectral graph theory
- multi-layer structure is the key to deal with hierarchical patterns, which captures the features of various sizes
It is straightforward to think of finding the generalization of CNNs to graphs.
In Figure 1.1, it is hard to define localized convolutional filters and pooling operators, which hinders the transformation of CNN from Euclidean to non-Euclidean domain.
Network Embedding
The graph embedding, which learns to represent graph nodes, edges, or subgraphs in low-dimensional vectors.
Following the idea of representation learning and the success of word embedding, Deep Walk, which is regarded as the first graph embedding method based on representation learning, applies SkipGram model on the generated random walks.
Similar approaches such as node2vec, LINE, and TADW also achieved breakthroughs.
However, these methods suffer from two severe drawbacks:
- no parameters are shared between nodes in the encoder, which leads to computational inefficiency
- the direct embedding methods lack the ability of generalization, which means they cannot deal with dynamic graphs or be generalized to new graphs
Basics of Math and Graph
Linear Algebra
Vectors
The norm of a vector measures its length. The LpLp is defined as follows:
||x||p=(n∑i=1|xi|p)1p||x||p=(n∑i=1|xi|p)1pThe L1L1 norm can be simplified as
||x||1=n∑i=1|xi|||x||1=n∑i=1|xi|In Euclidean space, the L2L2 norm is used to measure the length of vectors, where
||x||2=√n∑i=1x2i||x||2= ⎷n∑i=1x2iThe LinftyLinfty norm is also called the max norm, as
||x||∞=maxi|xi|||x||∞=maxi|xi|A set of vectors x1,x2,…,xmx1,x2,…,xm are linearly independent if and only if there does not exist a set of scalars λ1,λ2,…,λmλ1,λ2,…,λm, which are not all 0, such that
λ1x1+λ2x2+…λmxm=0λ1x1+λ2x2+…λmxm=0Matrix
For each n×nn×n square matrix AA, its determinant is defined as
det(A)=∑k1k2…kn(−1)τ(k1k2…kn)a1k1a2k2…ankndet(A)=∑k1k2…kn(−1)τ(k1k2…kn)a1k1a2k2…anknwhere k1k2…knk1k2…kn is a permutation of 1,2,…,n1,2,…,n and τ(k1k2…kn)τ(k1k2…kn) is the inversion number of the permutation k1k2…knk1k2…kn, which is the number of inverted sequence in the permutation.
There is a frequently used product between matrices called Hadamard product. The Hadamard product of two matrices A,BA,B is a matrix \(C), where
Cij=AijBijCij=AijBijTensor
An array with arbitrary dimension. Most matrix operations can also be applied to tensors.
Eigen Decomposition
Let AA be a matrix. A nonzero vector \(v) is called an eigenvector of AA if there exists such scalar λλ that
Av=λvAv=λvHere scalar λλ is an eigenvalue of AA corresponding to the eigenvector vv. If matrix AA has n eigenvectors v1,v2,…,vnv1,v2,…,vn that are linearly independent, corresponding to the eigenvalue λ1,λ2,…,λnλ1,λ2,…,λn, then it can be deduced that
A[v1v2…vn]=[v1v2…vn][λ1λ2…λn]Let
V = \begin{bmatrix}v_{1} & v_{2} & \dots & v_{n}\end bmatrix}then it is clear that V is an invertible matrix. We have the eigen decomposition of A (also called diagonalization)
A=VdiagλV−1It can also be written in the following form:
A=n∑i=1λiviv⊤iHowever, not all square matrices can be diagonalized in such form because a matrix may not have as many as n linear independent eigenvectors. It can be proved that every real symmetric matrix has an eigendecomposition.
The matrix A and its transpose A⊤ share the same eigen values.
Singular Value Decomposition
As eigendecomposition can only be applied to certain matrices, we introduce the singular value decomposotion, which is a generalization to all matrices.
Singular Value: Let r denote the rank of A⊤A, then there exist r positive scalars σ1≥σ2≥⋯≥σr≥0 such that for 1≤i≤r, vi is an eigenvector of A⊤A with corresponding eigenvalue σ2i. The r positive scalars σ1,σ2,…,σr are called singular values of A. Then we have the singular value decomposition:
A=UΣV⊤where U and V are orthogonal matrices and Σ is an m×n matrix defined as follows:
Σij={σiif i = j ≤ r0otherwise f AA⊤, and the eigenvectors of A⊤A are made up of the column vectors of V.
Probability Theory
Basic Concepts and Formulas
In probability theory, a random variable is a variable that has a random value. If we denote a random value by X, which has two possible values x1 and x2, then the probability of X equals to x1 is P(X=x1). The following equation remains true:
P(X=x1)+P(X=x2)=1Suppose there is another random variable Y that has y1 as a possible value. The probability that X=x1 and Y=y1 is written as P(X=x1,Y=y1), which is called the joint probability of X=x1 and Y=y1.
The probability of X=x1 on the condition that Y=y1, which can be written as P(X=x1∣Y=y1). We call this the conditional probability of X=x1 given Y=y1. With the concepts above, we can write the following two fundamental rules of probability theory:
P(X=x)=∑yP(X=x,Y=y)P(X=x,Y=y)=P(Y=y|X=x)P(X=x)The former is the sum rule while the latter is the product rule. The modified product rule can be written as:
P(Y=y|X=x)=P(X=x,Y=y)P(X=x)=P(X=x|Y=y)P(Y=y)P(X=x)which is the famous Bayes formula. It also holds for more than two variables:
P(Xi=xi|Yi=yi)=P(Y=y|Xi=xi)P(Xi=xi)∑nj=1P(Y=y|Xj=xj)P(Xj=xj)Using product rule, we can deduce the chain rule:
P(X1=x1,…,Xn=xn)=P(X1=x1)n∏i=2P(Xi=xi|X1=x1,…,Xi−1=xi−1)where X1,…,Xn are n random variables.
The average value of some function f(x) under a probability distribution P(x) is called the expectation of f(x). It can be written as:
E[f(x)]=∑xP(x)f(x)Usually, E[x] stands for the expectation of x.
To measure the dispersion level of f(x) around its mean value E[f(x)], we introduce the variance of f(x):
Var(f(x))=E[(f(x)−E[f(x)])2]=E[f(x)]2−E[f(x)]2Covariance expresses the degree to which two variables vary together:
Cov(f(x),g(y))=E[(f(x)−E[f(x)])](g(y)−E[g(y)])Probability Distribution
Probability distributions describe the probability of a random variable or several random variables on every state.
Gaussian Distribution: it is also known as normal distribution:
N(x|μ,σ2)=12πσ2exp(−12σ2(x−μ)2)where μ is the mean of variable x and σ2 is the variance.
Bernoulli distribution: random variable X can either be 0 or 1, with a probability P(X=1)=p. Then the distribution function is:
P(X=x)=px(1−p)1−x,x∈{0,1}It is quite obvious that E(X)=p and Var(X)=p(1−p).
Binomial distribution: repeat the Bernoulli experiment for N times and the times that X equals to 1 is denoted by Y, then
P(Y=k)=(Nk)pk(1−p)N−kis the Binomial distribution satisfying that E(Y)=np and Var(Y)=np(1−p).
Laplace distribution: is described as
P(x|μ,b)=12bexp(−|x−μ|b)Graph Theory
Basic Concepts
A graph is often denoted by G=(V,E), where V is the set of vertices and E is the set of edges. An edge e=u,v has two endpoints u and v, which are said to be joined by e. In this case, u is called a neighbor of v, these two vertices are adjacent. A graph is called a directed graph if all edges are directed. The degree of vertices v, denoted by d(v), is the number of edges connected with v.
Algebra Representations of Graphs
Adjacency matrix: for a simple graph G=(V,E) with n-vertices, it can be described by an adjacency matrix A∈Rn×n, where
Aij={1if {vi, vj}∈E and i≠j0otherwiseIt is obvious that such matrix is a symmetric matrix when G is an undirected graph.
Degree matrix: for a graph G=(V,E) with n-vertices, its degree matrix D∈Rn×n is a diagonal matrix, where Dii=d(vi).
Laplacian matrix: for a simple graph G=(V,E) with n-vertices, if we consider all edges in G to be undirected, then its Laplacian matrix L∈Rn×n can be defined as
L=D−AThus, we have the elements:
Lij={d(vi)if i = j−1if {vi, vj} ∈ E and i ≠j0otherwiseSymmetric normalized Laplacian: the symmetric normalized Laplacian is defined as:
Lsym=D−12LD−12=I−D−12AD−12The elements are given by:
Lsymij={1if i = j and d(vi)≠0−1√d(vi)d(vj)if {vi, vj}∈E and i≠j0otherwiseRandom walk normalized Laplacian: it is defined as:
Lrw=D−1L=I−D−1A The elements can be computed by:
Lrwij={1if i = j and d(vi)≠0−1d(vi)if {vi, vj}∈E and i≠j0otherwiseIncidence matrix: For a directed graph G=(V,E) with n-vertices and m-edges, the corresponding incidence matrix is M∈Rn×n, where
Mij={1if ∃k s.t. ej={vi, vk}−1if ∃k s.t. ej={vk, vj}0otherwiseFor a undirected graph, the corresponding incidence matrix satisfies that:
Mij={1if ∃k s.t. ej={vi, vk}0otherwiseBasics of Neural Network
A neural network learns in the following way: initiated with random weights or values, the connections between neurons updates its weights or values by the back propagation algorithm repeatedly till the model performs rather precisely. In the end, the knowledge that a neural network learned is stored in the connections in a digital manner.
Neuron
The basic units of neural networks are neurons, which can receive a series of inputs and return the corresponding output. Where the neuron receives n inputs x1,…,xn with corresponding weights w1,…,wn and an offset b.
Several activation functions are shown as follows:
- Sigmoid: σ(x)=11+e−x
- Tanh: tanh(x)=ex−e−xex+e−x
- ReLU:
During the training of a neural network, the choice of activation function is usually essential to the outcome.
Back Propagation
During the training of a neural network, the back propagation algorithm is most commonly used. It is an algorithm based on gradient descend to optimize the parameters in a model.
In summary, the process of he back propagation consists of the following two steps:
- Forward calculation: given a set of parameters and an input, the neural network computes the values at each neuron in a forward order.
- Backward propagation: compute the error at each variable to be optimized, and update the parameters with their corresponding partial derivatives in a backward order.
Neural Networks
Feedforward neural network
The FNN usually contains an input layer, several hidden layers, and an output layer. The feedforward neural network has a clear hierarchical structure, which always consists of multiple layers of neurons, and each layer is only connected to its neighbor layers. There are no loops in this network.
Convolutional neural network
FNNs are usually fully connected networks while CNNs preserve the local connectivity. The CNN architecture usually contains convolutional layers, pooling layers, and several fully connected layers. CNNs are widely used in the area of computer vision and proven to be effective in man other research fields.
Recurrent neural network
The neurons in RNNs receive not only signals and inputs from other neurons, but also its own historical information. The memory mechanism in RNN help the model to process series daa effectively. However, the RNN usually suffers from te problem of long-term dependencies. The RNN is widely used in the area of speech and natural language processing.
Graph neural network
The GNN is designed specifically to handle graph-structured data, such as social networks, molecular structures, knowledge graphs.
CATALOG
- Introduction
- Motivations
- Convolutional Neural Networks
- Network Embedding
- Basics of Math and Graph
- Linear Algebra
- Vectors
- Matrix
- Tensor
- Eigen Decomposition
- Singular Value Decomposition
- Probability Theory
- Basic Concepts and Formulas
- Probability Distribution
- Graph Theory
- Basic Concepts
- Algebra Representations of Graphs
- Basics of Neural Network
- Neuron
- Back Propagation
- Neural Networks
- Feedforward neural network
- Convolutional neural network
- Recurrent neural network
- Graph neural network