Neural Networks

Contents

1. Linear Model - Perceptron

Perceptron

y=xw+b\begin{align*} y = xw + b \end{align*}

2. Two Input Model with 1 Neuron

Two Input Perceptron

y=(x1×w1)+(x2×w2)+b y = (x_1 \times w_1) + (x_2 \times w_2) + b This is then passed through an activation function. This is because the model is not linear and we may need to normalise the values. y=σ(x1w1+x2w2+b)y = \sigma(x_1w_1 + x_2w_2 + b)

3. Two Neuron Model

Two Neurons

a(1)=σ(xw(1)+b(1))y=σ(a(1)w(2)+b(2))\begin{align} a^{(1)} &= \sigma(xw^{(1)} + b^{(1)}) \\ y &= \sigma(a^{(1)}w^{(2)} + b^{(2)}) \end{align}

This is the feedforward algorithm. We take the previous layers and feed them into the next. Therefore

a(n)=σ(i=0nw(n)a(n)+b(n))\begin{align} a^{(n)} = \sigma \left( \sum_{i=0}^nw^{(n)}a^{(n)} +b^{(n)} \right) \end{align}

Where a(n)a^{(n)} is the activation for a given layer.

4. Two Layers with Two Neurons Each

Two Layer Network

a0(0)=σ((x0,0×w0,0)+(x1,0×w0,2)+b0))a1(0)=σ((x1,0×w0,3)+(x0,0×w0,1)+b0))a0(1)=σ((a0(0)×w1,0)+(a1(0)×w1,1)+b1))a1(1)=σ((a0(0)×w1,1)+(a1(0)×w1,3)+b1))y=σ((a0(1)×w2,0)+(a1(1)×w2,1)+b2)\begin{align} a_0^{(0)} &= \sigma ((x_{0,0}\times w_{0,0}) +(x_{1,0} \times w_{0,2}) + b_0)) \\ a_1^{(0)} &= \sigma((x_{1,0} \times w_{0,3}) + (x_{0,0}\times w_{0,1}) + b_0)) \\ a_0^{(1)} &= \sigma((a_0^{(0)} \times w_{1,0})+ (a_1^{(0)} \times w_{1,1}) + b_1)) \\ a_1^{(1)} &= \sigma((a_0^{(0)} \times w_{1,1}) + (a_1^{(0)} \times w_{1,3}) + b_1)) \\ y &= \sigma((a_0^{(1)} \times w_{2,0}) + (a_1^{(1)} \times w_{2,1}) + b_2) \end{align}

Sometimes we write zjiz^i_j to represent k(wjki×aki1)+bji\sum_k(w^i_{jk} \times a^{i-1}_k) + b^i_j. We can write this in a more concise manner using vectors and matrices for the transition from one layer to the next:

a(0)=σ([w0,0w0,1w0,2w0,3][x0,0x1,0]+[b0b0])=[a0(0)a1(0)]W(0)=[w0,0w0,1w0,2w0,3],X=[x0,0x1,0],b(0)=[b0b0]a(0)=σ(W(0)X+b(0))\begin{align*} a^{(0)} =& \sigma \left( \begin{bmatrix} w_{0,0} & w_{0,1} \\ w_{0,2} & w_{0,3} \end{bmatrix} \begin{bmatrix} x_{0,0} \\ x_{1,0} \\ \end{bmatrix} + \begin{bmatrix} b_0 \\ b_0 \end{bmatrix} \right) = \begin{bmatrix} a_0^{(0)} \\ a_1^{(0)} \end{bmatrix} \\ W^{(0)} &= \begin{bmatrix} w_{0,0} & w_{0,1} \\ w_{0,2} & w_{0,3} \end{bmatrix}, X = \begin{bmatrix} x_{0,0} \\ x_{1,0} \end{bmatrix}, b^{(0)} = \begin{bmatrix} b_{0} \\ b_{0} \end{bmatrix} \\ a^{(0)} &= \sigma(W^{(0)}X + b^{(0)}) \end{align*} a(1)=σ([w1,0w1,1w1,2w1,3][a0(0)a1(0)]+[b1b1])=[a0(1)a1(1)]a^{(1)} = \sigma \left( \begin{bmatrix} w_{1,0} & w_{1,1} \\ w_{1,2} & w_{1,3} \end{bmatrix} \begin{bmatrix} a_{0}^{(0)} \\ a_{1}^{(0)} \\ \end{bmatrix} + \begin{bmatrix} b_1 \\ b_1 \end{bmatrix} \right) = \begin{bmatrix} a_0^{(1)} \\ a_1^{(1)} \end{bmatrix} y=σ([w2,0w2,1][a1(1)a0(1)]+b2)y = \sigma( \begin{bmatrix} w_{2,0} & w_{2,1} \end{bmatrix} \begin{bmatrix} a_1^{(1)} \\ a_0^{(1)} \end{bmatrix} + b_2)

5. Cost Function

A Cost function measures how well a neural network performs with respect to the given training sample and the expected output. A cost function takes the form:

C(W,B,Sr,Er) W=network weightsB=network biasesSr=Input of a single training sampleEr=The desired output\begin{align*} C(W, B, S^r, E^r) \\ \ \\ W = \text{network weights} \\ B = \text{network biases} \\ S^r = \text{Input of a single training sample} \\ E^r = \text{The desired output} \end{align*}

This function can also possibly be dependent on yjiy_j^i and zjiz_j^i for any neuron jj in layer ii, since those values are dependent on WW, BB, and SrS^r. In Back Propagation, the cost function is used to compute the error of the output layer, δL\delta^L, via δjL=CajLσ(zji)\delta^L_j = \frac{\partial C}{\partial a^L_j}\sigma'(z^i_j) This can also be written as a vector via δL=aC  σ(zi) \delta^L = \nabla_aC \ \odot \ \sigma' (z^i) Note that \odot denotes vector product. To be used in back-propagation, a cost function must:

  1. Be able to be written as an average over cost functions CxC_x for an individual training example, xx: C=1nxCx C = \frac{1}{n}\sum_x C_x This allows us to compute the gradient (w.r.t. WW and BB) for a single training example.
  2. The cost function CC must not be dependent on any activation values of a network besides the output values aLa^L.

5.1 Cost Functions (CxC_x)

5.1.1 Quadratic Cost

Also known as Mean Squared Error: CMSE(W,B,Sr,Er)=0.5j(ajLEjr)2 C_{MSE}(W, B, S^r, E^r) = 0.5 \sum_j(a^L_j - E^r_j)^2 The gradient of this cost function is the differentiated version:

CMSE(W,B,Sr,Er)=12j(ajLEjr)2CMSE(W,B,Sr,Er)=(12j(ajLEjr)2)CMSE(W,B,Sr,Er)=12(j(ajLEjr)2)CMSE(W,B,Sr,Er)=122(aLEr)for some sample raCMSE=(aLEr)\begin{align} C_{MSE}(W,B,S^r,E^r) &= \frac{1}{2}\sum_j(a^L_j - E^r_j)^2 \\ C'_{MSE}(W,B,S^r,E^r) &= \left(\frac{1}{2}\sum_j(a^L_j - E^r_j)^2\right)' \\ C'_{MSE}(W,B,S^r,E^r) &= \frac{1}{2}\left(\sum_j(a^L_j - E^r_j)^2\right)' \\ C'_{MSE}(W,B,S^r, E^r) &= \frac{1}{2}\sum2(a^L-E^r) \\ \therefore \text{for some sample } r \\ \nabla_aC_{MSE} &= (a^L-E^r) \end{align}

5.1.2 Cross-Entropy Cost

Also known as Bernoulli Negative Log-Likelihood and Binary Cross-Entropy. CCE(W,B,Sr,Er)=j[EjrlnajL+(1Ejr)ln(1ajL)] C_{CE}(W,B,S^r, E^r) = - \sum_j[E^r_j \ln{a^L_j} + (1 - E^r_j)\ln{(1-a_j^L)}] Therefore, the gradient is:

CCE=j[EjrlnajL+(1Ejr)ln(1ajL)]CCE=(j[EjrlnajL+(1Ejr)ln(1ajL)])CCE=j[Ejr1ajL+(1Ejr)1(1ajL)]CCE=j[EjrajL+(1Ejr)(1ajL)]CCE=j[Ejr(ajL)(1ajL)+(1Ejr)(ajL)(1ajL)]CCE=j[Ejr+(1Ejr)(ajL)(1ajL)]w.r.t to one sample r:CCE=j(aLEr)(aL)(1aL)\begin{align} C_{CE} &= - \sum_j[E^r_j \ln{a^L_j} + (1 - E^r_j)\ln{(1-a_j^L)}] \\ C'_{CE}&= \left(- \sum_j[E^r_j \ln{a^L_j} + (1 - E^r_j)\ln{(1-a_j^L)}]\right)' \\ C'_{CE} &= -\sum_j \left[E_j^r \frac{1}{a_j^L} + (1 -E_j^r) \frac{1}{(1-a_j^L)}\right]' \\ C'_{CE} &= -\sum_j \left[\frac{E_j^r}{a_j^L} + \frac{(1 -E_j^r)}{(1-a_j^L)}\right]' \\ C'_{CE} &= -\sum_j \left[\frac{E_j^r}{(a_j^L)(1-a_j^L)} + \frac{(1 -E_j^r)}{(a_j^L)(1-a_j^L)}\right]' \\ C'_{CE} &= -\sum_j \left[ \frac{E_j^r + (1 -E_j^r)}{(a_j^L)(1-a_j^L)}\right]' \\ \\ &\text{w.r.t to one sample r:} \\ C'_{CE} &= -\sum_j \frac{(a^L -E^r)}{(a^L)(1-a^L)} \\ \end{align}

Other Cost functions:

Loss functions help a model determine how wrong a prediction is, which helps the learning algorithm to decide how to minimise it.

6. Computing the Gradient

6.1 Gradient Descent

Gradient Decent or Hill-Climbing is the principle by which learning happens. We want to reduce the loss, drive down the loss function, and essentially find the turning point. New Weight=Old Weighta small change in W \text{New Weight} = \text{Old Weight}- \text{a small change in W} Wnew=WoldΔWW_{new} = W_{old} - \Delta W To do this, we compute the gradient at each point and move in the opposite direction, driving you to the minimum point.

6.1.1 Types of Gradient Descent

  1. Batch Gradient Descent: passing the entire data set and calculating the average loss. This is slow and memory-intensive.
  2. Mini-Batch Gradient Descent: we define batch sizes, say n, n randomly chosen values are selected, and the cost is computed for those data points.
  3. Stochastic Gradient Descent: weights are updated after every record; it’s quick and less memory intensive but has high volatility, meaning it may take longer to converge to a minimum.

6.2 Calculate the Gradient

There are many ways to compute a gradient, one example is Finite-Difference or Back Propagation.

6.2.1 Finite-Difference

C(w)=limε0C(w+ε)C(w)ε\begin{align*} C'(w) = \lim_{\varepsilon \to 0} \frac{C(w + \varepsilon) - C(w)}{\varepsilon} \end{align*}

7. Activation Functions

Activation functions are mathematical operations which are applied to the outputs of individual neurons in a neural network.

7.1 Types

  1. Sigmoid: f(x)=1(1+ex)f(x) = \frac{1}{(1 + e^{-x})}, ranging from 0 to 1

    1. Used in the output layer of binary classification
  2. Hyperbolic Tangent: tanh=exexex+ex\tanh = \frac{e^{x} - e^{-x}}{e^x + e^{-x}} inputs range from [1,1][-1, 1]

  3. ReLu: f(x)=max(0,x)f(x) = \max{(0, x)} often used but suffers from the dying ReLu problem.

  4. Leaky ReLu: f(x)=max(α×x,x)f(x) = \max{(\alpha \times x, x)} allows a small gradient for negative values to solve the dying ReLu issue.

  5. Exponential Linear Unit (ELU)

    f(x)={x if x>0α(ex1) otherwise\begin{align} f(x) = \begin{cases} x \text{ if } x> 0 \\ \alpha(e^x - 1) \text{ otherwise} \end{cases} \end{align}
    1. Combines Leaky ReLu and ReLu to mitigate the dying ReLu issue.
  6. Swish Activation: f(x)=x×σ(x)f(x) = x \times \sigma(x) proposed by google gives a smoother behaviour

  7. Parametric ReLu (PReLu)

  8. Randomised Leaky ReLu (RReLu)

  9. Parametric Exponential Linear Unit (PELU)

  10. Softmax - muti-class classification

  11. Softplus

  12. ArcTan

  13. Gaussian Error

  14. Swish-1 Activation

  15. Inverse Square Root Linear Unit (ISRLU)

  16. Scaled Exponential Linear Unit (SELU)

  17. SoftExponential

  18. Bipolar Sigmoid

  19. Binary Step Activation

7.2 Usages

1. Sigmoid: Sigmoid activation is well-suited for binary classification problems where you need outputs that resemble probabilities. It squashes input values into the range between 0 and 1, making it ideal for problems with two distinct classes.

2. Tanh (Hyperbolic Tangent): Tanh is an excellent choice for hidden layers, especially when your input data is centred around zero (mean-zero data). It maps input values to the range [-1, 1], which helps mitigate the vanishing gradient problem and is often preferred in recurrent neural networks (RNNs).

3. ReLU (Rectified Linear Unit): ReLU is a widely used activation function and serves as a good default choice for most situations. It introduces sparsity by setting negative values to zero, making it computationally efficient. However, it may lead to dead neurons during training, so it’s crucial to monitor its performance.

4. Leaky ReLU: Leaky ReLU is a variant of ReLU and is employed when the standard ReLU causes neurons to become inactive. It allows a small gradient for negative values, preventing the issue of dead neurons. It’s a recommended alternative to standard ReLU.

5. ELU (Exponential Linear Unit): ELU is valuable when you want the network to capture both positive and negative values within the hidden layers. It addresses the dying ReLU problem and can lead to faster convergence during training.

6. Swish: Swish is an activation function worth experimenting with, as it combines the computational efficiency of ReLU with a smoother, non-monotonic behavior. It has shown potential performance improvements in some architectures.

7. PReLU (Parametric ReLU): PReLU extends Leaky ReLU by allowing each neuron to learn its optimal alpha parameter. This can be beneficial when you want the network to adapt its activation function during training.

8. RReLU (Randomised Leaky ReLU): RReLU introduces randomness as a form of regularisation during training. It can help prevent overfitting and enhance the network’s generalisation ability.

9. PELU (Parametric Exponential Linear Unit): PELU extends ELU by enabling neurons to learn their alpha parameter. This flexibility can be advantageous in various scenarios, allowing the network to adapt to the data.

10. Softmax: Softmax activation is essential for multi-class classification problems in the output layer. It transforms a vector of real numbers into a probability distribution over multiple classes, enabling the network to make class predictions.

11. Softplus: Softplus is a smooth approximation of ReLU and can be helpful when you need a smooth activation function with continuous and differentiable derivatives.

12. ArcTan: ArcTan squashes input values to a limited range between -π/2 and π/2. It can be suitable for specific applications where you must restrict the output within this range.

13. GELU (Gaussian Error Linear Unit): GELU is popular in transformer models and combines a smooth function with Gaussian noise, potentially leading to improved model performance.

14. Swish-1: Swish-1 is a variant of Swish with a division operation. It offers a different activation profile compared to standard Swish and is worth considering in experimentation.

15. ISRLU (Inverse Square Root Linear Unit): ISRLU is a smooth alternative to ReLU that can be helpful when maintaining smooth gradients throughout the network.

16. SELU (Scaled Exponential Linear Unit): SELU encourages automatic activation normalisation and can lead to better training performance, especially in deep neural networks.

17. SoftExponential: SoftExponential introduces nonlinearity with a learnable parameter, allowing the network to adapt to specific data distributions.

18. Bipolar Sigmoid: Bipolar Sigmoid maps inputs to the range between -1 and 1, which can be beneficial when you want to model data with positive and negative values.

19. Binary Step: Binary Step is the simplest activation function, providing binary outputs based on a specified threshold. It’s suitable for binary decision problems.

8. Chain Rule and Partial Derivatives

The chain rule is used to differentiate complex functions.

h(x)=(sinx)2dhdx=d[(sinx)2]d[sinx]×d[sinx]dxdhdx=2sinx×cosx\begin{align} h(x) &= (\sin x)^2 \\ \frac{dh}{dx} &= \frac{d[(\sin x)^2]}{d[\sin x]} \times \frac{d[\sin x]}{dx} \\ \frac{dh}{dx} &= 2\sin x \times \cos x \end{align}

More Generally:

ddxf(g(x))=f(g(x))×g(x)\frac{d}{dx}f(g(x)) =f'(g(x)) \times g'(x)

For example:

ddx[5x+3]4g(x)=5x+3f(x)=x4f(x)=4x3g(x)=5ddx[5x+3]4=f(g(x))×g(x)=4[5x+3]3×5=20[5x+3]3\begin{align} \frac{d}{dx}[5x+3]^4 \\ g(x) &= 5x + 3 \\ f(x) &= x^4 \\ f'(x) &= 4x^3 \\ g'(x) &= 5 \\ \frac{d}{dx}[5x+3]^4 &= f'(g(x)) \times g'(x) \\ &= 4[5x+3]^3 \times 5 \\ & = 20[5x +3]^3 \end{align}

8.1 Partial Derivatives

Used in multivariable calculus

f(x,y)=x2+y3fx=2x+0fy=0+3y2\begin{align} f(x,y) &= x^2 +y^3 \\ \frac{\partial f}{\partial x} &= 2x + 0 \\ \frac{\partial f}{\partial y} &= 0 + 3y^2 \end{align}

First, we treated y as a constant and then x.

9. Back Propagation

Backpropagation repeatedly adjusts the weights of the connections in the network so as to minimise a measure of the difference between the actual output vector of the net and the desired output vector.

The Gradient of a function C(x1,x2,,xm)C(x_1, x_2, \dots, x_m) is a vector of partial derivatives: Cx=[Cx1,Cx2,,Cxm]\frac{\partial C}{\partial x} = \left[\frac{\partial C}{\partial x_1},\frac{\partial C}{\partial x_2},\dots,\frac{\partial C}{\partial x_m}\right]

Cwj,k(l))=Czj(l)zj(l)wj,k(l)where zj(l)=k=1mwj,k(l)×ak(l1)+blzjlwj,k(l)=akl1Cwj,k(l)=Czj(l)ak(l1)\begin{align} \frac{\partial C}{\partial w_{j,k}^{(l)})} &= \frac{\partial C}{\partial z_j^{(l)}} \frac{\partial z^{(l)}_j}{\partial w_{j,k}^{(l)}} \\ \text{where } z^{(l)}_j &= \sum_{k=1}^m w_{j,k}^{(l)}\times a^{(l-1)}_k + b_l \\ \frac{\partial z_j^l}{\partial w_{j,k}^{(l)}} &= a^{l-1}_k \\ \frac{\partial C}{\partial w_{j,k}^{(l)}} &= \frac{\partial C}{\partial z^{(l)}_j}a_{k}^{(l-1)} \end{align}

This uses j(l)=Czj(l)\partial^{(l)}_j = \frac{\partial C}{\partial z^{(l)}_j} the local gradient

9.1 Derivatives of Activation Functions

Sigmoid

σ(x)=11+exσ(x)=σ(x)(1σ(x))\begin{align} \sigma (x) &= \frac{1}{1 + e^x} \\ \sigma'(x) &= \sigma(x)(1-\sigma(x)) \end{align}

10. Optimisation

10.1 Stochastic Gradient Descent

The issue with gradient descent is that it is very computation-heavy.

Mini-Batch

In Mini-Batch, instead of looking at one data point, we sample a small number of data points.

10.2 Learning Rate

10.2.1 Learning Rate Decay

With learning rate decay, the learning rate is calculated at each update (e.g. end of each mini-batch) as follows: lrate=initial_lrate×(1(1+decay×iteration))lrate = initial\_lrate \times \left(\frac{1}{(1 + decay \times iteration)}\right) Where lrate is the learning rate for the current epoch, initial_lrate is the learning rate specified as an argument to SGD, decay is the decay rate which is greater than zero and iteration is the current update number.

10.2.2 Learning Rate Scheduler

A learning rate scheduler is a method that adjusts the learning rate during the training process, often lowering it as the training progresses. There are several schedulers for example:

  1. Step Decay
  2. Exponential Decay
  3. Cosine Annealing

10.2.3 Adaptive Learning Rate

Different Methods exist that adapt the learning rate to the problem.

10.3 Momentum

Momentum can smooth the progression of the learning algorithm that, in turn, can accelerate the training process.

vt=βvt1ηL(wt)wt+1=wt+vt\begin{align} v_t = \beta v_{t-1} - \eta \nabla L(w_t) \\ w_{t+1} = w_t + v_t \end{align}

Where

11. Data

11.1 Data Shuffling

It is useful to shuffle the training data during training.

  1. Preventing Bias: Without shuffling, the model might learn patterns based on the order of the data, leading to biased training and potentially poor generalisation to unseen data.
  2. Randomness in Batch Selection: When training in mini-batches, shuffling ensures that each batch contains a diverse set of samples from the dataset. This randomness helps the model learn more effectively and prevents it from memorising specific patterns within a batch.
  3. Improving Generalisation: Shuffling the data helps to ensure that the model generalises well to unseen data by exposing it to a variety of samples during training. This can lead to better performance on validation and test datasets.
  4. Breaking Patterns: In some cases, the dataset might have inherent patterns based on the order of samples (e.g., temporal or spatial patterns). Shuffling disrupts these patterns, forcing the model to learn more robust features.
  5. Avoiding Overfitting: Shuffling helps to mitigate overfitting by preventing the model from memorising the specific characteristics of the training data, which may not generalise well to new data.

11.2 Data Batching

A batch defines how many samples to work through before making any updates to parameters. This is useful for: