..............................................730888047.............................................
............................................7688888888057...........................................
...........................................488888888888887..........................................
..........................................7888888888888888..........................................
..........................................70888888888888882.........................................
......................................71427588888888888882751.......................................
.................................774417...762088888888897...757.....................................
.............................72427.......23...713555417.......757...................................
..........746009517.....714477.........767......................757.................................
........48888888808574417.............15..........................757...............................
.......688888888888887...............51.............................157.............................
......7888888888888882.............767........735006317...............157...........................
.......8888888888888832222777.....44........5888888888867...............237.........................
.......18888888888886......77714396777.....088888888888087................43........................
........738888888867237........15...7712225888888888888885..................34......................
...........71111751...731.....57..........188888888888888522217..............732....................
..................15.....23776.............08888888888889.....7724217..........751.7712177..........
...................767.....962..........7347198888888802............7722217......60808888084........
.....................23..767.7237.....157......7122277....................77244108888888888887......
......................7543......7517327........................................688888888888889......
.......................552.......15437.........................................088888888888888......
.....................76..767..732....757..........7...........................7588888888888881......
....................32.....6457........7447..73088888047................7722277.7988888888807.......
..................767....75115............156888888888085.........7724417.......75225900527.........
............7777.44...7237...767...........088888888888885..7724417...........732...................
.........20888888847731........23....777223888888888888888417................32.....................
.......78888888888885........7716952777....088888888888886.................44.......................
......788888888888888112222177....32.......70888888888886................23.........................
......7888888888888882.............757.......16888888857...............237..........................
.......088888888888887...............51.........77777................137............................
.......79888888888893277..............15...........................157..............................
.........7408888037....724277..........767.......................157................................
...........................772427........23....7717777.........757..................................
................................772417....76730888888802.....757....................................
.....................................774417488888888888807.757......................................
..........................................108888888888888657........................................
..........................................7888888888888888..........................................
...........................................988888888888882..........................................
............................................5088888888801...........................................
.............................................7158888047.............................................

The purpose of these notes is to explore the basic architecture and mathematics underlying a feed-forward neural network. I don’t build a high-performing neural network, nor do I explore techniques for improving the performance of the network once that I have built it. This code is my personal exploration of feed-forward neural networks, so all of my code is designed to aid clarity and understanding, and not built for efficiency nor speed.

The 4 resources to which I referred most closely are (in order of use):

1 http://neuralnetworksanddeeplearning.com

2 http://www.deeplearningbook.org/

3 https://matrices.io/deep-neural-network-from-scratch/

4 https://selbydavid.com/2018/01/09/neural-network/

All of these resources are amazing.

Others that were useful were:

5 https://medium.com/@iliakarmanov/neural-networks-from-scratch-in-r-dcf97867c238

6 https://theclevermachine.wordpress.com/2014/09/06/derivation-error-backpropagation-gradient-descent-for-neural-networks/

I first build a network with a single hidden layer to solve a simple classification problem. Then, I build a deeper regression network, going a bit deeper into how back-propagation works.

These notes do not explain all of the concepts required to understand it. To fully understand the content of these notes, read the first 2 chapters of http://neuralnetworksanddeeplearning.com before reading these notes.

I use the terms cost function and loss function interchangeably throughout these notes.

Neural Network Classifier

For the classifier, we wish to make a neural network which can predict the class of each point/observation in our dataset using the 2 features/explanatory variables $X_1$ and $X_2$. First we simulate some data:

simulate some data:

library(mlbench)    
set.seed(69420)

sim_dataset <- mlbench.spirals(500, cycles=1, sd=0)

sim_dataset <- 
   tibble( x1 = sim_dataset$x[,1] + rnorm(100,0,0.1),
               x2 = sim_dataset$x[,2] + rnorm(100,0,0.1),
                y = as.numeric( sim_dataset$classes )-1
   ) 


ggplot( data = sim_dataset %>% mutate(class_of_y = factor(y)),
           aes( x = x1, 
                y = x2, 
                colour = class_of_y
              )
      ) +
  geom_point()

Feed Forward Classifier

Here is a diagram of the network which we are going to build to solve this problem:

Layers between the input and output layers are called hidden layers. This network is a very shallow network, with only 1 hidden layer.

Consider a function which switches on when given a positive input, and is off for negative input:

A neural network is a network of neurons: each layer of neurons takes in information, transforms this information, then feeds it forward as input into the next layer of neurons. A simple type of neuron is a perceptron, whose activation function is this switch function. Inputs into a perceptron neuron are combined into a single value, this value is then put through the switch function, and the neuron outputs either a 0 or a 1.

The sigmoid function is like a smoothed switch, allowing neurons to output a value between 0 and 1 - a gradient between on and off. Using this can allow a model to learn more slowly, because a small change in the input does not cause a big leap into the output (0,1).

The sigmoid function is

\[\text{sigmoid}(x) \quad = \quad \frac{e^x}{1+e^x} \quad = \quad \frac{1}{1+e^{-x}}\]

We can define it in R like this:

sigmoid <- function(x){ 1/( 1+exp(-x) ) }

and it looks like this:

par(bg="black", col="white", col.axis="white", col.lab="white", col.main="white")
x <- seq(-10,10,0.01)
plot( x = x, y = sigmoid(x) )
abline( v = 0 )

Sigmoid functions are no longer the activation function of choice when building neural networks because they are known to saturate (generate extremely large weights), but they will do fine for this explanation. In practice, ReLU activation functions are a more popular choice.

We define a function to feed forward a matrix of input features through the network (this is known as forward propagation):

feedforward <- function( X,    # design matrix (including column of 1s)
                         w1,   # weights between input layer and hidden layer
                         w2    # weights between hidden layer and output layer
){
    z1 <- X%*%w1
    h <- sigmoid(z1)
    H <- cbind(1,h)
    z2 <- H %*% w2
    y_predict <- sigmoid( z2 )
    return( sigmoid(z2) )
}


# here is a faster version of the above function:
feedforward2 <- function( X, w1, w2){  sigmoid( cbind(1,sigmoid(X%*%w1)) %*% w2 )  }

Given a network that has already been created, this function takes in inputs and gives us the output of the network.

Forward propagation means passing the input values through the network, where each node that a value passes through applies a (possibly non-linear) function to that value. This allows the neural network to perform a complex non-linear transformation/function of the input variables to produce a prediction, where the complexity of the transformation/function is controlled by the depth of the network (number of hidden layers and number of neurons in each layer). For a thorough treatment of forward propagation, refer to http://neuralnetworksanddeeplearning.com/chap1.html

The loss/cost function

We have many observations in our dataset, but consider a single observation $i$. This observation has some true label $y_i$, which we are trying to build a model to predict.

For this classification problem, we have 2 features $x_{i1}$ and $x_{i2}$ for this observation. We give these features to our neural network model and it outputs a predicted label $\hat{y}_i$. This prediction will be a value somewhere in $\Big[0,1\Big]$. For predictions closer to 1, the model is more certain in it’s prediction that the true value is 1, and vice versa for a label of 0.

Some examples:

$\hat{y}_i=0.99$ is predicting that the true label for observation $i$ is very likely a 1.
$\hat{y}_i=0.7$ is predicting a 1 for observation $i$, but is only moderately certain.
$\hat{y}_i=0.05$ is a confident prediction of label 0.
$\hat{y}_i=0.5$ means the model is very uncertain in it’s prediction.

We can interpret $\hat{y}_i$ as $\hat{y}_i=Pr\Big[y_i = 1 \Big]$ and $1-\hat{y}_i$ as $Pr\Big[y_i = 0 \Big]$

For any single observation $i$, we can improve our model’s accuracy by increasing the value of

\[\hat{y}_i^{y_i} \space (1-\hat{y}_i)^{1-y_i}\]

This is because if the true label of observation $i$ is $y_i=1$, then we want to increase the value of the prediction $\hat{y}_i$, and if the true label is $y_i=0$ then we want to decrease the value of the prediction $\hat{y}_i$.

Here is how this function looks:

We can see that, using this performance-measuring function, our gain in model performance is linearly related to our prediction.

If we take the log of the function, then we get:

\[y_i \space log(\hat{y}_i) \quad + \quad (1-y_i) \space log(1-\hat{y}_i)\]

This measure of model performance is nice because there are smaller gains in model performance (less reward) as the model prediction gets closer to the true y label (the correct prediction). This means that if we are trying to maximise our model performance over many observations, then we can use this log function to make our model focus on improving it’s predictions for observations that it is getting very wrong, and focus less on ones that are closer to being right.

Here is how this cost function looks:

We can measure the performance of our whole model, over all of the $n$ observations in our data, by measuring the performance of each individual, then summing these:

\[\sum_i^n \space \Big( y_i \space log(\hat{y}_i) \quad + \quad (1-y_i) \space log(1-\hat{y}_i) \Big)\]

We can turn this into a loss function (which we want to minimise in order to improve our model) by multiplying it by -1:

\[\mathcal{L} \quad = \quad -\sum_i \space \Big( y_i \space log(\hat{y}_i) \quad + \quad (1-y_i) \space log(1-\hat{y}_i) \Big)\]

Here is how this function is specified in R:

loss_ftn <- function(y,     # vector of true labels 
                     y_hat  # vector of predicted labels
                    ){
     -sum( 
            y*log(y_hat) + (1-y)*log(1-y_hat)
         )    
                      }

We will use this loss function to train our model. Training a neural network means choosing the weights on each neuron in the network, which we do based on how changes to the weights affect the loss function that we’ve chosen.

The loss function that we have chosen here is only a function of our predicted labels and the true observation labels. In practice, we might add more features/components to the cost function, allowing us to control how our network learns (for example, a regularisation term penalising large weights could be added).

Back propagation

The back propagation algorithm allows us to calculate the gradient (rate of change of the loss function) for every individual weight in the network. Knowing how a change in any individual weight in the network will affect the loss function allows us to choose the weights of the network in a way that reduces the loss function value.

Let’s explore how this works:

Suppose for a moment that we have a training set with 10 samples/observations in it:

subset_obs <- sim_dataset[1:10,]
subset_obs

$y$ are the true labels for these observations, and $x_1$ and $x_2$ are the features which the model will use to try to predict these labels.

So, our design matrix for these 10 observations is:

X <- cbind(intercept=1, subset_obs[,-3]) %>% as.matrix()
X

##       intercept         x1          x2
##  [1,]         1 -0.8311040  0.48044131
##  [2,]         1 -0.4909602  0.78836302
##  [3,]         1  0.2567661  0.03589952
##  [4,]         1  0.3455692  0.07807066
##  [5,]         1 -0.2617071  0.82611327
##  [6,]         1  0.5407843 -0.12892211
##  [7,]         1  0.5454220 -0.34649535
##  [8,]         1 -0.3910073 -0.18564491
##  [9,]         1  0.5471577  0.09596109
## [10,]         1  0.1507474  0.15171993

Refer back to the diagram at the beginning of this page for the structure of the neurons and weights in the network.

We initialise the model with random weights:

set.seed(696969420)
# create the weight matrices (random weights)
w1 <- matrix( rnorm(3*5), nrow=3, ncol=5 )
w2 <- matrix( rnorm(6*1), nrow=6, ncol=1 )
w1

##           [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 0.6185844 -0.3069030  0.4636220  0.1770278 -0.4188216
## [2,] 1.4971467 -0.4411794 -0.8287034 -0.7124157 -1.3265791
## [3,] 1.2608524 -2.6888881 -0.9632643 -0.6046125  0.9598576

w2

##             [,1]
## [1,] -0.43902060
## [2,]  1.01316230
## [3,]  0.05069603
## [4,]  1.04799010
## [5,]  0.32022737
## [6,] -0.63367454

With these random weights and sample data, the output of the network (predictions $\hat{y}$) is:

feedforward( X = X, w1 = w1, w2 = w2 )

##            [,1]
##  [1,] 0.6198274
##  [2,] 0.6310896
##  [3,] 0.7030117
##  [4,] 0.7043924
##  [5,] 0.6393536
##  [6,] 0.7199627
##  [7,] 0.7285029
##  [8,] 0.6694953
##  [9,] 0.7090447
## [10,] 0.6929599

These predictions are currently meaningless because we’ve used random weights. We want to keep adjusting these weights (using our cost/loss function) to improve the predictions of our model.

We can plug our current (random) predictions into our loss function, which gives us a loss of:

loss_ftn( y = subset_obs$y,
          y_hat = feedforward( X = X, w1 = w1, w2 = w2 )
        )

## [1] 8.224894

Now, we calculate the gradient of our loss function with respect to our weights. This tells us how changing any particular weight affects the output of our loss function.

The gradient in terms of $W_2$ (our matrix of weights to the right of the hidden layer) is

# some things we need to calculate the gradient:
y <- subset_obs$y
y_hat <- feedforward( X = X, w1 = w1, w2 = w2 )
h <- sigmoid( X %*% w1 )
H <- cbind(1, h)
  
# the gradient/derivative of our loss function with respect to w2 is 
dL_dW2 <- t(H) %*% (y_hat - y)
dL_dW2

##           [,1]
## [1,] 1.8176400
## [2,] 1.6475883
## [3,] 0.7534591
## [4,] 0.8481189
## [5,] 0.7259653
## [6,] 0.0466680

What this says is that with our current weights, if we increase our first $W_2$ weight, but keep all of the other weights unchanged, then our loss function will change at a rate of 1.81764 per 1 unit increase in this first $W_2$ weight. Note that this only refers to the change (gradient/slope) at that exact point - it doesn’t mean that a 1 unit increase will result in exactly a change of 1.81764 in the loss function, because moving 1 unit takes us away from that point, passing points with different rates of change. However, it does mean that if we move a small amount (like -0.001), then we can expect a change in the loss of approximately -0.001$$1.81764 = -0.0018176. The actual change is:

adjusted_w2 <- w2 + c(-0.001, 0, 0, 0, 0, 0)

# the change in the loss function is:
loss_ftn( y = y, 
          y_hat = feedforward( X = X, w1 = w1, w2 = adjusted_w2 ) ) -
loss_ftn( y = y, 
          y_hat = feedforward( X = X, w1 = w1, w2 = w2 ) )

## [1] -0.001816562

We can see how close this approximation is.

Similarly, the gradient of the loss function with respect to the first weights (weights before the hidden layer) is:

dh  <- (y_hat - y) %*% t( w2[-1, drop = FALSE] )
dL_dW1 <- t(X) %*% (h * (1 - h) * dh)
dL_dW1

##                  [,1]          [,2]       [,3]        [,4]        [,5]
## intercept  0.25216518  0.0268277230  0.5017940  0.14788822 -0.24293143
## x1         0.36016831  0.0188839617  0.4754534  0.14737721 -0.24212150
## x2        -0.09596134 -0.0005268019 -0.1177939 -0.03598622  0.05556531

This tells us, for example, that we can change our loss by around 0.001$$0.1473772 = 0.0001474 by increasing the 4th weight on $x_1$ by 0.001 (keeping all other weights equal). The actual change is:

adjusted_w1 <- w1 + matrix( c( 0, 0, 0, 0,     0,
                               0, 0, 0, 0.001, 0,
                               0, 0, 0, 0,     0
                               ),
                                byrow = TRUE,
                                nrow=3, ncol = 5
                          )
loss_ftn( y = y, 
          y_hat = feedforward( X = X, w1 = adjusted_w1, w2 = w2 ) ) -
loss_ftn( y = y, 
          y_hat = feedforward( X = X, w1 = w1, w2 = w2 ) )

## [1] 0.0001473835

We can select the weights in the network (train the model) by simultaneously updating all of the weights in our network. We update the weights a very small amount at a time; updating weights by a large amount would make the gradients (instantaneous rates of change) an inaccurate estimate of the effect on the loss function. We control the rate at which the weights are updated in each iteration using a constant (a hyperparameter that we choose) called the learning rate.

We iteratively update each weight according to the rule:

\[W_{t+1} \quad = \quad W_t \quad - \quad \lambda \space \nabla \space \mathcal{L(W_t)} \]

So, for each weight (weights updated simultaneously):

We get the rate of change of the loss function for this weight at it’s current value $\nabla \space \mathcal{L(W_t)}$, assuming that all other weights do not change.
We update this weight by a very small (controlled by $\lambda$) amount in the direction in which we expect to reduce the loss function.

We make a function in R to do this:

backpropagate <- function( X,          # design matrix (with first column of 1s)
                           y,          # vector of true observed outcomes
                           y_hat,      # vector of current model predictions
                           w1,         # matrix of current weights between input layer and hidden layer
                           w2,         # matrix of current weights between hidden layer and output layer
                           learnrate   # the chosen learning rate (lambda) 
){
      
     h <- sigmoid(X%*%w1)
     H <- cbind( 1, h )
     dL_dw2 <- t(H) %*% (y_hat - y)       # calculate the gradients for W2
     dL_dh  <- (y_hat - y) %*% t(w2[-1, , drop = FALSE])
     dL_dw1 <- t(X) %*% (h * (1 - h) * dL_dh)  # calculate the gradients for W1
  
  w1 <- w1 - learnrate * dL_dw1         # update the weights W1
  w2 <- w2 - learnrate * dL_dw2         # update the weights W2
  
  return(  list(w1 = w1, w2 = w2)  )
     
}

As a reminder, our initial (random) weights are:

w1

##           [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 0.6185844 -0.3069030  0.4636220  0.1770278 -0.4188216
## [2,] 1.4971467 -0.4411794 -0.8287034 -0.7124157 -1.3265791
## [3,] 1.2608524 -2.6888881 -0.9632643 -0.6046125  0.9598576

w2

##             [,1]
## [1,] -0.43902060
## [2,]  1.01316230
## [3,]  0.05069603
## [4,]  1.04799010
## [5,]  0.32022737
## [6,] -0.63367454

After 1 iteration of weight updating, we get the following updated weights:

y_hat <- feedforward( X = X, w1 = w1, w2 = w2 )          
  
backpropagate( X = X,          # design matrix (with first column of 1s)
               y = y,          # vector of true observed outcomes
               y_hat = y_hat,      # vector of model predictions
               w1 = w1,         # matrix of weights between input layer and hidden layer
               w2 = w2,         # matrix of weights between hidden layer and output layer
               learnrate = 0.001   # the learning rate (lambda) 
)

## $w1
##                [,1]       [,2]       [,3]       [,4]       [,5]
## intercept 0.6183322 -0.3069298  0.4631202  0.1768799 -0.4185787
## x1        1.4967865 -0.4411983 -0.8291788 -0.7125631 -1.3263370
## x2        1.2609484 -2.6888876 -0.9631465 -0.6045765  0.9598020
## 
## $w2
##             [,1]
## [1,] -0.44083824
## [2,]  1.01151471
## [3,]  0.04994258
## [4,]  1.04714198
## [5,]  0.31950140
## [6,] -0.63372121

Testing our classifier network

We are now going to fit our feed-forward network to the full dataset. I’m not splitting into a training and validation set, just seeing how well the neural network can fit to the training data.

Rather than the 6 neurons shown in the initial network diagram, I am going to put 30 neurons in the hidden layer.

So, our design matrix (input features) are:

X <- cbind(1, sim_dataset[,1:2]) %>% as.matrix()

# first 10 rows:
head(X,10)

##       1         x1          x2
##  [1,] 1 -0.8311040  0.48044131
##  [2,] 1 -0.4909602  0.78836302
##  [3,] 1  0.2567661  0.03589952
##  [4,] 1  0.3455692  0.07807066
##  [5,] 1 -0.2617071  0.82611327
##  [6,] 1  0.5407843 -0.12892211
##  [7,] 1  0.5454220 -0.34649535
##  [8,] 1 -0.3910073 -0.18564491
##  [9,] 1  0.5471577  0.09596109
## [10,] 1  0.1507474  0.15171993

# count number of rows:
nrow(X)

## [1] 500

with true labels y:

y <- sim_dataset$y
y

##   [1] 1 1 0 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 1 0
##  [36] 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 0 0 0
##  [71] 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0
## [106] 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0
## [141] 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 1 0
## [176] 1 1 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 0 1 1 0 0 0
## [211] 1 1 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 0
## [246] 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
## [281] 1 1 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0 1 1 1
## [316] 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 0
## [351] 0 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 1
## [386] 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 1 1 1
## [421] 1 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 1 0 0 1
## [456] 1 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 1 1 1 1
## [491] 0 1 1 1 0 0 1 1 1 1

We initialise our weights with random normal values with mean 0 and variance 1:

current_w1 <- matrix( rnorm(3*30), ncol=30, nrow=3)
current_w2 <- matrix( rnorm(31*1), nrow=31, ncol=1 )

current_w1

##            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
## [1,] -0.6870580 -0.9496113 -1.9831288 -0.7185292 -0.8155478 -0.7865093
## [2,]  0.3286330 -1.2457112 -0.5869501  0.5992224  1.8865433  0.8664859
## [3,]  0.8883336  0.6363016  0.2260666 -1.1061342 -0.9083412  1.3519927
##            [,7]        [,8]      [,9]      [,10]      [,11]      [,12]
## [1,]  0.2808328  1.49081096 1.0614086 -0.2448154 -0.3526942 -0.7616990
## [2,]  1.2334168  0.04902111 0.5156258 -0.1197076 -2.2944330  1.7187671
## [3,] -0.8259938 -0.93784264 1.0075282  1.0869595 -0.1634781  0.7804397
##            [,13]      [,14]      [,15]       [,16]      [,17]      [,18]
## [1,]  0.94214402 0.02080711 -1.0686916 -0.08145594 -0.1003017  0.4716697
## [2,] -1.03655271 0.30175506 -1.2598245 -0.72588484 -0.1570956  1.9318030
## [3,]  0.03093519 0.46784579  0.6694906 -0.09373220 -0.2339447 -0.9034702
##           [,19]     [,20]      [,21]      [,22]      [,23]      [,24]
## [1,] -0.6140546  1.295499 -2.1890291  0.2760953 -2.6779981 -0.5381040
## [2,]  0.4919629  1.227210  0.3026268  2.1732636 -0.7065492 -2.4699171
## [3,] -1.3152220 -1.195298  0.7441742 -1.9236434 -0.2347173  0.8222032
##          [,25]      [,26]       [,27]      [,28]       [,29]       [,30]
## [1,] 0.5235839 -0.3791413 -0.47212576 -1.4543045  0.61410889 -0.27440419
## [2,] 0.6189947  0.4741847 -0.01077227  0.6310896 -0.02831715 -0.09917564
## [3,] 2.0249083  0.9593751 -2.48506700  1.7701504 -0.78397737 -0.03477970

current_w2

##               [,1]
##  [1,] -0.704985262
##  [2,]  2.144019428
##  [3,]  0.866556852
##  [4,]  1.084165974
##  [5,]  0.730648655
##  [6,] -0.576333306
##  [7,] -1.659551726
##  [8,]  0.084781815
##  [9,]  0.405017534
## [10,]  1.529721700
## [11,]  1.159353318
## [12,]  1.292737261
## [13,] -0.061818866
## [14,] -0.547429417
## [15,] -1.423055109
## [16,] -1.386645037
## [17,]  0.784467251
## [18,]  0.534292218
## [19,]  0.513703930
## [20,]  0.440467077
## [21,] -0.032972790
## [22,]  0.446220393
## [23,]  0.505655071
## [24,] -0.688748827
## [25,] -0.075516373
## [26,]  0.009921586
## [27,]  0.370951784
## [28,] -1.318298527
## [29,] -0.550775592
## [30,]  0.896188228
## [31,]  0.320133730

I am going to update the weights 10,000 times, so I make a vector of length 1000 to store the cost/loss of the output (label predictions) of the network at each update (so that we can refer to this later to see how our cost changed as the network learned). Initially, each element of this vector just contains an NA value.

store_cost <- as.numeric( rep( as.numeric(NA), 10000) )

# show first 10 values
head( store_cost, 10 )

##  [1] NA NA NA NA NA NA NA NA NA NA

With our initial weights, the output of the network is:

current_y_hat <- feedforward( X = X,
                              w1 = current_w1,
                              w2 = current_w2
                            )

# show first 10 values
head( current_y_hat, 10)

##            [,1]
##  [1,] 0.9495247
##  [2,] 0.9392550
##  [3,] 0.9245212
##  [4,] 0.9224533
##  [5,] 0.9298453
##  [6,] 0.9125697
##  [7,] 0.9051290
##  [8,] 0.9347791
##  [9,] 0.9169147
## [10,] 0.9282108

Giving us a cost/loss of

loss_ftn( y = y, y_hat = current_y_hat )

## [1] 659.6744

Now, we perform the following sequence 10,000 times:

forward propagation:Feed our inputs through the network and get the network output.
Calculate our loss/cost for our current network predictions (output) and store this to refer to later.
back propagation: Calculate the gradient (rate of change) of the cost/loss function with respect to each weight in the network for the current weights.
Simultaneously update all of our weights by a small amount using the gradient of the cost function.

for( i in 1:10000){
  
  #print(i)
  
  # get network output for current weights:
  current_y_hat <- feedforward( X = X, 
                                w1 = current_w1,
                                w2 = current_w2
                            )
  
  # store the cost/loss at this point:
  store_cost[i] <- loss_ftn( y = y, y_hat = current_y_hat )
  
  # calculate the gradient (back propagate) and update the weights:
  do_backprop <- backpropagate( X = X,          # design matrix (with first column of 1s)
                                y = y,          # vector of true observed outcomes
                                y_hat = current_y_hat,      # vector of model predictions
                                w1 = current_w1,         # matrix of weights between input layer and hidden layer
                                w2 = current_w2,         # matrix of weights between hidden layer and output layer
                                learnrate = 0.001   # the learning rate (lambda) 
                              )
  
  # store the new weights:
  current_w1 <- do_backprop$w1
  current_w2 <- do_backprop$w2
  
}

We can look at how our loss has changed over each iteration:

plot( x = 1:length(store_cost),
      y = store_cost,
      xlab = "iteration",
      ylab = "loss/cost",
      #type = "b",
      cex = 0.5
    )

If we consider a network output (prediction) of greater than 0.5 as a predicted label 1 and 0.5 or less as a 0 label, then the predictions of our trained network are:

fitted_values <- as.numeric( current_y_hat > 0.5 )

# show first 20 predictions
head(fitted_values, 20)

##  [1] 1 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1

We can compare our predictions to the true labels of the points using a confusion matrix:

table( fitted_values, y )

##              y
## fitted_values   0   1
##             0 237  15
##             1  13 235

table( fitted_values, y ) %>% prop.table()

##              y
## fitted_values     0     1
##             0 0.474 0.030
##             1 0.026 0.470

And we can plot the decision boundaries of our network predictions over the data that we trained the network on:

pred_grid <- expand.grid( x1 = seq( min(sim_dataset$x1)-0.1, 
                                   max(sim_dataset$x1)+0.1, 
                                   0.01 ),
                          x2 = seq( min(sim_dataset$x2)-0.1, 
                                    max(sim_dataset$x2)+0.1, 
                                   0.01 )
                         )
pred_grid_labels <- as.numeric( 
                  feedforward( X = as.matrix( cbind(1,pred_grid) ),
                               w1 = current_w1,
                               w2 = current_w2
                              ) > 0.5
                  ) 

ggplot( data = sim_dataset,
           aes( x = x1, 
                y = x2, 
                colour = factor(y)
              )
      ) +
  geom_point( data = cbind( pred_grid, y = pred_grid_labels),
              aes(x = x1 , y = x2 ),
              size = 0.02 ) +
  geom_point()

Rather than plotting the hard decision boundary which we created using the rule $\hat{y}_{i}>0.5$, we can plot the actual output of the network using a gradient of colour - this gives us a fuzzy/blurry decision boundary because areas where the network is less certain of it’s label predictions give a mix of the 2 label colours.

pred_grid_fitted_values <- 
                  feedforward( X = as.matrix( cbind(1,pred_grid) ),
                               w1 = current_w1,
                               w2 = current_w2
                              ) 

ggplot( data = sim_dataset,
           aes( x = x1, 
                y = x2, 
                colour = y
              )
      ) +
  geom_point( data = cbind( pred_grid, y = pred_grid_fitted_values),
              aes(x = x1, y = x2 ),
              size = 0.02 ) +
  geom_point()

A Regression Network

Now, I am going to build a deeper network, and one to solve a different prediction problem: regression.

I am also switching the notation of the weights to the style used in the book http://neuralnetworksanddeeplearning.com. All of the notation and mathematics (aside from mistakes that I may have made) are from this resource.

For the regression problem, I am going to use the quadratic cost function (mean squared error):

\[C \quad = \quad \frac{1}{2} \times \frac{\sum_{i=1}^n \bigg( y_i - \hat{y}_i\bigg)^2}{n}\]

This is not a popular choice of loss function any more (cross-entropy loss functions are preferred), but is fine for the purpose of explanation.

I will explore the mechanics of this problem using the following network:

Following this exploration, I will generalise this network in R so that a user can specify how many network layers they would like, and how many neurons they would like in each layer, without rewriting the code each time they make a change.

I use sigmoid activation functions in all layers except the last layer (the last layer will have a linear activation function).

suppose that we have a single observation (a single input to the network):

single_x <- matrix( rnorm(2), nrow=2 )
single_x

##            [,1]
## [1,] -0.1955271
## [2,] -0.1246678

Because we have 2 input neurons (2 features) in our network, this single input is a vector with 2 features.

For a single input $x$, our quadratic cost function is:

\[C_x \quad = \quad \frac{1}{2}\Big( y_x - \hat{y_x}\Big)^2\] ..wherer $y_x$ is the true response value for our single observation and $\hat{y_x}$ is our neural network’s prediction of this true response value.

We initialise all of the weights and biases with random values:

set.seed(69)

W2 <- matrix( rnorm(4,0.5), nrow=2)
W3 <- matrix( rnorm(4,0.5), nrow=2)
W4 <- matrix( rnorm(4,0.5), nrow=2)
W5 <- matrix( rnorm(2,0.5), nrow=1)
b2 <- matrix( rnorm(2,0.5), nrow=2)
b3 <- matrix( rnorm(2,0.5), nrow=2)
b4 <- matrix( rnorm(2,0.5), nrow=2)
b5 <- rnorm(1,2)

Here is how a few of them look:

W2

##           [,1]       [,2]
## [1,] 0.5771654  0.1651861
## [2,] 0.8743156 -0.4498890

W4

##            [,1]      [,2]
## [1,] -0.6668554 0.3308235
## [2,] -1.0710733 0.8517533

W5

##            [,1]     [,2]
## [1,] -0.5550501 0.914994

b2

##            [,1]
## [1,] -0.8113744
## [2,]  0.9499766

b5

## [1] 4.493901

We define a feed-forward function in R. For a given set of weights, biases, and a single input, this function will give the output of the network (also outputting the intermediate steps).

feed_forward <- function( single_x, W2, W3, W4, W5, b2, b3, b4, b5){
  
 z2 <- W2 %*% single_x + b2 
 a2 <- sigmoid(z2)            # recall that the sigmoid function is:  function(x){ 1/( 1+exp(-x) ) }

 z3 <- W3 %*% a2 + b3
 a3 <- sigmoid(z3)
 
 z4 <- W4 %*% a3 + b4
 a4 <- sigmoid(z4)
 
 z5 <- W5 %*% a4 + b5
 a5 <- z5                # last layer has linear activation function (identity)
 
 return( 
    list( z2=z2,z3=z3,z4=z4,z5=z5,
          a2=a2,a3=a3,a4=a4,a5=a5
        )
 )
 
}

So, for the single observation we stated earlier and the weights which we initialised with random values, the output of our network is:

single_x_network_output <- 
   feed_forward( single_x = single_x, 
                       W2 = W2, 
                       W3 = W3, 
                       W4 = W4, 
                       W5 = W5, 
                       b2 = b2, 
                       b3 = b3, 
                       b4 = b4, 
                       b5 = b5
                )

single_x_network_output$a5

##          [,1]
## [1,] 4.566298

We also stored all of the intermediate neuron outputs $a^l=\Big[a^l_j \Big]=\Big[ \sigma(z^l_j) \Big]$:

single_x_network_output

## $z2
##            [,1]
## [1,] -0.9448192
## [2,]  0.8351109
## 
## $z3
##           [,1]
## [1,] 1.1327139
## [2,] 0.2944829
## 
## $z4
##             [,1]
## [1,]  0.66676027
## [2,] -0.08019527
## 
## $z5
##          [,1]
## [1,] 4.566298
## 
## $a2
##           [,1]
## [1,] 0.2799279
## [2,] 0.6974345
## 
## $a3
##           [,1]
## [1,] 0.7563394
## [2,] 0.5730933
## 
## $a4
##           [,1]
## [1,] 0.6607773
## [2,] 0.4799619
## 
## $a5
##          [,1]
## [1,] 4.566298

Assuming that the true response value for our single observation $x$ is $y_x=1$ (this is the value that we’re trying to build out network to predict), our cost function for the single observation gives a cost of:

0.5 * ( 1-single_x_network_output$a5 )^2

##          [,1]
## [1,] 6.359242

For the $5^{th}$ layer, we define the error $\delta^5$:

\[\delta^5 \quad = \quad \frac{\partial C_x}{\partial a^5_1} \space \sigma ' (Z^5_1)\]

The activation on the output layer is linear ($\sigma(z)=z$), so $\sigma'(z)=1$. This gives us, for the output layer:

\[\begin{array}{lcl} \delta^5 &=& \frac{\partial C_x}{\partial a^5_1} \space \times \space 1 \\ &=& \frac{\partial \space \Big[\frac{1}{2}(y-a^5_1)^2 \Big]}{\partial \space a^5_1} \\ &=& a^5_1 - y \\ &=& z^5_1 - y \hspace{4cm} a^5_1=z^5_1 \text{ because the activation function is linear in the output layer}\\ \end{array}\]

$\delta^5$ gives us the rate of change of our cost function with respect to the output of our network ($a^5_1=z^5_1$).

This tells us that a very small change $\Delta z$ in our final network output $z_1^5$ results in approximately a change of $\Delta z\times\underset{\delta^5}{\underbrace{\Big( z^5_1 - y \Big)}}$ in our cost function.

We can see that this is true in R:

# So, our original network output was 
single_x_network_output$a5

##          [,1]
## [1,] 4.566298

# which gave a cost of 
0.5 * ( 1-single_x_network_output$a5 )^2

##          [,1]
## [1,] 6.359242

# If we increase z5=a5 by 0.00001,
# then we'd expect a change in our cost function of approximately:
0.00001 * (single_x_network_output$a5 - 1)

##               [,1]
## [1,] 0.00003566298

# and the actual change in our cost is:
0.5 * ( 1- (single_x_network_output$a5+0.00001) )^2 -       # new cost
0.5 * ( 1- (single_x_network_output$a5) )^2         # previous cost

##               [,1]
## [1,] 0.00003566303

The error one layer back (in the $4^{th}$ layer), we can derive from the error $\delta^5$ that we derived above:

\[\begin{array}{lcl} \delta^4 &=& \begin{bmatrix} \delta^4_1 \\ \delta^4_2 \end{bmatrix} \hspace{5cm} \text{(1 error for each non-bias neuron in the 4th layer)} \\ &=& \begin{bmatrix} \frac{\partial \space C_x}{\partial \space z^4_1 }\\ \frac{\partial \space C_x}{\partial \space z^4_2 } \end{bmatrix} \\ &=& \Bigg( \Big(W^5\Big)^T \delta^5 \Bigg) \odot \sigma'(z^4) \\ &=& \Bigg( \begin{bmatrix} W^5_{11} \\ W^5_{12} \end{bmatrix} (z^5_1 - y) \Bigg) \space \odot \space \begin{bmatrix} \sigma ' (z^4_1) \\ \sigma ' (z^4_2)\end{bmatrix} \\ &=& \begin{bmatrix} W^5_{11} (z^5_1 - y) \times \sigma ' (z^4_1) \\ W^5_{12} (z^5_1 - y) \times \sigma ' (z^4_2) \end{bmatrix} \\ &=& \begin{bmatrix} W^5_{11} (z^5_1 - y) \times \sigma(z^4_1)\Big( 1-\sigma(z^4_1)\Big) \\ W^5_{12} (z^5_1 - y) \times \sigma(z^4_2)\Big( 1-\sigma(z^4_2)\Big) \end{bmatrix} \hspace{5cm} \text{(because we are using a sigmoid activation function)}\\ \end{array} \]

Because we know $\delta^5$ - the rate of change (gradient) of the cost function for changes in our last layer output ($z^5_1$) - and we know how our $4^{th}$ layer affects our $5^{th}$ layer, we can work out the rate of change of the cost function with respect to changes in our $4^{th}$ layer output. For example, if the cost function changes by $\delta^5$ for every 1 unit increase in our last layer output $z^5_1$, then a 1 unit increase in the output $z^4_1$ (the first neuron in layer 4), will give us a change of $\sigma'(z^4_1)$ in $a^4_1$. This change in $a^4_1$ results in a change of $W^5_{11}\times\sigma'(z^4_1)$ in the network output $z^5_1=a^5_1$. This change in the final layer output $z^5_1$ results in a corresponding change of $W^5_{11}\times\sigma'(z^4_1)\times\delta^5$ in the cost function.

Since $z^4_1$ is itself a function of its weights, biases and inputs, we can derive the change/derivative of the cost function for a change in one of the weights or biases on the neuron whose output is $a^4_1$. For example, a 1 unit change in the weight $W^4_{12}$ will lead to an increase of $a^3_2$ in $z^4_1$. We know that the rate of change of the cost function with respect to $z^4_1$ is $W^5_{11}\times\sigma'(z^4_1)\times\delta^5$, so we can see that the instantaneous rate of change of the cost function for a 1 unit increase in the $4^{th}$ layer weight $W^4_{12}$ is

\[a^3_2 \space \times \space W^5_{11}\times\sigma'(z^4_1)\times\delta^5 \quad = \quad a^3_2 \times \delta^4_1\]

So, we use $\delta^4$ to calculate how our cost function will change when we change the weights and biases in the 4th layer of our network.

$\delta^4$ itself is:

delta5 <- c(single_x_network_output$z5)-1
delta4 <- t(W5) * delta5 * sigmoid(single_x_network_output$z4)*(1-sigmoid(single_x_network_output$z4))
delta4

##            [,1]
## [1,] -0.4437004
## [2,]  0.8144752

So, if we change the weight $W^4_{12}$ by some small amount $\Delta w$. The change in our cost function (at this exact point) will be:

\[\frac{\partial \space C}{\partial \space W^l_{jk}} \times \Delta w \quad = \quad a^{l-1}_k \delta^l_j \times \Delta w \quad = \quad \frac{\partial \space C}{\partial \space W^4_{12}} \times \Delta w \quad = \quad a^{3}_2 \space \delta^4_1 \times \Delta w\]

So, for example, if we decrease our weight $W^4_{12}$ by $\Delta w = -0.00001$, then our cost function will change by approximately

single_x_network_output$a3[[2]] * delta4[[1]] * (-0.00001)

## [1] 0.000002542817

We can see this by making this small change to $W^4_{12}$, calculating our new network output with this weight update, calculating the new cost, and comparing this new cost to that theoretical change:

previous_output <- single_x_network_output$a5

new_output <- 
  feed_forward( single_x = single_x, 
                W2 = W2,
                W3 = W3, 
                W4 = W4 + matrix(c(0,0,-0.00001,0),nrow=2),
                W5 = W5, 
                b2 = b2, 
                b3 = b3, 
                b4 = b4, 
                b5 = b5
                )$a5

# change in cost:
0.5 * ( 1 - new_output )^2 -      # new cost
0.5 * ( 1 - previous_output )^2   # old cost

##               [,1]
## [1,] 0.00000254282

Similarly, we can see the effect on our cost function of making a small change to one of the 4th layer bias weights:

Changing $b^4_1$ by $\Delta b = 0.0001$ will give us approximately a change in our cost of:

\[\frac{\partial \space C_x}{\partial \space b^l_j} \times \Delta b \quad = \quad \delta^l_j \times \Delta b \quad = \quad \frac{\partial \space C_x}{\partial \space b^4_1} \times 0.0001 \quad = \quad \delta^4_1 \times 0.0001 \] So, increasing the bias weight $b^4_1$ to $(b^4_1+0.0001)$ we’d expect a change in cost of approximately

delta4[[1]] * 0.0001

## [1] -0.00004437004

and the actual observed change in cost is:

previous_output <- single_x_network_output$a5

new_output <- 
  feed_forward( single_x = single_x, 
                W2 = W2,
                W3 = W3, 
                W4 = W4,
                W5 = W5, 
                b2 = b2, 
                b3 = b3, 
                b4 = b4 + matrix( c(0.0001, 0), nrow=2 ), 
                b5 = b5
                )$a5

# change in cost:
0.5 * ( 1 - new_output )^2 -      # new cost
0.5 * ( 1 - previous_output )^2   # old cost

##                [,1]
## [1,] -0.00004436925

This working backwards through the network to calculate the rate of change of the cost function for changes in the weights and biases is called back propagation.

The general rule for working out the change in the cost for a change in any specific bias or weight in the network is:

\[\frac{\partial\space C_x}{\partial\space b^l_j} \quad = \quad \delta^l_j \] \[\frac{\partial\space C_x}{\partial\space W^l_{jk}} \quad = \quad a^{l-1}_{k} \times \delta^l_j \]

Let’s calculate the other errors (which will allow us to calculate the gradient of the cost function for weights earlier in the network):

\[ \begin{array}{lcl} \delta^3 &=& \begin{bmatrix} \delta^3_1 \\ \delta^3_2 \end{bmatrix} \hspace{5cm} \text{(1 error for each non-bias neuron in the 3rd layer)} \\ &=& \begin{bmatrix} \frac{\partial \space C_x}{\partial \space z^3_1 }\\ \frac{\partial \space C_x}{\partial \space z^3_2 } \end{bmatrix} \\ &=& \Bigg( \Big(W^4\Big)^T \delta^4 \Bigg) \odot \sigma'(z^3) \\ &=& \Bigg( \begin{bmatrix} W^4_{11} & W^4_{21} \\ W^4_{12} & W^4_{22} \end{bmatrix} \begin{bmatrix} \delta^4_1 \\ \delta^4_2 \end{bmatrix} \Bigg) \space \odot \space \begin{bmatrix} \sigma ' (z^3_1) \\ \sigma ' (z^3_2) \end{bmatrix} \\ &=& \begin{bmatrix} W^4_{11}\delta^4_1 + W^4_{21}\delta^4_2 \\ W^4_{12}\delta^4_1 + W^4_{22}\delta^4_2 \end{bmatrix} \space \odot \space \begin{bmatrix} \sigma (z^3_1) \Big( 1 - \sigma (z^3_1) \Big)\\ \sigma(z^3_2) \Big( 1 - \sigma(z^3_2) \Big) \end{bmatrix} \\ &=& \begin{bmatrix} \Big( W^4_{11}\delta^4_1 + W^4_{21}\delta^4_2 \Big) \times \sigma (z^3_1) \Big( 1 - \sigma (z^3_1) \Big) \\ \Big( W^4_{12}\delta^4_1 + W^4_{22}\delta^4_2 \Big) \times \sigma(z^3_2) \Big( 1 - \sigma(z^3_2) \Big) \end{bmatrix} \\ \end{array}\]

For our random weights and single input $x$, we calculate $\delta^3$ as :

delta3 <- ( t(W4) %*% delta4 ) * 
          ( 
            sigmoid(single_x_network_output$z3)*( 1-sigmoid(single_x_network_output$z3) ) 
          )
  
delta3

##            [,1]
## [1,] -0.1062393
## [2,]  0.1338142

And $\delta^2$ is

\[ \begin{array}{lcl} \delta^2 &=& \begin{bmatrix} \delta^2_1 \\ \delta^2_2 \end{bmatrix} \hspace{5cm} \text{(1 error for each non-bias neuron in the 2nd layer)} \\ &=& \begin{bmatrix} \frac{\partial \space C_x}{\partial \space z^2_1 }\\ \frac{\partial \space C_x}{\partial \space z^2_2 } \end{bmatrix} \\ &=& \Bigg( \Big(W^3\Big)^T \delta^3 \Bigg) \odot \sigma'(z^2) \\ &=& \Bigg( \begin{bmatrix} W^3_{11} & W^3_{21} \\ W^3_{12} & W^3_{22} \end{bmatrix} \begin{bmatrix} \delta^3_1 \\ \delta^3_2 \end{bmatrix} \Bigg) \space \odot \space \begin{bmatrix} \sigma ' (z^2_1) \\ \sigma ' (z^2_2) \end{bmatrix} \\ &=& \begin{bmatrix} W^3_{11}\delta^3_1 + W^3_{21}\delta^3_2 \\ W^3_{12}\delta^3_1 + W^3_{22}\delta^3_2 \end{bmatrix} \space \odot \space \begin{bmatrix} \sigma (z^2_1) \Big( 1 - \sigma (z^2_1) \Big)\\ \sigma(z^2_2) \Big( 1 - \sigma(z^2_2) \Big) \end{bmatrix} \\ &=& \begin{bmatrix} \Big( W^3_{11}\delta^3_1 + W^3_{21}\delta^3_2 \Big) \times \sigma (z^2_1) \Big( 1 - \sigma (z^2_1) \Big) \\ \Big( W^3_{12}\delta^3_1 + W^3_{22}\delta^3_2 \Big) \times \sigma(z^2_2) \Big( 1 - \sigma(z^2_2) \Big) \end{bmatrix} \\ \end{array}\]

In R, using our current network, $\delta^2$ is:

delta2 <- 
   t(W3) %*% delta3 * 
   ( sigmoid(single_x_network_output$z2)*(1-sigmoid(single_x_network_output$z2)) )

delta2

##             [,1]
## [1,]  0.05500057
## [2,] -0.05499022

So, for our single input $x=\begin{bmatrix} x_1 \\ x_2\end{bmatrix}$, we can now use $\delta^2$, $\delta^3$, $\delta^4$ and $\delta^5$ to find the derivative of our cost function with respect to any of the weights in our network.

As another example, the bias weights in the second layer are currently:

\[b^2 \quad = \quad \begin{bmatrix} b^2_1 \\ b^2_2\end{bmatrix} \quad = \quad \]

b2

##            [,1]
## [1,] -0.8113744
## [2,]  0.9499766

So, $\delta^2 = \begin{bmatrix} \delta^2_1 \\ \delta^2_2 \end{bmatrix} = \begin{bmatrix} \frac{\partial \space C_x}{\space b^l_j} \\ \frac{\partial C_x}{\partial \space b^l_j} \end{bmatrix}$ gives the rate of change of the cost function at the current values of the bias weights.

For our data at the moment, $\delta^2$ is

delta2

##             [,1]
## [1,]  0.05500057
## [2,] -0.05499022

So, if we change the weight on the first bias in the second layer ($b^2_1$) by -0.001, we’d expect a change of around $\delta^2_1 \times (-0.001) = -0.000055$ in the cost. To compare, the true change in the cost is:

previous_cost <- 0.5 * ( single_x_network_output$a5-1 )^2
new_cost <- 0.5 * 
    ( 
        feed_forward( single_x = single_x, 
                            W2 = W2,
                            W3 = W3, 
                            W4 = W4,
                            W5 = W5, 
                            b2 = b2 + c(-0.001, 0), 
                            b3 = b3, 
                            b4 = b4,
                            b5 = b5
                    )$a5 - 1 
    )^2

new_cost - previous_cost

##                [,1]
## [1,] -0.00005498903

And similarly, decreasing the other 2nd layer bias weight ($b^2_2$) by the same amount (-0.001) would give us approximately a change of $\delta^2_2 \times (-0.001) = 0.000055$ in the cost. The observed change is:

previous_cost <- 0.5 * ( single_x_network_output$a5-1 )^2
new_cost <- 0.5 * 
    ( 
        feed_forward( single_x = single_x, 
                            W2 = W2,
                            W3 = W3, 
                            W4 = W4,
                            W5 = W5, 
                            b2 = b2 + c(0, -0.001), 
                            b3 = b3, 
                            b4 = b4,
                            b5 = b5
                    )$a5 - 1 
    )^2

new_cost - previous_cost

##              [,1]
## [1,] 0.0000550061

Further examples of the effect of changing weights using the gradient of the cost function

At the risk of boring you, and to satisfy my anxiety for correctness, I give a few more examples here of the change in the cost function for changes in the network weights and biases (still for our single input $x$)

We calculated $\delta^2 = \begin{bmatrix} \delta^2_1 \\ \delta^2_2\end{bmatrix}$ as

delta2

##             [,1]
## [1,]  0.05500057
## [2,] -0.05499022

Increasing our weight $W^2_{12}$ by 0.0001 should change our cost function by approximately

\[\begin{array}{lcl} \frac{\partial \space C_x}{\partial \space W^2_{12}} \times 0.0001 &=& a^1_2 \space \delta^2_1 \times 0.0001\\ &=& x_2 \space \delta^2_1 \times 0.0001 \hspace{4cm} a^1 \text{ is our vector of inputs } x \text{ (the 1st layer of our network)}\\ \end{array}\]

our single input $x=\begin{bmatrix} x_1 \\ x_2\end{bmatrix}$ was

single_x

##            [,1]
## [1,] -0.1955271
## [2,] -0.1246678

So,

$x_2 \times \delta^2_1 \times 0.0001 =$ -0.1246678 $\times$ 0.0550006 $\times 0.0001 =$ -0.0000007

The observed change is:

previous_cost <- 0.5 * ( single_x_network_output$a5-1 )^2
new_cost <- 0.5 * 
    ( 
        feed_forward( single_x = single_x, 
                            W2 = W2 + matrix( c(0,0,0.0001,0), nrow=2),
                            W3 = W3, 
                            W4 = W4,
                            W5 = W5, 
                            b2 = b2, 
                            b3 = b3, 
                            b4 = b4,
                            b5 = b5
                    )$a5 - 1 
    )^2

new_cost - previous_cost

##                  [,1]
## [1,] -0.0000006856781

The gradient of the cost function tells us that increasing weight $W^3_{22}$ in the 3rd layer by some small amount (say 0.0001) should change the output of our cost function by

\[\begin{array}{lcl} \frac{\partial \space C_x}{\partial \space W^3_{22}} \times 0.0001 &=& a^2_2 \space \delta^3_2 \times 0.0001\\ \end{array}\]

The matrix of weights in the 3rd layer ($W^3=\begin{bmatrix} W^3_{11} & W^3_{12} \\ W^3_{21} & W^3_{22}\end{bmatrix}$ is

W3

##            [,1]       [,2]
## [1,] -0.4402101  1.9440935
## [2,]  1.6896231 -0.4039467

and $\delta^3=\begin{bmatrix} \delta^3_1 \\ \delta^3_2 \\\end{bmatrix}$ is

##            [,1]
## [1,] -0.1062393
## [2,]  0.1338142

and $a^2=\begin{bmatrix} a^2_1 \\ a^2_2\end{bmatrix}$ is

single_x_network_output$a2

##           [,1]
## [1,] 0.2799279
## [2,] 0.6974345

So,

$a^2_2 \times \delta^3_2 \times 0.0001 =$ 0.6974345 $\times$ 0.1338142 $\times 0.0001 =$ 0.0000093

For comparison, the observed change in cost is

##                [,1]
## [1,] 0.000009332628

Calculating the gradient of the cost function for multiple inputs

Up til now, we have been dealing with a single input (input vector) $x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}$ to our neural network. For this single input, our cost function was

\[C_x \quad = \quad \frac{1}{2}(y - a^5_1)^2 \quad = \quad \frac{1}{2}(y - \hat{y})^2\]

..where $a^5_1=\hat{y}$ is the single output of our network for the single input $x$. If instead of one input we have multiple different inputs, then we can feed each one separately through our neural network. Each single input will have it’s own single output.

Let’s consider $n$ inputs (each individual input consists of 2 input variables/features):

\[X \quad = \quad \begin{bmatrix} \underline{X}_1 \\ \underline{X}_2 \\ \underline{X}_3 \\ . \\ . \\ . \\ \underline{X}_n \end{bmatrix} \quad = \quad \begin{bmatrix} \text{1st input vector} \\ \text{2nd input vector} \\ \text{3rd input vector} \\ . \\ . \\ . \\ n^{th}\text{ input vector} \end{bmatrix} \quad = \quad \begin{bmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ x_{31} & x_{32} \\ . & . \\ . & . \\ . & . \\ x_{n1} & x_{n2} \end{bmatrix}\]

Each row of $X$ is a different input vector (we have $n$ input vectors/observations/samples/data rows), so for our previous single input we had

\[X \quad = \quad \Big[ \underline{X}_1 \Big] \quad = \quad \begin{bmatrix} x_{11} & x_{12} \end{bmatrix}\]

When considering multiple input vectors $x_i$ simultaneously, we instead use the cost function

\[C \quad = \quad \frac{ \sum_{i=1}^n C_x}{n} \quad = \quad \frac{ \sum_{i=1}^n \frac{1}{2} (y_i-\hat{y}_i)^2}{n}\]

This cost function is simply calculating the cost individually for each individual input vector, then averaging all of these individual costs to get an average cost across all of them (with each input contributing equally to the cost).

Suppose that we now have 5 input vectors (observations/samples):

# generate some data randomly:
X <- data.frame( runif(2,-1,1),
                 runif(2,-1,1),
                 runif(2,-1,1),
                 runif(2,-1,1),
                 runif(2,-1,1)
                ) %>% 
     as.matrix() %>% 
  t()

colnames(X) <- c("X1","X2")
rownames(X) <- paste("observation", 1:5)
X

##                        X1         X2
## observation 1 -0.03699013  0.3261458
## observation 2 -0.56338334 -0.9632378
## observation 3 -0.80566246 -0.4918422
## observation 4 -0.18534015 -0.6948660
## observation 5  0.01943456  0.6305214

So, we have 5 observations/measurements/samples ($n=5$) of 2 variables ($X_1$ and $X_2$). For instance, we could have measured the height and weight of 5 different people.

Let’s suppose that the response values $y_i$ (the values that we’re trying to predict using our model) for our 5 observations are

## observation 1 observation 2 observation 3 observation 4 observation 5 
##     0.2891556    -1.5266211    -1.2975046    -0.8802061     0.6499560

We define this multiple-input cost function as a function in R:

quadratic_cost_ftn <- function( # the cost function is a function of:
                                y,      # our true response values 
                                y_hat   # our predicted response values  
){
  
  return( mean( 0.5*(y-y_hat)^2 ) )
  
                              }

We define a feed-forward function, which passes our 5 inputs through the network, giving us one output ($y$ prediction) per input:

feed_forward <- function( X,
                          W2,W3,W4,W5,
                          b2,b3,b4,b5
){
     z2 <- W2%*%t(X) + matrix( rep(b2, nrow(X)), nrow=2 )
     a2 <- sigmoid(z2)

     z3 <- W3%*%a2 + matrix( rep(b3, ncol(a2)), nrow=2 )
     a3 <- sigmoid(z3)

     z4 <- W4%*%a3 + matrix( rep(b4, ncol(a3)), nrow=2 )
     a4 <- sigmoid(z4) 
     
     z5 <- W5%*%a4 + matrix( rep(b5, ncol(a4)), nrow=1 )
     a5 <- z5
     output_y_hat <- a5    
     
     return( list( z2=z2, a2=a2,
                   z3=z3, a3=a3,
                   z4=z4, a4=a4,
                   z5=z5, a5=a5, 
                   y_hat = output_y_hat)
     )
}

So, for our 5 input vectors,

##                        X1         X2
## observation 1 -0.03699013  0.3261458
## observation 2 -0.56338334 -0.9632378
## observation 3 -0.80566246 -0.4918422
## observation 4 -0.18534015 -0.6948660
## observation 5  0.01943456  0.6305214

the output of our network with our current weights (we’re still using random weights) is:

network_output <- feed_forward( X=X,W2=W2,W3=W3,W4=W4,W5=W5,b2=b2,b3=b3,b4=b4,b5=b5 )
network_output$y_hat

##      observation 1 observation 2 observation 3 observation 4 observation 5
## [1,]      4.569956      4.560474       4.56669      4.561194      4.572783

The error in the last (5th) layer $\delta^5$ for any single input $x$ is

\[\delta_{x}^5 \quad = \quad \frac{\partial \space C_x}{\partial \space a^5_{1x}}\quad = \quad a^5_{1x} - y_{x}\]

$\delta_{x}^5$ tells us the instantaneous rate of change of the cost function $C_x$ for a unit increase in one of the network outputs $a^5_{1x}$.

So, there will be a different $\delta^5_x$ value for each of our 5 inputs:

delta5 <- matrix( network_output$a5 - y )
rownames(delta5) <- paste("observation", 1:5)
colnames(delta5) <- "delta5"
delta5

##                 delta5
## observation 1 4.280801
## observation 2 6.087095
## observation 3 5.864195
## observation 4 5.441400
## observation 5 3.922827

For any single observation $x$: because a change in the bias in the 4th layer ($b_{1x}^5$) simply changes the final output of the network ($a_{1x}^5$) by that change, the rate of change of the cost function $C_x$ for a 1 unit increase in a $4^{th}$ layer bias weight is simply $\delta^5_x$. Averaging over all 5 $\delta^5_x$ values gives us the instantaneous rate of change of the cost function $\delta^5 = \frac{ \sum_x \delta^5_x}{5}$ for a 1 unit change in bias $b_{1}^5$, taking into account all 5 inputs $x$.

Our $\delta^5_x$ values we calculated as

delta5

##                 delta5
## observation 1 4.280801
## observation 2 6.087095
## observation 3 5.864195
## observation 4 5.441400
## observation 5 3.922827

This tells us that the rate of change in the cost function $C$ for a 1 unit change in the bias weight $b^5_1$ is

delta5_global <- mean(delta5)
delta5_global

## [1] 5.119264

For a change of -0.0001 in $b^5_1$, we’d expect to see a change in the cost $C$ of around

-0.0001 * delta5_global

## [1] -0.0005119264

..and the actual change we see is:

output1 <- feed_forward( X=X,W2=W2,W3=W3,W4=W4,W5=W5,b2=b2,b3=b3,b4=b4,b5=b5 )$a5 
output2 <- feed_forward( X=X,W2=W2,W3=W3,W4=W4,W5=W5,b2=b2,b3=b3,b4=b4,b5=b5-0.0001 )$a5
  
quadratic_cost_ftn(y=y, y_hat=output2) - quadratic_cost_ftn(y=y, y_hat=output1)

## [1] -0.0005119214

The change in cost resulting from a change in the weights is similarly found by aggregating the individual gradients:

The weights $W^5=\begin{bmatrix} W^5_{11} & W^5_{12}\end{bmatrix}$ between the 4th and 5th layer are:

W5

##            [,1]     [,2]
## [1,] -0.5550501 0.914994

The rate of change of $C$ with respect to these weights $W^5$ is:

\[\frac{\partial \space C}{\partial W^5_{11}} \quad = \quad \frac{\sum_{x=1}^5 \Big[ \frac{\partial \space C_x }{\partial \space W^5_{11}} \Big]}{5} \quad = \quad \frac{\sum_{x=1}^5 \Big[ a^4_{1x} \delta^5_{1x} \Big]}{5}\]

and

\[\frac{\partial \space C}{\partial W^5_{12}} \quad = \quad \frac{\sum_{x=1}^5 \Big[ \frac{\partial \space C_x }{\partial \space W^5_{12}} \Big]}{5} \quad = \quad \frac{\sum_{x=1}^5 \Big[ a^4_{2x} \delta^5_{1x} \Big]}{5}\]

For our network with it’s current weights, $a^4$ is

a4 <- network_output$a4
rownames(a4) <- c("a4_1", "a4_2")
a4

##      observation 1 observation 2 observation 3 observation 4 observation 5
## a4_1     0.6630971     0.6572831     0.6626979     0.6568814     0.6650662
## a4_2     0.4853672     0.4714765     0.4815555     0.4720203     0.4896511

and $\delta^5$ is

delta5

##                 delta5
## observation 1 4.280801
## observation 2 6.087095
## observation 3 5.864195
## observation 4 5.441400
## observation 5 3.922827

So, $a^4_{1x} \delta^5_{1x}$ is

dCx_dW5_11 <- a4[1,] * t(delta5)
rownames(dCx_dW5_11) <- "dCx_dW5_11"
dCx_dW5_11

##            observation 1 observation 2 observation 3 observation 4
## dCx_dW5_11      2.838587      4.000945       3.88619      3.574355
##            observation 5
## dCx_dW5_11       2.60894

and $a^4_{2x} \delta^5_{1x}$ is

dCx_dW5_12 <- a4[2,] * t(delta5)
rownames(dCx_dW5_12) <- "dCx_dW5_12"
dCx_dW5_12

##            observation 1 observation 2 observation 3 observation 4
## dCx_dW5_12       2.07776      2.869922      2.823935      2.568451
##            observation 5
## dCx_dW5_12      1.920817

averaging over all 5 input observations $x$ gives us $\frac{\partial \space C}{\partial W^5_{11}}$:

dC_dW5_11 <- mean(dCx_dW5_11)

and $\frac{\partial \space C}{\partial W^5_{12}}$ is

dC_dW5_12 <- mean(dCx_dW5_12)

Let’s check that these derivatives are correct:

a 0.0001 increase to $W^5_{11}$ should change our cost by approximately

dC_dW5_11*0.0001

## [1] 0.0003381803

and it changes it by precisely:

output1 <- feed_forward( X=X,W2=W2,W3=W3,W4=W4,W5=W5,b2=b2,b3=b3,b4=b4,b5=b5 )$a5 
output2 <- feed_forward( X=X,W2=W2,W3=W3,W4=W4,W5=W5+c(0.0001,0),b2=b2,b3=b3,b4=b4,b5=b5 )$a5
  
quadratic_cost_ftn(y=y, y_hat=output2) - quadratic_cost_ftn(y=y, y_hat=output1)

## [1] 0.0003381825

Similarly, a 0.0001 increase to $W^5_{12}$ should change our cost by

dC_dW5_12*0.0001

## [1] 0.0002452177

and it in fact changes it by:

output1 <- feed_forward( X=X,W2=W2,W3=W3,W4=W4,W5=W5,b2=b2,b3=b3,b4=b4,b5=b5 )$a5 
output2 <- feed_forward( X=X,W2=W2,W3=W3,W4=W4,W5=W5+c(0,0.0001),b2=b2,b3=b3,b4=b4,b5=b5 )$a5
  
quadratic_cost_ftn(y=y, y_hat=output2) - quadratic_cost_ftn(y=y, y_hat=output1)

## [1] 0.0002452189

We can find the derivative of the cost function with respect to all of the other weights in the network in the same way: for each of our observations we find the derivative, and then average this derivative value over all of the input observations.

Building the regression network

Now, I am going to build a simple feed-forward regression network in R. I want it to be generalisable in the sense that it can be given different datasets (different numbers of observations/samples and different numbers of inputs) and that the user can specify the number of layers that they want - and the number of nodes in each layer - without rewriting any code.

The model will learn (i.e. the weights updated) using stochastic gradient descent. This is achieved by feeding the network random subsets of the data (batches) one at a time (i.e. it does not see all of the training data in each forward pass). The model is updated after each batch. The process is called gradient descent because the weights and biases in the network are updated using the gradient (derivative) of the cost function.

The algorithm is as follows:

For each batch of input data:

1 For every layer $\mathcal{l}$ from the second layer to the output layer (2,3,4,…,$L$), feed each input $x$ in the batch seperately forward through the network, storing the activations along the way:

i.e. for each $x$ and $l$, calculate \[z^{x,l}=w^l a^{x,l-1}+b^l \quad \text{ and } \quad a^{x,l}=\sigma(z^{x,l}) \\ [\text{note that } a^{x, l-1} \text{ is a vector}]\]

2 Calculate the output error of the last layer \[\delta^{L}\]

With a single output node with linear activation function on the last layer, this is: \[\delta^L \quad = \quad \frac{1}{n}\sum_{x=1}^n \bigg( \frac{\partial \space C_x}{\partial \space a^{x,L}_1} \bigg)\]

..where $n$ is the number of samples/observations in the batch

3 For each previous layer in the network (aside from the first input layer), calculate the error vector $\delta^l$

This is

\[\begin{array}{lcl} \delta^l &=& \displaystyle\frac{\sum_{x=1}^n \delta^{l,x}}{n} \\ \delta^{l,x} &=& \Bigg( \Big( w^{l+1}\Big)^T \space \delta^{l+1,x} \Bigg) \space \odot \space \sigma ' (z^{l,x}) \end{array}\]

4 Calculate the gradient of the cost function $C$ with respect to every bias and weight in the network using

\[\frac{\partial\space C_x}{\partial\space b^l_j} \quad = \quad \delta^l_j\]

and

\[\frac{\partial\space C_x}{\partial\space W^l_{jk}} \quad = \quad a^{l-1}_{k} \times \delta^l_j\]

5 Update every weight and bias in the network simultaneously using the gradient of the cost function for this batch. For example, if the derivative of the cost function $C$ for one of the weights in our network (e.g. $W^{8}_{31}$) is -0.054, then this means that the cost function is decreasing at a rate of $\frac{\partial\space C}{\partial\space W^{8}_{31}} = -0.054$ per unit increase in this weight. So, a natural way to update this weight would be to add 0.054 to it. So, we could simply update every weight $W^{l}_{jk}$ in the network by subtracting $\frac{\partial\space C}{\partial\space W^{l}_{jk}}$ from it. Unfortunately, because we are updating all of the weights simultaneously, these changes in the weights are too large - the network makes huge leaps over possible solutions. In order to combat this, we can make the network learn more slowly by simultaneously updating each weight using

\[\text{updated } W^l_{jk} \quad = \quad W^l_{jk} - \frac{\partial\space C}{\partial\space W^{l}_{jk}} \times\lambda\] and each bias as

\[\text{updated } b^l_{j} \quad = \quad b^l_{j} - \frac{\partial\space C}{\partial\space b^l_{j}} \times\lambda\]

$\lambda$ is a constant that we choose, called the learning rate. We use $\lambda$ to control the speed at which the network learns.

First, we define a function which gives us random starting weights. We will specify the number of layers and nodes like this:

layers.neurons <- c(
                     layer1=2,  # this is the input layer (always num features == num neurons)
                     layer2=2,
                     layer3=2,
                     layer4=2,
                     layer5=1  # this is the output layer (always 1 neuron)
                    )

and the function giving us initial random weights as:

# this function creates a list of weight matrices of the correct dimensions.
# the weights are initialised with rnorm(0,1) values:
initialise_weights <- function( layers.neurons ){
  
  n_hidden_neurons <- length(layers.neurons)-2       # number of hidden layers
  weight_matrix_list <- vector("list",length = n_hidden_neurons*2 + 2 )   # create empty list 
  
  names(weight_matrix_list) <-    # name the entries in the weights matrix list
        c( 
              paste( names(layers.neurons)[2:(n_hidden_neurons+2)], "biases", sep="_"),
              paste( names(layers.neurons)[2:(n_hidden_neurons+2)], "weights", sep="_")
         )
  
  for( layer in 2:(n_hidden_neurons+2) ){     # for each each hidden layer
    
    # add random biases: 
    weight_matrix_list[[paste0("layer",layer,"_biases")]] <- 
        rnorm( layers.neurons[[paste0("layer",layer)]] , mean=0, sd=1 )
    
    # add random weights:
    weight_matrix_list[[paste0("layer",layer,"_weights")]] <-
        matrix( rnorm( layers.neurons[[paste0("layer",layer-1)]] * # number of neurons on LHS
                       layers.neurons[[paste0("layer",layer)]],    # number of neurons on RHS
                       mean=0, sd=1),
                ncol = layers.neurons[[paste0("layer",layer-1)]],  # number of neurons on LHS
                nrow = layers.neurons[[paste0("layer",layer)]]     # number of neurons on RHS
              )

  }
  
  ################################################################  
  # make a little visual representation of the network:   
  blank_matrix <- matrix( c( rep(" ", ( max(layers.neurons)+1 )*length(layers.neurons) ) ),
                          ncol = length(layers.neurons) 
                          )
  colnames(blank_matrix) <- paste( "layer_", 1:ncol(blank_matrix), sep="")
  
  blank_matrix[,1] <- c( "b", 
                               paste("x", 1:layers.neurons[1], sep=""),
                               rep(" ", nrow(blank_matrix) - layers.neurons[1] - 1 ) 
                       )
  
  for( layer in 2:(ncol(blank_matrix)-1) ){
    
    blank_matrix[,layer] <- c( "b", 
                               paste("a", 1:layers.neurons[layer], sep=""),
                               rep(" ", nrow(blank_matrix) - layers.neurons[layer] - 1 ) 
                             )
  }
  blank_matrix[,ncol(blank_matrix)] <- c("a1=output",
                                         rep(" ", nrow(blank_matrix) - 1 ) )
  print(blank_matrix)
  ################################################################
  
  return( weight_matrix_list)
}

So, giving this function the network that we’ve been discussing gives us:

layers.neurons

## layer1 layer2 layer3 layer4 layer5 
##      2      2      2      2      1

starting_weights <- initialise_weights( layers.neurons )

##      layer_1 layer_2 layer_3 layer_4 layer_5    
## [1,] "b"     "b"     "b"     "b"     "a1=output"
## [2,] "x1"    "a1"    "a1"    "a1"    " "        
## [3,] "x2"    "a2"    "a2"    "a2"    " "

starting_weights

## $layer2_biases
## [1] 0.6882792 1.5473721
## 
## $layer3_biases
## [1] -2.1851455 -0.2275997
## 
## $layer4_biases
## [1] -0.4373370 -0.1741804
## 
## $layer5_biases
## [1] 0.2819461
## 
## $layer2_weights
##            [,1]       [,2]
## [1,] -1.0555686  0.1067898
## [2,]  0.7455053 -1.9559190
## 
## $layer3_weights
##            [,1]      [,2]
## [1,]  1.2160849  1.443702
## [2,] -0.1492242 -1.192196
## 
## $layer4_weights
##           [,1]      [,2]
## [1,] 0.4761741 2.1098629
## [2,] 0.2608518 0.5937334
## 
## $layer5_weights
##           [,1]      [,2]
## [1,] 0.5779044 0.1410554

Now, we define our feed-forward function, which takes in as input a batch of input samples and the current weights of our network. This function passes our inputs through the network using the weights that we give it, giving the network output and the activations in each layer along the way.

generic_feedforward_ftn <- function( X,                 # design matrix of input features 
                                     Weight_matrices    # matrices of weights and biases in each layer 
){
  
  n_hidden_layers <- length(Weight_matrices)/2 - 1
  z_list <- vector("list", length = n_hidden_layers+1 )
  names(z_list) <- paste("z", 2:(n_hidden_layers+2), sep="" )
  activations_list <- vector("list", length = n_hidden_layers+1 )
  names(activations_list) <- paste("a", 2:(n_hidden_layers+2), sep="" )
  
  activations_list$a1 <- t(X)
  
  for( layer in 2:(n_hidden_layers+1)){  # for each hidden layer:
    
    # calculate z_[this layer] and store it:
    z <- Weight_matrices[[paste0("layer",layer,"_weights")]] %*% 
         activations_list[[paste0("a",layer-1)]] +
         matrix( rep( Weight_matrices[[paste0("layer",layer,"_biases")]], nrow(X) ),
                 nrow = nrow(Weight_matrices[[paste0("layer",layer,"_weights")]])
               )
    
    z_list[[paste0("z",layer)]] <- z
    
    # calculate a_[this layer] and store it:
    activations_list[[paste0("a",layer)]] <- sigmoid(z)
    
  }
  
  # do the last layer manually (it has a linear activation function)
  z <- Weight_matrices[[paste0("layer",(n_hidden_layers+2),"_weights")]] %*% 
         activations_list[[paste0("a",(n_hidden_layers+1))]] +
         matrix( rep( Weight_matrices[[paste0("layer",(n_hidden_layers+2),"_biases")]], nrow(X) ),
                 nrow = nrow(Weight_matrices[[paste0("layer",(n_hidden_layers+2),"_weights")]])
               )
    
    z_list[[paste0("z",(n_hidden_layers+2))]] <- z
    activations_list[[paste0("a",(n_hidden_layers+2))]] <- z
  
  return( list( z = z_list,
                a = activations_list)
  )

}

Now, we define a back propagation function - a function given the output of the feedforward function and our true response values $y$ that returns the gradient of the cost function with respect to every bias and weight in the network.

We find each

\[\frac{\partial \space C}{\partial \space W^l_{jk}} \quad = \quad \frac{1}{n} \sum_{i=1}^n \bigg( \frac{\partial\space C_x}{\partial\space W^l_{jk}} \bigg) \quad = \quad \frac{1}{n} \sum_{i=1}^n \bigg( a^{l-1,x}_{k} \times \delta^{l,x}_j \bigg)\]

using the matrix multiplication

\[\frac{\partial \space C}{\partial \space W^l} \quad = \quad \frac{ \delta^l(a^{l-1})^T }{n}\]

For example, using the network structure we’ve been discussing (3 hidden layers with 2 nodes in each), and with (for example) a batch of 4 observations, the weights in layer 3 would be:

\[W^3 \quad = \quad \begin{bmatrix} W^3_{11} & W^3_{12} \\ W^3_{21} & W^3_{22} \end{bmatrix}\] and the derivative of the cost with respect to these weights, for this batch of 4 inputs, is:

\[\begin{array}{lcl} \frac{\partial \space C}{\partial \space W^3} &=& \delta^3(a^{2})^T \space \div n \\ &=& \begin{bmatrix} \delta^{3,x=1}_1 & \delta^{3,x=2}_1 & \delta^{3,x=3}_1 & \delta^{3,x=4}_1 \\ \delta^{3,x=1}_2 & \delta^{3,x=2}_2 & \delta^{3,x=3}_2 & \delta^{3,x=4}_2 \end{bmatrix} \begin{bmatrix} a^{2,x=1}_1 & a^{2,x=1}_2 \\ a^{2,x=2}_1 & a^{2,x=2}_2 \\ a^{2,x=3}_1 & a^{2,x=3}_2 \\ a^{2,x=4}_1 & a^{2,x=4}_2 \end{bmatrix} \space \div n \\ &=& \begin{bmatrix} \sum_{x=1}^4 \Big(\delta^{3,x}_1 a^{2,x}_1 \Big) & \sum_{x=1}^4 \Big(\delta^{3,x}_1 a^{2,x}_2 \Big) \\ \sum_{x=1}^4 \Big(\delta^{3,x}_2 a^{2,x}_1 \Big) & \sum_{x=1}^4 \Big(\delta^{3,x}_2 a^{2,x}_2 \Big) \end{bmatrix} \space \div n \\ &=& \begin{bmatrix} \frac{\partial \space C}{\partial W_{11}} & \frac{\partial \space C}{\partial W_{12}} \\ \frac{\partial \space C}{\partial W_{21}} & \frac{\partial \space C}{\partial W_{22}} \end{bmatrix} \end{array}\]

We also require a function which returns the derivative of the sigmoid function:

sigmoid_derivative <- function(x){ sigmoid(x) * (1-sigmoid(x)) }

We create our backpropagation function:

backpropagate_ftn <- function( network_output, Weight_matrices, y ){
  
   store_deltas <- vector( "list", length=length(network_output$a)-1  )
   names(store_deltas) <- paste( "delta_", 2:(length(store_deltas)+1), sep="") 
  
   # deltas in the last layer:
   store_deltas[[paste0("delta_",length(store_deltas)+1)]] <- 
      ( network_output$a[[paste0("a",length(network_output$a))]] - y ) %>% 
         matrix(., nrow=1)
   
   # deltas in all previous layers down to layer 2:    
  for( d in seq( from=length(network_output$a)-1, to=2, by=-1) )
  {
     store_deltas[[paste0("delta_",d)]] <- 
       
       apply( X = store_deltas[[paste0("delta_",d+1)]],
              MARGIN = 2,    # apply seperately to each column
              FUN = function(x) { t(Weight_matrices[[paste0("layer",d+1,"_weights")]]) %*% x }
       ) *
      sigmoid_derivative( network_output$z[[paste0("z",d)]] ) 
  }

   # store gradient of the cost function in terms of the biases in each layer:
   dC_db <- vector( "list", length=length(network_output$a)-1 )
   names(dC_db) <- paste( "dC_db_layer", 2:length(network_output$a), sep="" )
   
   for( layer in 2:length(network_output$a) )
   {
     
     dC_db[[paste0("dC_db_layer",layer)]] <- 
        apply( store_deltas[[paste0("delta_",layer)]],
               MARGIN = 1,   # apply to each row
               FUN = mean
             )
   }
   
   # store gradient of the cost function in terms of the weights in each layer:
   dC_dW <- vector( "list", length=length(network_output$a)-1 )
   names(dC_dW) <- paste( "dC_dW_layer", 2:length(network_output$a), sep="" )
   
   for( l in 2:length(network_output$a) ){
     
     # this commented code was part of my error-checking when I built this:  
     # dC_dW[[paste0("dC_dW_layer",l)]] <- Weight_matrices[[paste0("layer",l,"_weights")]]
      
      # for( k in 1:nrow(network_output$a[[paste0("a",l-1)]]) ){
      #   
      #    for( j in 1:nrow(store_deltas[[paste0("delta_",l)]]) ){
      #      
      #      dC_dW[[paste0("dC_dW_layer",l)]][j,k] <- 
      #          mean( 
      #                 network_output$a[[paste0("a",l-1)]][k,] *  
      #                 store_deltas[[paste0("delta_",l)]][j,]     
      #              )
      #    }
      # }
     
     # shorter way: 
     dC_dW[[paste0("dC_dW_layer",l)]] <- 
        store_deltas[[paste0("delta_",l)]] %*% t( network_output$a[[paste0("a",l-1)]] ) /
              length(y)
     
   }
   
   return( list( dC_db=dC_db,
                 dC_dW=dC_dW
                )
   )
}

Now, we make a function which uses the gradient of the cost function with respect to the weights and biases (the output of the backpropagation function above) to update the weights of our network. We choose a default learning rate of 0.001 (i.e. if the user doesn’t specify a learning rate, then this is the learning rate that the function will use).

update_weights <- function( current_weights, 
                            dC_db,
                            dC_dW,
                            learning_rate = 0.001 ){
  
  updated_weights <- current_weights
  
  # update the biases:
  for( b in 2:( length(names(dC_db))+1 ) ){
    
    updated_weights[[paste0("layer",b,"_biases")]] <- 
            current_weights[[paste0("layer",b,"_biases")]] - 
            ( learning_rate * dC_db[[paste0("dC_db_layer",b)]] )
      
  }
  
  # update the weights:
  for( w in 2:( length(names(dC_dW))+1 ) ){
    
    updated_weights[[paste0("layer",w,"_weights")]] <- 
            current_weights[[paste0("layer",w,"_weights")]] - 
            ( learning_rate * dC_dW[[paste0("dC_dW_layer",w)]] )
      
  }
    
  return( updated_weights )    
  
}

Now, let’s have a look at using these functions to fit a network to some data.

Using our regression network

The dataset I’ve chosen is called the Boston Housing dataset, found in the mlbench package.

We are trying to predict median value of houses (medv) using a whole lot of attributes of the houses and the suburbs which the houses are in.

Let’s load the dataset and have a look at it:

# load the data and store it in a tibble():
library(mlbench)
data(BostonHousing)
boston_data <- as_tibble( BostonHousing )

# number of rows in the data:
nrow(boston_data)

## [1] 506

# first 10 rows of the data:
head(boston_data, 10)

summary(boston_data)

##       crim                zn             indus       chas   
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1: 35  
##  Median : 0.25651   Median :  0.00   Median : 9.69          
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14          
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10          
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74          
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio            b         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

# store the response values:
y <- boston_data$medv

# store the features in a design matrix: 
X <- boston_data %>% 
        mutate( chas = as.numeric(chas) ) %>%   # make chas() column numeric
        select(-medv) %>% 
        as.matrix() %>% 
        scale()                             # standardize the data
        
head(X)

##            crim         zn      indus       chas        nox        rm
## [1,] -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629
## [2,] -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824
## [3,] -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456
## [4,] -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978
## [5,] -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620
## [6,] -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916
##             age      dis        rad        tax    ptratio         b
## [1,] -0.1198948 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159
## [2,]  0.3668034 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159
## [3,] -0.2655490 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351
## [4,] -0.8090878 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514
## [5,] -0.5106743 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159
## [6,] -0.3508100 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651
##           lstat
## [1,] -1.0744990
## [2,] -0.4919525
## [3,] -1.2075324
## [4,] -1.3601708
## [5,] -1.0254866
## [6,] -1.0422909

Notice that I also standardized the features data - when writing these notes and experimenting with the network, I found that weights tended to explode (saturate) or become zero when I did not standardize the data. Also, R was losing information because it couldn’t store all of the decimal places in my near-zero weights.

Standardizing means giving each column (feature) a mean of 0 and standard deviation of 1:

# check that the data has been correctly standardized:
colMeans(X)

##                        crim                          zn 
## -0.000000000000000006899468  0.000000000000000022983372 
##                       indus                        chas 
##  0.000000000000000015166831 -0.000000000000000158524928 
##                         nox                          rm 
## -0.000000000000000214941152 -0.000000000000000105852415 
##                         age                         dis 
## -0.000000000000000164503896  0.000000000000000114450610 
##                         rad                         tax 
##  0.000000000000000046515273  0.000000000000000019061388 
##                     ptratio                           b 
## -0.000000000000000393103424 -0.000000000000000115599093 
##                       lstat 
## -0.000000000000000070122597

apply( X, 2, sd)

##    crim      zn   indus    chas     nox      rm     age     dis     rad 
##       1       1       1       1       1       1       1       1       1 
##     tax ptratio       b   lstat 
##       1       1       1       1

In addition, notice that all features must be numeric (continuous) variables. Categorical variables (of which there is only one - chas- in this dataset) must be one-hot encoded for the network.

Let’s see if our functions are working correctly:

First, we define our network structure, and create initial random weights:

layers.neurons <- c(
                     layer1=ncol(X),  # this is the input layer (always num features == num neurons)
                     layer2=2,
                     layer3=2,
                     layer4=2,
                     layer5=1  # this is the output layer (always 1 neuron)
                    )

layers.neurons

## layer1 layer2 layer3 layer4 layer5 
##     13      2      2      2      1

starting_weights <- initialise_weights( layers.neurons )

##       layer_1 layer_2 layer_3 layer_4 layer_5    
##  [1,] "b"     "b"     "b"     "b"     "a1=output"
##  [2,] "x1"    "a1"    "a1"    "a1"    " "        
##  [3,] "x2"    "a2"    "a2"    "a2"    " "        
##  [4,] "x3"    " "     " "     " "     " "        
##  [5,] "x4"    " "     " "     " "     " "        
##  [6,] "x5"    " "     " "     " "     " "        
##  [7,] "x6"    " "     " "     " "     " "        
##  [8,] "x7"    " "     " "     " "     " "        
##  [9,] "x8"    " "     " "     " "     " "        
## [10,] "x9"    " "     " "     " "     " "        
## [11,] "x10"   " "     " "     " "     " "        
## [12,] "x11"   " "     " "     " "     " "        
## [13,] "x12"   " "     " "     " "     " "        
## [14,] "x13"   " "     " "     " "     " "

Let’s choose a batch of 20 random observations from our data to feed through the network:

batch_indices <- sample( 1:nrow(X), size=20, replace=FALSE )

batch_indices

##  [1] 395 255 501 262  21 214 446 283 181 203 241  95 108 299 359 234 184
## [18] 267  56 159

y_test <- y[batch_indices] 
X_test <- X[batch_indices,]

# responses for our batch: (we want to predict this)
y_test

##  [1] 12.7 21.9 16.8 43.1 13.6 28.1 11.8 46.0 39.8 42.3 22.0 20.6 20.4 22.5
## [15] 22.7 48.3 32.5 30.7 35.4 24.3

# features for our batch: (we want to use this to predict y)
X_test

##             crim         zn       indus       chas        nox          rm
##  [1,]  1.1330844 -0.4872402  1.01499462 -0.2723291  1.1935426 -0.56593323
##  [2,] -0.4144992  2.9429307 -1.09276866 -0.2723291 -1.4040242 -0.25139493
##  [3,] -0.3940157 -0.4872402 -0.21088983 -0.2723291  0.2615253 -0.36667820
##  [4,] -0.3580059  0.3703025 -1.04466617 -0.2723291  0.7965722  1.75823437
##  [5,] -0.2745709 -0.4872402 -0.43682573 -0.2723291 -0.1440749 -1.01710355
##  [6,] -0.4037651 -0.4872402 -0.07970124 -0.2723291 -0.5669346  0.12861288
##  [7,]  0.8205824 -0.4872402  1.01499462 -0.2723291  1.5991427  0.24816590
##  [8,] -0.4129762  0.3703025 -1.13795583  3.6647712 -0.9647679  1.93614065
##  [9,] -0.4124426 -0.4872402 -1.26477147 -0.2723291 -0.5755644  2.10693068
## [10,] -0.4175707  3.0501236 -1.32745046 -0.2723291 -1.2055390  1.88632689
## [11,] -0.4069308  0.7990739 -0.90473168 -0.2723291 -1.0933517  0.87154949
## [12,] -0.4151096  0.7133196  0.56895343 -0.2723291 -0.7826793 -0.05071665
## [13,] -0.4048521 -0.4872402 -0.37560439 -0.2723291 -0.2994111 -0.22435318
## [14,] -0.4125844  2.5141594 -1.29683979 -0.2723291 -1.3349859  0.08591537
## [15,]  0.1846466 -0.4872402  1.01499462  3.6647712  1.8580364 -0.22435318
## [16,] -0.3815656 -0.4872402 -0.71961001 -0.2723291 -0.4115983  2.79293728
## [17,] -0.4084666 -0.4872402 -1.26477147 -0.2723291 -0.5755644  0.39618392
## [18,] -0.3287576  0.3703025 -1.04466617 -0.2723291  0.7965722  1.03806976
## [19,] -0.4185775  3.3717021 -1.44552019 -0.2723291 -1.3090965  1.37253356
## [20,] -0.2639855 -0.4872402  1.23072696 -0.2723291  0.4341211 -0.31117144
##               age          dis        rad         tax     ptratio
##  [1,]  0.92810499 -0.955944820  1.6596029  1.52941294  0.80577843
##  [2,] -1.29933856  2.576450217 -0.9818712 -0.55321437 -0.94946204
##  [3,]  0.39522376 -0.615869521 -0.4076377 -0.10227512  0.34387304
##  [4,]  0.73982029 -0.786073385 -0.5224844 -0.85581834 -2.51994036
##  [5,]  1.04889141  0.001356935 -0.6373311 -0.60068166  1.17530274
##  [6,] -1.28868094  0.071404563 -0.6373311 -0.77868399  0.06672981
##  [7,]  0.93165753 -0.858210570  1.6596029  1.52941294  0.80577843
##  [8,] -0.67053871  0.672864367 -0.5224844 -1.14062207 -1.64232012
##  [9,]  0.52311526 -0.500564002 -0.7521778 -1.27709053 -0.30279450
## [10,] -1.87840284  1.175355183 -0.8670245 -0.35741180 -1.73470120
## [11,] -0.50712180  1.206746019 -0.4076377 -0.64221553 -0.85708096
## [12,]  0.30996276 -0.085502124 -0.6373311 -0.82021787 -0.11803234
## [13,]  0.59061354 -0.794336631 -0.5224844 -0.14380900  1.12911220
## [14,] -1.72209101  1.915153117 -0.5224844 -0.29807769 -1.68851066
## [15,]  0.52666780 -0.509254657  1.6596029  1.52941294  0.80577843
## [16,]  0.06483739 -0.067978344 -0.1779443 -0.60068166 -0.48755665
## [17,]  0.96007787 -0.450224689 -0.7521778 -1.27709053 -0.30279450
## [18,]  0.56929830 -0.789350190 -0.5224844 -0.85581834 -2.51994036
## [19,] -1.65814526  2.327745519 -0.5224844 -1.08128796 -0.25660396
## [20,]  1.11638970 -0.967722319 -0.5224844 -0.03107419 -1.73470120
##                 b      lstat
##  [1,]  0.44061589  0.5177013
##  [2,]  0.39669229 -0.8518430
##  [3,]  0.44061589  0.2348302
##  [4,]  0.34718238 -0.7552187
##  [5,]  0.21793086  1.1716657
##  [6,]  0.31914137 -0.4583441
##  [7,] -3.43517715  1.5861699
##  [8,]  0.22340762 -1.3503683
##  [9,]  0.42593818 -0.7132081
## [10,]  0.42396655 -1.3363648
## [11,]  0.37872851 -0.1782737
## [12,]  0.44061589 -0.2889015
## [13,]  0.33973399  0.2012217
## [14,]  0.12668805 -1.0758993
## [15,]  0.42451422 -0.1642701
## [16,]  0.24400024 -1.2187352
## [17,]  0.44061589 -0.9764743
## [18,]  0.30008225  0.2992464
## [19,]  0.42999098 -1.0983050
## [20,] -0.03049494 -0.8714479

We feed these through the network using our forward-propagation function:

feedforward_test <- 
  generic_feedforward_ftn( X = X_test,                 # design matrix of input features 
                           Weight_matrices = starting_weights
                         )

The output of our feedforward function (all of the activations) is:

feedforward_test

## $z
## $z$z2
##            [,1]      [,2]       [,3]      [,4]        [,5]       [,6]
## [1,] -1.2514633 0.3404809 -2.2271252 -1.665109 -3.65542990 -0.1755184
## [2,] -0.1724963 3.6851337 -0.7781849  2.818071 -0.05381885 -2.8070118
##           [,7]       [,8]      [,9]    [,10]      [,11]     [,12]
## [1,] 1.6172529 -2.3046574 -1.708404 2.252070 -0.7090949 -2.638019
## [2,] 0.1220298  0.2348854  1.981021 3.357432  2.4396314 -1.138108
##           [,13]    [,14]     [,15]     [,16]     [,17]     [,18]    [,19]
## [1,] -2.5656230 1.132987 -1.451897 0.1139444 -3.347483 -2.118996 3.020145
## [2,] -0.3027117 3.864323 -3.839509 1.7900800  2.643977  1.973718 3.088623
##           [,20]
## [1,] -4.1683464
## [2,] -0.7166504
## 
## $z$z3
##            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
## [1,] -0.8737633 -0.8193857 -0.8654643 -0.4603785 -0.6747525 -1.3769399
## [2,] -0.3323693 -0.5436346 -0.2658510 -0.3971617 -0.2722440 -0.3392953
##            [,7]       [,8]       [,9]     [,10]      [,11]      [,12]
## [1,] -1.3702057 -0.6810239 -0.5027434 -1.114305 -0.6313477 -0.8907019
## [2,] -0.5480328 -0.3066277 -0.3838397 -0.648188 -0.4495089 -0.2430128
##           [,13]      [,14]      [,15]      [,16]      [,17]      [,18]
## [1,] -0.7614322 -0.9712353 -1.1635796 -0.8560097 -0.3552211 -0.4618573
## [2,] -0.2766748 -0.6013289 -0.2446845 -0.5042316 -0.3539597 -0.3684297
##           [,19]      [,20]
## [1,] -1.1653163 -0.7818159
## [2,] -0.6625522 -0.2410218
## 
## $z$z4
##             [,1]        [,2]       [,3]        [,4]       [,5]        [,6]
## [1,] -2.82449725 -2.78841671 -2.8390207 -2.86023434 -2.8595021 -2.77394685
## [2,] -0.09053358 -0.05248912 -0.1032784 -0.08607696 -0.1054911 -0.08149761
##             [,7]        [,8]        [,9]       [,10]       [,11]
## [1,] -2.73295690 -2.85171599 -2.85763267 -2.73728297 -2.82877871
## [2,] -0.04310659 -0.09885135 -0.08773025 -0.02904271 -0.07307695
##           [,12]      [,13]       [,14]      [,15]       [,16]       [,17]
## [1,] -2.8409564 -2.8484977 -2.76075983 -2.8126255 -2.79203970 -2.88229324
## [2,] -0.1071966 -0.1030605 -0.03958129 -0.1024745 -0.05897949 -0.09625501
##           [,18]       [,19]      [,20]
## [1,] -2.8658459 -2.72962750 -2.8535048
## [2,] -0.0914148 -0.02575974 -0.1094863
## 
## $z$z5
##            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
## [1,] -0.5575505 -0.5599363 -0.5568861 -0.5602771 -0.5578579 -0.5554507
##            [,7]       [,8]       [,9]      [,10]      [,11]      [,12]
## [1,] -0.5574472 -0.5582003 -0.5599183 -0.5594616 -0.5599526 -0.5565253
##           [,13]      [,14]      [,15]      [,16]      [,17]      [,18]
## [1,] -0.5574904 -0.5597259 -0.5553496 -0.5593726 -0.5603437 -0.5599601
##           [,19]      [,20]
## [1,] -0.5593497 -0.5570079
## 
## 
## $a
## $a$a2
##           [,1]      [,2]       [,3]      [,4]       [,5]       [,6]
## [1,] 0.2224469 0.5843073 0.09734094 0.1590773 0.02519898 0.45623271
## [2,] 0.4569825 0.9755205 0.31471121 0.9436446 0.48654853 0.05694645
##           [,7]       [,8]      [,9]     [,10]     [,11]      [,12]
## [1,] 0.8344159 0.09073797 0.1533708 0.9048290 0.3297989 0.06673128
## [2,] 0.5304697 0.55845287 0.8787900 0.9663474 0.9197999 0.24266792
##          [,13]     [,14]      [,15]     [,16]      [,17]     [,18]
## [1,] 0.0713839 0.7563896 0.18970973 0.5284553 0.03397768 0.1072642
## [2,] 0.4248947 0.9794539 0.02105146 0.8569371 0.93363877 0.8780099
##          [,19]      [,20]
## [1,] 0.9534759 0.01524192
## [2,] 0.9564210 0.32813101
## 
## $a$a3
##           [,1]      [,2]     [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.2944718 0.3058941 0.296199 0.3868960 0.3374335 0.2015009 0.2025866
## [2,] 0.4176642 0.3673425 0.433926 0.4019945 0.4323563 0.4159807 0.3663209
##           [,8]      [,9]     [,10]     [,11]    [,12]     [,13]     [,14]
## [1,] 0.3360328 0.3768962 0.2470692 0.3472050 0.290965 0.3183354 0.2746343
## [2,] 0.4239381 0.4052011 0.3433980 0.3894775 0.439544 0.4312692 0.3540397
##          [,15]     [,16]     [,17]     [,18]     [,19]     [,20]
## [1,] 0.2380175 0.2981737 0.4121169 0.3865453 0.2377026 0.3139286
## [2,] 0.4391323 0.3765468 0.4124225 0.4089205 0.3401665 0.4400346
## 
## $a$a4
##            [,1]       [,2]       [,3]      [,4]       [,5]       [,6]
## [1,] 0.05601466 0.05795333 0.05525163 0.0541547 0.05419221 0.05874838
## [2,] 0.47738205 0.48688073 0.47420332 0.4784940 0.47365166 0.47963687
##            [,7]       [,8]       [,9]      [,10]     [,11]      [,12]
## [1,] 0.06105643 0.05459268 0.05428811 0.06080889 0.0557887 0.05515068
## [2,] 0.48922502 0.47530727 0.47808149 0.49273983 0.4817389 0.47322648
##           [,13]      [,14]      [,15]      [,16]      [,17]      [,18]
## [1,] 0.05475903 0.05948184 0.05664572 0.05775585 0.05303584 0.05386798
## [2,] 0.47425765 0.49010597 0.47440378 0.48525940 0.47595481 0.47716220
##           [,19]      [,20]
## [1,] 0.06124758 0.05450043
## [2,] 0.49356042 0.47265572
## 
## $a$a5
##            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
## [1,] -0.5575505 -0.5599363 -0.5568861 -0.5602771 -0.5578579 -0.5554507
##            [,7]       [,8]       [,9]      [,10]      [,11]      [,12]
## [1,] -0.5574472 -0.5582003 -0.5599183 -0.5594616 -0.5599526 -0.5565253
##           [,13]      [,14]      [,15]      [,16]      [,17]      [,18]
## [1,] -0.5574904 -0.5597259 -0.5553496 -0.5593726 -0.5603437 -0.5599601
##           [,19]      [,20]
## [1,] -0.5593497 -0.5570079
## 
## $a$a1
##               [,1]       [,2]       [,3]       [,4]         [,5]
## crim     1.1330844 -0.4144992 -0.3940157 -0.3580059 -0.274570851
## zn      -0.4872402  2.9429307 -0.4872402  0.3703025 -0.487240187
## indus    1.0149946 -1.0927687 -0.2108898 -1.0446662 -0.436825726
## chas    -0.2723291 -0.2723291 -0.2723291 -0.2723291 -0.272329068
## nox      1.1935426 -1.4040242  0.2615253  0.7965722 -0.144074855
## rm      -0.5659332 -0.2513949 -0.3666782  1.7582344 -1.017103545
## age      0.9281050 -1.2993386  0.3952238  0.7398203  1.048891406
## dis     -0.9559448  2.5764502 -0.6158695 -0.7860734  0.001356935
## rad      1.6596029 -0.9818712 -0.4076377 -0.5224844 -0.637331090
## tax      1.5294129 -0.5532144 -0.1022751 -0.8558183 -0.600681657
## ptratio  0.8057784 -0.9494620  0.3438730 -2.5199404  1.175302739
## b        0.4406159  0.3966923  0.4406159  0.3471824  0.217930861
## lstat    0.5177013 -0.8518430  0.2348302 -0.7552187  1.171665689
##                [,6]       [,7]       [,8]       [,9]      [,10]      [,11]
## crim    -0.40376508  0.8205824 -0.4129762 -0.4124426 -0.4175707 -0.4069308
## zn      -0.48724019 -0.4872402  0.3703025 -0.4872402  3.0501236  0.7990739
## indus   -0.07970124  1.0149946 -1.1379558 -1.2647715 -1.3274505 -0.9047317
## chas    -0.27232907 -0.2723291  3.6647712 -0.2723291 -0.2723291 -0.2723291
## nox     -0.56693456  1.5991427 -0.9647679 -0.5755644 -1.2055390 -1.0933517
## rm       0.12861288  0.2481659  1.9361406  2.1069307  1.8863269  0.8715495
## age     -1.28868094  0.9316575 -0.6705387  0.5231153 -1.8784028 -0.5071218
## dis      0.07140456 -0.8582106  0.6728644 -0.5005640  1.1753552  1.2067460
## rad     -0.63733109  1.6596029 -0.5224844 -0.7521778 -0.8670245 -0.4076377
## tax     -0.77868399  1.5294129 -1.1406221 -1.2770905 -0.3574118 -0.6422155
## ptratio  0.06672981  0.8057784 -1.6423201 -0.3027945 -1.7347012 -0.8570810
## b        0.31914137 -3.4351771  0.2234076  0.4259382  0.4239665  0.3787285
## lstat   -0.45834408  1.5861699 -1.3503683 -0.7132081 -1.3363648 -0.1782737
##               [,12]      [,13]       [,14]      [,15]       [,16]
## crim    -0.41510955 -0.4048521 -0.41258443  0.1846466 -0.38156558
## zn       0.71331963 -0.4872402  2.51415937 -0.4872402 -0.48724019
## indus    0.56895343 -0.3756044 -1.29683979  1.0149946 -0.71961001
## chas    -0.27232907 -0.2723291 -0.27232907  3.6647712 -0.27232907
## nox     -0.78267931 -0.2994111 -1.33498587  1.8580364 -0.41159834
## rm      -0.05071665 -0.2243532  0.08591537 -0.2243532  2.79293728
## age      0.30996276  0.5906135 -1.72209101  0.5266678  0.06483739
## dis     -0.08550212 -0.7943366  1.91515312 -0.5092547 -0.06797834
## rad     -0.63733109 -0.5224844 -0.52248439  1.6596029 -0.17794429
## tax     -0.82021787 -0.1438090 -0.29807769  1.5294129 -0.60068166
## ptratio -0.11803234  1.1291122 -1.68851066  0.8057784 -0.48755665
## b        0.44061589  0.3397340  0.12668805  0.4245142  0.24400024
## lstat   -0.28890148  0.2012217 -1.07589932 -0.1642701 -1.21873523
##              [,17]      [,18]      [,19]       [,20]
## crim    -0.4084666 -0.3287576 -0.4185775 -0.26398554
## zn      -0.4872402  0.3703025  3.3717021 -0.48724019
## indus   -1.2647715 -1.0446662 -1.4455202  1.23072696
## chas    -0.2723291 -0.2723291 -0.2723291 -0.27232907
## nox     -0.5755644  0.7965722 -1.3090965  0.43412107
## rm       0.3961839  1.0380698  1.3725336 -0.31117144
## age      0.9600779  0.5692983 -1.6581453  1.11638970
## dis     -0.4502247 -0.7893502  2.3277455 -0.96772232
## rad     -0.7521778 -0.5224844 -0.5224844 -0.52248439
## tax     -1.2770905 -0.8558183 -1.0812880 -0.03107419
## ptratio -0.3027945 -2.5199404 -0.2566040 -1.73470120
## b        0.4406159  0.3000823  0.4299910 -0.03049494
## lstat   -0.9764743  0.2992464 -1.0983050 -0.87144793

Now, we back propagate to get the gradient of the cost function with respect to each weight and bias in the network:

test_backprop <- backpropagate_ftn( network_output = feedforward_test,  # output of feedforward function
                                    Weight_matrices = starting_weights,  # current weights
                                    y = y_test                  # responses for this batch 
                                  )

The output of the back propagation function (all of the gradients) is:

test_backprop

## $dC_db
## $dC_db$dC_db_layer2
## [1] -0.003375796  0.017413217
## 
## $dC_db$dC_db_layer3
## [1]  0.1355797 -0.2883371
## 
## $dC_db$dC_db_layer4
## [1] -1.769622  3.468732
## 
## $dC_db$dC_db_layer5
## [1] -28.3334
## 
## 
## $dC_dW
## $dC_dW$dC_dW_layer2
##              crim            zn        indus         chas          nox
## [1,]  0.001096192 -0.0026485736  0.002881480 0.0001052848  0.001683371
## [2,] -0.004358103  0.0008913972 -0.007442556 0.0080698440 -0.003477994
##               rm         age          dis          rad          tax
## [1,] -0.00431844 0.001145423 -0.001661618  0.001374996  0.001885165
## [2,]  0.01450535 0.003558910 -0.002376660 -0.005557485 -0.008565146
##          ptratio             b        lstat
## [1,]  0.00314627 -0.0008711749  0.002519569
## [2,] -0.01017325  0.0025611529 -0.007565479
## 
## $dC_dW$dC_dW_layer3
##             [,1]        [,2]
## [1,]  0.04581234  0.09624287
## [2,] -0.09211988 -0.19783939
## 
## $dC_dW$dC_dW_layer4
##            [,1]       [,2]
## [1,] -0.5471392 -0.7029606
## [2,]  1.0800971  1.3816790
## 
## $dC_dW$dC_dW_layer5
##          [,1]     [,2]
## [1,] -1.59951 -13.6218

Let’s see that this is not a lie:

For this batch of 20 observations, the derivative of the cost function with respect to the third layer weight $W^{3}_{12}$ should be (according to our backpropagation function output):

test_backprop$dC_dW$dC_dW_layer3[1,2]

## [1] 0.09624287

This means that a change of $+0.0001$ to this weight should change the cost by approximately $0.0001\times$ 0.0962429 = 0.0000096

The observed change in cost is:

quadratic_cost_ftn <- function( # the cost function is a function of:
                                y,      # our true response values 
                                y_hat   # our predicted response values  
){
  
  return( mean( 0.5*(y-y_hat)^2 ) )
  
                              }

cost_before_change <- quadratic_cost_ftn(     y = y_test,      # true response values 
                                          y_hat = feedforward_test$a$a5   # predicted response values  
                                        )

# increase weight W3_12 by +0.0001
updated_starting_weights <- starting_weights
updated_starting_weights$layer3_weights <- 
    updated_starting_weights$layer3_weights + matrix( c(0,0.0001,0,0), byrow=TRUE, ncol=2 )

feedforward_with_new_weight_W3_12 <-
      generic_feedforward_ftn( X = X_test, 
                               Weight_matrices = updated_starting_weights
                             )
 
cost_with_new_weight <-
  quadratic_cost_ftn( y = y_test, y_hat = feedforward_with_new_weight_W3_12$a$a5 )

cost_with_new_weight - cost_before_change

## [1] 0.000009624369

Now, let us see if the weight-updating function works correctly by using our back-propagation output to update the weights:

new_weights <- update_weights(  current_weights = starting_weights, 
                                dC_db = test_backprop$dC_db,
                                dC_dW = test_backprop$dC_dW,
                                learning_rate = 0.001 
                              )

Let’s test a random weight to see if this has worked correctly:

The biases on the second layer were initially

starting_weights$layer2_biases

## [1] -1.1769578 -0.1388739

and the derivative of the cost function with respect to these biases was then:

test_backprop$dC_db$dC_db_layer2

## [1] -0.003375796  0.017413217

So, $b^2_1$ = -1.1769578 should have been updated to -1.1769578 - -0.0033758 $\times 0.001$ = -1.1769544 and $b^2_2$ = -0.1388739 should have been updated to -0.1388739 - 0.0174132 $\times 0.001$ = -0.1388913

The function updated these biases to:

new_weights$layer2_biases

## [1] -1.1769544 -0.1388913

Now, let’s see how our network performs on the full dataset.

I split the data into 90% training data (to train the model on) and 10% testing data (to test the performance of the model on):

# store the response variable in a vector y:
y <- boston_data$medv

training_set_indices <- sample( 1:nrow(X), size=round(0.9*nrow(X)), replace=FALSE )

training_y <- y[training_set_indices] 
test_y <- y[-training_set_indices] 

X_unstandard <- 
    boston_data %>% 
        mutate( chas = as.numeric(chas) ) %>%   # make chas() column numeric
        select(-medv) %>% 
        as.matrix() 

training_X_unstandard <- X_unstandard[training_set_indices,] 
test_X_unstandard <- X_unstandard[-training_set_indices,] 

mean_training_X <- mean(training_X_unstandard)
sd_training_X <- sd(training_X_unstandard)

training_X_standard <- (training_X_unstandard - mean_training_X) / sd_training_X     
test_X_standard <- (test_X_unstandard - mean_training_X) / sd_training_X     

length(training_y); length(test_y)

## [1] 455

## [1] 51

nrow(training_X_standard); nrow(test_X_standard)

## [1] 455

## [1] 51

Notice that the standardisation of the test data uses the mean and variance of the training data.

Let’s also fit a bigger network: 6 layers with 70 nodes in each hidden layer:

# specify number of layers, and number of neurons in each layer:
joes_network <- 
          c( 
             layer1=ncol(training_X_standard),  # input layer (always num features == num neurons)
             layer2=70,
             layer3=70,
             layer4=70,
             layer5=70,
             layer6=1    # this is the output layer (always 1 neuron)
            )

# initialise the weights with random values
nn_weights <- initialise_weights( layers.neurons = joes_network )

##       layer_1 layer_2 layer_3 layer_4 layer_5 layer_6    
##  [1,] "b"     "b"     "b"     "b"     "b"     "a1=output"
##  [2,] "x1"    "a1"    "a1"    "a1"    "a1"    " "        
##  [3,] "x2"    "a2"    "a2"    "a2"    "a2"    " "        
##  [4,] "x3"    "a3"    "a3"    "a3"    "a3"    " "        
##  [5,] "x4"    "a4"    "a4"    "a4"    "a4"    " "        
##  [6,] "x5"    "a5"    "a5"    "a5"    "a5"    " "        
##  [7,] "x6"    "a6"    "a6"    "a6"    "a6"    " "        
##  [8,] "x7"    "a7"    "a7"    "a7"    "a7"    " "        
##  [9,] "x8"    "a8"    "a8"    "a8"    "a8"    " "        
## [10,] "x9"    "a9"    "a9"    "a9"    "a9"    " "        
## [11,] "x10"   "a10"   "a10"   "a10"   "a10"   " "        
## [12,] "x11"   "a11"   "a11"   "a11"   "a11"   " "        
## [13,] "x12"   "a12"   "a12"   "a12"   "a12"   " "        
## [14,] "x13"   "a13"   "a13"   "a13"   "a13"   " "        
## [15,] " "     "a14"   "a14"   "a14"   "a14"   " "        
## [16,] " "     "a15"   "a15"   "a15"   "a15"   " "        
## [17,] " "     "a16"   "a16"   "a16"   "a16"   " "        
## [18,] " "     "a17"   "a17"   "a17"   "a17"   " "        
## [19,] " "     "a18"   "a18"   "a18"   "a18"   " "        
## [20,] " "     "a19"   "a19"   "a19"   "a19"   " "        
## [21,] " "     "a20"   "a20"   "a20"   "a20"   " "        
## [22,] " "     "a21"   "a21"   "a21"   "a21"   " "        
## [23,] " "     "a22"   "a22"   "a22"   "a22"   " "        
## [24,] " "     "a23"   "a23"   "a23"   "a23"   " "        
## [25,] " "     "a24"   "a24"   "a24"   "a24"   " "        
## [26,] " "     "a25"   "a25"   "a25"   "a25"   " "        
## [27,] " "     "a26"   "a26"   "a26"   "a26"   " "        
## [28,] " "     "a27"   "a27"   "a27"   "a27"   " "        
## [29,] " "     "a28"   "a28"   "a28"   "a28"   " "        
## [30,] " "     "a29"   "a29"   "a29"   "a29"   " "        
## [31,] " "     "a30"   "a30"   "a30"   "a30"   " "        
## [32,] " "     "a31"   "a31"   "a31"   "a31"   " "        
## [33,] " "     "a32"   "a32"   "a32"   "a32"   " "        
## [34,] " "     "a33"   "a33"   "a33"   "a33"   " "        
## [35,] " "     "a34"   "a34"   "a34"   "a34"   " "        
## [36,] " "     "a35"   "a35"   "a35"   "a35"   " "        
## [37,] " "     "a36"   "a36"   "a36"   "a36"   " "        
## [38,] " "     "a37"   "a37"   "a37"   "a37"   " "        
## [39,] " "     "a38"   "a38"   "a38"   "a38"   " "        
## [40,] " "     "a39"   "a39"   "a39"   "a39"   " "        
## [41,] " "     "a40"   "a40"   "a40"   "a40"   " "        
## [42,] " "     "a41"   "a41"   "a41"   "a41"   " "        
## [43,] " "     "a42"   "a42"   "a42"   "a42"   " "        
## [44,] " "     "a43"   "a43"   "a43"   "a43"   " "        
## [45,] " "     "a44"   "a44"   "a44"   "a44"   " "        
## [46,] " "     "a45"   "a45"   "a45"   "a45"   " "        
## [47,] " "     "a46"   "a46"   "a46"   "a46"   " "        
## [48,] " "     "a47"   "a47"   "a47"   "a47"   " "        
## [49,] " "     "a48"   "a48"   "a48"   "a48"   " "        
## [50,] " "     "a49"   "a49"   "a49"   "a49"   " "        
## [51,] " "     "a50"   "a50"   "a50"   "a50"   " "        
## [52,] " "     "a51"   "a51"   "a51"   "a51"   " "        
## [53,] " "     "a52"   "a52"   "a52"   "a52"   " "        
## [54,] " "     "a53"   "a53"   "a53"   "a53"   " "        
## [55,] " "     "a54"   "a54"   "a54"   "a54"   " "        
## [56,] " "     "a55"   "a55"   "a55"   "a55"   " "        
## [57,] " "     "a56"   "a56"   "a56"   "a56"   " "        
## [58,] " "     "a57"   "a57"   "a57"   "a57"   " "        
## [59,] " "     "a58"   "a58"   "a58"   "a58"   " "        
## [60,] " "     "a59"   "a59"   "a59"   "a59"   " "        
## [61,] " "     "a60"   "a60"   "a60"   "a60"   " "        
## [62,] " "     "a61"   "a61"   "a61"   "a61"   " "        
## [63,] " "     "a62"   "a62"   "a62"   "a62"   " "        
## [64,] " "     "a63"   "a63"   "a63"   "a63"   " "        
## [65,] " "     "a64"   "a64"   "a64"   "a64"   " "        
## [66,] " "     "a65"   "a65"   "a65"   "a65"   " "        
## [67,] " "     "a66"   "a66"   "a66"   "a66"   " "        
## [68,] " "     "a67"   "a67"   "a67"   "a67"   " "        
## [69,] " "     "a68"   "a68"   "a68"   "a68"   " "        
## [70,] " "     "a69"   "a69"   "a69"   "a69"   " "        
## [71,] " "     "a70"   "a70"   "a70"   "a70"   " "

# store how many layers are in the network:
n_layers <- length(joes_network)

We tell the network to learn/train for 10,000 epochs. Each epoch is made up of multiple batches of our training data. Once the neural network has seen every training observation/sample once, this concludes one epoch.

We will divide our training data in half (2 batches) because the training set is quite small.

Then, in each epoch we will divide our data randomly in half (the data will be split differently in each epoch). The network will update it’s weights after seeing batch 1. After concluding batch 1, the network will evaluate batch 2 and update it’s weights according to batch 2.

n_epochs <- 1e5L #10000
n_batches <- 2

We are keep a record of how the cost changes with each epoch:

# create empty vector to store the cost for each epoch:
store_historic_cost <- rep( as.numeric(NA), n_epochs )

Store the initial cost of the network on the training set with the initial random weights:

network_output <- generic_feedforward_ftn( X = training_X_standard, 
                                           Weight_matrices = nn_weights )
store_historic_cost[1] <- quadratic_cost_ftn( y = training_y, 
                                              y_hat = network_output$a[[paste0("a",n_layers)]] 
                                            )

store_historic_cost[1]

## [1] 146.8269

Now, we train the model on the training data for the number of epochs that we specified:

for( e in 1:n_epochs ){
  
  # split the data into n_batches:
  batch_indices <- sample( 1:n_batches, replace=TRUE, size = nrow(training_X_standard) ) 
  
  for( b in 1:n_batches ){      # for each batch
    
    X_b <- training_X_standard[ batch_indices==b, ]
    y_b <- training_y[ batch_indices==b ]
    
    network_out_this_batch <- generic_feedforward_ftn( X = X_b, Weight_matrices = nn_weights )
    
    get_gradients <- backpropagate_ftn( network_output = network_out_this_batch,
                                        Weight_matrices = nn_weights,
                                        y = y_b
                                      )
    
    nn_weights <- update_weights( current_weights = nn_weights, 
                                  dC_db = get_gradients$dC_db,
                                  dC_dW = get_gradients$dC_dW,
                                  learning_rate = 0.0001
                                )
  }
  
  network_output_this_epoch <- generic_feedforward_ftn( X = training_X_standard,
                                                        Weight_matrices = nn_weights )
  
  
  # store the cost at this point:
  store_historic_cost[e+1] <- quadratic_cost_ftn( 
    y = training_y, 
    y_hat = network_output_this_epoch$a[[paste0("a",n_layers)]]  
                                                 )
  if( e%%100 == 0)
  {
    print( paste0("epoch ", e, " done") )
    print( paste0( "cost: ", store_historic_cost[e+1] ) )
  }
  
}

## [1] "epoch 100 done"
## [1] "cost: 45.7752666682867"
## [1] "epoch 200 done"
## [1] "cost: 38.0380953373437"
## [1] "epoch 300 done"
## [1] "cost: 36.2502285586534"
## [1] "epoch 400 done"
## [1] "cost: 34.8967374764455"
## [1] "epoch 500 done"
## [1] "cost: 33.9345930852474"
## [1] "epoch 600 done"
## [1] "cost: 33.2487182566549"
## [1] "epoch 700 done"
## [1] "cost: 32.7005442444636"
## [1] "epoch 800 done"
## [1] "cost: 32.2355580546383"
## [1] "epoch 900 done"
## [1] "cost: 31.8393364600309"
## [1] "epoch 1000 done"
## [1] "cost: 31.5043975777668"
## [1] "epoch 1100 done"
## [1] "cost: 31.2228576499376"
## [1] "epoch 1200 done"
## [1] "cost: 30.9849600781439"
## [1] "epoch 1300 done"
## [1] "cost: 30.782523236858"
## [1] "epoch 1400 done"
## [1] "cost: 30.6089476148343"
## [1] "epoch 1500 done"
## [1] "cost: 30.4584278558616"
## [1] "epoch 1600 done"
## [1] "cost: 30.3261856272187"
## [1] "epoch 1700 done"
## [1] "cost: 30.2086922556804"
## [1] "epoch 1800 done"
## [1] "cost: 30.1025397435349"
## [1] "epoch 1900 done"
## [1] "cost: 30.0056724081183"
## [1] "epoch 2000 done"
## [1] "cost: 29.9160400872418"
## [1] "epoch 2100 done"
## [1] "cost: 29.8325284274627"
## [1] "epoch 2200 done"
## [1] "cost: 29.7538526677022"
## [1] "epoch 2300 done"
## [1] "cost: 29.6796492621507"
## [1] "epoch 2400 done"
## [1] "cost: 29.6089190003076"
## [1] "epoch 2500 done"
## [1] "cost: 29.541391099881"
## [1] "epoch 2600 done"
## [1] "cost: 29.4764926359361"
## [1] "epoch 2700 done"
## [1] "cost: 29.4142106552766"
## [1] "epoch 2800 done"
## [1] "cost: 29.3540484530305"
## [1] "epoch 2900 done"
## [1] "cost: 29.2959094182774"
## [1] "epoch 3000 done"
## [1] "cost: 29.2394449772553"
## [1] "epoch 3100 done"
## [1] "cost: 29.1846029256038"
## [1] "epoch 3200 done"
## [1] "cost: 29.1310662544089"
## [1] "epoch 3300 done"
## [1] "cost: 29.0788094602771"
## [1] "epoch 3400 done"
## [1] "cost: 29.0277462273877"
## [1] "epoch 3500 done"
## [1] "cost: 28.9776571593259"
## [1] "epoch 3600 done"
## [1] "cost: 28.9284862579419"
## [1] "epoch 3700 done"
## [1] "cost: 28.8801341031247"
## [1] "epoch 3800 done"
## [1] "cost: 28.8325458699454"
## [1] "epoch 3900 done"
## [1] "cost: 28.7856491945782"
## [1] "epoch 4000 done"
## [1] "cost: 28.7392427784764"
## [1] "epoch 4100 done"
## [1] "cost: 28.6934553519929"
## [1] "epoch 4200 done"
## [1] "cost: 28.6481636078975"
## [1] "epoch 4300 done"
## [1] "cost: 28.6032357701474"
## [1] "epoch 4400 done"
## [1] "cost: 28.558771103471"
## [1] "epoch 4500 done"
## [1] "cost: 28.5146615294799"
## [1] "epoch 4600 done"
## [1] "cost: 28.470880711606"
## [1] "epoch 4700 done"
## [1] "cost: 28.4274286748804"
## [1] "epoch 4800 done"
## [1] "cost: 28.3842196454681"
## [1] "epoch 4900 done"
## [1] "cost: 28.3413284462181"
## [1] "epoch 5000 done"
## [1] "cost: 28.2986224815217"
## [1] "epoch 5100 done"
## [1] "cost: 28.2561964369518"
## [1] "epoch 5200 done"
## [1] "cost: 28.2139057576401"
## [1] "epoch 5300 done"
## [1] "cost: 28.171797344664"
## [1] "epoch 5400 done"
## [1] "cost: 28.1298432325449"
## [1] "epoch 5500 done"
## [1] "cost: 28.0880018470453"
## [1] "epoch 5600 done"
## [1] "cost: 28.0462054542268"
## [1] "epoch 5700 done"
## [1] "cost: 28.0045923328995"
## [1] "epoch 5800 done"
## [1] "cost: 27.9630392253967"
## [1] "epoch 5900 done"
## [1] "cost: 27.9215458703023"
## [1] "epoch 6000 done"
## [1] "cost: 27.880095031974"
## [1] "epoch 6100 done"
## [1] "cost: 27.8386676261975"
## [1] "epoch 6200 done"
## [1] "cost: 27.7971282297805"
## [1] "epoch 6300 done"
## [1] "cost: 27.7555836412481"
## [1] "epoch 6400 done"
## [1] "cost: 27.7139868372423"
## [1] "epoch 6500 done"
## [1] "cost: 27.6723027600164"
## [1] "epoch 6600 done"
## [1] "cost: 27.630555798711"
## [1] "epoch 6700 done"
## [1] "cost: 27.5886161703308"
## [1] "epoch 6800 done"
## [1] "cost: 27.5465280644392"
## [1] "epoch 6900 done"
## [1] "cost: 27.5042697579398"
## [1] "epoch 7000 done"
## [1] "cost: 27.4617970168053"
## [1] "epoch 7100 done"
## [1] "cost: 27.4190687586579"
## [1] "epoch 7200 done"
## [1] "cost: 27.3760699203972"
## [1] "epoch 7300 done"
## [1] "cost: 27.3327661940215"
## [1] "epoch 7400 done"
## [1] "cost: 27.28917938925"
## [1] "epoch 7500 done"
## [1] "cost: 27.2451915327751"
## [1] "epoch 7600 done"
## [1] "cost: 27.2008491890736"
## [1] "epoch 7700 done"
## [1] "cost: 27.1561201128713"
## [1] "epoch 7800 done"
## [1] "cost: 27.1108912011131"
## [1] "epoch 7900 done"
## [1] "cost: 27.0651501388628"
## [1] "epoch 8000 done"
## [1] "cost: 27.0188923863461"
## [1] "epoch 8100 done"
## [1] "cost: 26.9720585437206"
## [1] "epoch 8200 done"
## [1] "cost: 26.9246177955095"
## [1] "epoch 8300 done"
## [1] "cost: 26.8765729735231"
## [1] "epoch 8400 done"
## [1] "cost: 26.8278163035483"
## [1] "epoch 8500 done"
## [1] "cost: 26.7783283890551"
## [1] "epoch 8600 done"
## [1] "cost: 26.7281056572505"
## [1] "epoch 8700 done"
## [1] "cost: 26.6770792860736"
## [1] "epoch 8800 done"
## [1] "cost: 26.6251281677695"
## [1] "epoch 8900 done"
## [1] "cost: 26.5723166919603"
## [1] "epoch 9000 done"
## [1] "cost: 26.5185620788156"
## [1] "epoch 9100 done"
## [1] "cost: 26.4638635298359"
## [1] "epoch 9200 done"
## [1] "cost: 26.4081095006522"
## [1] "epoch 9300 done"
## [1] "cost: 26.3512941677225"
## [1] "epoch 9400 done"
## [1] "cost: 26.2933099258467"
## [1] "epoch 9500 done"
## [1] "cost: 26.2341426042053"
## [1] "epoch 9600 done"
## [1] "cost: 26.173784626373"
## [1] "epoch 9700 done"
## [1] "cost: 26.1123716762374"
## [1] "epoch 9800 done"
## [1] "cost: 26.0494215553527"
## [1] "epoch 9900 done"
## [1] "cost: 25.9853352651696"
## [1] "epoch 10000 done"
## [1] "cost: 25.9199217300766"
## [1] "epoch 10100 done"
## [1] "cost: 25.8532165996077"
## [1] "epoch 10200 done"
## [1] "cost: 25.7852148121147"
## [1] "epoch 10300 done"
## [1] "cost: 25.715866602885"
## [1] "epoch 10400 done"
## [1] "cost: 25.645247261152"
## [1] "epoch 10500 done"
## [1] "cost: 25.5734651944091"
## [1] "epoch 10600 done"
## [1] "cost: 25.5003994445557"
## [1] "epoch 10700 done"
## [1] "cost: 25.4262970460416"
## [1] "epoch 10800 done"
## [1] "cost: 25.3511741074774"
## [1] "epoch 10900 done"
## [1] "cost: 25.2751308234918"
## [1] "epoch 11000 done"
## [1] "cost: 25.1982418000103"
## [1] "epoch 11100 done"
## [1] "cost: 25.1206603270272"
## [1] "epoch 11200 done"
## [1] "cost: 25.0424054018551"
## [1] "epoch 11300 done"
## [1] "cost: 24.9636888102441"
## [1] "epoch 11400 done"
## [1] "cost: 24.8843651426566"
## [1] "epoch 11500 done"
## [1] "cost: 24.8046878446828"
## [1] "epoch 11600 done"
## [1] "cost: 24.7248573644722"
## [1] "epoch 11700 done"
## [1] "cost: 24.6444624737914"
## [1] "epoch 11800 done"
## [1] "cost: 24.5639322640823"
## [1] "epoch 11900 done"
## [1] "cost: 24.4832450081963"
## [1] "epoch 12000 done"
## [1] "cost: 24.4023721848094"
## [1] "epoch 12100 done"
## [1] "cost: 24.3212732611617"
## [1] "epoch 12200 done"
## [1] "cost: 24.240019703623"
## [1] "epoch 12300 done"
## [1] "cost: 24.1586298100144"
## [1] "epoch 12400 done"
## [1] "cost: 24.077171144852"
## [1] "epoch 12500 done"
## [1] "cost: 23.9955027875705"
## [1] "epoch 12600 done"
## [1] "cost: 23.9135989086674"
## [1] "epoch 12700 done"
## [1] "cost: 23.8315282411613"
## [1] "epoch 12800 done"
## [1] "cost: 23.7489716369714"
## [1] "epoch 12900 done"
## [1] "cost: 23.6661302279796"
## [1] "epoch 13000 done"
## [1] "cost: 23.5831916085403"
## [1] "epoch 13100 done"
## [1] "cost: 23.4999045639264"
## [1] "epoch 13200 done"
## [1] "cost: 23.4164966733472"
## [1] "epoch 13300 done"
## [1] "cost: 23.3325035162937"
## [1] "epoch 13400 done"
## [1] "cost: 23.2482874643604"
## [1] "epoch 13500 done"
## [1] "cost: 23.1635901681541"
## [1] "epoch 13600 done"
## [1] "cost: 23.0785611452443"
## [1] "epoch 13700 done"
## [1] "cost: 22.9929547782829"
## [1] "epoch 13800 done"
## [1] "cost: 22.9069643379473"
## [1] "epoch 13900 done"
## [1] "cost: 22.8205645015691"
## [1] "epoch 14000 done"
## [1] "cost: 22.7336015281579"
## [1] "epoch 14100 done"
## [1] "cost: 22.6461171927036"
## [1] "epoch 14200 done"
## [1] "cost: 22.5583664296888"
## [1] "epoch 14300 done"
## [1] "cost: 22.4699924550255"
## [1] "epoch 14400 done"
## [1] "cost: 22.3811011081913"
## [1] "epoch 14500 done"
## [1] "cost: 22.2915151558554"
## [1] "epoch 14600 done"
## [1] "cost: 22.2013719699761"
## [1] "epoch 14700 done"
## [1] "cost: 22.1108184076398"
## [1] "epoch 14800 done"
## [1] "cost: 22.0196486654927"
## [1] "epoch 14900 done"
## [1] "cost: 21.9282401082039"
## [1] "epoch 15000 done"
## [1] "cost: 21.8358679126077"
## [1] "epoch 15100 done"
## [1] "cost: 21.7430870319547"
## [1] "epoch 15200 done"
## [1] "cost: 21.6497930805668"
## [1] "epoch 15300 done"
## [1] "cost: 21.5559095483612"
## [1] "epoch 15400 done"
## [1] "cost: 21.4613430095323"
## [1] "epoch 15500 done"
## [1] "cost: 21.3660773522129"
## [1] "epoch 15600 done"
## [1] "cost: 21.2704991588987"
## [1] "epoch 15700 done"
## [1] "cost: 21.1740978067024"
## [1] "epoch 15800 done"
## [1] "cost: 21.0774403997014"
## [1] "epoch 15900 done"
## [1] "cost: 20.9800508539193"
## [1] "epoch 16000 done"
## [1] "cost: 20.8822843957738"
## [1] "epoch 16100 done"
## [1] "cost: 20.7840534683148"
## [1] "epoch 16200 done"
## [1] "cost: 20.6852109490688"
## [1] "epoch 16300 done"
## [1] "cost: 20.586034469447"
## [1] "epoch 16400 done"
## [1] "cost: 20.48634941257"
## [1] "epoch 16500 done"
## [1] "cost: 20.3859885234685"
## [1] "epoch 16600 done"
## [1] "cost: 20.285383524727"
## [1] "epoch 16700 done"
## [1] "cost: 20.1841166954214"
## [1] "epoch 16800 done"
## [1] "cost: 20.0824157568636"
## [1] "epoch 16900 done"
## [1] "cost: 19.9808403790827"
## [1] "epoch 17000 done"
## [1] "cost: 19.8783578153613"
## [1] "epoch 17100 done"
## [1] "cost: 19.7761072707045"
## [1] "epoch 17200 done"
## [1] "cost: 19.6729317961062"
## [1] "epoch 17300 done"
## [1] "cost: 19.570186569493"
## [1] "epoch 17400 done"
## [1] "cost: 19.4665089986631"
## [1] "epoch 17500 done"
## [1] "cost: 19.3629022337124"
## [1] "epoch 17600 done"
## [1] "cost: 19.2589609810137"
## [1] "epoch 17700 done"
## [1] "cost: 19.1549351542074"
## [1] "epoch 17800 done"
## [1] "cost: 19.0507349598787"
## [1] "epoch 17900 done"
## [1] "cost: 18.9464481733313"
## [1] "epoch 18000 done"
## [1] "cost: 18.8419634736774"
## [1] "epoch 18100 done"
## [1] "cost: 18.738191797788"
## [1] "epoch 18200 done"
## [1] "cost: 18.6327224452723"
## [1] "epoch 18300 done"
## [1] "cost: 18.5282519456676"
## [1] "epoch 18400 done"
## [1] "cost: 18.4235572370409"
## [1] "epoch 18500 done"
## [1] "cost: 18.3187335035849"
## [1] "epoch 18600 done"
## [1] "cost: 18.2143627880848"
## [1] "epoch 18700 done"
## [1] "cost: 18.1100174721788"
## [1] "epoch 18800 done"
## [1] "cost: 18.0057482654394"
## [1] "epoch 18900 done"
## [1] "cost: 17.9015144583064"
## [1] "epoch 19000 done"
## [1] "cost: 17.7978469073045"
## [1] "epoch 19100 done"
## [1] "cost: 17.6939964521927"
## [1] "epoch 19200 done"
## [1] "cost: 17.5924148040423"
## [1] "epoch 19300 done"
## [1] "cost: 17.4867908537229"
## [1] "epoch 19400 done"
## [1] "cost: 17.3837998060578"
## [1] "epoch 19500 done"
## [1] "cost: 17.2812315146317"
## [1] "epoch 19600 done"
## [1] "cost: 17.1789384728073"
## [1] "epoch 19700 done"
## [1] "cost: 17.0778723151199"
## [1] "epoch 19800 done"
## [1] "cost: 16.9756983314349"
## [1] "epoch 19900 done"
## [1] "cost: 16.873510012521"
## [1] "epoch 20000 done"
## [1] "cost: 16.7766625532173"
## [1] "epoch 20100 done"
## [1] "cost: 16.6767253925928"
## [1] "epoch 20200 done"
## [1] "cost: 16.5740005103387"
## [1] "epoch 20300 done"
## [1] "cost: 16.472964585796"
## [1] "epoch 20400 done"
## [1] "cost: 16.374938699641"
## [1] "epoch 20500 done"
## [1] "cost: 16.2756416301558"
## [1] "epoch 20600 done"
## [1] "cost: 16.1779400057326"
## [1] "epoch 20700 done"
## [1] "cost: 16.0825502431338"
## [1] "epoch 20800 done"
## [1] "cost: 15.9877524934444"
## [1] "epoch 20900 done"
## [1] "cost: 15.8892272669081"
## [1] "epoch 21000 done"
## [1] "cost: 15.7972879713822"
## [1] "epoch 21100 done"
## [1] "cost: 15.7131031190906"
## [1] "epoch 21200 done"
## [1] "cost: 15.6096503435049"
## [1] "epoch 21300 done"
## [1] "cost: 15.5134707540613"
## [1] "epoch 21400 done"
## [1] "cost: 15.422240359602"
## [1] "epoch 21500 done"
## [1] "cost: 15.3303110031875"
## [1] "epoch 21600 done"
## [1] "cost: 15.2416798807339"
## [1] "epoch 21700 done"
## [1] "cost: 15.1508075205759"
## [1] "epoch 21800 done"
## [1] "cost: 15.0636767178921"
## [1] "epoch 21900 done"
## [1] "cost: 14.9737851188581"
## [1] "epoch 22000 done"
## [1] "cost: 14.8878503143118"
## [1] "epoch 22100 done"
## [1] "cost: 14.8060397633104"
## [1] "epoch 22200 done"
## [1] "cost: 14.717257064817"
## [1] "epoch 22300 done"
## [1] "cost: 14.6318768310469"
## [1] "epoch 22400 done"
## [1] "cost: 14.5495477481453"
## [1] "epoch 22500 done"
## [1] "cost: 14.4713560188788"
## [1] "epoch 22600 done"
## [1] "cost: 14.3858258620963"
## [1] "epoch 22700 done"
## [1] "cost: 14.3107195398078"
## [1] "epoch 22800 done"
## [1] "cost: 14.2269199772895"
## [1] "epoch 22900 done"
## [1] "cost: 14.1567364143103"
## [1] "epoch 23000 done"
## [1] "cost: 14.0781149015859"
## [1] "epoch 23100 done"
## [1] "cost: 13.9936077354019"
## [1] "epoch 23200 done"
## [1] "cost: 13.9176968424109"
## [1] "epoch 23300 done"
## [1] "cost: 13.8440798550015"
## [1] "epoch 23400 done"
## [1] "cost: 13.7738918340483"
## [1] "epoch 23500 done"
## [1] "cost: 13.6993270770899"
## [1] "epoch 23600 done"
## [1] "cost: 13.6345587860734"
## [1] "epoch 23700 done"
## [1] "cost: 13.5546135016202"
## [1] "epoch 23800 done"
## [1] "cost: 13.484207118783"
## [1] "epoch 23900 done"
## [1] "cost: 13.4296014285987"
## [1] "epoch 24000 done"
## [1] "cost: 13.3475326788764"
## [1] "epoch 24100 done"
## [1] "cost: 13.280333605753"
## [1] "epoch 24200 done"
## [1] "cost: 13.2148816309388"
## [1] "epoch 24300 done"
## [1] "cost: 13.1481095058234"
## [1] "epoch 24400 done"
## [1] "cost: 13.0914855825121"
## [1] "epoch 24500 done"
## [1] "cost: 13.0206279717778"
## [1] "epoch 24600 done"
## [1] "cost: 12.9558744976469"
## [1] "epoch 24700 done"
## [1] "cost: 12.9459452786937"
## [1] "epoch 24800 done"
## [1] "cost: 12.8331281618307"
## [1] "epoch 24900 done"
## [1] "cost: 12.772949849017"
## [1] "epoch 25000 done"
## [1] "cost: 12.7130607044632"
## [1] "epoch 25100 done"
## [1] "cost: 12.6579044031067"
## [1] "epoch 25200 done"
## [1] "cost: 12.5942405576499"
## [1] "epoch 25300 done"
## [1] "cost: 12.5459287617157"
## [1] "epoch 25400 done"
## [1] "cost: 12.4823066992029"
## [1] "epoch 25500 done"
## [1] "cost: 12.423960852435"
## [1] "epoch 25600 done"
## [1] "cost: 12.3691840225853"
## [1] "epoch 25700 done"
## [1] "cost: 12.3156502555957"
## [1] "epoch 25800 done"
## [1] "cost: 12.2782234387697"
## [1] "epoch 25900 done"
## [1] "cost: 12.2166933059885"
## [1] "epoch 26000 done"
## [1] "cost: 12.1539752088572"
## [1] "epoch 26100 done"
## [1] "cost: 12.102721027479"
## [1] "epoch 26200 done"
## [1] "cost: 12.0530146454776"
## [1] "epoch 26300 done"
## [1] "cost: 12.0024766130799"
## [1] "epoch 26400 done"
## [1] "cost: 11.9524405033958"
## [1] "epoch 26500 done"
## [1] "cost: 11.9013436068688"
## [1] "epoch 26600 done"
## [1] "cost: 11.8864625664264"
## [1] "epoch 26700 done"
## [1] "cost: 11.8119717404853"
## [1] "epoch 26800 done"
## [1] "cost: 11.7559422574536"
## [1] "epoch 26900 done"
## [1] "cost: 11.7148720922957"
## [1] "epoch 27000 done"
## [1] "cost: 11.6663913206569"
## [1] "epoch 27100 done"
## [1] "cost: 11.6227087670474"
## [1] "epoch 27200 done"
## [1] "cost: 11.5883375549974"
## [1] "epoch 27300 done"
## [1] "cost: 11.5288936109697"
## [1] "epoch 27400 done"
## [1] "cost: 11.4998553177942"
## [1] "epoch 27500 done"
## [1] "cost: 11.438399884062"
## [1] "epoch 27600 done"
## [1] "cost: 11.3953215960353"
## [1] "epoch 27700 done"
## [1] "cost: 11.3585540813837"
## [1] "epoch 27800 done"
## [1] "cost: 11.3255711305898"
## [1] "epoch 27900 done"
## [1] "cost: 11.2739957277139"
## [1] "epoch 28000 done"
## [1] "cost: 11.2346116827499"
## [1] "epoch 28100 done"
## [1] "cost: 11.1987116462526"
## [1] "epoch 28200 done"
## [1] "cost: 11.1470262995069"
## [1] "epoch 28300 done"
## [1] "cost: 11.109605881528"
## [1] "epoch 28400 done"
## [1] "cost: 11.0722592111044"
## [1] "epoch 28500 done"
## [1] "cost: 11.0367240049855"
## [1] "epoch 28600 done"
## [1] "cost: 10.9904015826972"
## [1] "epoch 28700 done"
## [1] "cost: 10.9550829961133"
## [1] "epoch 28800 done"
## [1] "cost: 10.9147757612712"
## [1] "epoch 28900 done"
## [1] "cost: 10.8880649502197"
## [1] "epoch 29000 done"
## [1] "cost: 10.8735224314657"
## [1] "epoch 29100 done"
## [1] "cost: 10.8145611277819"
## [1] "epoch 29200 done"
## [1] "cost: 10.7878343657928"
## [1] "epoch 29300 done"
## [1] "cost: 10.759989366547"
## [1] "epoch 29400 done"
## [1] "cost: 10.7291529486256"
## [1] "epoch 29500 done"
## [1] "cost: 10.6975828362237"
## [1] "epoch 29600 done"
## [1] "cost: 10.635975384876"
## [1] "epoch 29700 done"
## [1] "cost: 10.6073284683667"
## [1] "epoch 29800 done"
## [1] "cost: 10.5952597999884"
## [1] "epoch 29900 done"
## [1] "cost: 10.5547160583452"
## [1] "epoch 30000 done"
## [1] "cost: 10.500489505626"
## [1] "epoch 30100 done"
## [1] "cost: 10.4657015769907"
## [1] "epoch 30200 done"
## [1] "cost: 10.4359918218622"
## [1] "epoch 30300 done"
## [1] "cost: 10.4258989769045"
## [1] "epoch 30400 done"
## [1] "cost: 10.3716099753362"
## [1] "epoch 30500 done"
## [1] "cost: 10.3424809156163"
## [1] "epoch 30600 done"
## [1] "cost: 10.3407253905845"
## [1] "epoch 30700 done"
## [1] "cost: 10.3146816426325"
## [1] "epoch 30800 done"
## [1] "cost: 10.2509430731209"
## [1] "epoch 30900 done"
## [1] "cost: 10.2214883068759"
## [1] "epoch 31000 done"
## [1] "cost: 10.1958965729801"
## [1] "epoch 31100 done"
## [1] "cost: 10.1637920385921"
## [1] "epoch 31200 done"
## [1] "cost: 10.1360231548907"
## [1] "epoch 31300 done"
## [1] "cost: 10.1121739291032"
## [1] "epoch 31400 done"
## [1] "cost: 10.1073349095354"
## [1] "epoch 31500 done"
## [1] "cost: 10.0797613346921"
## [1] "epoch 31600 done"
## [1] "cost: 10.0263158721204"
## [1] "epoch 31700 done"
## [1] "cost: 10.0004752840538"
## [1] "epoch 31800 done"
## [1] "cost: 9.97229346319502"
## [1] "epoch 31900 done"
## [1] "cost: 9.94637488272482"
## [1] "epoch 32000 done"
## [1] "cost: 9.94833513116582"
## [1] "epoch 32100 done"
## [1] "cost: 9.89699893171868"
## [1] "epoch 32200 done"
## [1] "cost: 9.8757003332325"
## [1] "epoch 32300 done"
## [1] "cost: 9.85236828733781"
## [1] "epoch 32400 done"
## [1] "cost: 9.82234526009801"
## [1] "epoch 32500 done"
## [1] "cost: 9.79507656228214"
## [1] "epoch 32600 done"
## [1] "cost: 9.78686554874907"
## [1] "epoch 32700 done"
## [1] "cost: 9.74734477328864"
## [1] "epoch 32800 done"
## [1] "cost: 9.72309526008601"
## [1] "epoch 32900 done"
## [1] "cost: 9.70625588383421"
## [1] "epoch 33000 done"
## [1] "cost: 9.67784693076063"
## [1] "epoch 33100 done"
## [1] "cost: 9.66569697814206"
## [1] "epoch 33200 done"
## [1] "cost: 9.64097058256866"
## [1] "epoch 33300 done"
## [1] "cost: 9.60834679347342"
## [1] "epoch 33400 done"
## [1] "cost: 9.58763521427147"
## [1] "epoch 33500 done"
## [1] "cost: 9.56404352332519"
## [1] "epoch 33600 done"
## [1] "cost: 9.54348045805573"
## [1] "epoch 33700 done"
## [1] "cost: 9.5258685859541"
## [1] "epoch 33800 done"
## [1] "cost: 9.5679242699602"
## [1] "epoch 33900 done"
## [1] "cost: 9.49379628135381"
## [1] "epoch 34000 done"
## [1] "cost: 9.46077169825233"
## [1] "epoch 34100 done"
## [1] "cost: 9.45945110750871"
## [1] "epoch 34200 done"
## [1] "cost: 9.41806075385014"
## [1] "epoch 34300 done"
## [1] "cost: 9.40501012366733"
## [1] "epoch 34400 done"
## [1] "cost: 9.449320596025"
## [1] "epoch 34500 done"
## [1] "cost: 9.36186425180048"
## [1] "epoch 34600 done"
## [1] "cost: 9.41323817889064"
## [1] "epoch 34700 done"
## [1] "cost: 9.31757887179756"
## [1] "epoch 34800 done"
## [1] "cost: 9.3038966414407"
## [1] "epoch 34900 done"
## [1] "cost: 9.27840948554685"
## [1] "epoch 35000 done"
## [1] "cost: 9.31253461230479"
## [1] "epoch 35100 done"
## [1] "cost: 9.24315837277891"
## [1] "epoch 35200 done"
## [1] "cost: 9.25348803953304"
## [1] "epoch 35300 done"
## [1] "cost: 9.20284573025339"
## [1] "epoch 35400 done"
## [1] "cost: 9.18695294350252"
## [1] "epoch 35500 done"
## [1] "cost: 9.16865674087904"
## [1] "epoch 35600 done"
## [1] "cost: 9.1491584262596"
## [1] "epoch 35700 done"
## [1] "cost: 9.13749186827212"
## [1] "epoch 35800 done"
## [1] "cost: 9.18603873446357"
## [1] "epoch 35900 done"
## [1] "cost: 9.11927896345702"
## [1] "epoch 36000 done"
## [1] "cost: 9.14121783155486"
## [1] "epoch 36100 done"
## [1] "cost: 9.06612330283762"
## [1] "epoch 36200 done"
## [1] "cost: 9.04394224035023"
## [1] "epoch 36300 done"
## [1] "cost: 9.05861408070691"
## [1] "epoch 36400 done"
## [1] "cost: 9.01964590769215"
## [1] "epoch 36500 done"
## [1] "cost: 8.99553980623699"
## [1] "epoch 36600 done"
## [1] "cost: 8.97718827547407"
## [1] "epoch 36700 done"
## [1] "cost: 8.9807580558446"
## [1] "epoch 36800 done"
## [1] "cost: 9.01381966503855"
## [1] "epoch 36900 done"
## [1] "cost: 8.95255957000716"
## [1] "epoch 37000 done"
## [1] "cost: 8.96827083414673"
## [1] "epoch 37100 done"
## [1] "cost: 8.90572880772721"
## [1] "epoch 37200 done"
## [1] "cost: 8.88147429777368"
## [1] "epoch 37300 done"
## [1] "cost: 8.93281581422285"
## [1] "epoch 37400 done"
## [1] "cost: 8.85346936709732"
## [1] "epoch 37500 done"
## [1] "cost: 8.83745447456232"
## [1] "epoch 37600 done"
## [1] "cost: 8.82892346253156"
## [1] "epoch 37700 done"
## [1] "cost: 8.81469025742392"
## [1] "epoch 37800 done"
## [1] "cost: 8.8130611327604"
## [1] "epoch 37900 done"
## [1] "cost: 8.77768209323496"
## [1] "epoch 38000 done"
## [1] "cost: 8.81819624567332"
## [1] "epoch 38100 done"
## [1] "cost: 8.75410062235515"
## [1] "epoch 38200 done"
## [1] "cost: 8.7503557814217"
## [1] "epoch 38300 done"
## [1] "cost: 8.72376213900358"
## [1] "epoch 38400 done"
## [1] "cost: 8.71932088228204"
## [1] "epoch 38500 done"
## [1] "cost: 8.72141369327057"
## [1] "epoch 38600 done"
## [1] "cost: 8.7191014093548"
## [1] "epoch 38700 done"
## [1] "cost: 8.70901947310727"
## [1] "epoch 38800 done"
## [1] "cost: 8.66647429071209"
## [1] "epoch 38900 done"
## [1] "cost: 8.71492377940188"
## [1] "epoch 39000 done"
## [1] "cost: 8.63403005048926"
## [1] "epoch 39100 done"
## [1] "cost: 8.63018771421928"
## [1] "epoch 39200 done"
## [1] "cost: 8.67598462246468"
## [1] "epoch 39300 done"
## [1] "cost: 8.59573375368228"
## [1] "epoch 39400 done"
## [1] "cost: 8.5777714268957"
## [1] "epoch 39500 done"
## [1] "cost: 8.55502171286839"
## [1] "epoch 39600 done"
## [1] "cost: 8.55293178810039"
## [1] "epoch 39700 done"
## [1] "cost: 8.53717681362863"
## [1] "epoch 39800 done"
## [1] "cost: 8.5459364738197"
## [1] "epoch 39900 done"
## [1] "cost: 8.57542356143819"
## [1] "epoch 40000 done"
## [1] "cost: 8.49204308093791"
## [1] "epoch 40100 done"
## [1] "cost: 8.4892474077985"
## [1] "epoch 40200 done"
## [1] "cost: 8.4667838398331"
## [1] "epoch 40300 done"
## [1] "cost: 8.52929473434576"
## [1] "epoch 40400 done"
## [1] "cost: 8.47412009638519"
## [1] "epoch 40500 done"
## [1] "cost: 8.43246094055088"
## [1] "epoch 40600 done"
## [1] "cost: 8.42737827850511"
## [1] "epoch 40700 done"
## [1] "cost: 8.41948150651067"
## [1] "epoch 40800 done"
## [1] "cost: 8.39757016430407"
## [1] "epoch 40900 done"
## [1] "cost: 8.38429824298543"
## [1] "epoch 41000 done"
## [1] "cost: 8.38080165852355"
## [1] "epoch 41100 done"
## [1] "cost: 8.35987090976266"
## [1] "epoch 41200 done"
## [1] "cost: 8.36299948274062"
## [1] "epoch 41300 done"
## [1] "cost: 8.3405193987468"
## [1] "epoch 41400 done"
## [1] "cost: 8.32525554530602"
## [1] "epoch 41500 done"
## [1] "cost: 8.31590154089256"
## [1] "epoch 41600 done"
## [1] "cost: 8.39734746503863"
## [1] "epoch 41700 done"
## [1] "cost: 8.30461384979034"
## [1] "epoch 41800 done"
## [1] "cost: 8.31708789606688"
## [1] "epoch 41900 done"
## [1] "cost: 8.27560308140529"
## [1] "epoch 42000 done"
## [1] "cost: 8.27599730076414"
## [1] "epoch 42100 done"
## [1] "cost: 8.2832930204624"
## [1] "epoch 42200 done"
## [1] "cost: 8.24383402816342"
## [1] "epoch 42300 done"
## [1] "cost: 8.26122564470917"
## [1] "epoch 42400 done"
## [1] "cost: 8.24780249199519"
## [1] "epoch 42500 done"
## [1] "cost: 8.20805729689243"
## [1] "epoch 42600 done"
## [1] "cost: 8.20067303438138"
## [1] "epoch 42700 done"
## [1] "cost: 8.19543121625202"
## [1] "epoch 42800 done"
## [1] "cost: 8.23841894768342"
## [1] "epoch 42900 done"
## [1] "cost: 8.18606429102305"
## [1] "epoch 43000 done"
## [1] "cost: 8.19463313127834"
## [1] "epoch 43100 done"
## [1] "cost: 8.17193233507159"
## [1] "epoch 43200 done"
## [1] "cost: 8.17417234422117"
## [1] "epoch 43300 done"
## [1] "cost: 8.12606587023983"
## [1] "epoch 43400 done"
## [1] "cost: 8.12788829605094"
## [1] "epoch 43500 done"
## [1] "cost: 8.13708426533088"
## [1] "epoch 43600 done"
## [1] "cost: 8.09418259026926"
## [1] "epoch 43700 done"
## [1] "cost: 8.086571953505"
## [1] "epoch 43800 done"
## [1] "cost: 8.0954463920591"
## [1] "epoch 43900 done"
## [1] "cost: 8.06659080935641"
## [1] "epoch 44000 done"
## [1] "cost: 8.05838148120143"
## [1] "epoch 44100 done"
## [1] "cost: 8.07739950201442"
## [1] "epoch 44200 done"
## [1] "cost: 8.0422162486745"
## [1] "epoch 44300 done"
## [1] "cost: 8.04404613743387"
## [1] "epoch 44400 done"
## [1] "cost: 8.03462982443867"
## [1] "epoch 44500 done"
## [1] "cost: 8.03757676351441"
## [1] "epoch 44600 done"
## [1] "cost: 7.99941102361562"
## [1] "epoch 44700 done"
## [1] "cost: 7.99085124322628"
## [1] "epoch 44800 done"
## [1] "cost: 8.03651462474389"
## [1] "epoch 44900 done"
## [1] "cost: 7.9867841599937"
## [1] "epoch 45000 done"
## [1] "cost: 7.96349528771731"
## [1] "epoch 45100 done"
## [1] "cost: 7.97018658484361"
## [1] "epoch 45200 done"
## [1] "cost: 8.00497156914147"
## [1] "epoch 45300 done"
## [1] "cost: 7.94165090838826"
## [1] "epoch 45400 done"
## [1] "cost: 7.94507665265119"
## [1] "epoch 45500 done"
## [1] "cost: 7.94777434007059"
## [1] "epoch 45600 done"
## [1] "cost: 7.91407960956644"
## [1] "epoch 45700 done"
## [1] "cost: 7.90165058041067"
## [1] "epoch 45800 done"
## [1] "cost: 7.89319077094668"
## [1] "epoch 45900 done"
## [1] "cost: 7.89392543775224"
## [1] "epoch 46000 done"
## [1] "cost: 7.87937142793436"
## [1] "epoch 46100 done"
## [1] "cost: 7.90738326919676"
## [1] "epoch 46200 done"
## [1] "cost: 7.92815222365613"
## [1] "epoch 46300 done"
## [1] "cost: 7.87383529379073"
## [1] "epoch 46400 done"
## [1] "cost: 7.84266938595832"
## [1] "epoch 46500 done"
## [1] "cost: 7.83501212544912"
## [1] "epoch 46600 done"
## [1] "cost: 7.82674803246669"
## [1] "epoch 46700 done"
## [1] "cost: 7.89271055379785"
## [1] "epoch 46800 done"
## [1] "cost: 7.82509188139677"
## [1] "epoch 46900 done"
## [1] "cost: 7.81034762733322"
## [1] "epoch 47000 done"
## [1] "cost: 7.81632504372102"
## [1] "epoch 47100 done"
## [1] "cost: 7.85474589517546"
## [1] "epoch 47200 done"
## [1] "cost: 7.80328133345008"
## [1] "epoch 47300 done"
## [1] "cost: 7.77045938663433"
## [1] "epoch 47400 done"
## [1] "cost: 7.85038016554506"
## [1] "epoch 47500 done"
## [1] "cost: 7.76005162875749"
## [1] "epoch 47600 done"
## [1] "cost: 7.74779308830485"
## [1] "epoch 47700 done"
## [1] "cost: 7.83075212353507"
## [1] "epoch 47800 done"
## [1] "cost: 7.73148070373829"
## [1] "epoch 47900 done"
## [1] "cost: 7.76014968419386"
## [1] "epoch 48000 done"
## [1] "cost: 7.76378139196754"
## [1] "epoch 48100 done"
## [1] "cost: 7.71000761300999"
## [1] "epoch 48200 done"
## [1] "cost: 7.81853804426062"
## [1] "epoch 48300 done"
## [1] "cost: 7.69648363303691"
## [1] "epoch 48400 done"
## [1] "cost: 7.698586704567"
## [1] "epoch 48500 done"
## [1] "cost: 7.97908560676056"
## [1] "epoch 48600 done"
## [1] "cost: 7.67914920618623"
## [1] "epoch 48700 done"
## [1] "cost: 7.68238115265415"
## [1] "epoch 48800 done"
## [1] "cost: 7.6603594145208"
## [1] "epoch 48900 done"
## [1] "cost: 7.67739604419691"
## [1] "epoch 49000 done"
## [1] "cost: 7.76238826819405"
## [1] "epoch 49100 done"
## [1] "cost: 7.64257901702831"
## [1] "epoch 49200 done"
## [1] "cost: 7.65002012827482"
## [1] "epoch 49300 done"
## [1] "cost: 7.65260448597891"
## [1] "epoch 49400 done"
## [1] "cost: 7.61421410812505"
## [1] "epoch 49500 done"
## [1] "cost: 7.60984009555559"
## [1] "epoch 49600 done"
## [1] "cost: 7.60178664353518"
## [1] "epoch 49700 done"
## [1] "cost: 7.68420031199366"
## [1] "epoch 49800 done"
## [1] "cost: 7.58828005168221"
## [1] "epoch 49900 done"
## [1] "cost: 7.57911423052903"
## [1] "epoch 50000 done"
## [1] "cost: 7.63872520927128"
## [1] "epoch 50100 done"
## [1] "cost: 7.64420537298683"
## [1] "epoch 50200 done"
## [1] "cost: 7.5627119056654"
## [1] "epoch 50300 done"
## [1] "cost: 7.57271228397106"
## [1] "epoch 50400 done"
## [1] "cost: 7.56806176641722"
## [1] "epoch 50500 done"
## [1] "cost: 7.56264236694575"
## [1] "epoch 50600 done"
## [1] "cost: 7.56187036875792"
## [1] "epoch 50700 done"
## [1] "cost: 7.53628445506798"
## [1] "epoch 50800 done"
## [1] "cost: 7.52386729861511"
## [1] "epoch 50900 done"
## [1] "cost: 7.55281442033907"
## [1] "epoch 51000 done"
## [1] "cost: 7.52585779930332"
## [1] "epoch 51100 done"
## [1] "cost: 7.49985265741598"
## [1] "epoch 51200 done"
## [1] "cost: 7.49791369516001"
## [1] "epoch 51300 done"
## [1] "cost: 7.5398096946139"
## [1] "epoch 51400 done"
## [1] "cost: 7.51541158133583"
## [1] "epoch 51500 done"
## [1] "cost: 7.49592750523967"
## [1] "epoch 51600 done"
## [1] "cost: 7.46690543394571"
## [1] "epoch 51700 done"
## [1] "cost: 7.53334399433271"
## [1] "epoch 51800 done"
## [1] "cost: 7.45683361823605"
## [1] "epoch 51900 done"
## [1] "cost: 7.45179794431632"
## [1] "epoch 52000 done"
## [1] "cost: 7.45716892372785"
## [1] "epoch 52100 done"
## [1] "cost: 7.45978881817639"
## [1] "epoch 52200 done"
## [1] "cost: 7.47577546116741"
## [1] "epoch 52300 done"
## [1] "cost: 7.42857749301934"
## [1] "epoch 52400 done"
## [1] "cost: 7.42412144050529"
## [1] "epoch 52500 done"
## [1] "cost: 7.43914302682064"
## [1] "epoch 52600 done"
## [1] "cost: 7.42474381391391"
## [1] "epoch 52700 done"
## [1] "cost: 7.39813371154787"
## [1] "epoch 52800 done"
## [1] "cost: 7.41323103907023"
## [1] "epoch 52900 done"
## [1] "cost: 7.3861032141596"
## [1] "epoch 53000 done"
## [1] "cost: 7.38410579656393"
## [1] "epoch 53100 done"
## [1] "cost: 7.44521613757971"
## [1] "epoch 53200 done"
## [1] "cost: 7.51384548293362"
## [1] "epoch 53300 done"
## [1] "cost: 7.49913623300613"
## [1] "epoch 53400 done"
## [1] "cost: 7.4087583339833"
## [1] "epoch 53500 done"
## [1] "cost: 7.35421221809785"
## [1] "epoch 53600 done"
## [1] "cost: 7.34728651249139"
## [1] "epoch 53700 done"
## [1] "cost: 7.33903908136143"
## [1] "epoch 53800 done"
## [1] "cost: 7.36268116220458"
## [1] "epoch 53900 done"
## [1] "cost: 7.35604588826661"
## [1] "epoch 54000 done"
## [1] "cost: 7.33415887047256"
## [1] "epoch 54100 done"
## [1] "cost: 7.37431478865438"
## [1] "epoch 54200 done"
## [1] "cost: 7.32469982288782"
## [1] "epoch 54300 done"
## [1] "cost: 7.31108387716388"
## [1] "epoch 54400 done"
## [1] "cost: 7.30893949296408"
## [1] "epoch 54500 done"
## [1] "cost: 7.31829675100827"
## [1] "epoch 54600 done"
## [1] "cost: 7.31666456130124"
## [1] "epoch 54700 done"
## [1] "cost: 7.28610665916617"
## [1] "epoch 54800 done"
## [1] "cost: 7.37927888098667"
## [1] "epoch 54900 done"
## [1] "cost: 7.28942646391913"
## [1] "epoch 55000 done"
## [1] "cost: 7.26702884114331"
## [1] "epoch 55100 done"
## [1] "cost: 7.27161467390313"
## [1] "epoch 55200 done"
## [1] "cost: 7.25640990956007"
## [1] "epoch 55300 done"
## [1] "cost: 7.25089428874651"
## [1] "epoch 55400 done"
## [1] "cost: 7.31348852604962"
## [1] "epoch 55500 done"
## [1] "cost: 7.39045362364441"
## [1] "epoch 55600 done"
## [1] "cost: 7.30729778015093"
## [1] "epoch 55700 done"
## [1] "cost: 7.29159313032151"
## [1] "epoch 55800 done"
## [1] "cost: 7.2716957600623"
## [1] "epoch 55900 done"
## [1] "cost: 7.26609968392241"
## [1] "epoch 56000 done"
## [1] "cost: 7.22279310882074"
## [1] "epoch 56100 done"
## [1] "cost: 7.2279907070077"
## [1] "epoch 56200 done"
## [1] "cost: 7.22252575971484"
## [1] "epoch 56300 done"
## [1] "cost: 7.19906121312054"
## [1] "epoch 56400 done"
## [1] "cost: 7.2717174605253"
## [1] "epoch 56500 done"
## [1] "cost: 7.19799878283309"
## [1] "epoch 56600 done"
## [1] "cost: 7.2773068087939"
## [1] "epoch 56700 done"
## [1] "cost: 7.18227431096153"
## [1] "epoch 56800 done"
## [1] "cost: 7.1807935938109"
## [1] "epoch 56900 done"
## [1] "cost: 7.18888918247558"
## [1] "epoch 57000 done"
## [1] "cost: 7.17058267882532"
## [1] "epoch 57100 done"
## [1] "cost: 7.16440883006986"
## [1] "epoch 57200 done"
## [1] "cost: 7.19405514759157"
## [1] "epoch 57300 done"
## [1] "cost: 7.14443667437239"
## [1] "epoch 57400 done"
## [1] "cost: 7.15847756845158"
## [1] "epoch 57500 done"
## [1] "cost: 7.24902892233558"
## [1] "epoch 57600 done"
## [1] "cost: 7.13402071042411"
## [1] "epoch 57700 done"
## [1] "cost: 7.15994872834886"
## [1] "epoch 57800 done"
## [1] "cost: 7.14013070837383"
## [1] "epoch 57900 done"
## [1] "cost: 7.15313480297245"
## [1] "epoch 58000 done"
## [1] "cost: 7.11204217245002"
## [1] "epoch 58100 done"
## [1] "cost: 7.16389427386064"
## [1] "epoch 58200 done"
## [1] "cost: 7.12772389843685"
## [1] "epoch 58300 done"
## [1] "cost: 7.2070096750178"
## [1] "epoch 58400 done"
## [1] "cost: 7.09355245658725"
## [1] "epoch 58500 done"
## [1] "cost: 7.08543499408431"
## [1] "epoch 58600 done"
## [1] "cost: 7.08375371293069"
## [1] "epoch 58700 done"
## [1] "cost: 7.12358413876575"
## [1] "epoch 58800 done"
## [1] "cost: 7.11346372256997"
## [1] "epoch 58900 done"
## [1] "cost: 7.09176855369461"
## [1] "epoch 59000 done"
## [1] "cost: 7.06808327701602"
## [1] "epoch 59100 done"
## [1] "cost: 7.2346632464232"
## [1] "epoch 59200 done"
## [1] "cost: 7.06949255130755"
## [1] "epoch 59300 done"
## [1] "cost: 7.05744918783674"
## [1] "epoch 59400 done"
## [1] "cost: 7.05599814267365"
## [1] "epoch 59500 done"
## [1] "cost: 7.06295436803296"
## [1] "epoch 59600 done"
## [1] "cost: 7.06971966341353"
## [1] "epoch 59700 done"
## [1] "cost: 7.03738074897947"
## [1] "epoch 59800 done"
## [1] "cost: 7.02541830556747"
## [1] "epoch 59900 done"
## [1] "cost: 7.02050181000074"
## [1] "epoch 60000 done"
## [1] "cost: 7.05341680255979"
## [1] "epoch 60100 done"
## [1] "cost: 7.01224898390868"
## [1] "epoch 60200 done"
## [1] "cost: 7.09952810697144"
## [1] "epoch 60300 done"
## [1] "cost: 7.08437261088825"
## [1] "epoch 60400 done"
## [1] "cost: 6.99718968769506"
## [1] "epoch 60500 done"
## [1] "cost: 7.00157892364264"
## [1] "epoch 60600 done"
## [1] "cost: 6.99652218024779"
## [1] "epoch 60700 done"
## [1] "cost: 6.98312277852206"
## [1] "epoch 60800 done"
## [1] "cost: 7.16152841167175"
## [1] "epoch 60900 done"
## [1] "cost: 6.99381004145221"
## [1] "epoch 61000 done"
## [1] "cost: 6.97333817732363"
## [1] "epoch 61100 done"
## [1] "cost: 7.01999916761209"
## [1] "epoch 61200 done"
## [1] "cost: 6.96671146802409"
## [1] "epoch 61300 done"
## [1] "cost: 6.97539959682901"
## [1] "epoch 61400 done"
## [1] "cost: 7.01593312371022"
## [1] "epoch 61500 done"
## [1] "cost: 6.9518002405481"
## [1] "epoch 61600 done"
## [1] "cost: 6.97762483080327"
## [1] "epoch 61700 done"
## [1] "cost: 6.96634103802154"
## [1] "epoch 61800 done"
## [1] "cost: 6.9809510335248"
## [1] "epoch 61900 done"
## [1] "cost: 6.9497669136634"
## [1] "epoch 62000 done"
## [1] "cost: 6.92710558540891"
## [1] "epoch 62100 done"
## [1] "cost: 6.93993377287352"
## [1] "epoch 62200 done"
## [1] "cost: 6.92815744629657"
## [1] "epoch 62300 done"
## [1] "cost: 6.9677788859475"
## [1] "epoch 62400 done"
## [1] "cost: 7.14285691565906"
## [1] "epoch 62500 done"
## [1] "cost: 6.9054360422804"
## [1] "epoch 62600 done"
## [1] "cost: 6.90046495384073"
## [1] "epoch 62700 done"
## [1] "cost: 7.11473171240091"
## [1] "epoch 62800 done"
## [1] "cost: 6.89647143576944"
## [1] "epoch 62900 done"
## [1] "cost: 6.9031718277824"
## [1] "epoch 63000 done"
## [1] "cost: 6.88511830880721"
## [1] "epoch 63100 done"
## [1] "cost: 6.89235225014872"
## [1] "epoch 63200 done"
## [1] "cost: 6.92306214558957"
## [1] "epoch 63300 done"
## [1] "cost: 6.96964612117932"
## [1] "epoch 63400 done"
## [1] "cost: 6.92952934751795"
## [1] "epoch 63500 done"
## [1] "cost: 6.8639208127232"
## [1] "epoch 63600 done"
## [1] "cost: 6.8674540674515"
## [1] "epoch 63700 done"
## [1] "cost: 6.86251395074961"
## [1] "epoch 63800 done"
## [1] "cost: 6.96993939962033"
## [1] "epoch 63900 done"
## [1] "cost: 6.85369508308533"
## [1] "epoch 64000 done"
## [1] "cost: 6.87786273863721"
## [1] "epoch 64100 done"
## [1] "cost: 6.85084854036753"
## [1] "epoch 64200 done"
## [1] "cost: 6.83607035907481"
## [1] "epoch 64300 done"
## [1] "cost: 6.8323963999548"
## [1] "epoch 64400 done"
## [1] "cost: 6.86259215217822"
## [1] "epoch 64500 done"
## [1] "cost: 6.85117922953992"
## [1] "epoch 64600 done"
## [1] "cost: 6.86438712067567"
## [1] "epoch 64700 done"
## [1] "cost: 6.86378072086279"
## [1] "epoch 64800 done"
## [1] "cost: 6.81137778055687"
## [1] "epoch 64900 done"
## [1] "cost: 6.8084612687621"
## [1] "epoch 65000 done"
## [1] "cost: 6.80476100576917"
## [1] "epoch 65100 done"
## [1] "cost: 6.83464535756886"
## [1] "epoch 65200 done"
## [1] "cost: 6.86082687416468"
## [1] "epoch 65300 done"
## [1] "cost: 6.79155790431956"
## [1] "epoch 65400 done"
## [1] "cost: 6.78796599735846"
## [1] "epoch 65500 done"
## [1] "cost: 6.82401626716977"
## [1] "epoch 65600 done"
## [1] "cost: 6.79086406637107"
## [1] "epoch 65700 done"
## [1] "cost: 6.77761085083345"
## [1] "epoch 65800 done"
## [1] "cost: 6.77598030265316"
## [1] "epoch 65900 done"
## [1] "cost: 6.83018283417502"
## [1] "epoch 66000 done"
## [1] "cost: 6.78138933916326"
## [1] "epoch 66100 done"
## [1] "cost: 6.8807592920481"
## [1] "epoch 66200 done"
## [1] "cost: 6.76135910376758"
## [1] "epoch 66300 done"
## [1] "cost: 6.76372199268703"
## [1] "epoch 66400 done"
## [1] "cost: 7.08590381969797"
## [1] "epoch 66500 done"
## [1] "cost: 6.84949475156075"
## [1] "epoch 66600 done"
## [1] "cost: 6.9343035328287"
## [1] "epoch 66700 done"
## [1] "cost: 6.73923641826705"
## [1] "epoch 66800 done"
## [1] "cost: 6.74613162121849"
## [1] "epoch 66900 done"
## [1] "cost: 6.75634191554167"
## [1] "epoch 67000 done"
## [1] "cost: 6.73511149905906"
## [1] "epoch 67100 done"
## [1] "cost: 6.72349890312773"
## [1] "epoch 67200 done"
## [1] "cost: 6.72894667150168"
## [1] "epoch 67300 done"
## [1] "cost: 6.77226946894128"
## [1] "epoch 67400 done"
## [1] "cost: 6.786518789564"
## [1] "epoch 67500 done"
## [1] "cost: 6.71161178682885"
## [1] "epoch 67600 done"
## [1] "cost: 6.77543099987452"
## [1] "epoch 67700 done"
## [1] "cost: 6.70920986937588"
## [1] "epoch 67800 done"
## [1] "cost: 6.71233133317396"
## [1] "epoch 67900 done"
## [1] "cost: 6.73586735964391"
## [1] "epoch 68000 done"
## [1] "cost: 6.70272348070784"
## [1] "epoch 68100 done"
## [1] "cost: 6.92488153098987"
## [1] "epoch 68200 done"
## [1] "cost: 6.68743441228457"
## [1] "epoch 68300 done"
## [1] "cost: 6.7203785437089"
## [1] "epoch 68400 done"
## [1] "cost: 6.6827019474973"
## [1] "epoch 68500 done"
## [1] "cost: 6.70439829264421"
## [1] "epoch 68600 done"
## [1] "cost: 6.67089647137264"
## [1] "epoch 68700 done"
## [1] "cost: 6.74994040736709"
## [1] "epoch 68800 done"
## [1] "cost: 6.7594357175757"
## [1] "epoch 68900 done"
## [1] "cost: 6.80818012335775"
## [1] "epoch 69000 done"
## [1] "cost: 6.67054533477455"
## [1] "epoch 69100 done"
## [1] "cost: 6.65307905777323"
## [1] "epoch 69200 done"
## [1] "cost: 6.65292203586568"
## [1] "epoch 69300 done"
## [1] "cost: 6.65697412552425"
## [1] "epoch 69400 done"
## [1] "cost: 6.67968708713417"
## [1] "epoch 69500 done"
## [1] "cost: 6.64713222591515"
## [1] "epoch 69600 done"
## [1] "cost: 6.7352320671757"
## [1] "epoch 69700 done"
## [1] "cost: 6.65212748719035"
## [1] "epoch 69800 done"
## [1] "cost: 6.63651997176345"
## [1] "epoch 69900 done"
## [1] "cost: 6.66103958072549"
## [1] "epoch 70000 done"
## [1] "cost: 6.79707536114023"
## [1] "epoch 70100 done"
## [1] "cost: 6.76912256185342"
## [1] "epoch 70200 done"
## [1] "cost: 6.64425290216772"
## [1] "epoch 70300 done"
## [1] "cost: 6.61035234605794"
## [1] "epoch 70400 done"
## [1] "cost: 6.61760970057431"
## [1] "epoch 70500 done"
## [1] "cost: 6.67823962873691"
## [1] "epoch 70600 done"
## [1] "cost: 6.61311232397745"
## [1] "epoch 70700 done"
## [1] "cost: 6.59842375539555"
## [1] "epoch 70800 done"
## [1] "cost: 6.61495165884132"
## [1] "epoch 70900 done"
## [1] "cost: 6.62787065060715"
## [1] "epoch 71000 done"
## [1] "cost: 6.58971562159728"
## [1] "epoch 71100 done"
## [1] "cost: 6.79070996587188"
## [1] "epoch 71200 done"
## [1] "cost: 6.59906790671651"
## [1] "epoch 71300 done"
## [1] "cost: 6.58799701936293"
## [1] "epoch 71400 done"
## [1] "cost: 6.58588172098727"
## [1] "epoch 71500 done"
## [1] "cost: 6.59988616273991"
## [1] "epoch 71600 done"
## [1] "cost: 6.68615317584396"
## [1] "epoch 71700 done"
## [1] "cost: 6.57807908817349"
## [1] "epoch 71800 done"
## [1] "cost: 6.57375383860587"
## [1] "epoch 71900 done"
## [1] "cost: 6.55779883967945"
## [1] "epoch 72000 done"
## [1] "cost: 6.57047693490209"
## [1] "epoch 72100 done"
## [1] "cost: 6.58086667598427"
## [1] "epoch 72200 done"
## [1] "cost: 6.61982003131768"
## [1] "epoch 72300 done"
## [1] "cost: 6.57791218504744"
## [1] "epoch 72400 done"
## [1] "cost: 6.54362620393902"
## [1] "epoch 72500 done"
## [1] "cost: 6.59819033392564"
## [1] "epoch 72600 done"
## [1] "cost: 6.53548365420404"
## [1] "epoch 72700 done"
## [1] "cost: 6.53168830838412"
## [1] "epoch 72800 done"
## [1] "cost: 6.56657605448561"
## [1] "epoch 72900 done"
## [1] "cost: 6.62153284710505"
## [1] "epoch 73000 done"
## [1] "cost: 6.52268822085042"
## [1] "epoch 73100 done"
## [1] "cost: 6.54224318220711"
## [1] "epoch 73200 done"
## [1] "cost: 6.52040963464495"
## [1] "epoch 73300 done"
## [1] "cost: 6.65928618528423"
## [1] "epoch 73400 done"
## [1] "cost: 6.52315052084622"
## [1] "epoch 73500 done"
## [1] "cost: 6.53746610129809"
## [1] "epoch 73600 done"
## [1] "cost: 6.52060853187767"
## [1] "epoch 73700 done"
## [1] "cost: 6.62994556212611"
## [1] "epoch 73800 done"
## [1] "cost: 6.49668074385732"
## [1] "epoch 73900 done"
## [1] "cost: 6.96794480831902"
## [1] "epoch 74000 done"
## [1] "cost: 6.50403899346448"
## [1] "epoch 74100 done"
## [1] "cost: 6.5064530508121"
## [1] "epoch 74200 done"
## [1] "cost: 6.48615063646974"
## [1] "epoch 74300 done"
## [1] "cost: 6.49521587541001"
## [1] "epoch 74400 done"
## [1] "cost: 6.47808243469794"
## [1] "epoch 74500 done"
## [1] "cost: 6.50630777804688"
## [1] "epoch 74600 done"
## [1] "cost: 6.5954102040012"
## [1] "epoch 74700 done"
## [1] "cost: 6.49241531757746"
## [1] "epoch 74800 done"
## [1] "cost: 6.47592412490074"
## [1] "epoch 74900 done"
## [1] "cost: 6.4728479677135"
## [1] "epoch 75000 done"
## [1] "cost: 6.48887051342758"
## [1] "epoch 75100 done"
## [1] "cost: 6.46788909586737"
## [1] "epoch 75200 done"
## [1] "cost: 6.45697166258796"
## [1] "epoch 75300 done"
## [1] "cost: 6.54328094607128"
## [1] "epoch 75400 done"
## [1] "cost: 6.44964517029464"
## [1] "epoch 75500 done"
## [1] "cost: 6.55973004965784"
## [1] "epoch 75600 done"
## [1] "cost: 6.46532460748227"
## [1] "epoch 75700 done"
## [1] "cost: 6.63252939276188"
## [1] "epoch 75800 done"
## [1] "cost: 6.62716936654047"
## [1] "epoch 75900 done"
## [1] "cost: 6.55635906194133"
## [1] "epoch 76000 done"
## [1] "cost: 6.43748871188551"
## [1] "epoch 76100 done"
## [1] "cost: 6.42794829692432"
## [1] "epoch 76200 done"
## [1] "cost: 6.45903971435192"
## [1] "epoch 76300 done"
## [1] "cost: 6.42126613050296"
## [1] "epoch 76400 done"
## [1] "cost: 6.43306500185809"
## [1] "epoch 76500 done"
## [1] "cost: 6.65347401501778"
## [1] "epoch 76600 done"
## [1] "cost: 6.43983938321904"
## [1] "epoch 76700 done"
## [1] "cost: 6.45902127650474"
## [1] "epoch 76800 done"
## [1] "cost: 6.50805465808419"
## [1] "epoch 76900 done"
## [1] "cost: 6.4585497976079"
## [1] "epoch 77000 done"
## [1] "cost: 6.40384232549049"
## [1] "epoch 77100 done"
## [1] "cost: 6.47880864353474"
## [1] "epoch 77200 done"
## [1] "cost: 6.42274199761"
## [1] "epoch 77300 done"
## [1] "cost: 6.47207709008177"
## [1] "epoch 77400 done"
## [1] "cost: 6.40117468763079"
## [1] "epoch 77500 done"
## [1] "cost: 6.79466868947494"
## [1] "epoch 77600 done"
## [1] "cost: 6.42091779492108"
## [1] "epoch 77700 done"
## [1] "cost: 6.54475735208808"
## [1] "epoch 77800 done"
## [1] "cost: 6.39622626565829"
## [1] "epoch 77900 done"
## [1] "cost: 6.44602505721904"
## [1] "epoch 78000 done"
## [1] "cost: 6.3888816334701"
## [1] "epoch 78100 done"
## [1] "cost: 6.36836579071027"
## [1] "epoch 78200 done"
## [1] "cost: 6.3666645322037"
## [1] "epoch 78300 done"
## [1] "cost: 6.4037986491538"
## [1] "epoch 78400 done"
## [1] "cost: 6.38137009809106"
## [1] "epoch 78500 done"
## [1] "cost: 6.37626054231149"
## [1] "epoch 78600 done"
## [1] "cost: 6.36955104258548"
## [1] "epoch 78700 done"
## [1] "cost: 6.3614504939421"
## [1] "epoch 78800 done"
## [1] "cost: 6.39477330058387"
## [1] "epoch 78900 done"
## [1] "cost: 6.48468415550383"
## [1] "epoch 79000 done"
## [1] "cost: 6.34318287947419"
## [1] "epoch 79100 done"
## [1] "cost: 6.43733533578151"
## [1] "epoch 79200 done"
## [1] "cost: 6.3395697059869"
## [1] "epoch 79300 done"
## [1] "cost: 6.37187597794637"
## [1] "epoch 79400 done"
## [1] "cost: 6.33609448662466"
## [1] "epoch 79500 done"
## [1] "cost: 6.32938290847943"
## [1] "epoch 79600 done"
## [1] "cost: 6.40715689347769"
## [1] "epoch 79700 done"
## [1] "cost: 6.36233768175199"
## [1] "epoch 79800 done"
## [1] "cost: 6.40015125891188"
## [1] "epoch 79900 done"
## [1] "cost: 6.34027906196311"
## [1] "epoch 80000 done"
## [1] "cost: 6.34529432669755"
## [1] "epoch 80100 done"
## [1] "cost: 6.41175465843633"
## [1] "epoch 80200 done"
## [1] "cost: 6.32631872342341"
## [1] "epoch 80300 done"
## [1] "cost: 6.30709665491628"
## [1] "epoch 80400 done"
## [1] "cost: 6.34962723022115"
## [1] "epoch 80500 done"
## [1] "cost: 6.3614877359446"
## [1] "epoch 80600 done"
## [1] "cost: 6.63725949300595"
## [1] "epoch 80700 done"
## [1] "cost: 6.297521467011"
## [1] "epoch 80800 done"
## [1] "cost: 6.39109322639761"
## [1] "epoch 80900 done"
## [1] "cost: 6.29267576763665"
## [1] "epoch 81000 done"
## [1] "cost: 6.29876136719163"
## [1] "epoch 81100 done"
## [1] "cost: 6.33379334240216"
## [1] "epoch 81200 done"
## [1] "cost: 6.30452088653284"
## [1] "epoch 81300 done"
## [1] "cost: 6.28012264589698"
## [1] "epoch 81400 done"
## [1] "cost: 6.30460569385614"
## [1] "epoch 81500 done"
## [1] "cost: 6.27483266802692"
## [1] "epoch 81600 done"
## [1] "cost: 6.32897683346774"
## [1] "epoch 81700 done"
## [1] "cost: 6.27229952597585"
## [1] "epoch 81800 done"
## [1] "cost: 6.29312493305983"
## [1] "epoch 81900 done"
## [1] "cost: 6.32942520190853"
## [1] "epoch 82000 done"
## [1] "cost: 6.28560430962022"
## [1] "epoch 82100 done"
## [1] "cost: 6.29220043392653"
## [1] "epoch 82200 done"
## [1] "cost: 6.26137538302202"
## [1] "epoch 82300 done"
## [1] "cost: 6.2535212682813"
## [1] "epoch 82400 done"
## [1] "cost: 6.25483548052863"
## [1] "epoch 82500 done"
## [1] "cost: 6.29890375088122"
## [1] "epoch 82600 done"
## [1] "cost: 6.2984052927508"
## [1] "epoch 82700 done"
## [1] "cost: 6.25722187728436"
## [1] "epoch 82800 done"
## [1] "cost: 6.3187001923064"
## [1] "epoch 82900 done"
## [1] "cost: 6.34447060920883"
## [1] "epoch 83000 done"
## [1] "cost: 6.23718891283533"
## [1] "epoch 83100 done"
## [1] "cost: 6.23340950415724"
## [1] "epoch 83200 done"
## [1] "cost: 6.24691905639909"
## [1] "epoch 83300 done"
## [1] "cost: 6.24259299937238"
## [1] "epoch 83400 done"
## [1] "cost: 6.25485660342053"
## [1] "epoch 83500 done"
## [1] "cost: 6.23032867810955"
## [1] "epoch 83600 done"
## [1] "cost: 6.21971830366784"
## [1] "epoch 83700 done"
## [1] "cost: 6.21851375388537"
## [1] "epoch 83800 done"
## [1] "cost: 6.26916986535483"
## [1] "epoch 83900 done"
## [1] "cost: 6.21403220042344"
## [1] "epoch 84000 done"
## [1] "cost: 6.24743740453012"
## [1] "epoch 84100 done"
## [1] "cost: 6.2497255882092"
## [1] "epoch 84200 done"
## [1] "cost: 6.24548347116929"
## [1] "epoch 84300 done"
## [1] "cost: 6.20466290956049"
## [1] "epoch 84400 done"
## [1] "cost: 6.22674291289844"
## [1] "epoch 84500 done"
## [1] "cost: 6.20283773718148"
## [1] "epoch 84600 done"
## [1] "cost: 6.21334024544229"
## [1] "epoch 84700 done"
## [1] "cost: 6.19301136938893"
## [1] "epoch 84800 done"
## [1] "cost: 6.24565157293182"
## [1] "epoch 84900 done"
## [1] "cost: 6.21651605785211"
## [1] "epoch 85000 done"
## [1] "cost: 6.2524188887352"
## [1] "epoch 85100 done"
## [1] "cost: 6.37317866352009"
## [1] "epoch 85200 done"
## [1] "cost: 6.18702974204391"
## [1] "epoch 85300 done"
## [1] "cost: 6.27465267172544"
## [1] "epoch 85400 done"
## [1] "cost: 6.17487416118181"
## [1] "epoch 85500 done"
## [1] "cost: 6.18191222919295"
## [1] "epoch 85600 done"
## [1] "cost: 6.18214776529931"
## [1] "epoch 85700 done"
## [1] "cost: 6.25886811622856"
## [1] "epoch 85800 done"
## [1] "cost: 6.37097577208827"
## [1] "epoch 85900 done"
## [1] "cost: 6.34696383962389"
## [1] "epoch 86000 done"
## [1] "cost: 6.16596785754396"
## [1] "epoch 86100 done"
## [1] "cost: 6.21715734960611"
## [1] "epoch 86200 done"
## [1] "cost: 6.31861028884112"
## [1] "epoch 86300 done"
## [1] "cost: 6.16741085617836"
## [1] "epoch 86400 done"
## [1] "cost: 6.18796745058946"
## [1] "epoch 86500 done"
## [1] "cost: 6.15067064606539"
## [1] "epoch 86600 done"
## [1] "cost: 6.14773644074847"
## [1] "epoch 86700 done"
## [1] "cost: 6.16307733273738"
## [1] "epoch 86800 done"
## [1] "cost: 6.14252689448013"
## [1] "epoch 86900 done"
## [1] "cost: 6.16230958871203"
## [1] "epoch 87000 done"
## [1] "cost: 6.13744648699042"
## [1] "epoch 87100 done"
## [1] "cost: 6.16716736969769"
## [1] "epoch 87200 done"
## [1] "cost: 6.13126547325625"
## [1] "epoch 87300 done"
## [1] "cost: 6.26723552037772"
## [1] "epoch 87400 done"
## [1] "cost: 6.14119188023718"
## [1] "epoch 87500 done"
## [1] "cost: 6.21318395147624"
## [1] "epoch 87600 done"
## [1] "cost: 6.12029450741449"
## [1] "epoch 87700 done"
## [1] "cost: 6.11792535069868"
## [1] "epoch 87800 done"
## [1] "cost: 6.12904710252124"
## [1] "epoch 87900 done"
## [1] "cost: 6.11815055515365"
## [1] "epoch 88000 done"
## [1] "cost: 6.21566042155251"
## [1] "epoch 88100 done"
## [1] "cost: 6.12727269788496"
## [1] "epoch 88200 done"
## [1] "cost: 6.10665096095211"
## [1] "epoch 88300 done"
## [1] "cost: 6.15765024967128"
## [1] "epoch 88400 done"
## [1] "cost: 6.46906951981039"
## [1] "epoch 88500 done"
## [1] "cost: 6.09986308885366"
## [1] "epoch 88600 done"
## [1] "cost: 6.14280306082967"
## [1] "epoch 88700 done"
## [1] "cost: 6.09414825888777"
## [1] "epoch 88800 done"
## [1] "cost: 6.09290975395915"
## [1] "epoch 88900 done"
## [1] "cost: 6.09918575795744"
## [1] "epoch 89000 done"
## [1] "cost: 6.15541047687101"
## [1] "epoch 89100 done"
## [1] "cost: 6.09506743737205"
## [1] "epoch 89200 done"
## [1] "cost: 6.09255489239849"
## [1] "epoch 89300 done"
## [1] "cost: 6.08022163231187"
## [1] "epoch 89400 done"
## [1] "cost: 6.13131352450633"
## [1] "epoch 89500 done"
## [1] "cost: 6.23325947944722"
## [1] "epoch 89600 done"
## [1] "cost: 6.09076867987644"
## [1] "epoch 89700 done"
## [1] "cost: 6.11646311941465"
## [1] "epoch 89800 done"
## [1] "cost: 6.25886620508246"
## [1] "epoch 89900 done"
## [1] "cost: 6.14294724567255"
## [1] "epoch 90000 done"
## [1] "cost: 6.07426464013664"
## [1] "epoch 90100 done"
## [1] "cost: 6.06277395104518"
## [1] "epoch 90200 done"
## [1] "cost: 6.0776227830469"
## [1] "epoch 90300 done"
## [1] "cost: 6.06000835457375"
## [1] "epoch 90400 done"
## [1] "cost: 6.44602793679987"
## [1] "epoch 90500 done"
## [1] "cost: 6.30126810865745"
## [1] "epoch 90600 done"
## [1] "cost: 6.05452423642485"
## [1] "epoch 90700 done"
## [1] "cost: 6.06293373319169"
## [1] "epoch 90800 done"
## [1] "cost: 6.04584985206605"
## [1] "epoch 90900 done"
## [1] "cost: 6.07050407571557"
## [1] "epoch 91000 done"
## [1] "cost: 6.1143475478914"
## [1] "epoch 91100 done"
## [1] "cost: 6.17745828416969"
## [1] "epoch 91200 done"
## [1] "cost: 6.05634450714472"
## [1] "epoch 91300 done"
## [1] "cost: 6.03912579931086"
## [1] "epoch 91400 done"
## [1] "cost: 6.03860638745278"
## [1] "epoch 91500 done"
## [1] "cost: 6.04296215500184"
## [1] "epoch 91600 done"
## [1] "cost: 6.04578327959753"
## [1] "epoch 91700 done"
## [1] "cost: 6.06516907311063"
## [1] "epoch 91800 done"
## [1] "cost: 6.18837487952026"
## [1] "epoch 91900 done"
## [1] "cost: 6.13700774433911"
## [1] "epoch 92000 done"
## [1] "cost: 6.02156643728028"
## [1] "epoch 92100 done"
## [1] "cost: 6.01622780849093"
## [1] "epoch 92200 done"
## [1] "cost: 6.01561227318683"
## [1] "epoch 92300 done"
## [1] "cost: 6.01647580028013"
## [1] "epoch 92400 done"
## [1] "cost: 6.0097622494829"
## [1] "epoch 92500 done"
## [1] "cost: 6.00782774938876"
## [1] "epoch 92600 done"
## [1] "cost: 6.06936151731368"
## [1] "epoch 92700 done"
## [1] "cost: 6.03888361197648"
## [1] "epoch 92800 done"
## [1] "cost: 6.09668152201371"
## [1] "epoch 92900 done"
## [1] "cost: 6.11490093431333"
## [1] "epoch 93000 done"
## [1] "cost: 6.13278350145317"
## [1] "epoch 93100 done"
## [1] "cost: 6.13200472153523"
## [1] "epoch 93200 done"
## [1] "cost: 5.99194323358457"
## [1] "epoch 93300 done"
## [1] "cost: 5.99272629703461"
## [1] "epoch 93400 done"
## [1] "cost: 6.0497825251164"
## [1] "epoch 93500 done"
## [1] "cost: 6.03742745220484"
## [1] "epoch 93600 done"
## [1] "cost: 5.9839632813512"
## [1] "epoch 93700 done"
## [1] "cost: 6.03452408517129"
## [1] "epoch 93800 done"
## [1] "cost: 6.16134100612668"
## [1] "epoch 93900 done"
## [1] "cost: 6.05869633783227"
## [1] "epoch 94000 done"
## [1] "cost: 5.97443574937121"
## [1] "epoch 94100 done"
## [1] "cost: 6.03925084477593"
## [1] "epoch 94200 done"
## [1] "cost: 5.97101613106883"
## [1] "epoch 94300 done"
## [1] "cost: 5.99191720666583"
## [1] "epoch 94400 done"
## [1] "cost: 5.96576496413207"
## [1] "epoch 94500 done"
## [1] "cost: 5.96953141586829"
## [1] "epoch 94600 done"
## [1] "cost: 5.96147309005696"
## [1] "epoch 94700 done"
## [1] "cost: 5.966299396202"
## [1] "epoch 94800 done"
## [1] "cost: 6.03546446291225"
## [1] "epoch 94900 done"
## [1] "cost: 5.95776092151098"
## [1] "epoch 95000 done"
## [1] "cost: 6.118702507391"
## [1] "epoch 95100 done"
## [1] "cost: 5.97537951352356"
## [1] "epoch 95200 done"
## [1] "cost: 6.06517295395872"
## [1] "epoch 95300 done"
## [1] "cost: 5.98746414163864"
## [1] "epoch 95400 done"
## [1] "cost: 5.97003522321439"
## [1] "epoch 95500 done"
## [1] "cost: 6.21586712004423"
## [1] "epoch 95600 done"
## [1] "cost: 5.96252517507556"
## [1] "epoch 95700 done"
## [1] "cost: 6.01283349486455"
## [1] "epoch 95800 done"
## [1] "cost: 5.9644288715044"
## [1] "epoch 95900 done"
## [1] "cost: 5.93827409724664"
## [1] "epoch 96000 done"
## [1] "cost: 6.14018738399642"
## [1] "epoch 96100 done"
## [1] "cost: 5.95179891167515"
## [1] "epoch 96200 done"
## [1] "cost: 6.11894365268492"
## [1] "epoch 96300 done"
## [1] "cost: 6.01621917912935"
## [1] "epoch 96400 done"
## [1] "cost: 5.92963087833745"
## [1] "epoch 96500 done"
## [1] "cost: 5.9288476287396"
## [1] "epoch 96600 done"
## [1] "cost: 5.99600725121104"
## [1] "epoch 96700 done"
## [1] "cost: 5.93385543853806"
## [1] "epoch 96800 done"
## [1] "cost: 5.98547548624564"
## [1] "epoch 96900 done"
## [1] "cost: 5.92734043606831"
## [1] "epoch 97000 done"
## [1] "cost: 6.00184520541022"
## [1] "epoch 97100 done"
## [1] "cost: 5.98121511972427"
## [1] "epoch 97200 done"
## [1] "cost: 6.12699250508058"
## [1] "epoch 97300 done"
## [1] "cost: 6.04738380338944"
## [1] "epoch 97400 done"
## [1] "cost: 6.1411771447277"
## [1] "epoch 97500 done"
## [1] "cost: 5.91018577057017"
## [1] "epoch 97600 done"
## [1] "cost: 5.90315013383567"
## [1] "epoch 97700 done"
## [1] "cost: 5.93677976283108"
## [1] "epoch 97800 done"
## [1] "cost: 6.03365841214874"
## [1] "epoch 97900 done"
## [1] "cost: 5.94630019420147"
## [1] "epoch 98000 done"
## [1] "cost: 6.00078163500429"
## [1] "epoch 98100 done"
## [1] "cost: 5.88809871194647"
## [1] "epoch 98200 done"
## [1] "cost: 5.88994179907224"
## [1] "epoch 98300 done"
## [1] "cost: 5.89154018414829"
## [1] "epoch 98400 done"
## [1] "cost: 5.94912735074941"
## [1] "epoch 98500 done"
## [1] "cost: 5.94632198345465"
## [1] "epoch 98600 done"
## [1] "cost: 5.88234666132759"
## [1] "epoch 98700 done"
## [1] "cost: 5.87627577970838"
## [1] "epoch 98800 done"
## [1] "cost: 5.9028961318134"
## [1] "epoch 98900 done"
## [1] "cost: 6.05313992928772"
## [1] "epoch 99000 done"
## [1] "cost: 5.86990228885473"
## [1] "epoch 99100 done"
## [1] "cost: 5.88819704776816"
## [1] "epoch 99200 done"
## [1] "cost: 5.87814123976861"
## [1] "epoch 99300 done"
## [1] "cost: 5.86512534815085"
## [1] "epoch 99400 done"
## [1] "cost: 5.86450870094147"
## [1] "epoch 99500 done"
## [1] "cost: 5.86582371868998"
## [1] "epoch 99600 done"
## [1] "cost: 5.85760247933286"
## [1] "epoch 99700 done"
## [1] "cost: 5.97374114278003"
## [1] "epoch 99800 done"
## [1] "cost: 5.87056128642451"
## [1] "epoch 99900 done"
## [1] "cost: 5.92245586274464"
## [1] "epoch 100000 done"
## [1] "cost: 5.84980513813439"

Let’s see a plot of the change in cost on the training set as the network learned:

par(bg="black", col="white", col.axis="white", col.lab="white", col.main="white") # plot styling

plot( x = 0:( length(store_historic_cost)-1 ),
      y = store_historic_cost,
      xlab = "epoch",
      ylab = "cost",
      type = "l"
    )

Let’s see how close the response $y$ observations in the training set are to their fitted values $\hat{y}$ using the network:

par(bg="black", col="white", col.axis="white", col.lab="white", col.main="white") # plot styling

final_network_output <- generic_feedforward_ftn( X = training_X_standard, 
                                                 Weight_matrices = nn_weights )

final_y_hat_training <- final_network_output$a[[paste0("a",n_layers)]]

plot( x = 1:length(training_y),
      y = training_y,
      pch=16,
      cex=0.5,
      xlab = "sample id",
      ylab = "y",
      main = "neural network fit to training data",
      ylim = c( 0, max(training_y) )
    )
points( x = 1:length(training_y),
        y = final_y_hat_training,
        pch = 16,
        col = 3,
        cex = 0.5
      )
segments( x0 = 1:length(training_y),
          y0 = training_y,
          x1 = 1:length(training_y),
          y1 = final_y_hat_training,
          col = 2,
          lwd  =0.5,
          lty = 3
        )
legend( "topright",
        legend = c( "true value", "predicted value" ),
        pch = c(16,16),
        cex = 0.8,
        col = c(8,3),
        y.intersp   = 0.5,
        box.col = 8,
        title = NULL
      )

To compare, here is how well a linear model fits to the same training data:

par(bg="black", col="white", col.axis="white", col.lab="white", col.main="white") # plot styling

fit.lm <- lm( medv ~ .,
              data = cbind(medv=training_y, training_X_standard) %>% as.data.frame()
            )

plot( x = 1:length(training_y),
      y = training_y,
      cex = 0.5,
      pch=16,
      xlab = "sample id",
      ylab = "y",
      main = "simple linear model fit to training data",
      ylim = c( 0, max(training_y) )
    )
points( x = 1:length(training_y),
        y = fitted.values(fit.lm),
        pch = 16,
        col = 3,
        cex = 0.5
      )
segments( x0 = 1:length(training_y),
          y0 = training_y,
          x1 = 1:length(training_y),
          y1 = fitted.values(fit.lm),
          col = 2,
          lwd=0.5,
          lty=3
        )
legend( "topright",
        legend = c( "true value", "predicted value" ),
        pch = c(16,16),
        cex = 0.8,
        col = c(8,3),
        y.intersp   = 0.5,
        box.col = 8,
        title = NULL
      )

The network achieves a final cost on the training set of:

quadratic_cost_ftn( y = training_y, y_hat = final_y_hat_training )

## [1] 5.849805

while the linear model achieves cost on the training set of:

quadratic_cost_ftn( y = training_y, y_hat = fitted.values(fit.lm) )

## [1] 10.70182

Let’s see how both perform on the test set:

par(bg="black", col="white", col.axis="white", col.lab="white", col.main="white") # plot styling
####### NEURAL NETWORK #####

network_out_testset <- generic_feedforward_ftn( X = test_X_standard, 
                                                 Weight_matrices = nn_weights )

network_test_y_predict <- network_out_testset$a[[paste0("a",n_layers)]]

network_cost <- quadratic_cost_ftn( y = test_y, y_hat = network_test_y_predict )

plot( x = 1:length(test_y),
      y = test_y,
      pch=16,
      cex=0.5,
      main = paste0( "Nueral Network: test set predictions \n COST: ", round(network_cost,2) ),
      xlab = "sample index",
      ylab = "y",
      ylim = c(0, max(test_y)+10 )
    )
points( x = 1:length(test_y),
        y = network_test_y_predict,
        pch = 16,
        col = 3,
        cex = 0.5
      )
segments( x0 = 1:length(test_y),
          y0 = test_y,
          x1 = 1:length(test_y),
          y1 = network_test_y_predict,
          col = 2,
          lwd = 0.5,
          lty = 3
        )
legend( "topright",
        legend = c( "true value", "predicted value" ),
        pch = c(16,16),
        cex = 0.8,
        col = c(8,3),
        y.intersp   = 0.5,
        box.col = 8,
        title = NULL
      )

####### LINEAR MODEL ######
lm_test_y_predict <- predict(  object = fit.lm, 
                              newdata = cbind(medv=test_y, test_X_standard) %>% as.data.frame()
                            )

lm_cost <- quadratic_cost_ftn( y = test_y, y_hat = lm_test_y_predict )

plot( x = 1:length(test_y),
      y = test_y,
      pch=16,
      cex=0.5,
      main = paste0( "Linear Model: test set predictions \n COST: ", round(lm_cost,2) ),
      xlab = "sample index",
      ylab = "y",
      ylim = c(0, max(test_y)+10 )
    )
points( x = 1:length(test_y),
        y = lm_test_y_predict,
        pch = 16,
        col = 3,
        cex = 0.5
      )
segments( x0 = 1:length(test_y),
          y0 = test_y,
          x1 = 1:length(test_y),
          y1 = lm_test_y_predict,
          col = 2,
          lwd = 0.5,
          lty = 3
        )
legend( "topright",
        legend = c( "true value", "predicted value" ),
        pch = c(16,16),
        cex = 0.8,
        col = c(8,3),
        y.intersp   = 0.5,
        box.col = 8,
        title = NULL
      )

Here are the sizes of the weights in the final trained network:

store_weights <- tibble(name="namenull", value=0) %>% slice(0)

for( i in 1:length(nn_weights) ){
  store_weights <- 
    bind_rows( store_weights 
               ,
               tibble(  name = names(nn_weights)[[i]],
                       value = c( nn_weights[[i]] ) 
                     )
             )
}

ggplot( data = store_weights,
          aes( x = name,
               y = value
             )
      ) +
  geom_jitter( colour="white", size=0.1, height=0, width=0.3 ) +
  theme( axis.text.x = element_text(angle=90) ) +
  labs( title = "Weights in the Final Trained Network")

Feed-Forward Neural Network from Scratch in R

Joseph Bolton

20 September 2018