HTC Education Series: Getting Started with Deep Learning - Lesson 4

  Ok, let's dive into the 4th lecture in our Getting Started with Deep Learning series.  

We're going to dive under the hood in this lecture to get a better understanding of how deep learning neural nets really work.  We'll do this by building and training one from scratch.  We'll start old school with a single layer linear network.  Then extend it using ReLU nonlinearities to create a deep nonlinear network.  Along the way we will discover our old friend the sigmoid function, as well as the newer softmax activation function.

Remember, a great way for you to help learn and retain the specific material these lectures cover is to put together your own summery of what was covered in the lecture.  You should ideally do this before reading our summery below.

You can also watch the video on the fastai course site here. The advantage of that is that you can access on that site searchable transcript, interactive notebooks, setup guides, questionnaires, etc.


Don't forget to read the course book



What is covered in this lecture

Arthur Samuel's description of machine learning
breaks down to 7 steps

initialize the weights
processing cycle
    for each input, use weights to predict output of system
    calculate how good the model is (it's loss) based on the predictions
    calculate the gradient (measure for each weight how changing the weight would change the loss)
    change all the weights based on the above calculation 
repeat cycle until you decide to stop training

Gradient Descent can be thought of graphically as an error landscape (with hills and valleys)
error landscape - the marble wants to get to the bottom (lowest error)

Dataset
    training
    validation

building a neural net
    parameters
        weights
        biases

we need to restructure the computation to be suitable for a GPU
    use computational blocks like matrix multiplication

activation function

gradient of a function is it's slope

Batch processing
    mini batches (random subset of total data)
    introduces stochastic variation that turns GD into SGD

DataLoader abstracts batch processing for you
    uses Dataset as it's input

train one epoch
    processing entire data set in epoch is gradient descent (GDD)
    processing a mini batch in epoch is stochastic gradient descent (SGD)

Learner
    uses DataLoader as input
    uses model as input
    uses optimization function as input
    uses metrics as input

neural network
    linear function
    nonlinearity
    linear function

Rectified Linear Unit (ReLU)

linear vs nonlinear network

nonlinear network = universal approximation function
    solves any computable function (WOW!)

High Level fastai multiple category classification example
    - use DataLoader to read in the data
    - use DataLoader to randomly augment data on the fly when training
        fastai has some really nice data augmentation features that live here

SoftMax Activation Function
    - used for non-binary classification

Choose the ethical path when evaluating your data, and when deploying your deep learning data model into the world


Additional HTC Course Material

1: Xander will continue to get us pumped up about How Neural Networks Learn, in Part 3: The  Learning Dynamics Behind Generalization and Overfitting.

Remember, Part 1 focused on neural network Feature Visualization, and then Part 2 focused on taking what we learned about Feature Visualization and using it to fool a deep learning system by creating adversarial examples.  Part 3 looks at how deep learning networks memorize and represent data from an information theory analysis viewpoint. 



Observations

1:  This fastai lecture covered some important fundamental concepts associated with building and training a neural net.  Think about how the concepts below match up with the individual steps in Arthur Samuel's description of machine learning.

initialize the weights
processing cycle
    for each input, use weights to predict output of system
    calculate how good the model is (it's loss) based on the predictions
    calculate the gradient (meansure for each weight how changing the weight would change the loss)
    change all the weights based on the above calculation 
repeat cycle until you decide to stop training

Neural Net Concepts discussed in the lecture

forward pass    -pass input to the model, model computes output 
        ex: prediction if model is a classifier

loss    -a value that represents the performance of the model
            computed using a function (that represents some useful metric)

gradient    - derivative of the loss with respect to the parameters of the model

backward pass    - computing the gradients of the loss with respect to the model parameters

gradient descent (GD)     -the step in a direction opposite to the gradients to make the model parameters get better (lower error rate)

learning rate    -size of the step in a direction opposite to the gradients to make the model parameters better

ReLU    -Nonlinear function that returns 0 for negative numbers, and doesn't change positive numbers

from GD to SGD
    mini-batch    -subset of your data stored in 2 big arrays
        ex: a few inputs (image) and associated labels (cat, dog)
    batch processing effectively adds noise to the system, jostles the error landscape
        - shaking helps the marble get over obstacles on the error landscape (or a pinball machine)
            remember, we want that marble to get to the very bottom of the error landscape


2:  Tensors are used to store below for the neural net model
activations    - numbers that are calculated by the net
parameters    - numbers that are randomly initialized, then optimized (weights, biases)

Working with tensors and tensor abstracted functional blocks like matrix multiplication structure the computations in a way that lends itself to GPU speedup.  Why?


3:  With Arthur Samuel's description of machine learning, and gradient descent (which Newton talked about in the early 1700s as Newton's Method), we actually have everything we need to build neural nets.  At least the knowledge, you still have to run that knowledge on some kind of hardware.  Imagine a parallel universe Steam Punk society that developed AI technology very early on in their history using Steam Punk technology.  How would it work?

4:  DataLoader is a very powerful abstraction object in fastai
    - handles batch processing
    - handles data augmentation
        - runs on GPU
        - concatenates multiple geometric transformations for single GPU transform step (higher quality)
    - handles getting physical data into ideal form model wants to see

In other courses on deep learning, you typically end up spending a huge amount of manual programming time dealing with these issues.  The fastai DataLoader object just does it all for you in a very flexible way that can also be extended or customized if necessary.

These other deep learning courses will also spend huge amounts of time manually tinkering with their data sets before they even begin to run it through a model. Fastai takes a 180 degree opposite viewpoint that is extremely practical.  Use your model from the very beginning to help clean up your dataset (automate whenever possible).

Randomly augmenting your data on the fly during batch processing is also a huge fastai feature.  They have put a lot of thought into this, and it really shows.  It's also one big factor in why fastai gets such good results in competitions.  And can be used to solve problems with smaller training datasets and again get very good results (defying conventional wisdom).


Fun Stuff

1:  Jeremy shows off how show_image(w[0].view(28,28)) can show you a feature visualization of the first layer in the MINST example discussed in the fastai lecture.



Need to review something from the previous lessons in the course.
No problem.

You can access Lesson 1 here.

You can access Lesson 2 here.

You can access Lesson 3 here.

You can move on to the next Lesson 5 in the course here.

Comments

Popular posts from this blog

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Pix2Pix: a GAN architecture for image to image transformation

Smart Fabrics