Utilizing cloud based GPU clusters for neural net training

The modern revolution in the development and application of deep learning neural net systems is to some extent based on the increased computational resources available today for training the deep learning nets. The availability of GPU chips being a huge factor in providing those increased computational resources.

Lack of sufficient computational resource for training neural nets was the primary issue that stalled the back-prop neural net revolution that was taking place in the late 80s and early 90s.  Computers at the time were just not powerful enough to successfully train neural nets with greater than 2 or 3 layers.

Once sufficient computational resources were available around 2012, approaches were developed to successfully train neural networks with many layers. This is where the term deep comes from in deep learning, deep referring to many layers in the neural network.  As we mentioned in yesterday's post on Keras,  the 2014 Simonyan and Zisserman VGG ImageNet networks had 16 and 19 layers.

Just to be clear, the availability of extremely large data sets for training deep learning nets was also a huge factor in the current deep learning revolution.  As were some fundamental revelations in how to design the weight initialization and the non-linear activation functions used within the neural net algorithms.

Access to GPU training is a huge win for training your deep learning neural net. You can certainly do it on a desktop cpu. The issue is that you might have to wait a long time for the training to take place for many real world problems you would want to solve. Being able to utilize a powerful GPU on your personal computer can help quite a bit. And of course if you have access to a local GPU cluster, that can really help out.

People may have access to a GPU on their desktop computer. But you quickly start to run into issues of which GPU is it, how powerful is it really, does my particular computer support the GPU i would like to use (lack of Apple support for Nvidia GPUs being one notable factor). And of course all of this stuff costs money you may not be willing to spend.

Amazon Web Services (AWS) provides access to cloud based GPU configurations. Including GPU clusters. AWS is not a free system, you will be paying an hourly rate for computation whose cost varies based on how many GPUs you are working with. But using GPU clusters can dramatically speed up training times. And give you the ability to tackle problems that would just not be feasible to solve on a desktop computer.

Cloud based deep learning net training also isolates you from the ever present problem of buying hardware and then immediately having it becoming out of date as new and more powerful chips are developed and released. And you may live somewhere (Maui for example) where the environmental conditions can be brutal to computer hardware, limiting it's potential lifespan.

MXNet is an open source deep learning framework that supports working with cloud based GPU clusters. MXNet has deep integration with the Python programming language. It also has a variety of other language bindings.  The MXNet dependency engine provides a solution to parallelize computation across multiple devices.

Keras-MXNet is a fork of the Keras project, adding support for MXNet as a backend.

MXNet's system is designed not only for deep learning, but for any domain specific problem. It’s designed to solve a general problem: execute a bunch of functions following their dependencies. Execution of any two functions with dependencies should be serialized. To boost performance, functions with no dependencies can be executed in parallel.

It would be interesting to explore it's use for speeding up and optimizing other computation in multi core desktop cpus. As opposed to rolling your own core splitting libraries.  NNPack can work with MXNet to provide multi-core cpu acceleration on x86-64, ARMv7, and ARM64 architectures. It also offers some clever optimizations for convolutions.

Another alternative to AWS cloud systems would be Google's Cloud Services.  One thing they offer is a Deep Learning VM image. What they mean by 'image' is a virtual machine image with a pre-installed deep learning framework. Obviously they support TensorFlow. Here's some information on running distributed TensorFlow on their generic cloud compute engine. The previous links in this paragraph point at their specific AI platform.

You can train a neural network on their Ai platform using the Keras API we discussed in a previous post.

Another alternative is Microsoft's Azure machine learning cloud computing services. You can train a neural network using the Keras API on their system as well.

That was a lot of overview material with very little specific information about how to explicitly work with these systems.  We will be discussing more specific details of using some of these systems in future posts. One of the goals here at HTC is to provide specific portals or templates for our members to these various cloud based platforms for ease of use when working with training and using deep learning networks.


Popular posts from this blog

Pix2Pix: a GAN architecture for image to image transformation

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Smart Fabrics