Deep Learning Deployment Options

Let's continue our deep learning deployment options exploration by taking a look at embedded deployment. So we're talking about potentially deploying deep learning nets in Internet of Things (IoT) style devices.

So you may have heard of a single processor board called the Raspberry Pi. Which start at $35. Surely such a cheap and inexpensive processor board can't run sophisticated deep learning nets. Well, actually you can run them on a pi board, although they will run slowly. But things start getting way more interesting when you start looking at some very inexpensive coprocessor options that can be paired with the Raspberry Pi.

Some performance specs to keep in mind.

ResNet [4, 5], arguably one of the most powerful and accurate CNNs, requires 2 GFLOPs to 16 GFLOPs, depending on how deep the model is.

The Raspberry Pi 3 includes a ARM Cortex-A53 running at 1.2 GHz, having 10x the performance of the original RPi [6], giving us approximately 0.41 GFLOPs.

Let's start by taking a look at the Google Coral Tensor Processing Unit (TPU). A USB accessory that performs high speed deep learning inferencing. Capable of running 4 trillion operations per second, (TOPS) using only 0.5 watts of power for each TOPS.

The Coral coprocessor contains an Edge TPU designed by Google. Now Google runs Cloud TPUs in their data centers that can perform 240 teraflops, making them ideal for training deep learning nets (or just running already trained ones very fast). The Edge TPU is specifically designed for small low power devices, and is really designed to only implement trained neural nets. It also takes some liberty with the precision of the calculations being performed (the weights of the net are quantized

One thing to be aware of is that the top speed is dependent on having a USB 3.0 connection. It works with USB 2.0, but runs much slower. This is important to be paying attention to because you want to make sure your Pi system has a USB 3 port on it. If you use USB2 to connect to Pi, it runs quite a bit slower.

The Edge TPU works with TensorFlow Lite models. So if you are using Keras to train up a TensorFlow 2.0 model, then you need to convert that to TensorFlow Lite. It looks like there is a conversion path for Keras model files as well.

And be aware that TensorFlowLite supports a limited subset of TensorFlow operations. All input data uses uint8 format. You can read more about the quantization options here.

So again, this whole thing works because you can pretty dramatically reduce the precision of the calculations that are running in a trained deep learning neural net and still get reasonable results out of it.

I should point out that Coral has other deployment options beside the USB accelerator (which was originally priced at $74.99 but has dropped to $59.99).

Another inexpensive coprocessor option is the Intel Movidius Neural Compute Stick (NCS). It is similar to the Coral stick in that both plug into USB ports and act as co-processors to speed up deep learning computations. The NCS uses an array of 12 vector processors called SHAVE processors running in parallel, 4GB of DRAM, and a SPARC co-processors core.

The NCS can run between 80-150 GLOPs in just over 1W of power.

Google reports that their Coral line of products are over 10x faster than the NCS. But that assumes USB 3. If you are using USB 2, it's 10X slower, so then it's about the same as the NCS.
One thing about the NCS is that you can run a greater class of algorithms on it then the graph computation constrained Edge TPU (which basically does one thing extremely well). So while it may not be as fast, it could potentially be a little bit more versatile.

A third (even more versatile) inexpensive coprocessor option is the NVIDIA Jetson Nano. Which costs $99, and is a small coprocessor board as opposed to a USB stick like the other 2 discussed options.

When i first read about the Jetson Nano, my immediate reaction was to wonder if you could configure a bunch of them as a low cost GPU cluster. And of course someone has already done this, building a GPU-enabled Kubernetes cluster. Kubernetes, originally created by Google, is a very commonly used software tool to manage distributed applications running on hundreds, thousands or maybe even hundreds of thousands machines.

Now the Jetson Nano runs CUDA, so you can use it to do way more than you can with those other 2 approaches.

The Jetson Nano includes a 128-core Maxwell GPU, a quad-core ARM 57 processor running at 1.43 GHz, and 4GB of 64-bit LPDDR4 RAM. The Nano can provide 472 GFLOPs with only 5-10W of power.

Another accelerator board option would be the Google Coral TPU Dev Board, which is capable of 32-634 GLOPs (so much more powerful than the usb stick option). But you can only run TensorFlowLite models on it.

So the embedded space for deploying deep learning neural nets is actually very active, and has a number of very potentially interesting low cost options you can take advantage of. So you don't have to just think about deploying to desktop or mobile device applications.

For example, HTC has been discussing potential applications of deep learning that Maui farmers could take advantage of. We were originally thinking of something like targeted smart phone applications. But the things we discussed in this post offer up other possibilities, like dedicated small boxes for specific tasks that could be constructed for under $200.

Of course any of these options could be used in conjunction with portable computers as well to speed up or off load neural net computations. More on that in a later post.

Search This Blog

Haiku Tech Center

Deep Learning Deployment Options - part 2

Comments

Post a Comment

Popular posts from this blog

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Pix2Pix: a GAN architecture for image to image transformation

Smart Fabrics