Exploring OAK, DepthAI and SpatialAI

We're continuing our exploration of the super nifty open source OAK hardware boards that allow you to run VPU accelerated neural net and computer vision algorithms.

DepthAI is the term OAK's developers use for their complete embedded Spatial AI platform. So the open source hardware, firmware, and software interface api.  It allows you to run preconfigured neural net AI models (openVINO models) for constructing real time embedded computer vision systems. If you want to use custom neural net models, you can do so as well. Utilizing a set of instructions they provide here.

SpatialAI is all about localizing objects in 3D space. So finding their positions in physical space as opposed to just pixel space in a 2D image.

DepthAI provides 2 different methods for generating SpatialAI results.
  -Monocular neural inference fused with stereo depth
  -Stereo neural inference

Monocular neural inference fused with stereo depth means that the neural net model is run on a single camera, and then is fused with the stereo depth disparity results. Any of the 3 cameras in the system, RGB, left, right can be used to run the neural inference.

Stereo inference means that the the neural net model is run in parallel on the left and right stereo views.  The disparity of the results are then triangulated with the calibrated camera intrinsics to give the 3D position of the detected features.

Neither of these approaches requires neural net models to be trained with depth data.  You can use standard 2D networks and get accurate 3D results out of them.

If you want to use depth data to train custom neural net models you certainly can do that.  You'd need to add an extra 2 input layers to your neural net model for the 2 depth camera images.  And then train this new neural net model on RGB and left and right input images.  Keep in mind that standard 2D object recognition models were trained on huge databases to get the high quality results they generate. 

The DepthAI api provides support for getting 3D positional data from standard 2D neural net models that do object recognition.  Hiding all of the gory details associated with how that works internally.  Which is really great for you the software developer using the system. 

I should point out the spatial positioning support discussed above can work with neural nets other than bounding box object detectors.  You can also us it with neural net feature detectors that return single points (or sets of single points).  Things like facial landmark detectors or pose estimators can also generate 3D position output data for the feature points they detect in a viewed scene.


Popular posts from this blog

Simulating the Universe with Machine Learning

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Pix2Pix: a GAN architecture for image to image transformation