Company / News

Machine Learning on GPUs and Embedded Devices

machine learning image

Deep nets are becoming pervasive in many desktop and mobile applications. However, running deep nets on small embedded devices remains a challenge. Some challenges include choosing the right implementation on the embedded device (cuDNN, TensorFlow, etc.) and using other tricks that can improve performance (for example: off-loading power hungry algorithms to the cloud while keeping algorithms local that reduce the data transmission).

Deep learning and more broadly machine learning are systems utilizing algorithmic learning based on data representations. The algorithms have been around since the late 60’s, but only recently has the hardware become powerful enough and the algorithms good enough to make the processing feasible. The algorithms are flexible enough for image recognition, voice recognition, and a wide variety of other domains. A common, familiar application is the image recognition when you deposit a check at an ATM. The system reads the amount of the check almost error free in spite of the uniqueness of each person’s handwriting.

Python libraries

There are many software libraries for these algorithms. The most popular one is TensorFlow from Google. Torch (used by Facebook), Microsoft and cuDNN are also gaining in popularity.

Training the algorithm

The computer is presented with example inputs and their desired outputs are called “training sets”. The training sets come from a known library of data where the input represents a wide range of expected input. The data set can also be broken down further into training and verification sets. While the training set is used to tune the algorithm, the verification set can be used after training to verify the performance of the tuned algorithm. The goal is for the system to learn a general set of rules using parameter weights that allow it to classify objects. For example, the bank’s ATM can identify the check amount by breaking it down into reading each number. For the system to learn to recognize the number “1” lots of training data is required because it must handle differences in rotations, size, shape, handwriting, pen thickness, and background noise. When a check image is inputted, we want the system to output the correct deposit amount.

There are two phases. The first is training, which is usually performed on specialized Graphic Processing Unit (GPU) hardware to reduce the training time. Once the model is trained, it can possibly be adapted to inexpensive platforms. Pruning and reducing the algorithmic features can allow certain algorithms to run on simple embedded devices. For other more complicated algorithms, a higher end CPU can be used. GPU advancement has allowed very complex processing to be performed on smaller and low power devices.

Why is Machine Learning on GPUs?

The training phase requires a lot of processing power for its iterative matrix multiplication. GPUs excel and can perform at several Tera FLoating point Operations Per Second (TFLOPS) per graphics accelerator. The math required for these types of algorithms is similar to graphics processing, which allows for a new market for GPUs. These are sometimes called GPGPUs (general purpose GPU), and they can be scaled out to provide an incredible amount of processing power in performing complex math.

Business benefits of Machine Learning on GPUs

Using Amazon EC2 with P2 instances, it is possible to turn on a training environment for a couple of days, do the training on a large system with 8 or 16 GPUs. Then turn it off when training is complete. This powerful and flexible method is cost effective even for startups. Larger companies may use GPU farms or even custom ASICs for even faster processing.

Porting to an embedded end device

Once the training session is complete, a set of weighted parameters are loaded into the processing nodes. The end device might be a phone, Raspberry Pi or other embedded device. In some cases it might use a custom ASIC with math accelerators. The major challenges with this port are memory size and processor speed.  It is necessary to minimize processing requirements and power usage in the embedded device if it runs on a battery. There are some new libraries to help solve these issues including VGG-16, Resnet, and AlexNet.


As everyday products get smarter by running deep nets on small embedded devices, technologies can provide new solutions. Learning networks will be the cornerstone of such products. Machine Learning on GPUs will remain popular for the training phase of algorithm development because they deliver the processing power required cost effectively.

For more information on Voler Systems’ GPU, FPGA and embedded hardware and software expertise, go to or call us at (408) 412-9175.

Tell us about your next design project

Do you have a question about our services, pricing, samples, resources, or anything else?

Contact Us Now

Related News

Sensor Tutorial

Download our tutorial (email address required) and learn about: Using...

Designing the Electronics for Augmented Reality...

Webinar Alert! Voler Systems together with the Wireless Communications Alliance...

BIOMEDevice 2011 Is Right Around The...

BIOMEDevice San Jose Dec. 6 & 7, 2011 San Jose...

Newsletter Sign Up
Get Expert Consulting

Voler is really good at identifying risks and finding the best way to do a project on-time, on spec, and easy to manufacture.