The Complete Guide to Setting Up the Intel Movidius Neural Compute Stick for Edge AI Development
The Intel Movidius Neural Compute Stick (NCS) is a powerful USB device that brings hardware acceleration for deep neural networks to the edge. Featuring the Myriad 2 Vision Processing Unit (VPU), the Movidius NCS enables rapid prototyping and deployment of AI applications on resource-constrained devices, from smart cameras to drones to industrial equipment.
In this comprehensive guide, we‘ll walk through everything you need to know to get up and running with the Movidius NCS, including:
- An overview of the Myriad VPU architecture and its advantages for edge AI
- Setting up a development environment with the Intel Movidius Neural Compute SDK (NCSDK)
- Running pre-trained models and converting custom models for the Movidius NCS
- Code samples demonstrating the NCSDK API and how to build Python applications
- Practical tips for optimizing performance, managing power consumption, and deploying to production
- A look at the AI accelerator landscape and how the Movidius compares to other solutions
Whether you‘re an IoT solutions architect, embedded systems engineer, or data scientist experimenting with edge AI, this guide will provide a solid foundation for developing with the Movidius NCS. Let‘s dive in!
Myriad 2 VPU Architecture
At the core of the Movidius NCS is the Myriad 2 VPU, a system-on-chip designed from the ground up for high-performance, low-power computer vision and AI applications. The Myriad 2 combines multiple specialized processing units into a unique heterogeneous architecture^1:
- 12 128-bit VLIW Vector Processors (SHAVE cores) optimized for computer vision workloads
- A programmable 128-bit Vector Processor for additional flexibility
- Dedicated hardware accelerators for common neural network operations like convolutions and activations
- Interfaces for image sensors and displays
- Hardware for cryptography, video encoding/decoding, and power management
This architecture allows the Myriad 2 to achieve up to 1 TFLOPS of compute within a 1W power envelope^1. By offloading AI inference from the CPU to the Myriad 2, edge devices can perform complex neural network tasks in real-time while maintaining low latency and battery consumption.
Benchmarks show significant performance gains compared to running models on the CPU alone. For example, in one test classifying images with GoogleNet, the Movidius NCS reached 15 FPS while an Intel Atom CPU maxed out at 2.5 FPS^2. For a more recent MobileNet-SSD model, the Movidius reached 36 FPS compared to just 4.8 FPS on a Raspberry Pi 3 CPU ^3.
Of course, benchmark results depend heavily on model architecture and configuration. But in general, the Movidius excels at running larger, compute-bound neural networks where the SHAVE cores and dedicated accelerators make a big difference. It‘s less beneficial for smaller models that underutilize the VPU.
Setting Up a Development Environment
The first step in developing with the Movidius NCS is configuring a development machine and installing the NCSDK. Intel recommends using Ubuntu 16.04, which can run on a PC or inside a virtual machine (VM) on Windows/Mac.
If setting up a VM, use a hypervisor like VirtualBox and allocate at least 2 CPUs, 4 GB RAM, and 10 GB disk space to the VM. See the NCSDK documentation for detailed instructions.
With Ubuntu ready, open a terminal and install prerequisites:
sudo apt update
sudo apt install git python3-pip python3-dev libusb-1.0-0-dev libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler libatlas-base-dev python3-setuptools cmake
Next, clone the NCSDK repo:
git clone https://github.com/movidius/ncsdk.git
cd ncsdk
Plug in your Movidius NCS and verify it‘s visible to the OS:
lsusb
Look for an entry like:
Bus 001 Device 007: ID 03e7:2150 Intel Movidius MyriadX
Finally, run the installer and build the examples:
make install
make examples
After a successful installation, you‘re ready to start developing AI applications with the NCSDK.
Running Pre-trained Models
The NCSDK includes a number of pre-trained Caffe and Tensorflow models that are ready to run on the Movidius NCS. This is a great way to verify your setup and quickly experiment with different neural network architectures.
From the ncsdk directory, run:
cd examples/caffe
make
cd GoogLeNet
make run
This downloads the pre-trained GoogLeNet model, compiles it for the Movidius, and performs image classification on a sample image. You should see the top 5 classification results, something like:
Top predictions for examples/images/cat.jpg
ID Prediction Probability
--- ---------- -----------
281 ‘tabby, tabby cat‘ 0.3025
282 ‘tiger cat‘ 0.1993
278 ‘kit fox, Vulpes macrotis‘ 0.0799
288 ‘lynx, catamount‘ 0.0301
287 ‘cougar, puma, catamount, mountain lion, painter, panther, Felis concolor‘ 0.0203
To try other models, just change to the model directory and make run
. Options include AlexNet
, GoogLeNet
, SqueezeNet
, tiny-yolo
, and more.
These examples use a simple Python API provided by the NCSDK to load the compiled model, send input data, and receive inference results. In the next section, we‘ll see how to use this API to build custom applications.
Developing a Custom Application
While the pre-trained models are great for testing and demos, most real-world applications will require a custom neural network trained for a specific task. The NCSDK provides tools to convert models from popular frameworks to the Movidius format and a Python API to integrate the model into an application.
Let‘s walk through an end-to-end example of building an image classifier in Python using the Movidius NCS. We‘ll use a pre-trained MobileNet model, but the same process applies to custom-trained models.
import mvnc.mvncapi as mvnc
import numpy as np
from PIL import Image
# Path to the compiled model file
GRAPH_PATH = ‘mobilenet.graph‘
# Load precompiled model
devices = mvnc.enumerate_devices()
device = mvnc.Device(devices[0])
device.open()
with open(GRAPH_PATH, ‘rb‘) as f:
graph_buffer = f.read()
graph = mvnc.Graph("MobileNet")
fifo_in, fifo_out = graph.allocate_with_fifos(device, graph_buffer)
# Load labels
with open(‘labels.txt‘, ‘r‘) as f:
labels = [line.strip() for line in f.readlines()]
# Read image, preprocess, and classify
img = Image.open("example.jpg")
img = img.resize((224, 224))
img = np.array(img).astype(np.float32)
# Normalize input
img = img - 127.5
img = img * 0.007843
img = img.astype(np.float16)
graph.queue_inference_with_fifo_elem(fifo_in, fifo_out, img, "user object")
output, userobj = fifo_out.read_elem()
# Get top 5 results
top_indices = (-output).argsort()[:5]
print(‘Top predictions:‘)
for i in top_indices:
print(labels[i], output[i])
# Clean up
fifo_in.destroy()
fifo_out.destroy()
graph.destroy()
device.close()
device.destroy()
This script does the following:
- Loads the pre-compiled model graph file
- Opens the Movidius device and allocates the model on it
- Reads the class labels from a file
- Preprocesses an input image (resize, normalize)
- Runs inference by sending the image to the model and reading the output
- Prints the top 5 predicted labels and probabilities
- Cleans up the device and resources
The key parts are loading the model onto the Movidius, preprocessing the input to match the model (MobileNet expects 224×224 RGB images normalized to [-1, 1]), and using the NCSDK FIFO API to pass data to and from the device.
To convert another model to run on the Movidius, use the NCSDK compile tools:
mvNCCompile model.prototxt -w weights.caffemodel -s 12 -in input_name -on output_name -o compiled.graph
This will compile a Caffe model into the Movidius graph format, ready to use with the Python API. The -s 12
argument sets the number of SHAVE cores to use (up to 12).
With the NCSDK API, you can integrate the Movidius into all kinds of edge AI applications, from object detection on a security camera to gesture recognition on a smart home device. The key is designing your neural network architecture and preprocessing pipeline to run efficiently on the Myriad VPU.
Optimizing for Performance and Power
To get the best performance and lowest power consumption from the Movidius NCS, consider the following tips:
- Use models designed for embedded/mobile deployment, such as MobileNet, SqueezeNet, or models from the TensorFlow Lite model zoo
- Quantize weights and activations to 8-bit fixed point to reduce computation and bandwidth – the Movidius VPU operates on 8-bit data natively
- Fuse operations like batch norm and scale into conv layers to reduce data movement
- Aim for a compute-bound model that keeps the SHAVE cores busy – large, deep models tend to perform better than small, shallow ones
- Downsample input images as much as tolerable for your application – smaller inputs mean less data to process
- Compile your model with an appropriate number of SHAVE cores depending on its complexity – using all 12 cores can increase performance but also power consumption
- Reduce operating frequency and disable unused interfaces to lower power usage in production
- Use the NCSDK API to duty cycle the device, putting it into low power states when not processing data
Following these guidelines can help you achieve the impressive performance/watt that makes the Movidius well-suited for AI at the edge.
Comparison to Other Edge AI Solutions
The Movidius is just one of many options in the rapidly growing edge AI accelerator space. Other popular choices include:
- NVIDIA Jetson series (Nano, TX2, Xavier): GPU-based modules with high performance but also higher cost and power consumption than the Movidius
- Google Coral: USB accelerator and SoM featuring the Edge TPU ASIC, offers similar performance to Movidius for TensorFlow Lite models
- Dedicated ASICs: chips from startups like Kneron, Gyrfalcon, and Mythic designed for ultra-low power AI at the edge, but less flexible than general purpose accelerators
The best choice depends on your specific application requirements and constraints. The Movidius offers a good balance of performance, power efficiency, and ease of development with the NCSDK API and supported frameworks. Its low price also makes it attractive for startups and makers experimenting with edge AI.
One downside of the Movidius is that the Myriad X VPU is a few years old at this point and newer devices offer significantly more computing power. Intel is continuing to innovate with the Myriad VPU in its Keem Bay chip for edge servers.
Conclusion
The Intel Movidius Neural Compute Stick is a powerful tool for bringing AI workloads to the edge. With the Myriad 2 VPU architecture optimized for computer vision and neural network processing, the Movidius offers significant performance and efficiency gains compared to running models on CPU alone.
In this guide, we covered:
- The key features and capabilities of the Myriad 2 VPU
- Setting up an Ubuntu development environment and installing the NCSDK
- Running pre-trained models with the NCSDK
- Building custom applications in Python with the NCSDK API
- Tips for optimizing models for best performance and power efficiency on the Movidius
- Comparing the Movidius to other edge AI hardware solutions
For developers looking to add AI smarts to edge devices, the Movidius is definitely worth evaluating. While it‘s not the fanciest or highest-performing option, it hits a sweet spot of capability, usability, and cost.
Of course, AI at the edge is still an evolving field and new solutions are emerging all the time. Picking an accelerator is just one part of the challenge – you also need to consider factors like security, connectivity, model management, and more for real-world deployments.
The techniques and examples covered in this guide provide a foundation to start from. I encourage you to experiment with the Movidius, prototype your ideas, and see what you can create! Feel free to reach out with any questions.