MobileNet SSD object detection using the Intel Neural Compute Stick 2 and a Raspberry Pi

I had successfully run ssd_mobilenet_v2_coco object detection using an Intel NCS2 running on an Ubuntu PC in the past but had not tried this using a Raspberry Pi running Raspbian as it was not supported at that time (if I remember correctly). Now, OpenVINO does run on Raspbian so I thought it would be fun to get this working on the Pi. The main task consisted of getting the CSSD rt-ai Stream Processing Element (SPE) compiling and running using Raspbian and its version of OpenVINO rather then the usual x86 64 Ubuntu system.

Compiled rt-ai SPEs use Qt so it was a case of putting together a different .pro qmake file to reflect the particular requirements of the Raspbian environment. Once I had sorted out the slight link command changes, the SPE crashed as soon as it tried to read in the model .xml file. I got stuck here for quite a long time until I realized that I was missing a compiler argument that meant that my binary was incompatible with the OpenVINO inference engine. This was fixed by adding the following line to the Raspbian .pro file:

QMAKE_CXXFLAGS += -march=armv7-a

Once that was added, the code worked perfectly. To test, I set up a simple rt-ai design:


For this test, the CSSDPi SPE was the only thing running on the Pi itself (rtai1), the other two SPEs were running on a PC (default). The incoming captured frames from the webcam to the CSSDPi SPE were 1280 x 720 at 30fps. The CSSDPi SPE was able to process 17 frames per second, not at all bad for a Raspberry Pi 3 model B! Incidentally, I had tried a similar setup using the Coral Edge TPU device and its version of the SSD SPE, CoralSSD, but the performance was nowhere near as good. One obvious difference is that CoralSSD is a Python SPE because, at that time, the C++ API was not documented. One day I may change this to a C++ SPE and then the comparison will be more representative.

Of course you can use multiple NCS 2s to get better performance if required although I haven’t tried this on the Pi as yet. Still, the same can be done with Coral with suitable code. In any case, rt-ai has the Scaler SPE that allows any number of edge inference devices on any number of hosts to be used together to accelerate processing of a single flow. I have to say, the ability to use rt-ai and rtaiDesigner to quickly deploy distributed stream processing networks to heterogeneous hosts is a lot of fun!

The motivation for all of this is to move from x86 processors with big GPUs to Raspberry Pis with edge inference accelerators to save power. The driveway project has been running for months now, heating up the basement very nicely. Moving from YOLOv3 on a GTX 1080 to MobileNet SSD and a Coral edge TPU saved about 60W, moving the entire thing from that system to the Raspberry Pi has probably saved a total of 80W or so.

This is the design now running full time on the Pi:


CPU utilization for the CSSDPi SPE is around 21% and it uses around 23% of the RAM. The raw output of the CSSDPi SPE is fed through a filter SPE that only outputs a message when a detection has passed certain criteria to avoid false alarms. Then, I get an email with a frame showing what triggered the system. The View module is really just for debugging – this is the kind of thing it displays:


The metadata displayed on the right is what the SSDFilter SPE uses to determine whether the detection should be reported or not. It requires a configurable number of sequential frames with a similar detection (e.g. car rather than something else) over a configurable confidence level before emitting a message. Then, it has a hold-off in case the detected object remains in the frame for a long time and, even then, requires a defined gap before that detection is re-armed. It seems to work pretty well.

One advantage of using CSSD rather than CYOLO as before is that, while I don’t get specific messages for things like a USPS van, it can detect a wider range of objects:


Currently the filter only accepts all the COCO vehicle classes and the person class while rejecting others, all in the interest of reducing false detection messages.

I had expected to need a Raspberry Pi 4 (mine is on its way 🙂 ) to get decent performance but clearly the Pi 3 is well able to cope with the help fo the NCS 2.

Optimizing inference engine utilization with multiplexed streams

One of the issues with the GPU-based CYOLO (for example) is that it uses about 8GB of GPU memory meaning that, even on a GTX 1080 ti GPU card, it is only possible to have one instance of the CYOLO SPE on any one GPU card. A way around this is to run multiple streams through a single SPE instance. The architecture of rt-ai always supported fan in (i.e. stream multiplexing) but not fan out (i.e. stream demultiplexing). The new FanOut module solves this problem. The screen capture above shows the new FanOut SPE running with the Intel NCS 2-based CSSD SPE. Video streams from three cameras are multiplexed on the CSSD SPE’s input pin. The multiplexed output is then passed to the FanOut SPE which demultiplexes the composite stream to up to eight individual streams. The screen capture also shows the FanOut configuration dialog – you just enter the source SPE name for the stream to be associated with each output pin.


Since my second NCS 2 has arrived I was able to run the triple NCS configuration shown above. The old NCS didn’t really contribute much in this case – the two NCS 2s were able to get an aggregate throughput of around 26 frames per second. This is shared between the three input streams of course.

The fan in/fan out multiplexing idea fits very well with the NCS 2 as you can just add more NCS 2s (or more likely, a special purpose multiple Myriad X board) to a node to increase aggregate throughput.

SSD object detection using the Neural Compute Stick 2 now has its own rt-ai stream processing element


Turned out to be pretty easy to integrate the ssd_mobilenet_v2_coco model compiled for the Intel NCS 2 into rt-ai. Since it doesn’t use the GPU, I was able to run this and the YOLOv3 SPE on the same machine which is kind of amusing – one YOLOv3 instance tends to chew up most of the GPU memory, unfortunately, so the GPU can’t be shared. I would have liked to have run YOLOv3 on the NCS 2 for direct comparison but could not. The screen capture above shows the MediaView SPE output for both detectors running on the same 1280 x 720 video stream.


This is the design and it is showing the throughput of each detection SPE – 14 fps for the GTX 1080 ti YOLO and 9 fps for the NCS 2 based SSD. Not exactly a fair comparison, however, but still interesting. It would be much better if I had the same model running using a GPU of course. Right now, the GPU-based SPE that can run ssd_mobilenet_v2_coco (and similar models) is Python based and that (not surprisingly) runs a fair bit slower than the compiled C++ versions I am using here.

Running YOLOv3 with OpenVINO on CPU and (not) NCS 2


Since OpenVINO is the software framework for the Neural Compute Stick 2, I thought it would be interesting to get the OpenVINO YOLOv3 example up and running. While the toolkit download does include a number of models, YOLOv3 isn’t one of them. Instead, the model has to be created from a TensorFlow version.

The instructions here describe how to do this. Steps 1 and 2 are fine but it is kind of awkward how the .pb file is generated so I created a new simple script to do this:

# -*- coding: utf-8 -*-

import numpy as np
import tensorflow as tf
from tensorflow.python.framework import graph_io

from yolo_v3 import yolo_v3, load_weights, detections_boxes, non_max_suppression

def load_coco_names(file_name):
    names = {}
    with open(file_name) as f:
        for id, name in enumerate(f):
            names[id] = name
    return names
    
def main(argv):

    classes = load_coco_names("coco.names")

    # placeholder for detector inputs
    inputs = tf.placeholder(tf.float32, [None, 416, 416, 3])

    with tf.variable_scope('detector'):
        detections = yolo_v3(inputs, len(classes), data_format='NHWC')
        load_ops = load_weights(tf.global_variables(scope='detector'), "yolov3.weights")

    boxes = detections_boxes(detections)

    with tf.Session() as sess:
        sess.run(load_ops)
        frozen = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ['concat_1'])
        graph_io.write_graph(frozen, './', 'yolo_v3.pb', as_text=False)

if __name__ == '__main__':
    tf.app.run()

This has the important filenames hardcoded – you just need to put yolo_v3.weights and coco.names in the tensorflow-yolo-v3 directory. Run the script above with:

python3 script.py

and the yolo_v3.pb file should be created. Copy this into the model_optimizer directory, set that as the current directory and run:

python3 mo_tf.py --input_model yolo_v3.pb --tensorflow_use_custom_operations_config ./extensions/front/tf/yolo_v3.json --input_shape [1,416,416,3]

The –input_shape parameter is needed as otherwise it blows up due to getting -1 for the mini-batch size. I just forced this to 1 and it was happy.

The result is in yolo_v3.xml and yolo_v3.bin. These can be used with the demo object_detection_demo_yolov3_async and an example output is shown in the screen capture above. Note that it is necessary to run the following:

~/intel/computer_vision_sdk/bin/setupvars.sh

in the same terminal session as the demo will be run in order for CPU mode to work.

By default, the output just annotates the boxes with label numbers rather than readable labels. To get readable labels, copy coco.names to yolo_v3.labels and put it in the same directory as the xml file. One problem is that the label file reader doesn’t handle spaces in the labels. Rather than mess with the code, I just changed the spaces in the yolo_v3.labels file to underlines. Otherwise it thinks a mouse is a donut and a monitor a dog which is a little confusing.

However, what I really wanted to do was to run this on the NCS 2. The model as generated is FP32 and the NCS 2 wants FP16. Adding –data_type FP16 to the mo_tf.py command line fixes that but unfortunately it reports that the NCS 2 doesn’t support the Resample layer which is used by YOLOv3. If I had been smart I would have noticed that the usage info only mentions CPU and GPU :-(. Interestingly, the table of supported layers indicates that both Resample and Interp are supported on MYRIAD so I do not know what is going on here.

I did try changing the offending tf.image_resize_nearest_neighbor call into a tf.image.resize.bilinear call (by editing yolo_v3.py in the tensorflow-yolo-v3 directory). This maps to Interp instead of Resample in the OpenVINO IR.  This worked fine in CPU mode but still failed to run on the NCS 2 except in a different way:


Not sure if that is a bug or intended. Anyway, that seems to be the end of the road with running YOLOv3 on the NCS 2 for the moment at least. However, there are a lot of things that do run on the NCS 2 very nicely. Still, YOLOv3 had started to become my standard way of checking inference things out, just like my strategy of evaluating restaurants by the quality of their Caesar salad – at least in the days when you could still get them!

*** Update: YOLOv3 does now work on the NCS 2 using the latest OpenVINO release.

The new ZED camera SPE and CYOLO SPE with support for depth cameras


The new SPE for the Stereolabs ZED depth camera is now working nicely, as is the new support for depth data in the CYOLO SPE. The extra depth information can be seen in the metadata display on the right of the screen capture – the annotation on the image itself is still the standard code but, since that is just for testing, it is ok.


This is the design used for testing. The ZED camera SPE has two outputs: one looks like a standard camera output while the other has both left and right images and the depth image. The CYOLO SPE can now accept either standard video messages or depth video messages using the appropriate input pin. The depth image adds about 3.7MB to each message so it isn’t a trivial overhead but the CYOLO module only ever outputs a standard video frame so the large payload is contained in the single link in this design. Even running everything on a busy machine with 1280 x 720 frames, the whole design still runs at around 15fps which is not too bad.

 

Old-school IoT sensing with absolutely no AI

Winter is coming, as one might say. Actually it is here, with freezing temperatures and inches of snow. Time to turn on the heater in the garage. However, if someone leaves a door open it will heat the atmosphere in general and burn through all of the propane. Yes, it would be absolutely trivial to put magnetic sensors on the doors that turn off the heater if the door is open but that is no fun at all. This has actually been a long-running project with all kinds of interesting solutions, some even including Apache NiFi and other big data things. One solution used Insteon and actually allowed the system to automatically close doors. However, various disasters could occur and I turned this system off. In any case, the Insteon door sensors were not reliable.

The root of the problem is that the system really has to understand what is happening in the space, not just the states of various parts of it. Someone could be just about to back a car out when the system decides to shut a door. Not good. Seemingly, a perfect and autonomous solution is still a little way down the road. A short term hack seems in order, however!

Since I am working on rt-ai Edge right now, it was natural to think about how to use rt-ai Edge to solve this problem. I have some ZeroSensors around so I put one of them in the garage. I realized that, when the doors are shut, the garage is very dark and the temperature stable at the thermostat setting (allowing for hysteresis). If a door is open, the garage gets illuminated in one way or another (garage door motor lights, ambient light, outside lights at night etc) and is a somewhat reliable, albeit indirect, indicator of door status. Also, the temperature will drop pretty quickly, giving an indication that either a door is open or the heater has failed.

Given how simple this is, I put together a quick filter SPE to process the light and temperature data from a ZeroSensor and send me an email if the doors are seemingly open for more than a preset time. The screen capture above shows what is going on. There are actually a couple of stream processing networks (SPNs) in action in this space, mainly running on the rtai0 node. One of them is the YOLO-based driveway vehicle detection system while the other is the garage environmental sensing network. A node can be running multiple SPNs at the same time – they are ships in the night and can be managed totally independently.

 

This screen capture shows the configuration dialog for the SensorFilter SPE. I currently also have it set to confirm when the garage doors have been shut so that I know if someone else has taken care of the problem.

Another step would be to turn the heater off if a door is open and the temperature is dropping – right now, I am not planning to allow the door to be shut remotely for the reasons mentioned earlier. At least turning the heater off avoids wasting a load of energy. I can control the heater via Insteon. I would just need another SPE to provide the interface between the filtered alerts and my Insteon server.

Still, one day it would be nice to do this properly. This entails understanding the state of the space – things like are there any people in it? This is not completely trivial as the system can’t always see someone inside a car. However, it can assume (until I get an autonomous vehicle) that if a car is moving or has moved, there must be someone driving and, until the system detects that person leaving the space, it must also assume that it should take no autonomous action. However, it needs to maintain state in order to be sure that there’s nobody in the space (or in a car just outside the space) so that it can take control. I have a feeling that there may be quite a few corner cases with this but it would be fun to try, even if it only simulates trying to close doors.

Stereolabs ZED depth camera with YOLO

The Stereolabs ZED camera is a quite effective way of generating depth-enhanced video streams and it seemed like it was time to get one and integrate it with rt-ai. I have worked with one of these before in a different context and I knew that using the ZED was pretty straightforward.

The screen capture above shows the ZED YOLO C++ example code running. The mug in the shot was a bit too close to the monitor to get picked up and my hand was probably too close in general hence the strange 4.92m depth reading. However, it does seem to work pretty well. It even picked up the image of the monitor on the screen as a monitor.

Just as a note, I did have to modify the main.cpp code to run. At line 49, I had to add a std:: in front of an isfinite() call for some reason. Maybe something odd on my Ubuntu system. Also, to get the standard samples to build, I had to add libxmu-dev as another dependency.

Now comes the task of adding this to rt-ai. I am going to split this into two: the first is to produce a new camera SPE that works with the ZED and outputs the depth image in addition to the normal camera image. Then, the CYOLO SPE will be modified to accept optional depth information and perform the processing to generate the actual object depth value. This seems like a more general solution as the ZED SPE then looks like a standard depth camera while the upgraded CYOLO will be able to work with any depth camera.

CYOLO – a pure C++ implementation of a YOLOv3 SPE for rt-ai

The Python-based YOLOv3 SPE has been working for a while now but the performance was a little disappointing at 2 or 3 fps using 1280 x 720 frames on an i7 5820K CPU/GTX 1080ti GPU machine. I was interested to see how much effect the Python code was having on overall performance. To do this I implemented the C++ rt-ai SPE API and added the C version of the YOLOv3 demo code. The result is shown above and this version now runs at just over 14 fps (17 fps at 640 x 480) which is very usable.

While Python is very convenient, it is clearly (and unsurprisingly) more efficient to use C/C++ so I will probably do that in the future where possible. The main side-effect is that rtaiDesigner has to deploy the correct compiled SPE for the target node (typically x64 or ARM) and that any shared libraries that are not part of the standard install are included too. A Dockerized version would of course solve the dependency problem and just require a container for each target architecture.