rt-ai dynamic and adaptive parallel inference using the new Scaler SPE

One way to achieve higher video frame inference rates in situations where no state is maintained between frames is to split an incoming video stream across multiple inference pipelines. The new rt-ai Scaler Stream Processing Element (SPE) does exactly that. The screen capture above shows the design and the real time performance information (in the windows on the right). The pipelines in this case are just single SPEs running single shot object detection on the Intel NCS 2. The CSSD SPE is able to process around 13 1280 x 720 frames per second by itself. Using the Scaler SPE to leverage two CSSD SPEs, each with one NCS 2 running on different nodes, the throughput has been doubled. In fact, performance should scale roughly linearly with the number of pipelines attached.

The Scaler SPE implements a health check function that determines the availability of pipelines at run time. Only pipelines that pass the health check are eligible to receive frames to be processed. In the example, Scaler can support eight pipelines but only two are connected (2 and 6) so only these pass the health check and receive frames. Frames from the In port are distributed across the active pipelines in a round robin fashion.

Pipelines are configured with a maximum to the number of in-flight frames in order to maximize pipeline throughput and minimize latency. Without multiple in-flight frames, CSSD performance would be roughly halved. In fact, pipelines can have different processing throughputs – the Scaler SPE automatically adjusts pipeline usage based on achieved throughput. Result messages may be received from pipelines out of sequence and the Scaler SPE ensures that the final output stream on the Out port has been reordered correctly.

The Scaler SPE can actually support any type of data (not just video) and any type of pipeline (not just inference) provided there is no retained state between messages. This makes it a very useful new building block.

Optimizing inference engine utilization with multiplexed streams

One of the issues with the GPU-based CYOLO (for example) is that it uses about 8GB of GPU memory meaning that, even on a GTX 1080 ti GPU card, it is only possible to have one instance of the CYOLO SPE on any one GPU card. A way around this is to run multiple streams through a single SPE instance. The architecture of rt-ai always supported fan in (i.e. stream multiplexing) but not fan out (i.e. stream demultiplexing). The new FanOut module solves this problem. The screen capture above shows the new FanOut SPE running with the Intel NCS 2-based CSSD SPE. Video streams from three cameras are multiplexed on the CSSD SPE’s input pin. The multiplexed output is then passed to the FanOut SPE which demultiplexes the composite stream to up to eight individual streams. The screen capture also shows the FanOut configuration dialog – you just enter the source SPE name for the stream to be associated with each output pin.


Since my second NCS 2 has arrived I was able to run the triple NCS configuration shown above. The old NCS didn’t really contribute much in this case – the two NCS 2s were able to get an aggregate throughput of around 26 frames per second. This is shared between the three input streams of course.

The fan in/fan out multiplexing idea fits very well with the NCS 2 as you can just add more NCS 2s (or more likely, a special purpose multiple Myriad X board) to a node to increase aggregate throughput.

SSD object detection using the Neural Compute Stick 2 now has its own rt-ai stream processing element


Turned out to be pretty easy to integrate the ssd_mobilenet_v2_coco model compiled for the Intel NCS 2 into rt-ai. Since it doesn’t use the GPU, I was able to run this and the YOLOv3 SPE on the same machine which is kind of amusing – one YOLOv3 instance tends to chew up most of the GPU memory, unfortunately, so the GPU can’t be shared. I would have liked to have run YOLOv3 on the NCS 2 for direct comparison but could not. The screen capture above shows the MediaView SPE output for both detectors running on the same 1280 x 720 video stream.


This is the design and it is showing the throughput of each detection SPE – 14 fps for the GTX 1080 ti YOLO and 9 fps for the NCS 2 based SSD. Not exactly a fair comparison, however, but still interesting. It would be much better if I had the same model running using a GPU of course. Right now, the GPU-based SPE that can run ssd_mobilenet_v2_coco (and similar models) is Python based and that (not surprisingly) runs a fair bit slower than the compiled C++ versions I am using here.

Running YOLOv3 with OpenVINO on CPU and (not) NCS 2


Since OpenVINO is the software framework for the Neural Compute Stick 2, I thought it would be interesting to get the OpenVINO YOLOv3 example up and running. While the toolkit download does include a number of models, YOLOv3 isn’t one of them. Instead, the model has to be created from a TensorFlow version.

The instructions here describe how to do this. Steps 1 and 2 are fine but it is kind of awkward how the .pb file is generated so I created a new simple script to do this:

# -*- coding: utf-8 -*-

import numpy as np
import tensorflow as tf
from tensorflow.python.framework import graph_io

from yolo_v3 import yolo_v3, load_weights, detections_boxes, non_max_suppression

def load_coco_names(file_name):
    names = {}
    with open(file_name) as f:
        for id, name in enumerate(f):
            names[id] = name
    return names
    
def main(argv):

    classes = load_coco_names("coco.names")

    # placeholder for detector inputs
    inputs = tf.placeholder(tf.float32, [None, 416, 416, 3])

    with tf.variable_scope('detector'):
        detections = yolo_v3(inputs, len(classes), data_format='NHWC')
        load_ops = load_weights(tf.global_variables(scope='detector'), "yolov3.weights")

    boxes = detections_boxes(detections)

    with tf.Session() as sess:
        sess.run(load_ops)
        frozen = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ['concat_1'])
        graph_io.write_graph(frozen, './', 'yolo_v3.pb', as_text=False)

if __name__ == '__main__':
    tf.app.run()

This has the important filenames hardcoded – you just need to put yolo_v3.weights and coco.names in the tensorflow-yolo-v3 directory. Run the script above with:

python3 script.py

and the yolo_v3.pb file should be created. Copy this into the model_optimizer directory, set that as the current directory and run:

python3 mo_tf.py --input_model yolo_v3.pb --tensorflow_use_custom_operations_config ./extensions/front/tf/yolo_v3.json --input_shape [1,416,416,3]

The –input_shape parameter is needed as otherwise it blows up due to getting -1 for the mini-batch size. I just forced this to 1 and it was happy.

The result is in yolo_v3.xml and yolo_v3.bin. These can be used with the demo object_detection_demo_yolov3_async and an example output is shown in the screen capture above. Note that it is necessary to run the following:

~/intel/computer_vision_sdk/bin/setupvars.sh

in the same terminal session as the demo will be run in order for CPU mode to work.

By default, the output just annotates the boxes with label numbers rather than readable labels. To get readable labels, copy coco.names to yolo_v3.labels and put it in the same directory as the xml file. One problem is that the label file reader doesn’t handle spaces in the labels. Rather than mess with the code, I just changed the spaces in the yolo_v3.labels file to underlines. Otherwise it thinks a mouse is a donut and a monitor a dog which is a little confusing.

However, what I really wanted to do was to run this on the NCS 2. The model as generated is FP32 and the NCS 2 wants FP16. Adding –data_type FP16 to the mo_tf.py command line fixes that but unfortunately it reports that the NCS 2 doesn’t support the Resample layer which is used by YOLOv3. If I had been smart I would have noticed that the usage info only mentions CPU and GPU :-(. Interestingly, the table of supported layers indicates that both Resample and Interp are supported on MYRIAD so I do not know what is going on here.

I did try changing the offending tf.image_resize_nearest_neighbor call into a tf.image.resize.bilinear call (by editing yolo_v3.py in the tensorflow-yolo-v3 directory). This maps to Interp instead of Resample in the OpenVINO IR.  This worked fine in CPU mode but still failed to run on the NCS 2 except in a different way:


Not sure if that is a bug or intended. Anyway, that seems to be the end of the road with running YOLOv3 on the NCS 2 for the moment at least. However, there are a lot of things that do run on the NCS 2 very nicely. Still, YOLOv3 had started to become my standard way of checking inference things out, just like my strategy of evaluating restaurants by the quality of their Caesar salad – at least in the days when you could still get them!

*** Update: YOLOv3 does now work on the NCS 2 using the latest OpenVINO release.

The new ZED camera SPE and CYOLO SPE with support for depth cameras


The new SPE for the Stereolabs ZED depth camera is now working nicely, as is the new support for depth data in the CYOLO SPE. The extra depth information can be seen in the metadata display on the right of the screen capture – the annotation on the image itself is still the standard code but, since that is just for testing, it is ok.


This is the design used for testing. The ZED camera SPE has two outputs: one looks like a standard camera output while the other has both left and right images and the depth image. The CYOLO SPE can now accept either standard video messages or depth video messages using the appropriate input pin. The depth image adds about 3.7MB to each message so it isn’t a trivial overhead but the CYOLO module only ever outputs a standard video frame so the large payload is contained in the single link in this design. Even running everything on a busy machine with 1280 x 720 frames, the whole design still runs at around 15fps which is not too bad.

 

Old-school IoT sensing with absolutely no AI

Winter is coming, as one might say. Actually it is here, with freezing temperatures and inches of snow. Time to turn on the heater in the garage. However, if someone leaves a door open it will heat the atmosphere in general and burn through all of the propane. Yes, it would be absolutely trivial to put magnetic sensors on the doors that turn off the heater if the door is open but that is no fun at all. This has actually been a long-running project with all kinds of interesting solutions, some even including Apache NiFi and other big data things. One solution used Insteon and actually allowed the system to automatically close doors. However, various disasters could occur and I turned this system off. In any case, the Insteon door sensors were not reliable.

The root of the problem is that the system really has to understand what is happening in the space, not just the states of various parts of it. Someone could be just about to back a car out when the system decides to shut a door. Not good. Seemingly, a perfect and autonomous solution is still a little way down the road. A short term hack seems in order, however!

Since I am working on rt-ai Edge right now, it was natural to think about how to use rt-ai Edge to solve this problem. I have some ZeroSensors around so I put one of them in the garage. I realized that, when the doors are shut, the garage is very dark and the temperature stable at the thermostat setting (allowing for hysteresis). If a door is open, the garage gets illuminated in one way or another (garage door motor lights, ambient light, outside lights at night etc) and is a somewhat reliable, albeit indirect, indicator of door status. Also, the temperature will drop pretty quickly, giving an indication that either a door is open or the heater has failed.

Given how simple this is, I put together a quick filter SPE to process the light and temperature data from a ZeroSensor and send me an email if the doors are seemingly open for more than a preset time. The screen capture above shows what is going on. There are actually a couple of stream processing networks (SPNs) in action in this space, mainly running on the rtai0 node. One of them is the YOLO-based driveway vehicle detection system while the other is the garage environmental sensing network. A node can be running multiple SPNs at the same time – they are ships in the night and can be managed totally independently.

 

This screen capture shows the configuration dialog for the SensorFilter SPE. I currently also have it set to confirm when the garage doors have been shut so that I know if someone else has taken care of the problem.

Another step would be to turn the heater off if a door is open and the temperature is dropping – right now, I am not planning to allow the door to be shut remotely for the reasons mentioned earlier. At least turning the heater off avoids wasting a load of energy. I can control the heater via Insteon. I would just need another SPE to provide the interface between the filtered alerts and my Insteon server.

Still, one day it would be nice to do this properly. This entails understanding the state of the space – things like are there any people in it? This is not completely trivial as the system can’t always see someone inside a car. However, it can assume (until I get an autonomous vehicle) that if a car is moving or has moved, there must be someone driving and, until the system detects that person leaving the space, it must also assume that it should take no autonomous action. However, it needs to maintain state in order to be sure that there’s nobody in the space (or in a car just outside the space) so that it can take control. I have a feeling that there may be quite a few corner cases with this but it would be fun to try, even if it only simulates trying to close doors.

Stereolabs ZED depth camera with YOLO

The Stereolabs ZED camera is a quite effective way of generating depth-enhanced video streams and it seemed like it was time to get one and integrate it with rt-ai. I have worked with one of these before in a different context and I knew that using the ZED was pretty straightforward.

The screen capture above shows the ZED YOLO C++ example code running. The mug in the shot was a bit too close to the monitor to get picked up and my hand was probably too close in general hence the strange 4.92m depth reading. However, it does seem to work pretty well. It even picked up the image of the monitor on the screen as a monitor.

Just as a note, I did have to modify the main.cpp code to run. At line 49, I had to add a std:: in front of an isfinite() call for some reason. Maybe something odd on my Ubuntu system. Also, to get the standard samples to build, I had to add libxmu-dev as another dependency.

Now comes the task of adding this to rt-ai. I am going to split this into two: the first is to produce a new camera SPE that works with the ZED and outputs the depth image in addition to the normal camera image. Then, the CYOLO SPE will be modified to accept optional depth information and perform the processing to generate the actual object depth value. This seems like a more general solution as the ZED SPE then looks like a standard depth camera while the upgraded CYOLO will be able to work with any depth camera.

CYOLO – a pure C++ implementation of a YOLOv3 SPE for rt-ai

The Python-based YOLOv3 SPE has been working for a while now but the performance was a little disappointing at 2 or 3 fps using 1280 x 720 frames on an i7 5820K CPU/GTX 1080ti GPU machine. I was interested to see how much effect the Python code was having on overall performance. To do this I implemented the C++ rt-ai SPE API and added the C version of the YOLOv3 demo code. The result is shown above and this version now runs at just over 14 fps (17 fps at 640 x 480) which is very usable.

While Python is very convenient, it is clearly (and unsurprisingly) more efficient to use C/C++ so I will probably do that in the future where possible. The main side-effect is that rtaiDesigner has to deploy the correct compiled SPE for the target node (typically x64 or ARM) and that any shared libraries that are not part of the standard install are included too. A Dockerized version would of course solve the dependency problem and just require a container for each target architecture.

Dockerized YOLOv3 rt-ai SPE = YAOD (yet another object detector)

I had intended to be doing something completely different today (working on auto-compiling highlight reels of interesting events generated from the prototype production rt-ai object detection system) but managed to get sidetracked by reading about Darknet-based YOLOv3.  As Darknet itself is in C and compiles to a shared library this was a good candidate for a Dockerized stream processing element. I used a cuDNN image from NVIDIA as the base since it provides pretty much everything required – I just had to add in the rt-ai SPE library software and compile Darknet on top of that.

The results are pretty good. The preview above shows some detected objects. I discovered that it could detect toothbrushes which is why I am waving one around. It also did a good job of picking up the second mouse just by my left shoulder. 2fps with 1280 x 720 frame size is a little disappointing but this seems to be due to the Python parts of the code since the C demo provided with the library runs much faster. It is a little faster with preview turned off, however (which would be the production mode anyway).

Speaking of production, it does have a problem as it consumes just over 7GB of memory on my GTX 1080 ti GPU card. This means that one GPU card can’t run two instances simultaneously, unlike with the TensorFlow SSD detector. In fact, I can get two instances of that working on a GTX 1080 card with 8GB total memory.


Just for completeness, this is the design which looks just like the usual test designs. The Docker container is built and pushed to a private Docker registry automatically when the design is generated. The target node then just pulls the image from the registry when the design starts up.


This is the MediaView output showing the metadata. The metadata format is equivalent to that generated by the TensorFlow object detector so that they are completely interchangeable.

Completed ZeroSensors all ready for long term data collection

Finally this is a ZeroSensor all ready to go into full time service, capturing video, audio and environmental data. The goal is to use this data, and that from other cameras around the space, as training data for machine learning systems.

One specific goal is to create an anomaly detector with minimal supervision. As much as possible, it will learn from experience. This is kind of tricky as it requires detection of unknown length sequences depending on the circumstances. I am intrigued by the ideas behind the Universal Translator but not sure how much could carry over to this application. This paper reviews some of the techniques usually applied, at least for video processing. The situation here is a little different as there are quite different types of features involved. My plan is to preprocess video and audio to recognize salient features (using object detection or whatever) and then input these features, along with environmental sensor data, in the form of uniform time-slotted data sets to the anomaly detector. This doesn’t help with detecting the length of an interesting sequence – that’s the fun part of the project.