Using shared memory for rt-ai inter-SPE transfers

The screen capture above couldn’t have been obtained previously as it is passing uncompressed (RGB888) video between rt-ai SPEs on the same node (a Jetson Nano in this case). The CVideoView window is showing the output of the simple network using the CSSDJetson SPE to classify objects and also computes the frames per second and latency of received frames. The source of the frames is a Logitech C920 webcam running at 1280 x 720, 30fps. It shows that the latency is around 128mS at around 15fps.

This screen capture shows what happens when shared memory isn’t used. Actually, the latency here is misleading as it seems to be the link from the CUVCCam SPE to the MQTT server that is causing the bottleneck when running uncompressed video. The latency goes monotonically upwards until there is no memory left as there is no throttling on that interface since normally it isn’t a problem.

There doesn’t seem to be much benefit when passing smaller messages between SPEs.

This screen capture above shows shared memory being used when transferring JPEG frames. The one below is with shared memory support turned off.

This just shows that bouncing off the MQTT server within the same node is pretty efficient, at least when compared to the latency of the inference.

Being able to pass large messages around efficiently, even if only point to point within the same node, is quite a step forward by itself. For example, it makes it practical to create networks that pass RGBD frames around.

Shared memory support in rt-ai2 uses the Qt QSharedMemory and QSystemSemaphore wrappers to make things simple. When a design is generated, rtaiDesigner determines if shared memory has been enabled for the network, if the publisher and subscriber are on the same node and if the connection is point to point (i.e. exactly one subscriber). If so, the publisher and subscriber SPEs are told to use shared memory instead of MQTT for that particular connection. The SPE configuration file for the publisher SPE also includes the shared memory slot size to use and how big the pending transmission queue should be. The system is set up at the moment to always use three shared memory slots forming a rotating buffer. The shared memory slots are created by the publisher and attached by the subscriber.

To minimize latency, every time the publisher places a new message in the next shared memory slot, it releases a QSystemSempahore to unblock a thread in the subscriber that can then extract the message, free the shared memory slot and process the received message.

This implementation of shared memory seems to work very well and is highly reliable. In principle, it could be extended to support multiple subscribers by replicating the shared memory slot structure for each subscriber.

Jetson Nano SSD-Mobilenet-v2 SPE for rt-ai

Following on from the earlier work with the Jetson Nano, the SSD-Mobilenet-v2 model is now running as an rt-ai Stream Processing Element (SPE) for Jetson and so is fully integrated with the rt-ai system. Custom models created using transfer learning can also be used – it’s just a case of setting the model name in the SPE’s configuration and placing the required model files on the rt-ai data server. Since models are automatically downloaded at runtime if necessary, it’s pretty trivial to change the model being used on an existing Stream Processing Network (SPN).

The screen capture above shows the rt-ai design that generated the implementation. Here I am using the UVCCam SPE so that the video is sourced from a webcam but any of the other rt-ai video sources (such as RTSPCam) could be used, simply by replacing the camera SPE in the design using the graphical editor – originally this design used RTSPCam in fact.

Using 1280 x 720 video frames, the SSDJetson SPE processes around 17fps. This is not bad but less than the 21fps achieved by the monolithic example code. The problem is that, in order to achieve the one to many and many to one, heterogeneous multi-node network graphical design capability, rt-ai currently uses MQTT brokers to move data and provide multicast as necessary. Even when the broker and the SPEs are running on the same node, it is obviously less efficient than pointer passing within monolithic code.

This “inefficiency of generality” isn’t really visible on powerful x86 machines but has an impact on devices like the Jetson Nano and Raspberry Pi. The solution to this is to recognize such local links and side-step the MQTT broker interface using shared memory. This optimization will be done automatically in rtaiDesigner when it generates the configurations for each SPE in an SPN, flagging appropriate nodes as sources or sinks of shared memory links when both source and sink SPEs reside on the same node.

The ghost in the AI machine

The driveway monitoring system has been running full time for months now and it’s great to know if a vehicle or a person is moving on the driveway up to the house. The only bad thing is that it will give occasional false detections like the one above. This only happens at night and I guess there’s enough correct texture to trigger the “person” response with a very high confidence. Those white streaks might be rain or bugs being illuminated by the IR light. It also only seems to happen when the trash can is out for collection – it is in the frame about half way out from the center to the right.

It is well known that the image recognition capabilities of convolutional networks aren’t always exactly what they seem and this is a good example of the problem. Clearly, in this case, MobileNet feature detectors have detected things in small areas with a particular spatial relationship and added these together to come to the completely wrong conclusion. My problem is how to deal with these false detections. A couple of ideas come to mind. One is to use a different model in parallel and only generate an alert if both detect the same object at (roughly) the same place in the frame. Or instead of another CNN, use semantic segmentation to detect the object in a somewhat different way.

Whatever, it is a good practical demonstration of the fact that these simple neural networks don’t in any way understand what they are seeing. However, they can certainly be used as the basis of a more sophisticated system which adds higher level understanding to raw detections.

Object detection on the Raspberry Pi 4 with the Neural Compute Stick 2

Following on from the Coral USB experiment, the next step was to try it out with the NCS 2. Installation of OpenVINO on Raspbian Buster was straightforward. The rt-ai design was basically the same as for the Coral USB experiment but with the CoralSSD SPE replaced with the OpenVINO equivalent called CSSDPi. Both SPEs run ssd_mobilenet_v2_coco object detection.

Performance was pretty good – 17fps with 1280 x 720 frames. This is a little better than the Coral USB accelerator attained but then again the OpenVINO SPE is a C++ SPE while the Coral USB SPE is a Python SPE and image preparation and post processing takes its toll on performance. One day I am really going to use the C++ API to produce a new Coral USB SPE so that the two are on a level playing field. The raw inference time on the Coral USB accelerator is about 40mS or so meaning that there is plenty of opportunity for higher throughputs.

Object detection on the Raspberry Pi 4 with the Coral USB accelerator

SSD object detection with the Coral USB accelerator had been running on a Raspberry Pi 3 but the performance was disappointing and I was curious to see what would happen on the Raspberry Pi 4.

This is the test rt-ai design. The UVCCam and MediaView SPEs are running on an Ubuntu desktop, the CoralSSD SPE is running on the Raspberry Pi 4. It is getting a respectable 12fps with 1280 x 720 frames (an earlier version of this post had reported much worse performance but that was due to some silly image loading code). The utilization of one CPU core is around 93% which is fair enough for a Python SPE. I am sure that a C++ version of this SPE would be considerably faster again.

Getting this running at all was interesting as the Pi 4 requires Raspbian Buster and that comes with Python 3.7 which is not supported by the edgetpu_api toolkit at this point in time.

After writing the original blog post I discovered that in fact it is trivial to convert the edgetpu_api installation to work with Python 3.7. Without doing any virtualenv and Python 3.5 stuff, just run (modified as described below to recognize the Pi 4 and fix the sudo bug) and enter these commands:

cd /usr/local/lib/python3.7/dist-packages/edgetpu/swig
sudo cp

Turns out all it needed was a correctly named .so file to match the Python version. Anyway, if you want to go the Python 3.5 route…

The ARM version of the Python library is only compiled for Python 3.5. So, Python 3.5 needs to be installed alongside Python 3.7. To do this, download the GZipped source from here and expand and build with:

tar xzf Python-3.5.7.tgz
cd Python-3.5.7
sudo apt-get install libssl-dev
./configure --enable-optimizations
sudo make -j4 altinstall
virtualenv --python=python3.5 venv
source venv/bin/activate

The result of all of this should be Python 3.5 available in a virtual environment. Any specific packages that need to be installed should be installed using pip3.5 as required. Regarding numpy, I found that the install didn’t work for some reason (there were missing dependencies when imported) and I had to use this command (as described here):

pip3.5 install numpy --upgrade --no-binary :all:

Now it is time to install the edgetpu_api which is basically a case of following the instructions here. However, has a small bug and also will not recognize the Pi 4.

Modify to recognize the Pi 4 by adding this after line 59:

  elif [[ "${MODEL}" == "Raspberry Pi 4 Model B Rev"* ]]; then
    info "Recognized as Raspberry Pi 4 B."

Once that is added, go to line 128 and replace it with:

sudo udevadm control --reload-rules && sudo udevadm trigger

The original is missing the second sudo. Once that is done, the Coral USB accelerator should be able to run the bird classifier example.

MobileNet SSD object detection using the Intel Neural Compute Stick 2 and a Raspberry Pi

I had successfully run ssd_mobilenet_v2_coco object detection using an Intel NCS2 running on an Ubuntu PC in the past but had not tried this using a Raspberry Pi running Raspbian as it was not supported at that time (if I remember correctly). Now, OpenVINO does run on Raspbian so I thought it would be fun to get this working on the Pi. The main task consisted of getting the CSSD rt-ai Stream Processing Element (SPE) compiling and running using Raspbian and its version of OpenVINO rather then the usual x86 64 Ubuntu system.

Compiled rt-ai SPEs use Qt so it was a case of putting together a different .pro qmake file to reflect the particular requirements of the Raspbian environment. Once I had sorted out the slight link command changes, the SPE crashed as soon as it tried to read in the model .xml file. I got stuck here for quite a long time until I realized that I was missing a compiler argument that meant that my binary was incompatible with the OpenVINO inference engine. This was fixed by adding the following line to the Raspbian .pro file:

QMAKE_CXXFLAGS += -march=armv7-a

Once that was added, the code worked perfectly. To test, I set up a simple rt-ai design:

For this test, the CSSDPi SPE was the only thing running on the Pi itself (rtai1), the other two SPEs were running on a PC (default). The incoming captured frames from the webcam to the CSSDPi SPE were 1280 x 720 at 30fps. The CSSDPi SPE was able to process 17 frames per second, not at all bad for a Raspberry Pi 3 model B! Incidentally, I had tried a similar setup using the Coral Edge TPU device and its version of the SSD SPE, CoralSSD, but the performance was nowhere near as good. One obvious difference is that CoralSSD is a Python SPE because, at that time, the C++ API was not documented. One day I may change this to a C++ SPE and then the comparison will be more representative.

Of course you can use multiple NCS 2s to get better performance if required although I haven’t tried this on the Pi as yet. Still, the same can be done with Coral with suitable code. In any case, rt-ai has the Scaler SPE that allows any number of edge inference devices on any number of hosts to be used together to accelerate processing of a single flow. I have to say, the ability to use rt-ai and rtaiDesigner to quickly deploy distributed stream processing networks to heterogeneous hosts is a lot of fun!

The motivation for all of this is to move from x86 processors with big GPUs to Raspberry Pis with edge inference accelerators to save power. The driveway project has been running for months now, heating up the basement very nicely. Moving from YOLOv3 on a GTX 1080 to MobileNet SSD and a Coral edge TPU saved about 60W, moving the entire thing from that system to the Raspberry Pi has probably saved a total of 80W or so.

This is the design now running full time on the Pi:

CPU utilization for the CSSDPi SPE is around 21% and it uses around 23% of the RAM. The raw output of the CSSDPi SPE is fed through a filter SPE that only outputs a message when a detection has passed certain criteria to avoid false alarms. Then, I get an email with a frame showing what triggered the system. The View module is really just for debugging – this is the kind of thing it displays:

The metadata displayed on the right is what the SSDFilter SPE uses to determine whether the detection should be reported or not. It requires a configurable number of sequential frames with a similar detection (e.g. car rather than something else) over a configurable confidence level before emitting a message. Then, it has a hold-off in case the detected object remains in the frame for a long time and, even then, requires a defined gap before that detection is re-armed. It seems to work pretty well.

One advantage of using CSSD rather than CYOLO as before is that, while I don’t get specific messages for things like a USPS van, it can detect a wider range of objects:

Currently the filter only accepts all the COCO vehicle classes and the person class while rejecting others, all in the interest of reducing false detection messages.

I had expected to need a Raspberry Pi 4 (mine is on its way 🙂 ) to get decent performance but clearly the Pi 3 is well able to cope with the help fo the NCS 2.

Raspberry Pi 3 Model B with Coral Edge TPU acceleration running SSD object detection

It wasn’t too hard to go from the inline rt-ai Edge Stream Processing Element using the Coral Edge TPU accelerator to an embedded version running on a Raspberry Pi 3 Model B with Pi camera.  The rt-ai Edge test design for this SPE is pretty simple again:

As can be seen, the Pi + Coral runs at about 4 fps with 1280 x 720 frames which is not too bad at all. In this example, I am running the PiCoral camera SPE on the Raspberry Pi node (Pi7) and the View SPE on the Default node (an i7 Ubuntu machine). Also, I’m using the combined video and metadata output which contains both the detection data and the associated JPEG video frame. However, the PiCoral SPE also has a metadata-only output. This contains all the frame information and detection data (scores, boxes etc) but not the JPEG frame itself. This can be useful for a couple of reasons. First, especially if the Raspberry Pi is connected via WiFi, transmitting the JPEGs can be a bit onerous and, if they are not needed, very wasteful. Secondly, it satisfies a potential privacy issue in that the raw video data never leaves the Raspberry Pi. Provided the metadata contains enough information for useful downstream processing, this can be a very efficient way to configure a system.

An Edge TPU stream processing element for rt-ai using the Coral USB Accelerator

A Coral USB Accelerator turned up yesterday so of course it had to be integrated with rt-ai to see what it could do. Creating a Python-based SPE from the object detection demo in the API download didn’t take too long. I used the MobileNet SSD v2 COCO model as a starting point to generate this example output:

The very basic rt-ai test design looks like this:

Using 1280 x 720 video frames from the webcam, I was getting around 2 frames per second from the CoralSSD SPE. This isn’t as good as the Intel NCS 2 SPE but that is a compiled C++ SPE whereas the Coral SPE is a Python 3 SPE. I haven’t found a C++ API spec for the Edge TPU as yet. Perhaps by investigating the SWIG-generated Python interface I could link the compiled libraries directly but that’s for another day…

Combining TrueDepth, remote OpenPose inference and local depth map processing to generate spatial 3D pose coordinates

The problem with depth maps for video is that the depth data is very large and can’t be compressed easily. I had previously run OpenPose at 30 FPS using an iPad Pro and remote inference but that was just for the standard OpenPose (x, y) coordinate output. There’s no way that 30 FPS could be achieved by sending out TrueDepth depth maps with each frame. Instead, the depth processing has to be handled locally on the iPad – the depth map never leaves the device.

The screen capture above shows the system running at 30 FPS. I had to turn a lot of lights on in the office – the frame rate from the iPad camera will drop below 30 FPS if it is too dark which messes up the data!

This is the design. It is the triple scaled OpenPoseGPU design used previously. iOSOpenPose connects to the Conductor via a websocket connection that is used to send images to and receive processed images from the pipeline.

One issue is that each image frame has its own depth map and that’s the one that has to be used to convert the OpenPose (x, y) coordinates into spatial (x, y, z) distances. The solution, in a new app called iOSOpenPose, is to cache the depth maps locally and re-associate them with the processed images when they return. Each image and depth frame is marked with a unique incrementing index to assist with this. Incidentally, this is why I love using JSON for this kind of work – it is possible to add non-standard fields at any point and they will be carried transparently to their destination.

Empirically with my current setup, there is a six frame processing lag which is not too bad. It would probably be better with the dual scaled pipeline, two node design that more easily handles 30 FPS but I did not try that. Another issue is that the processing pipeline can validly lose image frames if it can’t keep up with the offered rate. The depth map cache management software has to take care of all of the nasty details like this and other real-world effects.

Real time OpenPose on an iPad…with the help of remote inference and rendering

I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.

So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai that runs very nicely so an obvious solution was to run rt-ai OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.

The rt-ai design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.

The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.

Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML  for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.

Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.