Extreme edge depth video processing: Intel L515 LiDAR + Raspberry Pi and Stereolabs ZED + Jetson Nano

Depth cameras are an important component of rt-ispace but things just aren’t going to scale if each one needs a server with GPU just to generate useful data. This means that the extreme edge, consisting of hopefully low cost components that can be widely distributed, needs to be able to interface to depth cameras and make the data available to the wider network.

I have been testing with a Stereolabs ZED camera connected to a Jetson Nano and an Intel L515 LiDAR connected to a Raspberry Pi 4. The depth video stream generated by the rt-ai capture code is 1280 x 720 pixels, JPEG encoded along with uncompressed 640 x 360 16 bit depth data with a target frame rate of 15fps. Both systems seem quite capable of capturing and transmitting the data streams as shown above. The rt-ai design being used is this:

The rtai0 node is the Raspberry Pi 4. In this case, the depth video streams from the cameras are not being processed further on the extreme edge systems. The depth data views, which display data coming directly from the extreme edge systems, show that the Jetson Nano is generating frames at the target rate while the Raspberry Pi 4 is achieving about half that. Both are usable rates for many applications.

The depth video frames are also passed to the OpenPoseGPU Stream processing Element (SPE). This is an implementation of OpenPose that uses the desktop GPU (a GTX1080 Ti) to implement pose estimation. The OpenPoseGPU SPE can work with standard video streams but if given a depth video stream will work out the depth of each identified joint and add that to the metadata generated.

The total throughput of the OpenPoseGPU SPE is around 14fps. As can be seen in the rt-ai design, the depth video streams are multiplexed into the OpenPoseGPU SPE so that this capability is being shared between the two streams. The FanOut SPE separates the output streams which are then sent to viewers. Due to the limited throughput of the OpenPoseGPU SPE, the data streams are reduced in frame rate by a factor of 2.

So this design, where OpenPose processing is offloaded from the extreme edge, works fine but it would be far more interesting to do this at the extreme edge.

The screen capture above shows pose estimation running at the extreme edge using an Intel NCS 2 to run inference as implemented by the OpenPoseVINO SPE on the rtai0 Raspberry Pi 4 node. This does work pretty well but can only achieve 2fps. This might be ok for some applications but it would be nice to get to around 10fps.

I did also try running trt_pose on the Jetson Nano to try extreme edge pose estimation there but this was not successful. It may be that trying to run the ZED camera and trt_pose on the same Nano is just asking too much. Moving to a Xavier NX would probably make sense as it has double the memory and more power in general but it is a fair bit more expensive that the Nano so somewhat less relevant to the extreme edge application.

Work is now moving onto a new architecture using distributed inference to relieve the load on the extreme edge while still achieving usable pose estimation frame rates.

Embedding AI inference in the Spatial Networking Cloud

rt-ai is a system for graphically composing edge AI stream processing networks either distributed on multiple processing nodes or all running on a single node. In the latter case, shared memory can be used for transfers between the functional blocks, making it almost as efficient as monolithic code. So when it comes to embedding AI inference in a Spatial Networking Cloud (SNC), rt-ai makes perfect sense. However, the underlying network styles are completely different – SNC uses a highly dynamic series of multicast and end to end virtual links whereas rt-ai uses static MQTT or shared memory links. Each makes sense for the different applications so it is necessary to create a bridge between the two worlds.

Right now, bridging is done (for video streams) with the GetSNCVideo and PutSNCVideo stream processing elements (SPEs) that can be added to any video stream processing network (SPN). GetSNCVideo can grab SNC video frames from the configured stream source which then acts as an rt-ai source for the downstream SPEs. Once processing has been completed, the frames can be re-injected into SNC using the PutSNCVideo SPE. There can be similar bridges for sensor or any other type of data that needs to be passed through an rt-ai SPN.

Originally, rt-ai had its own SPEs for collecting sensor data but this led to quite a bit of duplication between rt-ai and SNC. The embedding technique completely removes the need for this duplication as rt-ai SPNs can hook into any SNC data stream, no matter what hardware generated it.

The screen capture above shows an example that I am using as part of the driveway detection system that I have been running for quite a long term now to detect vehicles or people moving around the driveway – this post describes the original system. The heart of this is an NCS 2 inference engine with some post processing code to generate email alerts when something has been detected. All of the SPEs in this case are running on the same Raspberry Pi 4 which is humming along nicely, running inference on a 1280 x 720 10fps video stream. Now that this SPN has been embedded in SNC, it is possible to save all of the annotated video using standard SNC storage if required or else further process and add to the metadata with anything that connects to SNC.

rt-ai SPNs can be used to create synth modules (basically SPN macros) that can be replicated any number of times and individually configured to process different streams. Alternatively, a single SPN can process data from multiple SNC video streams using an SNC fan out SPE, similar to this one.

So what does this do for rt-ispace? The whole idea of rt-ispace is that ubiquitous sensing and other real-time data streams are collected in SNC, AI inference distills the raw streams into meaningful data and then the results are fed to SHAPE for integration into real world augmentations. Embedding rt-ai SPNs in SNC provides the AI data distillation in a highly efficient and reusable way.

Object detection on the Raspberry Pi 4 with the Neural Compute Stick 2

Following on from the Coral USB experiment, the next step was to try it out with the NCS 2. Installation of OpenVINO on Raspbian Buster was straightforward. The rt-ai design was basically the same as for the Coral USB experiment but with the CoralSSD SPE replaced with the OpenVINO equivalent called CSSDPi. Both SPEs run ssd_mobilenet_v2_coco object detection.

Performance was pretty good – 17fps with 1280 x 720 frames. This is a little better than the Coral USB accelerator attained but then again the OpenVINO SPE is a C++ SPE while the Coral USB SPE is a Python SPE and image preparation and post processing takes its toll on performance. One day I am really going to use the C++ API to produce a new Coral USB SPE so that the two are on a level playing field. The raw inference time on the Coral USB accelerator is about 40mS or so meaning that there is plenty of opportunity for higher throughputs.

MobileNet SSD object detection using the Intel Neural Compute Stick 2 and a Raspberry Pi

I had successfully run ssd_mobilenet_v2_coco object detection using an Intel NCS2 running on an Ubuntu PC in the past but had not tried this using a Raspberry Pi running Raspbian as it was not supported at that time (if I remember correctly). Now, OpenVINO does run on Raspbian so I thought it would be fun to get this working on the Pi. The main task consisted of getting the CSSD rt-ai Stream Processing Element (SPE) compiling and running using Raspbian and its version of OpenVINO rather then the usual x86 64 Ubuntu system.

Compiled rt-ai SPEs use Qt so it was a case of putting together a different .pro qmake file to reflect the particular requirements of the Raspbian environment. Once I had sorted out the slight link command changes, the SPE crashed as soon as it tried to read in the model .xml file. I got stuck here for quite a long time until I realized that I was missing a compiler argument that meant that my binary was incompatible with the OpenVINO inference engine. This was fixed by adding the following line to the Raspbian .pro file:

QMAKE_CXXFLAGS += -march=armv7-a

Once that was added, the code worked perfectly. To test, I set up a simple rt-ai design:

For this test, the CSSDPi SPE was the only thing running on the Pi itself (rtai1), the other two SPEs were running on a PC (default). The incoming captured frames from the webcam to the CSSDPi SPE were 1280 x 720 at 30fps. The CSSDPi SPE was able to process 17 frames per second, not at all bad for a Raspberry Pi 3 model B! Incidentally, I had tried a similar setup using the Coral Edge TPU device and its version of the SSD SPE, CoralSSD, but the performance was nowhere near as good. One obvious difference is that CoralSSD is a Python SPE because, at that time, the C++ API was not documented. One day I may change this to a C++ SPE and then the comparison will be more representative.

Of course you can use multiple NCS 2s to get better performance if required although I haven’t tried this on the Pi as yet. Still, the same can be done with Coral with suitable code. In any case, rt-ai has the Scaler SPE that allows any number of edge inference devices on any number of hosts to be used together to accelerate processing of a single flow. I have to say, the ability to use rt-ai and rtaiDesigner to quickly deploy distributed stream processing networks to heterogeneous hosts is a lot of fun!

The motivation for all of this is to move from x86 processors with big GPUs to Raspberry Pis with edge inference accelerators to save power. The driveway project has been running for months now, heating up the basement very nicely. Moving from YOLOv3 on a GTX 1080 to MobileNet SSD and a Coral edge TPU saved about 60W, moving the entire thing from that system to the Raspberry Pi has probably saved a total of 80W or so.

This is the design now running full time on the Pi:

CPU utilization for the CSSDPi SPE is around 21% and it uses around 23% of the RAM. The raw output of the CSSDPi SPE is fed through a filter SPE that only outputs a message when a detection has passed certain criteria to avoid false alarms. Then, I get an email with a frame showing what triggered the system. The View module is really just for debugging – this is the kind of thing it displays:

The metadata displayed on the right is what the SSDFilter SPE uses to determine whether the detection should be reported or not. It requires a configurable number of sequential frames with a similar detection (e.g. car rather than something else) over a configurable confidence level before emitting a message. Then, it has a hold-off in case the detected object remains in the frame for a long time and, even then, requires a defined gap before that detection is re-armed. It seems to work pretty well.

One advantage of using CSSD rather than CYOLO as before is that, while I don’t get specific messages for things like a USPS van, it can detect a wider range of objects:

Currently the filter only accepts all the COCO vehicle classes and the person class while rejecting others, all in the interest of reducing false detection messages.

I had expected to need a Raspberry Pi 4 (mine is on its way 🙂 ) to get decent performance but clearly the Pi 3 is well able to cope with the help fo the NCS 2.

Raspberry Pi 3 Model B with Coral Edge TPU acceleration running SSD object detection

It wasn’t too hard to go from the inline rt-ai Edge Stream Processing Element using the Coral Edge TPU accelerator to an embedded version running on a Raspberry Pi 3 Model B with Pi camera.  The rt-ai Edge test design for this SPE is pretty simple again:

As can be seen, the Pi + Coral runs at about 4 fps with 1280 x 720 frames which is not too bad at all. In this example, I am running the PiCoral camera SPE on the Raspberry Pi node (Pi7) and the View SPE on the Default node (an i7 Ubuntu machine). Also, I’m using the combined video and metadata output which contains both the detection data and the associated JPEG video frame. However, the PiCoral SPE also has a metadata-only output. This contains all the frame information and detection data (scores, boxes etc) but not the JPEG frame itself. This can be useful for a couple of reasons. First, especially if the Raspberry Pi is connected via WiFi, transmitting the JPEGs can be a bit onerous and, if they are not needed, very wasteful. Secondly, it satisfies a potential privacy issue in that the raw video data never leaves the Raspberry Pi. Provided the metadata contains enough information for useful downstream processing, this can be a very efficient way to configure a system.

An Edge TPU stream processing element for rt-ai using the Coral USB Accelerator

A Coral USB Accelerator turned up yesterday so of course it had to be integrated with rt-ai to see what it could do. Creating a Python-based SPE from the object detection demo in the API download didn’t take too long. I used the MobileNet SSD v2 COCO model as a starting point to generate this example output:

The very basic rt-ai test design looks like this:

Using 1280 x 720 video frames from the webcam, I was getting around 2 frames per second from the CoralSSD SPE. This isn’t as good as the Intel NCS 2 SPE but that is a compiled C++ SPE whereas the Coral SPE is a Python 3 SPE. I haven’t found a C++ API spec for the Edge TPU as yet. Perhaps by investigating the SWIG-generated Python interface I could link the compiled libraries directly but that’s for another day…

Combining TrueDepth, remote OpenPose inference and local depth map processing to generate spatial 3D pose coordinates

The problem with depth maps for video is that the depth data is very large and can’t be compressed easily. I had previously run OpenPose at 30 FPS using an iPad Pro and remote inference but that was just for the standard OpenPose (x, y) coordinate output. There’s no way that 30 FPS could be achieved by sending out TrueDepth depth maps with each frame. Instead, the depth processing has to be handled locally on the iPad – the depth map never leaves the device.

The screen capture above shows the system running at 30 FPS. I had to turn a lot of lights on in the office – the frame rate from the iPad camera will drop below 30 FPS if it is too dark which messes up the data!

This is the design. It is the triple scaled OpenPoseGPU design used previously. iOSOpenPose connects to the Conductor via a websocket connection that is used to send images to and receive processed images from the pipeline.

One issue is that each image frame has its own depth map and that’s the one that has to be used to convert the OpenPose (x, y) coordinates into spatial (x, y, z) distances. The solution, in a new app called iOSOpenPose, is to cache the depth maps locally and re-associate them with the processed images when they return. Each image and depth frame is marked with a unique incrementing index to assist with this. Incidentally, this is why I love using JSON for this kind of work – it is possible to add non-standard fields at any point and they will be carried transparently to their destination.

Empirically with my current setup, there is a six frame processing lag which is not too bad. It would probably be better with the dual scaled pipeline, two node design that more easily handles 30 FPS but I did not try that. Another issue is that the processing pipeline can validly lose image frames if it can’t keep up with the offered rate. The depth map cache management software has to take care of all of the nasty details like this and other real-world effects.

Generating 3D spatial coordinates from OpenPose with the help of the Stereolabs ZED camera

OpenPose does a great job of estimating the (x, y) coordinates of body points. However, in many situations, the spatial (3D) coordinates of the body joints is what’s required. To do that, the z coordinate has to be provided in some way. There are two common ways of doing that: using multiple cameras or using a depth camera. In this case, I chose using RGBD data from a StereoLabs ZED camera. An example of the result is shown in the screen capture above and another below. Coordinates are in units of meters.

The (x, y) 2D coordinates within the image (generated by OpenPose) along with the depth information at that (x, y) point in the image are used to calculate a spatial (sx, sy, sz) coordinate with origin at the camera and defined by the camera’s orientation. The important thing is that the spatial relationship between the joints is then trivial to calculate. This can be used by downstream inference blocks to discriminate higher level motions.

Incidentally I don’t have a leprechaun sitting on my computers to the right of the first screen capture – OpenPose was picking up my reflection in the window as another person.

The ZED is able to produce a depth map or point cloud but the depth map is more practical in this case as it necessary to transmit the data between processes (possibly on different machines). Even so, it is large and difficult to compress. The trick is to extract the meaningful data and then discard the depth information as soon as possible! The ZED camera also sends along the calibrated horizontal and vertical fields of view as this is essential to constructing (sx, sy, sz) from (x, y) and depth. Since the ZED doesn’t seem to produce a depth value for every pixel, the code samples an area around the (x, y) coordinate to evaluate a depth figure. If it fails to do this, the spatial coordinate is returned as (0, 0, 0).

This is the design I ended up using. Basically a dual OpenPose pipeline with scaler as for standard OpenPose. It averaged around 16 FPS with 1280 x 720 images (24 FPS with VGA images) using JPEG for the image part and raw depth map for the depth part. Using just one pipeline achieved about 13 FPS so the speed up from the second pipeline was disappointing. I expect that this was largely due to the communications overhead of moving the depth map around between nodes. Better network interfaces might improve this.

Real time OpenPose on an iPad…with the help of remote inference and rendering

I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.

So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai that runs very nicely so an obvious solution was to run rt-ai OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.

The rt-ai design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.

The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.

Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML  for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.

Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.

rt-ai dynamic and adaptive parallel inference using the new Scaler SPE

One way to achieve higher video frame inference rates in situations where no state is maintained between frames is to split an incoming video stream across multiple inference pipelines. The new rt-ai Scaler Stream Processing Element (SPE) does exactly that. The screen capture above shows the design and the real time performance information (in the windows on the right). The pipelines in this case are just single SPEs running single shot object detection on the Intel NCS 2. The CSSD SPE is able to process around 13 1280 x 720 frames per second by itself. Using the Scaler SPE to leverage two CSSD SPEs, each with one NCS 2 running on different nodes, the throughput has been doubled. In fact, performance should scale roughly linearly with the number of pipelines attached.

The Scaler SPE implements a health check function that determines the availability of pipelines at run time. Only pipelines that pass the health check are eligible to receive frames to be processed. In the example, Scaler can support eight pipelines but only two are connected (2 and 6) so only these pass the health check and receive frames. Frames from the In port are distributed across the active pipelines in a round robin fashion.

Pipelines are configured with a maximum to the number of in-flight frames in order to maximize pipeline throughput and minimize latency. Without multiple in-flight frames, CSSD performance would be roughly halved. In fact, pipelines can have different processing throughputs – the Scaler SPE automatically adjusts pipeline usage based on achieved throughput. Result messages may be received from pipelines out of sequence and the Scaler SPE ensures that the final output stream on the Out port has been reordered correctly.

The Scaler SPE can actually support any type of data (not just video) and any type of pipeline (not just inference) provided there is no retained state between messages. This makes it a very useful new building block.