Extreme edge depth video processing: Intel L515 LiDAR + Raspberry Pi and Stereolabs ZED + Jetson Nano

Depth cameras are an important component of rt-ispace but things just aren’t going to scale if each one needs a server with GPU just to generate useful data. This means that the extreme edge, consisting of hopefully low cost components that can be widely distributed, needs to be able to interface to depth cameras and make the data available to the wider network.

I have been testing with a Stereolabs ZED camera connected to a Jetson Nano and an Intel L515 LiDAR connected to a Raspberry Pi 4. The depth video stream generated by the rt-ai capture code is 1280 x 720 pixels, JPEG encoded along with uncompressed 640 x 360 16 bit depth data with a target frame rate of 15fps. Both systems seem quite capable of capturing and transmitting the data streams as shown above. The rt-ai design being used is this:

The rtai0 node is the Raspberry Pi 4. In this case, the depth video streams from the cameras are not being processed further on the extreme edge systems. The depth data views, which display data coming directly from the extreme edge systems, show that the Jetson Nano is generating frames at the target rate while the Raspberry Pi 4 is achieving about half that. Both are usable rates for many applications.

The depth video frames are also passed to the OpenPoseGPU Stream processing Element (SPE). This is an implementation of OpenPose that uses the desktop GPU (a GTX1080 Ti) to implement pose estimation. The OpenPoseGPU SPE can work with standard video streams but if given a depth video stream will work out the depth of each identified joint and add that to the metadata generated.

The total throughput of the OpenPoseGPU SPE is around 14fps. As can be seen in the rt-ai design, the depth video streams are multiplexed into the OpenPoseGPU SPE so that this capability is being shared between the two streams. The FanOut SPE separates the output streams which are then sent to viewers. Due to the limited throughput of the OpenPoseGPU SPE, the data streams are reduced in frame rate by a factor of 2.

So this design, where OpenPose processing is offloaded from the extreme edge, works fine but it would be far more interesting to do this at the extreme edge.

The screen capture above shows pose estimation running at the extreme edge using an Intel NCS 2 to run inference as implemented by the OpenPoseVINO SPE on the rtai0 Raspberry Pi 4 node. This does work pretty well but can only achieve 2fps. This might be ok for some applications but it would be nice to get to around 10fps.

I did also try running trt_pose on the Jetson Nano to try extreme edge pose estimation there but this was not successful. It may be that trying to run the ZED camera and trt_pose on the same Nano is just asking too much. Moving to a Xavier NX would probably make sense as it has double the memory and more power in general but it is a fair bit more expensive that the Nano so somewhat less relevant to the extreme edge application.

Work is now moving onto a new architecture using distributed inference to relieve the load on the extreme edge while still achieving usable pose estimation frame rates.

Using shared memory for rt-ai inter-SPE transfers

The screen capture above couldn’t have been obtained previously as it is passing uncompressed (RGB888) video between rt-ai SPEs on the same node (a Jetson Nano in this case). The CVideoView window is showing the output of the simple network using the CSSDJetson SPE to classify objects and also computes the frames per second and latency of received frames. The source of the frames is a Logitech C920 webcam running at 1280 x 720, 30fps. It shows that the latency is around 128mS at around 15fps.

This screen capture shows what happens when shared memory isn’t used. Actually, the latency here is misleading as it seems to be the link from the CUVCCam SPE to the MQTT server that is causing the bottleneck when running uncompressed video. The latency goes monotonically upwards until there is no memory left as there is no throttling on that interface since normally it isn’t a problem.

There doesn’t seem to be much benefit when passing smaller messages between SPEs.

This screen capture above shows shared memory being used when transferring JPEG frames. The one below is with shared memory support turned off.

This just shows that bouncing off the MQTT server within the same node is pretty efficient, at least when compared to the latency of the inference.

Being able to pass large messages around efficiently, even if only point to point within the same node, is quite a step forward by itself. For example, it makes it practical to create networks that pass RGBD frames around.

Shared memory support in rt-ai2 uses the Qt QSharedMemory and QSystemSemaphore wrappers to make things simple. When a design is generated, rtaiDesigner determines if shared memory has been enabled for the network, if the publisher and subscriber are on the same node and if the connection is point to point (i.e. exactly one subscriber). If so, the publisher and subscriber SPEs are told to use shared memory instead of MQTT for that particular connection. The SPE configuration file for the publisher SPE also includes the shared memory slot size to use and how big the pending transmission queue should be. The system is set up at the moment to always use three shared memory slots forming a rotating buffer. The shared memory slots are created by the publisher and attached by the subscriber.

To minimize latency, every time the publisher places a new message in the next shared memory slot, it releases a QSystemSempahore to unblock a thread in the subscriber that can then extract the message, free the shared memory slot and process the received message.

This implementation of shared memory seems to work very well and is highly reliable. In principle, it could be extended to support multiple subscribers by replicating the shared memory slot structure for each subscriber.

Jetson Nano SSD-Mobilenet-v2 SPE for rt-ai

Following on from the earlier work with the Jetson Nano, the SSD-Mobilenet-v2 model is now running as an rt-ai Stream Processing Element (SPE) for Jetson and so is fully integrated with the rt-ai system. Custom models created using transfer learning can also be used – it’s just a case of setting the model name in the SPE’s configuration and placing the required model files on the rt-ai data server. Since models are automatically downloaded at runtime if necessary, it’s pretty trivial to change the model being used on an existing Stream Processing Network (SPN).

The screen capture above shows the rt-ai design that generated the implementation. Here I am using the UVCCam SPE so that the video is sourced from a webcam but any of the other rt-ai video sources (such as RTSPCam) could be used, simply by replacing the camera SPE in the design using the graphical editor – originally this design used RTSPCam in fact.

Using 1280 x 720 video frames, the SSDJetson SPE processes around 17fps. This is not bad but less than the 21fps achieved by the monolithic example code. The problem is that, in order to achieve the one to many and many to one, heterogeneous multi-node network graphical design capability, rt-ai currently uses MQTT brokers to move data and provide multicast as necessary. Even when the broker and the SPEs are running on the same node, it is obviously less efficient than pointer passing within monolithic code.

This “inefficiency of generality” isn’t really visible on powerful x86 machines but has an impact on devices like the Jetson Nano and Raspberry Pi. The solution to this is to recognize such local links and side-step the MQTT broker interface using shared memory. This optimization will be done automatically in rtaiDesigner when it generates the configurations for each SPE in an SPN, flagging appropriate nodes as sources or sinks of shared memory links when both source and sink SPEs reside on the same node.

Jetson Nano and rt-ai

The Jetson Nano is an obvious platform for rt-ai to support, to go with the existing Intel NCS2 and Coral edge platforms. One nice plus is that the Jetson Nano comes basically ready to go, all set up for inference.

The screen capture above shows the Nano running the detectnet-camera example code using a webcam as the source generating 1280 x 720 frames and SSD-Mobilenet-v2 as the model. Performance was not bad at 21fps running at 10W, 16fps running at 5W. The heatsink did get very hot in a pretty short space of time however!

Installing the rt-ai runtime was no problem at all and it was easy to utilize the H.264 accelerated pipeline in rt-ai’s RTSP camera capture module. The screen capture above shows this running along with a viewer, demonstrating basic rt-ai funtionality.

Next up is to roll the detection code into an rt-ai Stream Processing Element (SPE). This will generate identical metadata to the existing SSD detectors, allowing full compatibility between server GPU, Jetson, NCS 2 and Coral SSD detectors.