Using shared memory for rt-ai inter-SPE transfers

The screen capture above couldn’t have been obtained previously as it is passing uncompressed (RGB888) video between rt-ai SPEs on the same node (a Jetson Nano in this case). The CVideoView window is showing the output of the simple network using the CSSDJetson SPE to classify objects and also computes the frames per second and latency of received frames. The source of the frames is a Logitech C920 webcam running at 1280 x 720, 30fps. It shows that the latency is around 128mS at around 15fps.

This screen capture shows what happens when shared memory isn’t used. Actually, the latency here is misleading as it seems to be the link from the CUVCCam SPE to the MQTT server that is causing the bottleneck when running uncompressed video. The latency goes monotonically upwards until there is no memory left as there is no throttling on that interface since normally it isn’t a problem.

There doesn’t seem to be much benefit when passing smaller messages between SPEs.

This screen capture above shows shared memory being used when transferring JPEG frames. The one below is with shared memory support turned off.

This just shows that bouncing off the MQTT server within the same node is pretty efficient, at least when compared to the latency of the inference.

Being able to pass large messages around efficiently, even if only point to point within the same node, is quite a step forward by itself. For example, it makes it practical to create networks that pass RGBD frames around.

Shared memory support in rt-ai2 uses the Qt QSharedMemory and QSystemSemaphore wrappers to make things simple. When a design is generated, rtaiDesigner determines if shared memory has been enabled for the network, if the publisher and subscriber are on the same node and if the connection is point to point (i.e. exactly one subscriber). If so, the publisher and subscriber SPEs are told to use shared memory instead of MQTT for that particular connection. The SPE configuration file for the publisher SPE also includes the shared memory slot size to use and how big the pending transmission queue should be. The system is set up at the moment to always use three shared memory slots forming a rotating buffer. The shared memory slots are created by the publisher and attached by the subscriber.

To minimize latency, every time the publisher places a new message in the next shared memory slot, it releases a QSystemSempahore to unblock a thread in the subscriber that can then extract the message, free the shared memory slot and process the received message.

This implementation of shared memory seems to work very well and is highly reliable. In principle, it could be extended to support multiple subscribers by replicating the shared memory slot structure for each subscriber.