Using UWB asset tags to dynamically create rt-ispace augmentations

Essentially, an asset tag is a small device that can be used to locate and instantiate, using UWB in this case, an augmentation in an rt-ispace environment completely dynamically. The augmentation follows the position and orientation of the asset tag, making for a very simple way to implement augmented spaces. If engineered properly, it could be an extremely simple piece of hardware that would be essentially UWB hardware along with a MEMS IMU and a battery. Instead of WiFi as in this prototype, pose updates could be sent over the UWB infrastructure to make things really simple. Ideally, these would be extremely cheap and could be placed anywhere in a space as a simple way of adding augmentations. These augmentations can be proxy objects (all aspects of a proxy object augmentation can be modified by remote servers) and can be as simple or complex as desired.

There are some similarities and differences with the ArUco marker system for instantiating augmentations. The ArUco marker can provide an ID but that has to be matched with a previously instantiated object that has the same ID attached. Asset tags don’t require any pre-configuration like that. Another problem with ArUco markers is that they are very affected by occlusion – even a wire running across a marker might make it undetectable. Asset tags are not affected by occlusion and so will function correctly in a much wider range of circumstances. They do require UWB enabled spaces, however. In the end, both styles of augmentation instantiation have their place.

Note that the asset tag doesn’t need to contain the actual asset data (although it could if desired). All it needs to do is to provide a URL of a repository where the asset (either Unity assetbundle or glTF blob) can be found. The asset is then streamed dynamically when it needs to be instantiated. It also provides information about where to find function servers in the case of a proxy object. The rt-ispace user app (in this case an iOS app running on an iPad Pro) doesn’t need to know anything about asset tags – they just inject normal looking (but transient) augmentation updates into the rt-ispace system so that augmentations magically appear. Obviously, this kind of flexibility could easily be abused and, in real life, a proper security strategy would need to be implemented in most cases. For development, though, it’s nice for things just to work!

One application that I like is a shared space where people can bring along their virtual creations in the form of some asset tags and just place them in the rt-ispace space so that any user in the space  could see them.

Another idea is that items in stores could have rt-ispace asset tags attached to them (like security tags today) so that looking at the item with an AR device would perhaps demonstrate some kind of feature. Manufacturers could supply the asset and functions servers, freeing the retail store from having to implement something for every stocked item. 

The video above shows how the augmentation tracks the UWB tag around and that the IMU controls the augmentation’s orientation. For now, the hardware is a complete hack with multiple components but it does prove that the concept is viable. The UWB tag (the white box on the floor under the figure’s right foot) controls the location of the augmentation in the physical space. A Raspberry Pi fitted with an IMU provides orientation information and drives the resulting pose via WiFi to the rt-ispace servers. The augmentation is the usual glTF sample CesiumMan.

Linking AR augmentations to physical space using the ArUco marker system

Following on from the earlier work with ArUco markers, rt-ispace can now associate ArUco markers with augmentations in a space. The image above shows two glTF sample models attached to two different ArUco marker codes (23 and 24 in this case). Since these models are animated, a video also seems appropriate!

The image and video were obtained using an iPad Pro running the rt-ispace app that forms the front end for the rt-ispace system. A new server, EdgeAnchor, receives the AR video stream from the iPad via the assigned EdgeAccess, detecting any ArUco markers that may be in view. The video stream also contains the iPad camera instrinsics and AR camera pose, which allows EdgeAnchor to determine a physical pose in space of the marker relative to the camera view. The marker detection results are sent back to the iPad app (via EdgeAccess) which then matches the ArUco IDs to instantiated augmentations and calculates the world space pose for the augmentation. There are some messy calculations in there but it actually works very well.

The examples shown are set up to instantiate the augmentation based on a horizontal marker. However, the augmentation configuration allows for a 6-dof offset to the marker. This means that markers can be hung on walls with augmentations either on the walls or in front of the walls, for example.

A single EdgeAnchor instance can be shared among many rt-ispace users as no state is retained between frames allowing the system to scale very nicely. Also, there is nothing specific to ArUco markers: in principle, EdgeAnchor could support multiple marker types, providing great flexibility. The only requirement is that the marker detection results in a 6-dof pose relative to the camera.

Previously, I had been resistant to the use of markers, preferring to use the spatial mapping capabilities of the user device to provide spatial lock and location of augmentations. However, there are many limitations to these systems, especially where there is very limited visual texture or depth changes to act as a natural anchor. Adding physical anchors means that augmentations can be reliably placed in very featureless spaces which is a big plus in terms of creating a pleasant user experience.

Adding ArUco marker detection to rt-ai

There are many situations where it is necessary to establish the spatial relationship between a camera in a space and 3D points within the same space. One particular application of interest is the ability to use markers to accurately locate holograms in a space so that AR headset users see the holograms locked in the space, even as they look or move around the space. OpenCV has the ArUco marker detection included so that seemed like a good place to start. The screen capture above shows the rt-ai Aruco marker detector identifying the pose of a few example markers.

This is the simple rt-ai test design with the new ArUcoDetect stream processing element (SPE). The UVC camera was running at 1920 x 1080, 30 fps, and the ArUco SPE had no trouble keeping up with this.

This screen capture is a demonstration of the kind of thing that might be useful to do in an AR application. The relative pose of the marker has been detected, allowing the marker to be replaced by an associated hologram by a 3D application.

While the detection is quite stable, the ArUco SPE implements a configurable filter to help eliminate occasional artifacts, especially regarding the blue (z-axis) which can swing around quite a bit under some circumstances due to the pose ambiguity problem. The trick is to tune the filter to eliminate any residual pose jitter while maintaining adequate response to headset movement.

One challenge here is management of camera intrinsic parameters. In this case, I was using a Logitech C920 webcam for which calibration intrinsics had been determined using a version of the ChArUco calibration sample here. It wouldn’t be hard for the CUVCCam SPE to include camera intrinsic parameters in the JSON associated with each frame, assuming it could detect the type of UVC camera and pick up a pre-determined matrix for that type. Whether that’s adequate is a TBD. In other situations, where the video source is able to supply calibration data, the problem goes away. Anyway, more work needs to be done in this area.

Since rt-ai stream processing networks (SPNs) can be integrated with SHAPE via the Conductor SPE (an example of the Conductor is here), an AR headset running the SHAPE application could stream front facing video to the ArUco SPN, which would then return the relative pose of detected markers that have previously been associated with SHAPE virtual objects. This would allow the headset to correctly instantiate the SHAPE virtual objects in the space and avoid problems relying on inside out tracking alone (such as in a spatial environment with a repeating texture that prevents unique identification).

Extreme edge depth video processing: Intel L515 LiDAR + Raspberry Pi and Stereolabs ZED + Jetson Nano

Depth cameras are an important component of rt-ispace but things just aren’t going to scale if each one needs a server with GPU just to generate useful data. This means that the extreme edge, consisting of hopefully low cost components that can be widely distributed, needs to be able to interface to depth cameras and make the data available to the wider network.

I have been testing with a Stereolabs ZED camera connected to a Jetson Nano and an Intel L515 LiDAR connected to a Raspberry Pi 4. The depth video stream generated by the rt-ai capture code is 1280 x 720 pixels, JPEG encoded along with uncompressed 640 x 360 16 bit depth data with a target frame rate of 15fps. Both systems seem quite capable of capturing and transmitting the data streams as shown above. The rt-ai design being used is this:

The rtai0 node is the Raspberry Pi 4. In this case, the depth video streams from the cameras are not being processed further on the extreme edge systems. The depth data views, which display data coming directly from the extreme edge systems, show that the Jetson Nano is generating frames at the target rate while the Raspberry Pi 4 is achieving about half that. Both are usable rates for many applications.

The depth video frames are also passed to the OpenPoseGPU Stream processing Element (SPE). This is an implementation of OpenPose that uses the desktop GPU (a GTX1080 Ti) to implement pose estimation. The OpenPoseGPU SPE can work with standard video streams but if given a depth video stream will work out the depth of each identified joint and add that to the metadata generated.

The total throughput of the OpenPoseGPU SPE is around 14fps. As can be seen in the rt-ai design, the depth video streams are multiplexed into the OpenPoseGPU SPE so that this capability is being shared between the two streams. The FanOut SPE separates the output streams which are then sent to viewers. Due to the limited throughput of the OpenPoseGPU SPE, the data streams are reduced in frame rate by a factor of 2.

So this design, where OpenPose processing is offloaded from the extreme edge, works fine but it would be far more interesting to do this at the extreme edge.

The screen capture above shows pose estimation running at the extreme edge using an Intel NCS 2 to run inference as implemented by the OpenPoseVINO SPE on the rtai0 Raspberry Pi 4 node. This does work pretty well but can only achieve 2fps. This might be ok for some applications but it would be nice to get to around 10fps.

I did also try running trt_pose on the Jetson Nano to try extreme edge pose estimation there but this was not successful. It may be that trying to run the ZED camera and trt_pose on the same Nano is just asking too much. Moving to a Xavier NX would probably make sense as it has double the memory and more power in general but it is a fair bit more expensive that the Nano so somewhat less relevant to the extreme edge application.

Work is now moving onto a new architecture using distributed inference to relieve the load on the extreme edge while still achieving usable pose estimation frame rates.

Embedding AI inference in the Spatial Networking Cloud

rt-ai is a system for graphically composing edge AI stream processing networks either distributed on multiple processing nodes or all running on a single node. In the latter case, shared memory can be used for transfers between the functional blocks, making it almost as efficient as monolithic code. So when it comes to embedding AI inference in a Spatial Networking Cloud (SNC), rt-ai makes perfect sense. However, the underlying network styles are completely different – SNC uses a highly dynamic series of multicast and end to end virtual links whereas rt-ai uses static MQTT or shared memory links. Each makes sense for the different applications so it is necessary to create a bridge between the two worlds.

Right now, bridging is done (for video streams) with the GetSNCVideo and PutSNCVideo stream processing elements (SPEs) that can be added to any video stream processing network (SPN). GetSNCVideo can grab SNC video frames from the configured stream source which then acts as an rt-ai source for the downstream SPEs. Once processing has been completed, the frames can be re-injected into SNC using the PutSNCVideo SPE. There can be similar bridges for sensor or any other type of data that needs to be passed through an rt-ai SPN.

Originally, rt-ai had its own SPEs for collecting sensor data but this led to quite a bit of duplication between rt-ai and SNC. The embedding technique completely removes the need for this duplication as rt-ai SPNs can hook into any SNC data stream, no matter what hardware generated it.

The screen capture above shows an example that I am using as part of the driveway detection system that I have been running for quite a long term now to detect vehicles or people moving around the driveway – this post describes the original system. The heart of this is an NCS 2 inference engine with some post processing code to generate email alerts when something has been detected. All of the SPEs in this case are running on the same Raspberry Pi 4 which is humming along nicely, running inference on a 1280 x 720 10fps video stream. Now that this SPN has been embedded in SNC, it is possible to save all of the annotated video using standard SNC storage if required or else further process and add to the metadata with anything that connects to SNC.

rt-ai SPNs can be used to create synth modules (basically SPN macros) that can be replicated any number of times and individually configured to process different streams. Alternatively, a single SPN can process data from multiple SNC video streams using an SNC fan out SPE, similar to this one.

So what does this do for rt-ispace? The whole idea of rt-ispace is that ubiquitous sensing and other real-time data streams are collected in SNC, AI inference distills the raw streams into meaningful data and then the results are fed to SHAPE for integration into real world augmentations. Embedding rt-ai SPNs in SNC provides the AI data distillation in a highly efficient and reusable way.

Using shared memory for rt-ai inter-SPE transfers

The screen capture above couldn’t have been obtained previously as it is passing uncompressed (RGB888) video between rt-ai SPEs on the same node (a Jetson Nano in this case). The CVideoView window is showing the output of the simple network using the CSSDJetson SPE to classify objects and also computes the frames per second and latency of received frames. The source of the frames is a Logitech C920 webcam running at 1280 x 720, 30fps. It shows that the latency is around 128mS at around 15fps.

This screen capture shows what happens when shared memory isn’t used. Actually, the latency here is misleading as it seems to be the link from the CUVCCam SPE to the MQTT server that is causing the bottleneck when running uncompressed video. The latency goes monotonically upwards until there is no memory left as there is no throttling on that interface since normally it isn’t a problem.

There doesn’t seem to be much benefit when passing smaller messages between SPEs.

This screen capture above shows shared memory being used when transferring JPEG frames. The one below is with shared memory support turned off.

This just shows that bouncing off the MQTT server within the same node is pretty efficient, at least when compared to the latency of the inference.

Being able to pass large messages around efficiently, even if only point to point within the same node, is quite a step forward by itself. For example, it makes it practical to create networks that pass RGBD frames around.

Shared memory support in rt-ai2 uses the Qt QSharedMemory and QSystemSemaphore wrappers to make things simple. When a design is generated, rtaiDesigner determines if shared memory has been enabled for the network, if the publisher and subscriber are on the same node and if the connection is point to point (i.e. exactly one subscriber). If so, the publisher and subscriber SPEs are told to use shared memory instead of MQTT for that particular connection. The SPE configuration file for the publisher SPE also includes the shared memory slot size to use and how big the pending transmission queue should be. The system is set up at the moment to always use three shared memory slots forming a rotating buffer. The shared memory slots are created by the publisher and attached by the subscriber.

To minimize latency, every time the publisher places a new message in the next shared memory slot, it releases a QSystemSempahore to unblock a thread in the subscriber that can then extract the message, free the shared memory slot and process the received message.

This implementation of shared memory seems to work very well and is highly reliable. In principle, it could be extended to support multiple subscribers by replicating the shared memory slot structure for each subscriber.

Jetson Nano SSD-Mobilenet-v2 SPE for rt-ai

Following on from the earlier work with the Jetson Nano, the SSD-Mobilenet-v2 model is now running as an rt-ai Stream Processing Element (SPE) for Jetson and so is fully integrated with the rt-ai system. Custom models created using transfer learning can also be used – it’s just a case of setting the model name in the SPE’s configuration and placing the required model files on the rt-ai data server. Since models are automatically downloaded at runtime if necessary, it’s pretty trivial to change the model being used on an existing Stream Processing Network (SPN).

The screen capture above shows the rt-ai design that generated the implementation. Here I am using the UVCCam SPE so that the video is sourced from a webcam but any of the other rt-ai video sources (such as RTSPCam) could be used, simply by replacing the camera SPE in the design using the graphical editor – originally this design used RTSPCam in fact.

Using 1280 x 720 video frames, the SSDJetson SPE processes around 17fps. This is not bad but less than the 21fps achieved by the monolithic example code. The problem is that, in order to achieve the one to many and many to one, heterogeneous multi-node network graphical design capability, rt-ai currently uses MQTT brokers to move data and provide multicast as necessary. Even when the broker and the SPEs are running on the same node, it is obviously less efficient than pointer passing within monolithic code.

This “inefficiency of generality” isn’t really visible on powerful x86 machines but has an impact on devices like the Jetson Nano and Raspberry Pi. The solution to this is to recognize such local links and side-step the MQTT broker interface using shared memory. This optimization will be done automatically in rtaiDesigner when it generates the configurations for each SPE in an SPN, flagging appropriate nodes as sources or sinks of shared memory links when both source and sink SPEs reside on the same node.

Shared nothing – sometimes being selfish is the way to go

Lock-free code is all the rage these days but it’s not just a fad. Having recently quantified the performance impact of a single lock on shared memory it’s easy to understand why eliminating locks (and indeed any other kind of kernel interaction) is the key to high performance.

A logical consequence of this is that threads must share no state (memory, disk or anything else) with any other thread unless it can be done in a safe manner without requiring synchronization. While there are some patterns that can be used for this, in general the solution is the shared nothing (or sharded) architecture where each thread works completely independently.

Coupled with core-locked threads, shared nothing architectures are capable of extracting the last drop of performance out of the underlying hardware. Suddenly that multi-core CPU looks like a very loosely coupled bunch of bare-metal processors.

One core == one thread

Back in the dark ages, when CPUs only had one, two or maybe four cores, the idea of dedicating an entire core to a single thread was ridiculous. Then it became apparent that the only way to scale CPU performance was to integrate more cores onto a single CPU chip. People started wondering – how to use all these cores in a meaningful way without getting bogged down in delays from cache coherency, locks and other synchronization issues.

Turns out the answer may well be to hard-allocate threads to cores – just one thread locked into each core. This means that almost all of an application can be free of kernel interaction. This is how DPDK gets its speed for example. It uses user space polling to minimize latency and maximize performance.

I have been running some tests using one thread per core with DPDK and lock-free shared memory links. So far, on my old i7-2700K dev machine (with another machine generating test data over a 40Gbps link), I have been seeing over 16Gbps of throughput through DPDK into the shared memory link using a single core without even trying to optimize the code. It’s kind of weird seeing certain cores holding at 100% continuously, even if they are doing nothing, but this is the new reality.

Jetson Nano and rt-ai

The Jetson Nano is an obvious platform for rt-ai to support, to go with the existing Intel NCS2 and Coral edge platforms. One nice plus is that the Jetson Nano comes basically ready to go, all set up for inference.

The screen capture above shows the Nano running the detectnet-camera example code using a webcam as the source generating 1280 x 720 frames and SSD-Mobilenet-v2 as the model. Performance was not bad at 21fps running at 10W, 16fps running at 5W. The heatsink did get very hot in a pretty short space of time however!

Installing the rt-ai runtime was no problem at all and it was easy to utilize the H.264 accelerated pipeline in rt-ai’s RTSP camera capture module. The screen capture above shows this running along with a viewer, demonstrating basic rt-ai funtionality.

Next up is to roll the detection code into an rt-ai Stream Processing Element (SPE). This will generate identical metadata to the existing SSD detectors, allowing full compatibility between server GPU, Jetson, NCS 2 and Coral SSD detectors.