Using UWB asset tags to dynamically create rt-ispace augmentations

Essentially, an asset tag is a small device that can be used to locate and instantiate, using UWB in this case, an augmentation in an rt-ispace environment completely dynamically. The augmentation follows the position and orientation of the asset tag, making for a very simple way to implement augmented spaces. If engineered properly, it could be an extremely simple piece of hardware that would be essentially UWB hardware along with a MEMS IMU and a battery. Instead of WiFi as in this prototype, pose updates could be sent over the UWB infrastructure to make things really simple. Ideally, these would be extremely cheap and could be placed anywhere in a space as a simple way of adding augmentations. These augmentations can be proxy objects (all aspects of a proxy object augmentation can be modified by remote servers) and can be as simple or complex as desired.

There are some similarities and differences with the ArUco marker system for instantiating augmentations. The ArUco marker can provide an ID but that has to be matched with a previously instantiated object that has the same ID attached. Asset tags don’t require any pre-configuration like that. Another problem with ArUco markers is that they are very affected by occlusion – even a wire running across a marker might make it undetectable. Asset tags are not affected by occlusion and so will function correctly in a much wider range of circumstances. They do require UWB enabled spaces, however. In the end, both styles of augmentation instantiation have their place.

Note that the asset tag doesn’t need to contain the actual asset data (although it could if desired). All it needs to do is to provide a URL of a repository where the asset (either Unity assetbundle or glTF blob) can be found. The asset is then streamed dynamically when it needs to be instantiated. It also provides information about where to find function servers in the case of a proxy object. The rt-ispace user app (in this case an iOS app running on an iPad Pro) doesn’t need to know anything about asset tags – they just inject normal looking (but transient) augmentation updates into the rt-ispace system so that augmentations magically appear. Obviously, this kind of flexibility could easily be abused and, in real life, a proper security strategy would need to be implemented in most cases. For development, though, it’s nice for things just to work!

One application that I like is a shared space where people can bring along their virtual creations in the form of some asset tags and just place them in the rt-ispace space so that any user in the space  could see them.

Another idea is that items in stores could have rt-ispace asset tags attached to them (like security tags today) so that looking at the item with an AR device would perhaps demonstrate some kind of feature. Manufacturers could supply the asset and functions servers, freeing the retail store from having to implement something for every stocked item. 

The video above shows how the augmentation tracks the UWB tag around and that the IMU controls the augmentation’s orientation. For now, the hardware is a complete hack with multiple components but it does prove that the concept is viable. The UWB tag (the white box on the floor under the figure’s right foot) controls the location of the augmentation in the physical space. A Raspberry Pi fitted with an IMU provides orientation information and drives the resulting pose via WiFi to the rt-ispace servers. The augmentation is the usual glTF sample CesiumMan.

Linking AR augmentations to physical space using the ArUco marker system

Following on from the earlier work with ArUco markers, rt-ispace can now associate ArUco markers with augmentations in a space. The image above shows two glTF sample models attached to two different ArUco marker codes (23 and 24 in this case). Since these models are animated, a video also seems appropriate!

The image and video were obtained using an iPad Pro running the rt-ispace app that forms the front end for the rt-ispace system. A new server, EdgeAnchor, receives the AR video stream from the iPad via the assigned EdgeAccess, detecting any ArUco markers that may be in view. The video stream also contains the iPad camera instrinsics and AR camera pose, which allows EdgeAnchor to determine a physical pose in space of the marker relative to the camera view. The marker detection results are sent back to the iPad app (via EdgeAccess) which then matches the ArUco IDs to instantiated augmentations and calculates the world space pose for the augmentation. There are some messy calculations in there but it actually works very well.

The examples shown are set up to instantiate the augmentation based on a horizontal marker. However, the augmentation configuration allows for a 6-dof offset to the marker. This means that markers can be hung on walls with augmentations either on the walls or in front of the walls, for example.

A single EdgeAnchor instance can be shared among many rt-ispace users as no state is retained between frames allowing the system to scale very nicely. Also, there is nothing specific to ArUco markers: in principle, EdgeAnchor could support multiple marker types, providing great flexibility. The only requirement is that the marker detection results in a 6-dof pose relative to the camera.

Previously, I had been resistant to the use of markers, preferring to use the spatial mapping capabilities of the user device to provide spatial lock and location of augmentations. However, there are many limitations to these systems, especially where there is very limited visual texture or depth changes to act as a natural anchor. Adding physical anchors means that augmentations can be reliably placed in very featureless spaces which is a big plus in terms of creating a pleasant user experience.

Adding ArUco marker detection to rt-ai

There are many situations where it is necessary to establish the spatial relationship between a camera in a space and 3D points within the same space. One particular application of interest is the ability to use markers to accurately locate holograms in a space so that AR headset users see the holograms locked in the space, even as they look or move around the space. OpenCV has the ArUco marker detection included so that seemed like a good place to start. The screen capture above shows the rt-ai Aruco marker detector identifying the pose of a few example markers.

This is the simple rt-ai test design with the new ArUcoDetect stream processing element (SPE). The UVC camera was running at 1920 x 1080, 30 fps, and the ArUco SPE had no trouble keeping up with this.

This screen capture is a demonstration of the kind of thing that might be useful to do in an AR application. The relative pose of the marker has been detected, allowing the marker to be replaced by an associated hologram by a 3D application.

While the detection is quite stable, the ArUco SPE implements a configurable filter to help eliminate occasional artifacts, especially regarding the blue (z-axis) which can swing around quite a bit under some circumstances due to the pose ambiguity problem. The trick is to tune the filter to eliminate any residual pose jitter while maintaining adequate response to headset movement.

One challenge here is management of camera intrinsic parameters. In this case, I was using a Logitech C920 webcam for which calibration intrinsics had been determined using a version of the ChArUco calibration sample here. It wouldn’t be hard for the CUVCCam SPE to include camera intrinsic parameters in the JSON associated with each frame, assuming it could detect the type of UVC camera and pick up a pre-determined matrix for that type. Whether that’s adequate is a TBD. In other situations, where the video source is able to supply calibration data, the problem goes away. Anyway, more work needs to be done in this area.

Since rt-ai stream processing networks (SPNs) can be integrated with SHAPE via the Conductor SPE (an example of the Conductor is here), an AR headset running the SHAPE application could stream front facing video to the ArUco SPN, which would then return the relative pose of detected markers that have previously been associated with SHAPE virtual objects. This would allow the headset to correctly instantiate the SHAPE virtual objects in the space and avoid problems relying on inside out tracking alone (such as in a spatial environment with a repeating texture that prevents unique identification).

Converting screen coordinates + depth into spatial coordinates for OpenPose…or anything else really

Depth cameras are wonderful things but they typically only give a distance associated with each (x, y) coordinate in screen space. To convert to spatial coordinates involves some calculation. One thing to note is that I am ignoring camera calibration which is required to get best accuracy. See this page for details of how to use calibration data in iOS for example. I have implemented this calculation for the iPad TrueDepth camera and also the ZED stereo camera to process OpenPose joint data and it seems to work but I cannot guarantee complete accuracy!

The concept for the conversion is shown in the diagram above. One can think of the 2D camera image as being mapped to a screen plane – the blue plane in the diagram. The width and height of the plane are determined by its distance from the camera and the camera’s field of view. Using the iPad as an example, you can get the horizontal and vertical camera field of view angles (hFOV and vFOV in the diagram) like this:

hFOV = captureDevice.activeFormat.videoFieldOfView * Float.pi / 180.0
vFOV = atan(height / width * tan(hFOV))
tanHalfHFOV = tan(hFOV / 2) 
tanHalfVFOV = tan(vFOV / 2)

where width and height are the width and height of the 2D image. This calculation can be done once at the start of the session since it is defined by the camera itself.

For the Stereolabs ZED camera (this is a partial code extract):

#include <sl_zed/Camera.hpp>

sl::Camera zed;
sl::InitParameters init_params;

// set up params here
if (zed.open(init_params) != sl::SUCCESS) {
    exit(-1);
}

sl::CameraInformation ci = zed.getCameraInformation();
sl::CameraParameters cp = ci.calibration_parameters.left_cam;
hFOV = cp.h_fov;
vFOV = cp.v_fov;
tanHalfHFOV = tan(hFOV / 2);
tanHalfVFOV = tan(vFOV / 2);

To pick up the depth value, you just look up the hit point (x, y) coordinate in the depth buffer. For the TrueDepth camera and the ZED, this seems to be the perpendicular distance from the center of the camera to the plane defined by the target point that is perpendicular to the camera look at point – the yellow plane in the diagram. Other types of depth sensors might give the radial distance from the center of the camera to the hit point which will obviously require a slightly modified calculation. Here I am assuming that the depth buffer contains the perpendicular distance – call this spatialZ.

What we need now are the tangents of the reduced angles that correspond to the horizontal and vertical angle components between the ray from the camera to the screen plane hit point and the ray that is the camera’s look at point. – call these angles ThetaX (horizontal) and ThetaY (vertical). Given the perpendicular distance to the yellow plane, we can then easily calculate the spatial x and y coordinates using the field of view tangents previously calculated:

tanThetaX = (x - Float32(width / 2)) / Float32(width / 2) * tanHalfHFOV
tanThetaY = (y - Float32(height / 2)) / Float32(height / 2) * tanHalfVFOV

spatialX = spatialZ * tanThetaX
spatialY = spatialZ * tanThetaY

The coordinates (spatialZ, spatialY, spatialZ) are in whatever units the depth buffer uses (often meters) and in the camera’s coordinate system. To convert the camera’s coordinate system to world coordinates is a standard operation given the camera’s pose in the world space.