Converting screen coordinates + depth into spatial coordinates for OpenPose…or anything else really

Depth cameras are wonderful things but they typically only give a distance associated with each (x, y) coordinate in screen space. To convert to spatial coordinates involves some calculation. One thing to note is that I am ignoring camera calibration which is required to get best accuracy. See this page for details of how to use calibration data in iOS for example. I have implemented this calculation for the iPad TrueDepth camera and also the ZED stereo camera to process OpenPose joint data and it seems to work but I cannot guarantee complete accuracy!

The concept for the conversion is shown in the diagram above. One can think of the 2D camera image as being mapped to a screen plane – the blue plane in the diagram. The width and height of the plane are determined by its distance from the camera and the camera’s field of view. Using the iPad as an example, you can get the horizontal and vertical camera field of view angles (hFOV and vFOV in the diagram) like this:

hFOV = captureDevice.activeFormat.videoFieldOfView * Float.pi / 180.0
vFOV = atan(height / width * tan(hFOV))
tanHalfHFOV = tan(hFOV / 2)
tanHalfVFOV = tan(vFOV / 2)

where width and height are the width and height of the 2D image. This calculation can be done once at the start of the session since it is defined by the camera itself.

For the Stereolabs ZED camera (this is a partial code extract):

#include <sl_zed/Camera.hpp>

sl::Camera zed;
sl::InitParameters init_params;

// set up params here
if (zed.open(init_params) != sl::SUCCESS) {
exit(-1);
}

sl::CameraInformation ci = zed.getCameraInformation();
sl::CameraParameters cp = ci.calibration_parameters.left_cam;
hFOV = cp.h_fov;
vFOV = cp.v_fov;
tanHalfHFOV = tan(hFOV / 2);
tanHalfVFOV = tan(vFOV / 2);

To pick up the depth value, you just look up the hit point (x, y) coordinate in the depth buffer. For the TrueDepth camera and the ZED, this seems to be the perpendicular distance from the center of the camera to the plane defined by the target point that is perpendicular to the camera look at point – the yellow plane in the diagram. Other types of depth sensors might give the radial distance from the center of the camera to the hit point which will obviously require a slightly modified calculation. Here I am assuming that the depth buffer contains the perpendicular distance – call this spatialZ.

What we need now are the tangents of the reduced angles that correspond to the horizontal and vertical angle components between the ray from the camera to the screen plane hit point and the ray that is the camera’s look at point. – call these angles ThetaX (horizontal) and ThetaY (vertical). Given the perpendicular distance to the yellow plane, we can then easily calculate the spatial x and y coordinates using the field of view tangents previously calculated:

tanThetaX = (x - Float32(width / 2)) / Float32(width / 2) * tanHalfHFOV
tanThetaY = (y - Float32(height / 2)) / Float32(height / 2) * tanHalfVFOV

spatialX = spatialZ * tanThetaX
spatialY = spatialZ * tanThetaY

The coordinates (spatialZ, spatialY, spatialZ) are in whatever units the depth buffer uses (often meters) and in the camera’s coordinate system. To convert the camera’s coordinate system to world coordinates is a standard operation given the camera’s pose in the world space.