OpenPose does a great job of estimating the (x, y) coordinates of body points. However, in many situations, the spatial (3D) coordinates of the body joints is what’s required. To do that, the z coordinate has to be provided in some way. There are two common ways of doing that: using multiple cameras or using a depth camera. In this case, I chose using RGBD data from a StereoLabs ZED camera. An example of the result is shown in the screen capture above and another below. Coordinates are in units of meters.
The (x, y) 2D coordinates within the image (generated by OpenPose) along with the depth information at that (x, y) point in the image are used to calculate a spatial (sx, sy, sz) coordinate with origin at the camera and defined by the camera’s orientation. The important thing is that the spatial relationship between the joints is then trivial to calculate. This can be used by downstream inference blocks to discriminate higher level motions.
Incidentally I don’t have a leprechaun sitting on my computers to the right of the first screen capture – OpenPose was picking up my reflection in the window as another person.
The ZED is able to produce a depth map or point cloud but the depth map is more practical in this case as it necessary to transmit the data between processes (possibly on different machines). Even so, it is large and difficult to compress. The trick is to extract the meaningful data and then discard the depth information as soon as possible! The ZED camera also sends along the calibrated horizontal and vertical fields of view as this is essential to constructing (sx, sy, sz) from (x, y) and depth. Since the ZED doesn’t seem to produce a depth value for every pixel, the code samples an area around the (x, y) coordinate to evaluate a depth figure. If it fails to do this, the spatial coordinate is returned as (0, 0, 0).
This is the design I ended up using. Basically a dual OpenPose pipeline with scaler as for standard OpenPose. It averaged around 16 FPS with 1280 x 720 images (24 FPS with VGA images) using JPEG for the image part and raw depth map for the depth part. Using just one pipeline achieved about 13 FPS so the speed up from the second pipeline was disappointing. I expect that this was largely due to the communications overhead of moving the depth map around between nodes. Better network interfaces might improve this.