Shared nothing – sometimes being selfish is the way to go

Lock-free code is all the rage these days but it’s not just a fad. Having recently quantified the performance impact of a single lock on shared memory it’s easy to understand why eliminating locks (and indeed any other kind of kernel interaction) is the key to high performance.

A logical consequence of this is that threads must share no state (memory, disk or anything else) with any other thread unless it can be done in a safe manner without requiring synchronization. While there are some patterns that can be used for this, in general the solution is the shared nothing (or sharded) architecture where each thread works completely independently.

Coupled with core-locked threads, shared nothing architectures are capable of extracting the last drop of performance out of the underlying hardware. Suddenly that multi-core CPU looks like a very loosely coupled bunch of bare-metal processors.

One core == one thread

Back in the dark ages, when CPUs only had one, two or maybe four cores, the idea of dedicating an entire core to a single thread was ridiculous. Then it became apparent that the only way to scale CPU performance was to integrate more cores onto a single CPU chip. People started wondering – how to use all these cores in a meaningful way without getting bogged down in delays from cache coherency, locks and other synchronization issues.

Turns out the answer may well be to hard-allocate threads to cores – just one thread locked into each core. This means that almost all of an application can be free of kernel interaction. This is how DPDK gets its speed for example. It uses user space polling to minimize latency and maximize performance.

I have been running some tests using one thread per core with DPDK and lock-free shared memory links. So far, on my old i7-2700K dev machine (with another machine generating test data over a 40Gbps link), I have been seeing over 16Gbps of throughput through DPDK into the shared memory link using a single core without even trying to optimize the code. It’s kind of weird seeing certain cores holding at 100% continuously, even if they are doing nothing, but this is the new reality.

Converting screen coordinates + depth into spatial coordinates for OpenPose…or anything else really

Depth cameras are wonderful things but they typically only give a distance associated with each (x, y) coordinate in screen space. To convert to spatial coordinates involves some calculation. One thing to note is that I am ignoring camera calibration which is required to get best accuracy. See this page for details of how to use calibration data in iOS for example. I have implemented this calculation for the iPad TrueDepth camera and also the ZED stereo camera to process OpenPose joint data and it seems to work but I cannot guarantee complete accuracy!

The concept for the conversion is shown in the diagram above. One can think of the 2D camera image as being mapped to a screen plane – the blue plane in the diagram. The width and height of the plane are determined by its distance from the camera and the camera’s field of view. Using the iPad as an example, you can get the horizontal and vertical camera field of view angles (hFOV and vFOV in the diagram) like this:

hFOV = captureDevice.activeFormat.videoFieldOfView * Float.pi / 180.0
vFOV = atan(height / width * tan(hFOV))
tanHalfHFOV = tan(hFOV / 2) 
tanHalfVFOV = tan(vFOV / 2)

where width and height are the width and height of the 2D image. This calculation can be done once at the start of the session since it is defined by the camera itself.

For the Stereolabs ZED camera (this is a partial code extract):

#include <sl_zed/Camera.hpp>

sl::Camera zed;
sl::InitParameters init_params;

// set up params here
if (zed.open(init_params) != sl::SUCCESS) {
    exit(-1);
}

sl::CameraInformation ci = zed.getCameraInformation();
sl::CameraParameters cp = ci.calibration_parameters.left_cam;
hFOV = cp.h_fov;
vFOV = cp.v_fov;
tanHalfHFOV = tan(hFOV / 2);
tanHalfVFOV = tan(vFOV / 2);

To pick up the depth value, you just look up the hit point (x, y) coordinate in the depth buffer. For the TrueDepth camera and the ZED, this seems to be the perpendicular distance from the center of the camera to the plane defined by the target point that is perpendicular to the camera look at point – the yellow plane in the diagram. Other types of depth sensors might give the radial distance from the center of the camera to the hit point which will obviously require a slightly modified calculation. Here I am assuming that the depth buffer contains the perpendicular distance – call this spatialZ.

What we need now are the tangents of the reduced angles that correspond to the horizontal and vertical angle components between the ray from the camera to the screen plane hit point and the ray that is the camera’s look at point. – call these angles ThetaX (horizontal) and ThetaY (vertical). Given the perpendicular distance to the yellow plane, we can then easily calculate the spatial x and y coordinates using the field of view tangents previously calculated:

tanThetaX = (x - Float32(width / 2)) / Float32(width / 2) * tanHalfHFOV
tanThetaY = (y - Float32(height / 2)) / Float32(height / 2) * tanHalfVFOV

spatialX = spatialZ * tanThetaX
spatialY = spatialZ * tanThetaY

The coordinates (spatialZ, spatialY, spatialZ) are in whatever units the depth buffer uses (often meters) and in the camera’s coordinate system. To convert the camera’s coordinate system to world coordinates is a standard operation given the camera’s pose in the world space.

Running YOLOv3 with OpenVINO on CPU and (not) NCS 2


Since OpenVINO is the software framework for the Neural Compute Stick 2, I thought it would be interesting to get the OpenVINO YOLOv3 example up and running. While the toolkit download does include a number of models, YOLOv3 isn’t one of them. Instead, the model has to be created from a TensorFlow version.

The instructions here describe how to do this. Steps 1 and 2 are fine but it is kind of awkward how the .pb file is generated so I created a new simple script to do this:

# -*- coding: utf-8 -*-

import numpy as np
import tensorflow as tf
from tensorflow.python.framework import graph_io

from yolo_v3 import yolo_v3, load_weights, detections_boxes, non_max_suppression

def load_coco_names(file_name):
    names = {}
    with open(file_name) as f:
        for id, name in enumerate(f):
            names[id] = name
    return names
    
def main(argv):

    classes = load_coco_names("coco.names")

    # placeholder for detector inputs
    inputs = tf.placeholder(tf.float32, [None, 416, 416, 3])

    with tf.variable_scope('detector'):
        detections = yolo_v3(inputs, len(classes), data_format='NHWC')
        load_ops = load_weights(tf.global_variables(scope='detector'), "yolov3.weights")

    boxes = detections_boxes(detections)

    with tf.Session() as sess:
        sess.run(load_ops)
        frozen = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ['concat_1'])
        graph_io.write_graph(frozen, './', 'yolo_v3.pb', as_text=False)

if __name__ == '__main__':
    tf.app.run()

This has the important filenames hardcoded – you just need to put yolo_v3.weights and coco.names in the tensorflow-yolo-v3 directory. Run the script above with:

python3 script.py

and the yolo_v3.pb file should be created. Copy this into the model_optimizer directory, set that as the current directory and run:

python3 mo_tf.py --input_model yolo_v3.pb --tensorflow_use_custom_operations_config ./extensions/front/tf/yolo_v3.json --input_shape [1,416,416,3]

The –input_shape parameter is needed as otherwise it blows up due to getting -1 for the mini-batch size. I just forced this to 1 and it was happy.

The result is in yolo_v3.xml and yolo_v3.bin. These can be used with the demo object_detection_demo_yolov3_async and an example output is shown in the screen capture above. Note that it is necessary to run the following:

~/intel/computer_vision_sdk/bin/setupvars.sh

in the same terminal session as the demo will be run in order for CPU mode to work.

By default, the output just annotates the boxes with label numbers rather than readable labels. To get readable labels, copy coco.names to yolo_v3.labels and put it in the same directory as the xml file. One problem is that the label file reader doesn’t handle spaces in the labels. Rather than mess with the code, I just changed the spaces in the yolo_v3.labels file to underlines. Otherwise it thinks a mouse is a donut and a monitor a dog which is a little confusing.

However, what I really wanted to do was to run this on the NCS 2. The model as generated is FP32 and the NCS 2 wants FP16. Adding –data_type FP16 to the mo_tf.py command line fixes that but unfortunately it reports that the NCS 2 doesn’t support the Resample layer which is used by YOLOv3. If I had been smart I would have noticed that the usage info only mentions CPU and GPU :-(. Interestingly, the table of supported layers indicates that both Resample and Interp are supported on MYRIAD so I do not know what is going on here.

I did try changing the offending tf.image_resize_nearest_neighbor call into a tf.image.resize.bilinear call (by editing yolo_v3.py in the tensorflow-yolo-v3 directory). This maps to Interp instead of Resample in the OpenVINO IR.  This worked fine in CPU mode but still failed to run on the NCS 2 except in a different way:


Not sure if that is a bug or intended. Anyway, that seems to be the end of the road with running YOLOv3 on the NCS 2 for the moment at least. However, there are a lot of things that do run on the NCS 2 very nicely. Still, YOLOv3 had started to become my standard way of checking inference things out, just like my strategy of evaluating restaurants by the quality of their Caesar salad – at least in the days when you could still get them!

*** Update: YOLOv3 does now work on the NCS 2 using the latest OpenVINO release.

Sending and receiving binary data using JSON encoding, Python and MQTT

I really like using JSON encoding as a way of transferring messages between processes as it is machine and language independent. Plus, it is very well suited to stream processing networks (such as rt-ai Edge) as arbitrary fields can be added to existing JSON messages and passed along. Contrast this with compiled IDLs which typically have no flexibility whatsoever.

One problem though is that binary data cannot be included in JSON messages directly. Typically base64 encoding is used to convert binary data into text. However, this is inefficient, especially in a stream processing network where base64 decoding and encoding might have to be done several times.

There are a variety of modifications to JSON around but it is very simple to just add binary data on to the end of a JSON message to form a complete message that can be transferred via MQTT for example.

In Python, an MQTT message can be published like this:

    import json
    import struct
    ...
    def publish(topic, jsonData, binData = None):
        jsonDump = json.dumps(jsonData)
        jsonString = struct.pack('>I', len(jsonDump)) + jsonDump + binData
        MQTTClient.publish(topic, jsonString)
        ...

Here, jsonData contains the normal JSON message text, binData contains the binary data to be sent along with it. To receive the message, use something like this:

    import json
    import struct
    ...
    def onMessage(client, userdata, message):
        jsonLength = struct.unpack('>I', message.payload[0:4])[0]
        jsonData = json.loads(message.payload[4:4+jsonLength])
        binData = message.payload[4+jsonLength:]
        ...

Using TensorFlow for things other than machine learning

LaplacianTensorFlow provides a very convenient dataflow graph framework for not just machine learning applications but really anything where data goes through a number of processing stages. The great thing about using TensorFlow for this is that all the GPU and scaling capabilities are potentially available, along with a Python API for added convenience.

To test this out, I created a simple Python script to act as an image processor that can be inserted into a Jpeg video stream using MQTT as a way of moving the data around. The script uses TensorFlow to shrink each frame in the stream by a factor of two (using average pooling) and then performing simple edge detection using a discrete Laplacian, implemented with a 2-D convolution. Jpeg encoding and decoding is also performed using TensorFlow functions.

The frame rate tops out at around 17 frames per second on my i7-2700K/GTX 970 machine (video source frame rate was 30 frames per second). I am guessing that there is a finite startup latency in TensorFlow – it’s no doubt highly inefficient to run the graph with one image at a time.

There’s no rocket science here and the functionality is trivial. However, it is interesting to think how else TensorFlow can be used. Given the incredible interest and the likelihood of dedicated hardware acceleration one day, there might be considerable value in mapping problems onto TensorFlow graphs.

Setting up an NVMe SSD on Ubuntu

Intel750I need a fast PC for some current work and decided that it’d be nice to use an NVMe SSD to speed up storage. I am using an Intel 750 PCIe 3.0 x4 add in card as it’s a simple way to go. Ubuntu has the NVMe driver built in so it came up straight away as /dev/nme0n1. One small issue is that gparted doesn’t seem to think this is a disk so it’s a case of reverting to old school command line stuff to get it set up for use.

Continue reading “Setting up an NVMe SSD on Ubuntu”

Deep Gaussian Processes

deepGPyI have a project that requires identifying sequences of signals and classifying them in various ways and I have been looking for good techniques that could be applied to the problem. I came across a paper on Deep Gaussian Processes. They are somewhat related to deep neural networks but have an advantage in requiring a lot less training data. Since the generation of high quality training data is a big issue with DNNs, this is quite appealing. There are some GitHub repos with Python code to make getting started easier. The screenshot is from a demo in the deepGPy repo. Hopefully it will do what I want but, at the very least, I am learning some new mathematics.

Using ethtool to prevent receive-side Ethernet IP frame coalescing

Some Linux Ethernet drivers have an enhancement that will de-fragment two or more fragments from a fragmented IP packet before handing the data off to userland. This is great unless you are trying to do a “lumpy cable” analyzer as I happen to be doing right now. For this application, it’s a nice idea to leave the frames unchanged as they are passed across the analyzer.

ethtool provides a way to turn off coalescing. Enter:

sudo ethtool -C eth0 rx-usecs 0

to turn off receive coalescing for eth0 for example. If it’s a dedicated system, this line (without the sudo) can be added to /etc/rc.local so that it is executed on every restart.