Back in the dark ages, when CPUs only had one, two or maybe four cores, the idea of dedicating an entire core to a single thread was ridiculous. Then it became apparent that the only way to scale CPU performance was to integrate more cores onto a single CPU chip. People started wondering – how to use all these cores in a meaningful way without getting bogged down in delays from cache coherency, locks and other synchronization issues.
Turns out the answer may well be to hard-allocate threads to cores – just one thread locked into each core. This means that almost all of an application can be free of kernel interaction. This is how DPDK gets its speed for example. It uses user space polling to minimize latency and maximize performance.
I have been running some tests using one thread per core with DPDK and lock-free shared memory links. So far, on my old i7-2700K dev machine (with another machine generating test data over a 40Gbps link), I have been seeing over 16Gbps of throughput through DPDK into the shared memory link using a single core without even trying to optimize the code. It’s kind of weird seeing certain cores holding at 100% continuously, even if they are doing nothing, but this is the new reality.