Easy dynamic dispatch using GLIBC Hardware Capabilities

TL;DR With GLIBC 2.33+, you can build a shared library multiple times targeting various optimization levels, and the dynamic linker/loader will pick the highest version supported by the current CPU. For example, with the layout below, on a Ryzen 9 5900X, x86-64-v3/libfoo0.so would be loaded:

/usr/lib/glibc-hwcaps/x86-64-v4/libfoo0.so
/usr/lib/glibc-hwcaps/x86-64-v3/libfoo0.so
/usr/lib/glibc-hwcaps/x86-64-v2/libfoo0.so
/usr/lib/libfoo0.so

Longer Version

GLIBC Hardware Capabilities or "hwcaps" are an easy, almost trivial way to add a simple form of dynamic dispatch to any amd64 or POWER build, provided that either the build target or the compiler's optimizations can make use of certain CPU extensions.

Mo Zhou pointed me towards this when I was faced with the challenge of creating a performant Debian package for ggml, the tensor library behind llama.cpp and whisper.cpp.

The Challenge

A performant yet universally loadable library needs to make use of some form of dynamic dispatch to leverage the most effective SIMD extensions available on any given CPU it may run on. Last January, when I first started with the packaging of ggml for Debian, ggml did have support for this through its GGML_CPU_ALL_VARIANTS=ON option, but this was limited to amd64. This meant that on all the other architectures that Debian supports, I would need to target some ancient baseline, thus effectively crippling the package there.

Dynamic Dispatch using hwcaps

hwcaps were introduced in GLIBC 2.33 and replace the (now) Legacy Hardware Capabilities, which were removed in 2.37. The way hwcaps work is delightfully simple: the dynamic linker/loader will look for a shared library not just in the standard library paths, but also in subdirectories thereof of the form hwcaps/<level>, starting with the highest <level> that the current CPU supports. The levels are predefined. I'm using the amd64 levels below.

For ggml, this meant that I simply could build the library in multiple passes, each time targeting a different <level>, and install the result in the corresponding subdirectory, which resulted in the following layout (reduced to libggml.so for brevity):

/usr/lib/x86_64-linux-gnu/ggml/glibc-hwcaps/x86-64-v4/libggml.so
/usr/lib/x86_64-linux-gnu/ggml/glibc-hwcaps/x86-64-v3/libggml.so
/usr/lib/x86_64-linux-gnu/ggml/glibc-hwcaps/x86-64-v2/libggml.so
/usr/lib/x86_64-linux-gnu/ggml/libggml.so

In practice, this means that on a CPU supporting AVX512, the linker/loader would load x86-64-v4/libggml.so if it existed, and otherwise continue to look for the other levels, all the way down to the lowest one. On a CPU which supported only SSE4.2, the lookup process would be the same, ending with picking x86-64-v2/libggml.so. With QEMU, all of this was quickly verified.

Note that the lowest-level library, targeting x86-64-v1, is not installed to a subdirectory, but to the path where the library would normally have been installed. This has the nice property that on systems not using GLIBC, and thus not having hwcaps available, package installation will still result in a loadable library, albeit the version with the worst performance. And a careful observer might have noticed that in the example above, the library is installed to a private ggml/ directory, so this mechanism also works when using RUNPATH or LD_LIBRARY_PATH.

As mentioned above, Debian's ggml package will soon switch to GGML_CPU_ALL_VARIANTS=ON, but this was still quite the useful feature to discover.