Easy dynamic dispatch using GLIBC Hardware Capabilities
TL;DR With GLIBC 2.33+, you can build a shared library multiple times
targeting various optimization levels, and the dynamic linker/loader will pick
the highest version supported by the current CPU. For example, with the layout
below, on a Ryzen 9 5900X, x86-64-v3/libfoo0.so
would be loaded:
/usr/lib/glibc-hwcaps/x86-64-v4/libfoo0.so /usr/lib/glibc-hwcaps/x86-64-v3/libfoo0.so /usr/lib/glibc-hwcaps/x86-64-v2/libfoo0.so /usr/lib/libfoo0.so
Longer Version
GLIBC Hardware Capabilities or "hwcaps" are an easy, almost trivial way to add a simple form of dynamic dispatch to any amd64 or POWER build, provided that either the build target or the compiler's optimizations can make use of certain CPU extensions.
Mo Zhou pointed me towards this when I was faced with the challenge of creating a performant Debian package for ggml, the tensor library behind llama.cpp and whisper.cpp.
The Challenge
A performant yet universally loadable library needs to make use of some
form of dynamic dispatch to leverage the most effective SIMD extensions
available on any given CPU it may run on. Last January, when I first started
with the packaging of ggml for Debian, ggml did have support for this through
its GGML_CPU_ALL_VARIANTS=ON
option, but this was limited to amd64.
This meant that on all the other architectures that Debian supports, I would
need to target some ancient
baseline,
thus effectively crippling the package there.
Dynamic Dispatch using hwcaps
hwcaps were introduced in GLIBC 2.33 and replace the (now) Legacy Hardware
Capabilities, which were removed in 2.37. The way hwcaps work is delightfully
simple: the dynamic linker/loader will look for a shared library not just in
the standard library paths, but also in subdirectories
thereof of the form hwcaps/<level>
, starting with the highest <level>
that the current CPU supports. The levels are predefined. I'm using the
amd64 levels
below.
For ggml, this meant that I simply could build the library in multiple passes,
each time targeting a different <level>
, and install the result in the
corresponding subdirectory, which resulted in the following
layout (reduced to libggml.so
for brevity):
/usr/lib/x86_64-linux-gnu/ggml/glibc-hwcaps/x86-64-v4/libggml.so /usr/lib/x86_64-linux-gnu/ggml/glibc-hwcaps/x86-64-v3/libggml.so /usr/lib/x86_64-linux-gnu/ggml/glibc-hwcaps/x86-64-v2/libggml.so /usr/lib/x86_64-linux-gnu/ggml/libggml.so
In practice, this means that on a CPU supporting AVX512, the linker/loader
would load x86-64-v4/libggml.so
if it existed, and otherwise continue to
look for the other levels, all the way down to the lowest one. On a
CPU which supported only SSE4.2, the lookup process would be the same,
ending with picking x86-64-v2/libggml.so
.
With QEMU, all of this was quickly
verified.
Note that the lowest-level library, targeting x86-64-v1
, is not installed
to a subdirectory, but to the path where the library would normally have been
installed. This has the nice property that on systems not using
GLIBC, and thus not having hwcaps available, package installation will still
result in a loadable library, albeit the version with the worst performance.
And a careful observer might have noticed that in the example above, the
library is installed to a private ggml/
directory, so this mechanism also
works when using RUNPATH
or LD_LIBRARY_PATH
.
As mentioned above, Debian's ggml package will soon switch to
GGML_CPU_ALL_VARIANTS=ON
, but this was still quite the useful feature to
discover.