The definitions were a bit of a mess and there wasn't even a fall back to
__builtin_clz() on compilers supporting it. Now we instead define a macro
for each implementation that is set on an arch-dependent case by case,
and add the fall back ones only when not defined. This also allows the
flsnz8() to automatically fall back to the 32-bit arch-specific version
if available. This shows a consistent 33% speedup on arm for strings.
This is cbtree commit c6075742e8d0a6924e7183d44bd93dec20ca8049.
This is ebtree commit f452d0f83eca72f6c3484ccb138d341ed6fd27ed.