btrfs: add xxhash to fast checksum implementations
The implementation of XXHASH is now CPU only but still fast enough to be
considered for the synchronous checksumming, like non-generic crc32c.
A userspace benchmark comparing it to various implementations (patched
hash-speedtest from btrfs-progs):
Block size: 4096
Iterations:
1000000
Implementation: builtin
Units: CPU cycles
NULL-NOP: cycles:
73384294, cycles/i 73
NULL-MEMCPY: cycles:
228033868, cycles/i 228, 61664.320 MiB/s
CRC32C-ref: cycles:
24758559416, cycles/i 24758, 567.950 MiB/s
CRC32C-NI: cycles:
1194350470, cycles/i 1194, 11773.433 MiB/s
CRC32C-ADLERSW: cycles:
6150186216, cycles/i 6150, 2286.372 MiB/s
CRC32C-ADLERHW: cycles:
626979180, cycles/i 626, 22427.453 MiB/s
CRC32C-PCL: cycles:
466746732, cycles/i 466, 30126.699 MiB/s
XXHASH: cycles:
860656400, cycles/i 860, 16338.188 MiB/s
Comparing purely software implementation (ref), current outdated
accelerated using crc32q instruction (NI), optimized implementations by
M. Adler (https://stackoverflow.com/questions/
17645167/implementing-sse-4-2s-crc32c-in-software/
17646775#
17646775)
and the best one that was taken from kernel using the PCLMULQDQ
instruction (PCL).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>