Opsmas 2025 Day 9: crcspeed
I wrote crcspeed in 2014 over about 8 months of figuring out how speed up CRC-Jones-64 and CRC-16-CCITT variants. It got a lot of exposure and inclusion in other popular public and private projects over the years.
a background on crcs
CRC stands for “cyclic redundancy check” and it’s a fairly accurate name. It generates a value for any input bitstream, except the calculation happens entirely serially where, during the calculation, every next calculation depends on the previous bits already calcluated. The CRC systems were designed in the time of slow serial connections with data rates measured in bits per second so burning time calculating checks during incoming data latency arrival delays processing byte-per-byte streaming data wasn’t too much of a burden.
But then things got faster. Do you know how inefficient it is to calculate a value by suming up individual bytes over gigabytes of data? It’s pretty bad.
the first fixes
On modern systems capable of reading 60+ GB/s out of main memory, a standard unoptimized CRC algorithm operates at about 70 MB/s. Quite a bad algorithm for modern times.
You can do “tricks” to make it faster like pre-calculate all single byte values and we boost speed from 70 MB/s to 330 MB/s at the cost of a lookup table being referenced constantly, but still awful for any kind of data check in real time (i’ve seen people use standard CRC implementations to like checksum 100+ GB data snapshot files).
My trick with crcspeed was using the recently (?) discovered or popularized “slicing by 8” method of pre-calculating a single CRC byte table, you use 8 of them for longer pre-calcualted byte sequences, and you get up to 1.5 GB/s of throughput now. Not bad for some math and cache waste going from 70 MB/s to 1.5 GB/s.
but the whale was still waiting.
even more modern speedups
Intel had long ago (2009-ish) wrote a paper on how to use SIMD/SSE instructions for CRC speed up with hand written assembly, but as it goes with things, it wasn’t easy to adapt or build or convert into other systems. They even have implementations and weird libraries of their own.
Except, I needed more. I wanted a full cross-platform SIMD (SSE+NEON) implementation which didn’t seem to exist and even the standard Intel versions weren’t performing the best.
thus, ergo, now we have crcspeed with SIMD support for SSE (Intel) and NEON (ARM) which get about 40+ GB/s in my testing for CRC-64-Jones (30x speedup from previous) and around 10 GB/s for CRC-16-CCITT (7x speedup from previous):
./crc_bench 33333333
===================
CRC Benchmark Suite
===================
SIMD support: YES
Buffer size: 33333333 bytes (31.79 MB)
Iterations: 100
CRC64 Benchmarks:
-----------------
crc64 (bit-by-bit) : 71.96 MB/s, 0.32 cycles/byte
crc64_lookup (table) : 378.39 MB/s, 0.06 cycles/byte
crc64speed (slice-8) : 1432.05 MB/s, 0.02 cycles/byte
crc64_simd (PCLMUL) : 41997.47 MB/s, 0.00 cycles/byte
CRC16 Benchmarks:
-----------------
crc16 (bit-by-bit) : 110.53 MB/s, 0.21 cycles/byte
crc16_lookup (table) : 335.39 MB/s, 0.07 cycles/byte
crc16speed (slice-8) : 1416.56 MB/s, 0.02 cycles/byte
crc16_simd (PCLMUL) : 9426.57 MB/s, 0.00 cycles/byte
Small Buffer Benchmarks (64 bytes):
-----------------------------------
crc64speed (slice-8) : 2277.01 MB/s, 0.01 cycles/byte
crc64_simd (PCLMUL) : 7174.70 MB/s, 0.00 cycles/byte
crc16speed (slice-8) : 2337.62 MB/s, 0.01 cycles/byte
crc16_simd (PCLMUL) : 7006.68 MB/s, 0.00 cycles/byte
Medium Buffer Benchmarks (4KB):
-------------------------------
crc64speed (slice-8) : 1468.52 MB/s, 0.02 cycles/byte
crc64_simd (PCLMUL) : 45002.88 MB/s, 0.00 cycles/byte
crc16speed (slice-8) : 1395.84 MB/s, 0.02 cycles/byte
crc16_simd (PCLMUL) : 11047.09 MB/s, 0.00 cycles/byte
Large Buffer Benchmarks (1MB):
------------------------------
crc64speed (slice-8) : 1457.24 MB/s, 0.02 cycles/byte
crc64_simd (PCLMUL) : 42405.88 MB/s, 0.00 cycles/byte
crc16speed (slice-8) : 1418.09 MB/s, 0.02 cycles/byte
crc16_simd (PCLMUL) : 9431.78 MB/s, 0.00 cycles/byte
===================./crcspeed -n 1 ~/Downloads/Blame\!\ by\ Tsutomu\ Nihei\ Collection.zip
═══════════════════════════════════════════════════════════════════════
Benchmark: /Users/matt/Downloads/Blame! by Tsutomu Nihei Collection.zip
═══════════════════════════════════════════════════════════════════════
File Size: 11.02 GB (11831396389 bytes)
Iterations: 1
Total Data: 11.02 GB per algorithm
CRC64 Results:
Algorithm CRC Value MB/s GB/s Cycles/B Status
────────────── ────────────────── ──────────── ────────── ────────── ──────
crc64 0xfadd0161b519b53e 71.93 0.07 0.318 [PASS]
crc64_lookup 0xfadd0161b519b53e 375.62 0.37 0.061 [PASS]
crc64speed 0xfadd0161b519b53e 1455.17 1.42 0.016 [PASS]
crc64_simd 0xfadd0161b519b53e 41937.87 40.95 0.001 [PASS]
CRC16 Results:
Algorithm CRC Value MB/s GB/s Cycles/B Status
────────────── ────────────────── ──────────── ────────── ────────── ──────
crc16 0x9770 111.92 0.11 0.204 [PASS]
crc16_lookup 0x9770 334.97 0.33 0.068 [PASS]
crc16speed 0x9770 1397.31 1.36 0.016 [PASS]
crc16_simd 0x9770 9268.78 9.05 0.002 [PASS]
Summary:
CRC64: 0xfadd0161b519b53e PASS
CRC16: 0x9770 PASS
Speedup (SIMD vs Table):
CRC64: 28.8x
CRC16: 6.6x
═══════════════════════════════════════════════════════════════════════
stats
===============================================================================
Language Files Lines Code Comments Blanks
===============================================================================
C 8 3568 2507 548 513
C Header 4 128 64 43 21
-------------------------------------------------------------------------------
Markdown 1 119 0 88 31
|- Haskell 1 126 102 5 19
(Total) 245 102 93 50
===============================================================================
Total 13 3815 2571 679 565
===============================================================================conclusion
Is this the best it can be? Probably not. You can probably use even wider SIMD instructions in more specific ways to rip through the sequences faster, but more advanced features on more advanced CPU architectures are outside the scope of what I have access to for free unpaid globally reusable work resulting in no compensation or glory along the way, womp womp.
any questions?
oh, also STOP USING CRC FOR ANYTHING MODERN! Stop using this! This library should be archived and abandoned and rotting! You should be using xxhash64 or xxhash128 for any modern application requiring “hashes” or “checksums” built after 2013.