Opsmas 2025 Day 9: crcspeed

I wrote crcspeed in 2014 over about 8 months of figuring out how speed up CRC-Jones-64 and CRC-16-CCITT variants. It got a lot of exposure and inclusion in other popular public and private projects over the years.

a background on crcs

CRC stands for “cyclic redundancy check” and it’s a fairly accurate name. It generates a value for any input bitstream, except the calculation happens entirely serially where, during the calculation, every next calculation depends on the previous bits already calcluated. The CRC systems were designed in the time of slow serial connections with data rates measured in bits per second so burning time calculating checks during incoming data latency arrival delays processing byte-per-byte streaming data wasn’t too much of a burden.

But then things got faster. Do you know how inefficient it is to calculate a value by suming up individual bytes over gigabytes of data? It’s pretty bad.

the first fixes

On modern systems capable of reading 60+ GB/s out of main memory, a standard unoptimized CRC algorithm operates at about 70 MB/s. Quite a bad algorithm for modern times.

You can do “tricks” to make it faster like pre-calculate all single byte values and we boost speed from 70 MB/s to 330 MB/s at the cost of a lookup table being referenced constantly, but still awful for any kind of data check in real time (i’ve seen people use standard CRC implementations to like checksum 100+ GB data snapshot files).

My trick with crcspeed was using the recently (?) discovered or popularized “slicing by 8” method of pre-calculating a single CRC byte table, you use 8 of them for longer pre-calcualted byte sequences, and you get up to 1.5 GB/s of throughput now. Not bad for some math and cache waste going from 70 MB/s to 1.5 GB/s.

but the whale was still waiting.

even more modern speedups

Intel had long ago (2009-ish) wrote a paper on how to use SIMD/SSE instructions for CRC speed up with hand written assembly, but as it goes with things, it wasn’t easy to adapt or build or convert into other systems. They even have implementations and weird libraries of their own.

Except, I needed more. I wanted a full cross-platform SIMD (SSE+NEON) implementation which didn’t seem to exist and even the standard Intel versions weren’t performing the best.

thus, ergo, now we have crcspeed with SIMD support for SSE (Intel) and NEON (ARM) which get about 40+ GB/s in my testing for CRC-64-Jones (30x speedup from previous) and around 10 GB/s for CRC-16-CCITT (7x speedup from previous):

./crc_bench 33333333
===================
CRC Benchmark Suite
===================

SIMD support: YES
Buffer size:  33333333 bytes (31.79 MB)
Iterations:   100

CRC64 Benchmarks:
-----------------
  crc64 (bit-by-bit)      :    71.96 MB/s,  0.32 cycles/byte
  crc64_lookup (table)    :   378.39 MB/s,  0.06 cycles/byte
  crc64speed (slice-8)    :  1432.05 MB/s,  0.02 cycles/byte
  crc64_simd (PCLMUL)     : 41997.47 MB/s,  0.00 cycles/byte

CRC16 Benchmarks:
-----------------
  crc16 (bit-by-bit)      :   110.53 MB/s,  0.21 cycles/byte
  crc16_lookup (table)    :   335.39 MB/s,  0.07 cycles/byte
  crc16speed (slice-8)    :  1416.56 MB/s,  0.02 cycles/byte
  crc16_simd (PCLMUL)     :  9426.57 MB/s,  0.00 cycles/byte

Small Buffer Benchmarks (64 bytes):
-----------------------------------
  crc64speed (slice-8)    :  2277.01 MB/s,  0.01 cycles/byte
  crc64_simd (PCLMUL)     :  7174.70 MB/s,  0.00 cycles/byte
  crc16speed (slice-8)    :  2337.62 MB/s,  0.01 cycles/byte
  crc16_simd (PCLMUL)     :  7006.68 MB/s,  0.00 cycles/byte

Medium Buffer Benchmarks (4KB):
-------------------------------
  crc64speed (slice-8)    :  1468.52 MB/s,  0.02 cycles/byte
  crc64_simd (PCLMUL)     : 45002.88 MB/s,  0.00 cycles/byte
  crc16speed (slice-8)    :  1395.84 MB/s,  0.02 cycles/byte
  crc16_simd (PCLMUL)     : 11047.09 MB/s,  0.00 cycles/byte

Large Buffer Benchmarks (1MB):
------------------------------
  crc64speed (slice-8)    :  1457.24 MB/s,  0.02 cycles/byte
  crc64_simd (PCLMUL)     : 42405.88 MB/s,  0.00 cycles/byte
  crc16speed (slice-8)    :  1418.09 MB/s,  0.02 cycles/byte
  crc16_simd (PCLMUL)     :  9431.78 MB/s,  0.00 cycles/byte
===================

./crcspeed -n 1 ~/Downloads/Blame\!\ by\ Tsutomu\ Nihei\ Collection.zip

═══════════════════════════════════════════════════════════════════════
Benchmark: /Users/matt/Downloads/Blame! by Tsutomu Nihei Collection.zip
═══════════════════════════════════════════════════════════════════════
  File Size:   11.02 GB (11831396389 bytes)
  Iterations:  1
  Total Data:  11.02 GB per algorithm

  CRC64 Results:
  Algorithm       CRC Value                   MB/s        GB/s    Cycles/B  Status
  ──────────────  ──────────────────  ────────────  ──────────  ──────────  ──────
  crc64           0xfadd0161b519b53e         71.93        0.07       0.318  [PASS]
  crc64_lookup    0xfadd0161b519b53e        375.62        0.37       0.061  [PASS]
  crc64speed      0xfadd0161b519b53e       1455.17        1.42       0.016  [PASS]
  crc64_simd      0xfadd0161b519b53e      41937.87       40.95       0.001  [PASS]

  CRC16 Results:
  Algorithm       CRC Value                   MB/s        GB/s    Cycles/B  Status
  ──────────────  ──────────────────  ────────────  ──────────  ──────────  ──────
  crc16           0x9770                    111.92        0.11       0.204  [PASS]
  crc16_lookup    0x9770                    334.97        0.33       0.068  [PASS]
  crc16speed      0x9770                   1397.31        1.36       0.016  [PASS]
  crc16_simd      0x9770                   9268.78        9.05       0.002  [PASS]

  Summary:
  CRC64: 0xfadd0161b519b53e PASS
  CRC16: 0x9770             PASS

  Speedup (SIMD vs Table):
  CRC64: 28.8x
  CRC16: 6.6x
═══════════════════════════════════════════════════════════════════════

stats

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                       8         3568         2507          548          513
 C Header                4          128           64           43           21
-------------------------------------------------------------------------------
 Markdown                1          119            0           88           31
 |- Haskell              1          126          102            5           19
 (Total)                            245          102           93           50
===============================================================================
 Total                  13         3815         2571          679          565
===============================================================================

conclusion

Is this the best it can be? Probably not. You can probably use even wider SIMD instructions in more specific ways to rip through the sequences faster, but more advanced features on more advanced CPU architectures are outside the scope of what I have access to for free unpaid globally reusable work resulting in no compensation or glory along the way, womp womp.

any questions?

oh, also STOP USING CRC FOR ANYTHING MODERN! Stop using this! This library should be archived and abandoned and rotting! You should be using xxhash64 or xxhash128 for any modern application requiring “hashes” or “checksums” built after 2013.

Metric	Min	Max	Mean	Median	Total
Humor	0	5	1.83	1.5	11
Helpfulness	8	9	8.17	8.0	49
Aggression	0	2	0.83	0.5	5
Spiciness	0	4	0.83	0.0	5