Opsmas 2025 Day 9: crcspeed

I wrote crcspeed in 2014 over about 8 months of figuring out how speed up CRC-Jones-64 and CRC-16-CCITT variants. It got a lot of exposure and inclusion in other popular public and private projects over the years.

a background on crcs

CRC stands for “cyclic redundancy check” and it’s a fairly accurate name. It generates a value for any input bitstream, except the calculation happens entirely serially where, during the calculation, every next calculation depends on the previous bits already calcluated. The CRC systems were designed in the time of slow serial connections with data rates measured in bits per second so burning time calculating checks during incoming data latency arrival delays processing byte-per-byte streaming data wasn’t too much of a burden.

But then things got faster. Do you know how inefficient it is to calculate a value by suming up individual bytes over gigabytes of data? It’s pretty bad.

the first fixes

On modern systems capable of reading 60+ GB/s out of main memory, a standard unoptimized CRC algorithm operates at about 70 MB/s. Quite a bad algorithm for modern times.

You can do “tricks” to make it faster like pre-calculate all single byte values and we boost speed from 70 MB/s to 330 MB/s at the cost of a lookup table being referenced constantly, but still awful for any kind of data check in real time (i’ve seen people use standard CRC implementations to like checksum 100+ GB data snapshot files).

My trick with crcspeed was using the recently (?) discovered or popularized “slicing by 8” method of pre-calculating a single CRC byte table, you use 8 of them for longer pre-calcualted byte sequences, and you get up to 1.5 GB/s of throughput now. Not bad for some math and cache waste going from 70 MB/s to 1.5 GB/s.

but the whale was still waiting.

even more modern speedups

Intel had long ago (2009-ish) wrote a paper on how to use SIMD/SSE instructions for CRC speed up with hand written assembly, but as it goes with things, it wasn’t easy to adapt or build or convert into other systems. They even have implementations and weird libraries of their own.

Except, I needed more. I wanted a full cross-platform SIMD (SSE+NEON) implementation which didn’t seem to exist and even the standard Intel versions weren’t performing the best.

thus, ergo, now we have crcspeed with SIMD support for SSE (Intel) and NEON (ARM) which get about 40+ GB/s in my testing for CRC-64-Jones (30x speedup from previous) and around 10 GB/s for CRC-16-CCITT (7x speedup from previous):

./crc_bench 33333333
===================
CRC Benchmark Suite
===================

SIMD support: YES
Buffer size:  33333333 bytes (31.79 MB)
Iterations:   100

CRC64 Benchmarks:
-----------------
  crc64 (bit-by-bit)      :    71.96 MB/s,  0.32 cycles/byte
  crc64_lookup (table)    :   378.39 MB/s,  0.06 cycles/byte
  crc64speed (slice-8)    :  1432.05 MB/s,  0.02 cycles/byte
  crc64_simd (PCLMUL)     : 41997.47 MB/s,  0.00 cycles/byte

CRC16 Benchmarks:
-----------------
  crc16 (bit-by-bit)      :   110.53 MB/s,  0.21 cycles/byte
  crc16_lookup (table)    :   335.39 MB/s,  0.07 cycles/byte
  crc16speed (slice-8)    :  1416.56 MB/s,  0.02 cycles/byte
  crc16_simd (PCLMUL)     :  9426.57 MB/s,  0.00 cycles/byte

Small Buffer Benchmarks (64 bytes):
-----------------------------------
  crc64speed (slice-8)    :  2277.01 MB/s,  0.01 cycles/byte
  crc64_simd (PCLMUL)     :  7174.70 MB/s,  0.00 cycles/byte
  crc16speed (slice-8)    :  2337.62 MB/s,  0.01 cycles/byte
  crc16_simd (PCLMUL)     :  7006.68 MB/s,  0.00 cycles/byte

Medium Buffer Benchmarks (4KB):
-------------------------------
  crc64speed (slice-8)    :  1468.52 MB/s,  0.02 cycles/byte
  crc64_simd (PCLMUL)     : 45002.88 MB/s,  0.00 cycles/byte
  crc16speed (slice-8)    :  1395.84 MB/s,  0.02 cycles/byte
  crc16_simd (PCLMUL)     : 11047.09 MB/s,  0.00 cycles/byte

Large Buffer Benchmarks (1MB):
------------------------------
  crc64speed (slice-8)    :  1457.24 MB/s,  0.02 cycles/byte
  crc64_simd (PCLMUL)     : 42405.88 MB/s,  0.00 cycles/byte
  crc16speed (slice-8)    :  1418.09 MB/s,  0.02 cycles/byte
  crc16_simd (PCLMUL)     :  9431.78 MB/s,  0.00 cycles/byte
===================

./crcspeed -n 1 ~/Downloads/Blame\!\ by\ Tsutomu\ Nihei\ Collection.zip

═══════════════════════════════════════════════════════════════════════
Benchmark: /Users/matt/Downloads/Blame! by Tsutomu Nihei Collection.zip
═══════════════════════════════════════════════════════════════════════
  File Size:   11.02 GB (11831396389 bytes)
  Iterations:  1
  Total Data:  11.02 GB per algorithm

  CRC64 Results:
  Algorithm       CRC Value                   MB/s        GB/s    Cycles/B  Status
  ──────────────  ──────────────────  ────────────  ──────────  ──────────  ──────
  crc64           0xfadd0161b519b53e         71.93        0.07       0.318  [PASS]
  crc64_lookup    0xfadd0161b519b53e        375.62        0.37       0.061  [PASS]
  crc64speed      0xfadd0161b519b53e       1455.17        1.42       0.016  [PASS]
  crc64_simd      0xfadd0161b519b53e      41937.87       40.95       0.001  [PASS]

  CRC16 Results:
  Algorithm       CRC Value                   MB/s        GB/s    Cycles/B  Status
  ──────────────  ──────────────────  ────────────  ──────────  ──────────  ──────
  crc16           0x9770                    111.92        0.11       0.204  [PASS]
  crc16_lookup    0x9770                    334.97        0.33       0.068  [PASS]
  crc16speed      0x9770                   1397.31        1.36       0.016  [PASS]
  crc16_simd      0x9770                   9268.78        9.05       0.002  [PASS]

  Summary:
  CRC64: 0xfadd0161b519b53e PASS
  CRC16: 0x9770             PASS

  Speedup (SIMD vs Table):
  CRC64: 28.8x
  CRC16: 6.6x
═══════════════════════════════════════════════════════════════════════

stats

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                       8         3568         2507          548          513
 C Header                4          128           64           43           21
-------------------------------------------------------------------------------
 Markdown                1          119            0           88           31
 |- Haskell              1          126          102            5           19
 (Total)                            245          102           93           50
===============================================================================
 Total                  13         3815         2571          679          565
===============================================================================

conclusion

Is this the best it can be? Probably not. You can probably use even wider SIMD instructions in more specific ways to rip through the sequences faster, but more advanced features on more advanced CPU architectures are outside the scope of what I have access to for free unpaid globally reusable work resulting in no compensation or glory along the way, womp womp.

any questions?

oh, also STOP USING CRC FOR ANYTHING MODERN! Stop using this! This library should be archived and abandoned and rotting! You should be using xxhash64 or xxhash128 for any modern application requiring “hashes” or “checksums” built after 2013.

Article Analysis

Summary

This document introduces `crcspeed`, a tool developed to overcome the performance limitations of traditional CRC algorithms. It details the creation of an optimized, cross-platform SIMD implementation that achieves significant speed improvements, as evidenced by benchmark data across various buffer sizes and file types. However, despite these optimizations, the author ultimately advises against using CRC for modern applications, recommending xxhash64/xxhash128 as superior alternatives.

Content Scores

Metric	Min	Max	Mean	Median	Total
Humor	0	5	1.83	1.5	11
Helpfulness	8	9	8.17	8.0	49
Aggression	0	2	0.83	0.5	5
Spiciness	0	4	0.83	0.0	5

Chunk-by-Chunk Analysis

Chunk Summary

This post introduces `crcspeed`, a tool developed in 2014 to address the performance limitations of traditional CRC algorithms on modern, high-speed data processing systems, explaining the inherent serial nature of CRCs and the bottlenecks it creates.

Chunk Ratings

Metric	Score	Reason
Humor	3	The author uses mild sarcasm and rhetorical questions to express frustration with the inefficiency of standard CRC algorithms, which adds a touch of lightheartedness.
Helpfulness	8	The text provides a clear explanation of what CRC is, its historical context, and a detailed, quantitative description of its performance limitations on modern hardware. It sets the stage for a solution.
Aggression	2	The text expresses frustration with the inefficiency of CRC algorithms but does not direct this negativity towards any specific entity or individual. It's more of a technical critique.
Spiciness	1	The tone is professional, with only a slight edge of exasperation due to the technical limitations discussed. There is no offensive language or content.

Show Original Text

---
date: '2025-12-20'
frame: frame-front
frontTitle: 'Opsmas 2025 Day 09: crcspeed'
pageClasses: ['opsmas-2025']
published: true
subframe: frame-article
title: 'Opsmas 2025 Day 9: crcspeed'
---

# Opsmas 2025 Day 9: crcspeed

I wrote [`crcspeed`](https://github.com/mattsta/crcspeed) in 2014 over about 8 months of figuring out how speed up CRC-Jones-64 and CRC-16-CCITT variants. It got a lot of exposure and inclusion in other popular public and private projects over the years.

## a background on crcs

CRC stands for "[cyclic redundancy check](https://en.wikipedia.org/wiki/Cyclic_redundancy_check)" and it's a fairly [accurate name](https://en.wikipedia.org/wiki/Mathematics_of_cyclic_redundancy_checks). It generates a value for any input bitstream, except the calculation happens entirely serially where, during the calculation, every next calculation depends on the previous bits already calcluated. The CRC systems were designed in the time of slow serial connections with data rates measured in bits per second so burning time calculating checks during incoming data latency arrival delays processing byte-per-byte streaming data wasn't too much of a burden.

But then things got faster. Do you know how inefficient it is to calculate a value by suming up individual bytes over gigabytes of data? It's pretty bad.

On modern systems capable of reading 60+ GB/s out of main memory, a standard unoptimized CRC algorithm operates at about 70 MB/s. Quite a bad algorithm for modern times.

You can do "tricks" to make it faster like pre-calculate all single byte values and we boost speed from 70 MB/s to 330 MB/s at the cost of a lookup table being referenced constantly, but still awful for any kind of data check in real time (i've seen people use standard CRC implementations to like checksum 100+ GB data snapshot files).

Chunk Summary

This text details the author's development of an optimized, cross-platform SIMD implementation for CRC calculations in the `crcspeed` project, achieving significant speed improvements over previous methods.

Chunk Ratings

Metric	Score	Reason
Humor	3	The humor is very subtle, primarily coming from the slightly dramatic phrasing like "but the whale was still waiting" and "ergo," which adds a touch of personality without being overtly funny.
Helpfulness	8	The text provides clear technical details about optimizing CRC calculation speeds, explains the methods used (slicing by 8, SIMD/SSE/NEON), and offers concrete performance improvements with specific benchmarks. The GitHub link is also helpful for further exploration.
Aggression	1	The tone is generally positive and focused on technical achievement. The mention of Intel's work being "not easy to adapt" is a mild critique, but not aggressive.
Spiciness	0	The content is entirely professional and technical, with no offensive or inappropriate language.

Show Original Text


My trick with [`crcspeed`](https://github.com/mattsta/crcspeed) was using the recently (?) discovered or popularized "slicing by 8" method of pre-calculating a single CRC byte table, you use 8 of them for longer pre-calcualted byte sequences, and you get up to 1.5 GB/s of throughput now. Not bad for some math and cache waste going from 70 MB/s to 1.5 GB/s.

but the whale was still waiting.

Intel had long ago (2009-ish) wrote a paper on how to use SIMD/SSE instructions for CRC speed up with hand written assembly, but as it goes with things, it wasn't easy to adapt or build or convert into other systems. They even have [implementations and weird libraries](https://deepwiki.com/intel/isa-l/5-crc-functions) of their own.

Except, I needed more. I wanted a full cross-platform SIMD (SSE+NEON) implementation which didn't seem to exist and even the standard Intel versions weren't performing the best.

thus, ergo, now we have [`crcspeed`](https://github.com/mattsta/crcspeed) with SIMD support for SSE (Intel) and NEON (ARM) which get about 40+ GB/s in my testing for CRC-64-Jones (30x speedup from previous) and around 10 GB/s for CRC-16-CCITT (7x speedup from previous):

```haskell
./crc_bench 33333333
===================
CRC Benchmark Suite
===================

SIMD support: YES
Buffer size:  33333333 bytes (31.79 MB)
Iterations:   100

CRC64 Benchmarks:
-----------------
  crc64 (bit-by-bit)      :    71.96 MB/s,  0.32 cycles/byte
  crc64_lookup (table)    :   378.39 MB/s,  0.06 cycles/byte
  crc64speed (slice-8)    :  1432.05 MB/s,  0.02 cycles/byte

Chunk Summary

This text presents benchmark data for various CRC algorithms, detailing their performance in MB/s and cycles/byte across different buffer sizes.

Chunk Ratings

Metric	Score	Reason
Humor	0	The text is a technical benchmark report and contains no attempts at humor.
Helpfulness	9	The text provides clear and detailed performance metrics for various CRC algorithms across different buffer sizes, which is highly valuable for performance tuning and algorithm selection.
Aggression	0	The text is purely informational and lacks any emotional tone or negativity.
Spiciness	0	The text is a neutral, technical report and contains no offensive or inappropriate content.

Show Original Text

  crc64_simd (PCLMUL)     : 41997.47 MB/s,  0.00 cycles/byte

CRC16 Benchmarks:
-----------------
  crc16 (bit-by-bit)      :   110.53 MB/s,  0.21 cycles/byte
  crc16_lookup (table)    :   335.39 MB/s,  0.07 cycles/byte
  crc16speed (slice-8)    :  1416.56 MB/s,  0.02 cycles/byte
  crc16_simd (PCLMUL)     :  9426.57 MB/s,  0.00 cycles/byte

Small Buffer Benchmarks (64 bytes):
-----------------------------------
  crc64speed (slice-8)    :  2277.01 MB/s,  0.01 cycles/byte
  crc64_simd (PCLMUL)     :  7174.70 MB/s,  0.00 cycles/byte
  crc16speed (slice-8)    :  2337.62 MB/s,  0.01 cycles/byte
  crc16_simd (PCLMUL)     :  7006.68 MB/s,  0.00 cycles/byte

Medium Buffer Benchmarks (4KB):
-------------------------------
  crc64speed (slice-8)    :  1468.52 MB/s,  0.02 cycles/byte
  crc64_simd (PCLMUL)     : 45002.88 MB/s,  0.00 cycles/byte
  crc16speed (slice-8)    :  1395.84 MB/s,  0.02 cycles/byte
  crc16_simd (PCLMUL)     : 11047.09 MB/s,  0.00 cycles/byte

Large Buffer Benchmarks (1MB):
------------------------------
  crc64speed (slice-8)    :  1457.24 MB/s,  0.02 cycles/byte

Chunk Summary

This text presents benchmark results for various CRC algorithms, detailing their performance in MB/s and cycles/byte when processing a large zip file.

Chunk Ratings

Metric	Score	Reason
Humor	0	The text consists of technical benchmark results and does not contain any elements of humor.
Helpfulness	8	The text provides clear and specific performance benchmarks for different CRC algorithms on a given file size, including MB/s, GB/s, and cycles/byte, which is useful for performance analysis.
Aggression	0	The text is purely objective technical data and exhibits no emotional tone, aggression, negativity, or anger.
Spiciness	0	The content is strictly technical data and contains no offensive or controversial material.

Show Original Text

  crc64_simd (PCLMUL)     : 42405.88 MB/s,  0.00 cycles/byte
  crc16speed (slice-8)    :  1418.09 MB/s,  0.02 cycles/byte
  crc16_simd (PCLMUL)     :  9431.78 MB/s,  0.00 cycles/byte
===================
```


```
./crcspeed -n 1 ~/Downloads/Blame\!\ by\ Tsutomu\ Nihei\ Collection.zip

═══════════════════════════════════════════════════════════════════════
Benchmark: /Users/matt/Downloads/Blame! by Tsutomu Nihei Collection.zip
═══════════════════════════════════════════════════════════════════════
  File Size:   11.02 GB (11831396389 bytes)
  Iterations:  1
  Total Data:  11.02 GB per algorithm

  CRC64 Results:
  Algorithm       CRC Value                   MB/s        GB/s    Cycles/B  Status
  ──────────────  ──────────────────  ────────────  ──────────  ──────────  ──────
  crc64           0xfadd0161b519b53e         71.93        0.07       0.318  [PASS]
  crc64_lookup    0xfadd0161b519b53e        375.62        0.37       0.061  [PASS]

Chunk Summary

This technical report presents benchmark results for CRC64 and CRC16 algorithms, indicating performance metrics and successful completion for all tested implementations.

Chunk Ratings

Metric	Score	Reason
Humor	0	The text is a technical benchmark report and contains no elements of humor.
Helpfulness	8	The text provides clear and concise performance data for CRC64 and CRC16 algorithms, including various implementation methods (e.g., SIMD, lookup table) and relevant metrics like MB/s, GB/s, and Cycles/B. This information is directly useful for understanding and comparing algorithm efficiency.
Aggression	0	The text is a neutral report of technical performance data and exhibits no emotional tone or negativity.
Spiciness	0	The text is strictly technical and professional, containing no offensive or inappropriate content.

Show Original Text

  crc64speed      0xfadd0161b519b53e       1455.17        1.42       0.016  [PASS]
  crc64_simd      0xfadd0161b519b53e      41937.87       40.95       0.001  [PASS]

  CRC16 Results:
  Algorithm       CRC Value                   MB/s        GB/s    Cycles/B  Status
  ──────────────  ──────────────────  ────────────  ──────────  ──────────  ──────
  crc16           0x9770                    111.92        0.11       0.204  [PASS]
  crc16_lookup    0x9770                    334.97        0.33       0.068  [PASS]
  crc16speed      0x9770                   1397.31        1.36       0.016  [PASS]
  crc16_simd      0x9770                   9268.78        9.05       0.002  [PASS]

  Summary:
  CRC64: 0xfadd0161b519b53e PASS
  CRC16: 0x9770             PASS

  Speedup (SIMD vs Table):

Chunk Summary

The author strongly advises against using CRC for modern applications, recommending xxhash64/xxhash128 instead and highlighting the limitations of their current optimization efforts.

Chunk Ratings

Metric	Score	Reason
Humor	5	The humor is dry and self-deprecating, leaning into the programmer trope of unpaid work and lack of recognition. The "womp womp" is a classic internet-era expression of mild disappointment.
Helpfulness	8	While the initial numbers lack context, the subsequent text provides a strong, actionable recommendation to stop using CRC for modern applications and switch to xxhash64/xxhash128, with links to relevant projects.
Aggression	2	The tone is assertive and dismissive of older technologies, but not overtly angry or negative. The aggression is directed at the technology, not the reader.
Spiciness	4	The language used to describe abandoning CRC ("abandoned and rotting!") is strong and opinionated, bordering on provocative within a technical context.

Show Original Text

  CRC64: 28.8x
  CRC16: 6.6x
═══════════════════════════════════════════════════════════════════════
```

Is this the best it can be? Probably not. You can probably use even wider SIMD instructions in more specific ways to rip through the sequences faster, but more advanced features on more advanced CPU architectures are outside the scope of what I have access to for free unpaid globally reusable work resulting in no compensation or glory along the way, womp womp.

any questions?

oh, also STOP USING CRC FOR ANYTHING MODERN! Stop [using this](https://github.com/mattsta/crcspeed)! [This library](https://github.com/mattsta/crcspeed) should be archived and abandoned and rotting! You should be using xxhash64 or xxhash128 for any modern application requiring "hashes" or "checksums" built after 2013.