Opsmas 2025 Day 3: datakit is growing & memory efficiency matters

Opsmas 2025 Day 3: datakit & ram prices

TOC:

datakit
notes on ram prices and computational efficiency

`datakit`

datakit oh datakit how thy data structures are so memory-efficient!

datakit (ref: datakit public release announcement) is my “most memory-efficient-possible” collection of data structures, OS abstractions, algorithms, and system/platform architectures.

This year I had a chance to revisit it and update for modern usage: more ARM optimizations, more correctness checks, more test coverage, more feature extensions, more capabilites, more data structures, more documentation, more usable interfaces, more optimizations, and more safety in many places.

random new stuff

basic modules and features are still:

Core & Utilities:

databox - Universal 16-byte polymorphic container
flex/mflex - Compressed variable-length arrays
String utilities (str, dks, strDoubleFormat)

Scale-Aware Containers:

multimap - 1-to-3-byte-per-entry-overhead Key-value store (Small/Medium/Full/Atom variants)
multilist - Linked lists with multi-element nodes
multiarray - Dynamic arrays (Small/Medium/Large variants)
multidict, multilru, multiroar

Integer Sets & Cardinality:

intset - Variable-width integer sets (16/32/64-bit auto-promotion)
intsetU32 - Fixed 32-bit integer sets
hyperloglog - Probabilistic cardinality estimation

Compression & Encoding:

xof - XOR filter for double compression (Gorilla algorithm)
dod - Delta-of-delta integer encoding
float16 - Half-precision floating point

Memory Management:

membound, fibbuf, jebuf, ptrPrevNext, offsetArray

System & OS:

OSRegulate, setproctitle, versionOSRuntime
fastmutex, timeUtil, multiTimer

Utilities:

list, multiheap, intersectInt (SIMD-optimized)

more additions and improvements include, but are not limited to:

NEW MODULES

Fenwick Tree Template System (49 files)

Complete Binary Indexed Tree implementation with 2-tier architecture (Small/Full)
10 numeric types: I16, I32, I64, I128, U16, U32, U64, U128, Float, Double
Operations: O(log n) point update, prefix query, range query, lower bound
Memory efficiency: I16 uses 75% less memory than I64, better cache utilization
Automatic tier transitions: Small (contiguous, 128KB max) → Full (unlimited)
Comprehensive documentation and testing (fenwickI64Test.c)

Segment Tree Template System (49 files)

Range query structure supporting MIN/MAX/SUM operations in O(log n)
Same 10 numeric types and 2-tier architecture as Fenwick
Full tier includes lazy propagation for efficient range updates
Template-based code generation from 3 core implementation files
Testing infrastructure with operation-specific validation

Persistence Subsystem (15 files in src/persist/)

Pluggable serialization framework for all major datakit structures
Snapshot files (.snap) for point-in-time state capture
Write-Ahead Log (.wal) for incremental operation durability
Automatic crash recovery via snapshot + WAL replay
Configurable compaction to merge WAL into snapshots
Multiple checksum algorithms (XXH32/64/128) for data integrity
Supported structures: flex, intset, multilist, multimap, multidict, multiOrderedSet, multilru, multiroar
persistCtx wrapper provides automatic WAL logging with sync policies
Comprehensive test suite with round-trip verification

NEW DATA STRUCTURES

multiOrderedSet (8 files, 3-tier architecture)

Sorted set storing (score, member) pairs
Automatic tier promotion: Small (flex) → Medium (indexed flexes) → Full (dict+flex)
Operations: Add, Remove, GetRank, RangeQuery, PopMin/Max, IncrBy
Tiering thresholds: 2KB (Small→Medium), 6KB (Medium→Full)
Full tier uses multidict for O(1) member lookup + flex for sorted iteration

multiFenwick (4 files)

Binary Indexed Tree wrapper using multilist for storage
Supports any numeric databox type without type decay
O(log n) operations: Update, Query, RangeQuery, Get, Set, LowerBound
Leverages multilist’s automatic tier management

atomPool (2 files)

Unified string interning interface with pluggable backends
VTable-based polymorphism for backend abstraction
HASH backend (stringPool): O(1) operations, ~84 bytes/entry
TREE backend (multimapAtom): O(log n) operations, ~22 bytes/entry, 73% memory savings
Reference counting with automatic cleanup on release

stringPool (2 files)

High-performance hash-based string interning (atomPool HASH backend)
Two-way mapping: multidict (string→ID) + array (ID→entry)
O(1) average-case operations for intern/lookup
Free list for ID reuse after release, keeps IDs compact
1-based IDs (0 = invalid/not found)

ulid (2 files, 3,949 lines)

Universally Unique Lexicographically Sortable Identifiers
Time-ordered IDs with nanosecond precision + randomness
Base36 encoding for compact string representation (25 chars for 128-bit)
Multiple SIMD implementations: Scalar, SWAR, SSE2, AVX2, NEON
13+ variant types: EPOCH2020/2024, NS/US/MS precision, DUALNS, SNOWFLAKE
Runtime capability detection with auto-selection of best SIMD path
Monotonic counter prevents collisions within same nanosecond

MAJOR REFACTORS

multidict (6,112+ lines added)

Conditional insert: AddNX (if-not-exists), AddXX (if-exists), Replace
Atomic operations: GetAndDelete, PopRandom, IncrBy, IncrByFloat
Dict operations: Copy, Merge (REPLACE/KEEP modes)
Bulk operations: AddMultiple, DeleteMultiple
Memory management: SetMaxMemory, EvictToLimit with eviction callbacks
Eviction policies: NONE, RANDOM, LRU, LFU, SIZE_LRU
Optional LRU tracking with zero overhead when disabled
Byte-based expansion mode for large-value workloads
Enhanced statistics: GetStats, GetDetailedStats, GetLoadMetrics with histograms
Result codes: multidictResult enum (ERR, OK_REPLACED, OK_INSERTED)
Better load factor control: separate expand/shrink thresholds (200%/10%)
Byte tracking: keyBytes, valBytes, totalBytes fields for memory accounting
Struct size: 124→140 bytes with new self-management fields

multilru (5,674+ lines added)

Adaptive entry width system: 9 tiers (5-16 bytes) scaling to 1 exabyte
- Width 5 (16-bit): 64K max entries
- Width 9 (32-bit): 4B max entries
- Width 16 (60-bit): 1E max entries
Full S4LRU implementation: Multi-level scan-resistant eviction (up to 64 levels)
Weight tracking (optional): Per-entry weights for size-aware eviction
Policy system: MLRU_POLICY_COUNT, SIZE, HYBRID
Eviction strategies: LRU, SIZE_WEIGHTED, SIZE_LRU
Configuration API: multilruNewWithConfig for declarative setup
Bulk eviction: EvictN, EvictToSize with callbacks
Introspection: GetStats, GetNLowest, GetNHighest
Weight operations: InsertWeighted, UpdateWeight, GetWeight, TotalWeight
Level mask optimization: O(1) lowest-entry lookup via bitmap
Intrusive free list for efficient slot recycling
Per-level tail pointers for S4LRU demotion semantics
Zero-overhead optionality: weight tracking adds 8 bytes/entry only when enabled
Promotion/demotion: Hits move up one level, evictions demote down before removal

multiroar (8,422+ lines added)

Bitcount: Count total set bits across all chunks
Min/Max operations: Get min/max set bit, IsEmpty check
Comparison: Intersects, IsSubset, Equals
Rank/Select (succinct): Rank (count bits in [0, pos)), Select (find k-th set bit)
Range operations: RangeCount, BitSetRange, BitClearRange, BitFlipRange
Set difference: AndNot (A-B) with in-place and constructor variants
N-way operations: AndN, OrN, XorN for N≥2 bitmaps (in-place and constructors)
Iterator: Forward iteration with Init, Next, Reset
Bulk operations: BitSetMany, BitGetMany, ToArray, FromArray
Similarity metrics: Jaccard, HammingDistance, Overlap, Dice
Serialization: Serialize, Deserialize, SerializedSize for persistence
Memory stats: MemoryUsage introspection
Remove operation: Clear bit with auto-conversion between representations
Const correctness: Many functions now accept const multiroar*
SIMD-optimized popcount via StrPopCntExact

intset (Tiered refactor, 7 new files)

Three-tier architecture with automatic promotion:
- Small: int16_t only, max 64KB or 32K elements
- Medium: int16_t + int32_t, max 8MB or 2M elements
- Full: int16_t + int32_t + int64_t, unlimited
Separate sorted arrays per type within each tier (was: single array with encoding)
Pointer tagging: Lower 2 bits encode tier type (SMALL/MEDIUM/FULL)
Zero-copy upgrades when possible, arrays reused during transitions
Memory efficiency: Small sets use 2 bytes/element, upgrade only when needed
Per-tier iterators: intsetSmallIterator, intsetMediumIterator, intsetFullIterator
Binary search implemented per-type within each tier
New files: intsetCommon.h, intsetSmall.{c,h}, intsetMedium.{c,h}, intsetFull.{c,h}, intsetTest.c

offsetArray (548+ lines added, was stub)

Sparse array with automatic offset adjustment for upward/downward growth
Template-based type system: offsetArrayCreateTypes(Name, ValueType, IndexType)
Operations: Grow, Get (lvalue), Set, Count, Low/High bounds, Empty check
Memory efficiency: Only allocates storage for actual range being used
Dynamic range: Grows both up and down while maintaining efficient access
Use case: Sliding windows, network packet sequences, time-series data

PERFORMANCE & BENCHMARKING

compressionBench (682 lines, new)

Comprehensive framework comparing datakit vs varint compression
Integer: dod (delta-of-delta), varintDelta (ZigZag), varintBP128 (SIMD block-packed)
Float: xof (bit-packed XOR), varintFloat (IEEE 754 component separation)
Metrics: bytes/element, encode/decode throughput (Mops/s), compression ratio
Test scales: Small (100), Medium (10K), Large (1M), XLarge (10M)
Table formatting with performance summaries

dataspeed (1,331+ lines added)

System performance benchmark suite with comprehensive profiling
CPU: Integer/float ops, FMA throughput, function call overhead (Gops/s)
Memory: Bandwidth (read/write/copy GB/s), latency (L1/L2/L3/RAM ns)
Cache hierarchy benchmarks at different levels
System info detection: CPU model/count/freq, cache sizes, memory
SIMD support: AVX-512, AVX2, SSE2, ARM NEON detection and usage
Accurate timing: Compiler barriers, memory fences, RDTSC for cycle-accurate
Statistical analysis: min/max/median/p95/stddev with sample collection
Output formats: Human-readable, CSV, JSON with box-drawing reports
Platform-specific timing: mach_absolute_time (macOS), clock_gettime (Linux)

membound (993+ lines added)

Optimize: O(1) freelist search via bitmap + CTZ (was O(LOGMAX) linear scan)
Optimize: Replace division with shifts for power-of-2 atom sizes
Add: SIMD-accelerated zeroing in memboundCalloc (AVX-512/AVX2/SSE2/NEON, 256-byte threshold)
Add: memboundCalloc (zero-init), memboundBytesUsed, memboundBytesAvailable, memboundCapacity, memboundOwns
Fix: Bitmap corruption during link/unlink operations
Fix: Upgrade statistics from uint32_t to uint64_t to prevent overflow
Improve: NULL safety, overflow protection, growth validation
Safety: IncreaseSize only allows growth when no allocations exist

memtest (Major refactor: 417 deleted, 948 added)

Complete rewrite with modular context system
Multiple test patterns: Address, random fill, solid (0/0xFF), checkerboard (0xAA/0x55)
Progress callbacks for custom reporting
Result structures: memtestResult with bytes_tested, errors_found, passes, duration
Two modes: Destructive (fast) and Preserving (backup/restore, cache-defeating)
API functions: memtest, memtestWithResult, memtestWithProgress, memtestInteractive, memtestAllocAndTest, memtestProcessMemory
Platform support: Linux (/proc/self/maps), macOS (limited), BSD (procfs-dependent)
Comprehensive unit test suite for all patterns and modes
Timing tracking via memtest_time_ns, throughput reporting (MB/s)

TESTING & VALIDATION

databox (451 lines added)

Extensive fuzz testing with 10K+ random comparisons using oracles
Test coverage: signed/unsigned/double comparison correctness
Cross-type comparison consistency validation
Bytes comparison with strcmp oracle
128-bit integer operations
Stress test: 100K random comparisons
Boundary testing: embed threshold (0-16 bytes), INT64_MIN/MAX, UINT64_MAX
Type coverage: void, error, bool, null, floats, all numeric types

databoxLinear (1,103+ lines added)

Linear array operations: Append, insert, remove with bounds checking
Memory safety: Reallocation with capacity tracking
Iterator support: Forward/backward iteration
Sorting: Multiple sort algorithms with comparison functions
PARTS_DECODE macro enhancements for fixed-width types

linearBloom Enhancements

Simplified bit indexing: Direct slot/mask computation with LB_SLOT/LB_MASK macros
Early-exit check variant: linearBloomHashCheckEarlyExit for faster negative queries
SWAR-optimized decay operations: Half, Quarter, Eighth (O(1))
Time-based exponential decay with probabilistic rounding (xorshift64 PRNG)
LUT-based deterministic decay (faster than per-entry float multiply)
Fix: Boundary value handling (entries 21, 42 spanning word boundaries)
Performance: SWAR avoids per-entry multiplies, ~90K boundary values fit in stack

General Testing Infrastructure

ctest.h: Better test framework integration
All new components include extensive test suites
Performance validation: Benchmarks verify correctness alongside speed
Cross-platform: Tests work on Linux, macOS, BSD with platform-specific paths

BUG FIXES & CORRECTNESS

Core Fixes

xof: Fix undefined behavior (bits >> 64) with proper guard for lengthOfOldBits
membound: Bitmap corruption during link/unlink, overflow in statistics
linearBloom: Boundary value corruption in SWAR operations
databox: Signed comparison edge cases, embed threshold correctness
intset: Variable encoding issues resolved via tiered separate-array design

Build System

Remove multimapIndex.{c,h} (75+26 lines, obsolete)
Remove multimapAtomLRU.h (1 line, unused)
Add DATAKIT_TEST mode option independent of BuildTestBinary
Add fenwick/ and segment/ subdirectories with conditional 128-bit support
Ensure tests can run with assertions enabled (add_definitions(-UNDEBUG))

ENHANCEMENTS TO EXISTING MODULES

float16 (282+ lines added)

Batch conversion implementations for arrays
ARM64 NEON hardware conversion support
Consistent handling across x86 (F16C) and ARM64 platforms

hyperloglog (143+ lines added)

Enhanced cardinality estimation operations
Improved merge and union operations

flex (803+ lines added)

Enhanced encoding/decoding operations
Better memory management
Improved iterator support

dod (120+ lines added)

Enhanced writer API with dodCloseWrites, dodGetWriterBytes
Better support for delta-of-delta compression
Improved read operations (dodReadAll)

dks (392+ lines added)

UTF-8 substring operations: DKS_SUBSTR_UTF8
Improved string manipulation functions
Better memory safety and bounds checking

str utilities (158+ lines added)

strPopcnt: Enhanced popcount operations (165+ lines)
strToBufSplat: Improved buffer operations (64+ lines)
strToNative: New native conversion utilities (68+ lines)
strUTF8: UTF-8 handling enhancements (134+ lines)

xof (302+ lines added)

Resumable reader with O(1) sequential access
Better documentation and API clarity
Fixed undefined behavior in bit shifts

timeUtil (40+ lines added)

Enhanced time utilities
Better cross-platform time handling

list (minor fixes)

Correctness improvements

multiarray (268+ lines added)

Enhanced array operations
Better memory management

multilist (343+ lines added)

Improved list operations
Better tier management

multiTimer (8+ lines changed)

Minor improvements

INFRASTRUCTURE & TOOLING

Compiler Optimizations

Add: COMPILER_BARRIER, DO_NOT_OPTIMIZE, TIMING_BARRIER for accurate benchmarking
Prevent optimizer from invalidating benchmark measurements
Memory fences for instruction reordering prevention

Platform Support

Enhanced macOS support (mach_absolute_time)
Better Linux support (clock_gettime, /proc/self/maps)
BSD support (procfs-dependent features)
ARM64 (NEON) optimizations throughout
x86-64 (AVX-512, AVX2, SSE2) optimizations

Build System (CMakeLists.txt, 351+ lines changed)

Add fenwick/ subdirectory with CMakeLists.txt
Add segment/ subdirectory with CMakeLists.txt
Add persist/ subdirectory integration
Conditional 128-bit type compilation (check_type_size(“__int128_t”))
DATAKIT_TEST mode option (independent of BuildTestBinary)
Ensure assertions enabled for tests (-UNDEBUG)
Updated compiler flags and optimizations

BENEFITS

Memory Efficiency

Tiered architectures reduce memory footprint by 50-75% for small datasets
Type-specific implementations (I16/I32 vs I64) enable optimal sizing
Adaptive width system in multilru scales from 64K to 1 exabyte
Separate arrays in intset tiers eliminate encoding overhead

Performance

SIMD optimizations: AVX-512, AVX2, SSE2, NEON throughout
O(1) operations via bitmap optimization (membound CTZ)
Cache-friendly Small tiers: 20-33% faster than Full for hot data
Branch-free code patterns in critical paths

Correctness

Comprehensive fuzz testing with oracles
Extensive boundary and edge case testing
Fixed undefined behavior (bit shifts, overflows)
Better assertions and validation

Maintainability

Template systems: 49 files generated from 3 templates (zero code duplication)
DRY principle: Single source of truth
Clear separation of concerns (tiers, backends)
Comprehensive test coverage for all new code

Production Readiness

Persistence subsystem with crash recovery
Serialization support for network transmission
Memory limits and eviction policies
Comprehensive statistics and introspection
Cross-platform support (Linux, macOS, BSD, ARM64, x86-64)

stats

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                     170       139753        99778        17746        22229
 C Header              147        20444        11094         6340         3010
 CMake                   3          848          571          147          130
-------------------------------------------------------------------------------
 Markdown               53        15739            0        10256         5483
 |- BASH                29          554          409           93           52
 |- C                   51        23340        15088         4222         4030
 |- CMake                5           65           41           13           11
 |- Makefile             1           26           18            3            5
 |- YAML                 1           31           24            1            6
 (Total)                          39755        15580        14588         9587
===============================================================================
 Total                 373       176784       111443        34489        30852
===============================================================================

license

as always, my large 10,000+ hour projects like datakit are released under a custom “it’s source-available for you to read, but you must pay me to use it” license. If you want to use it and your company consists of more than just a squirrel and an entry-level intern, rates start as low as $64,000 per year and go up to $65,000,000 per year (net after taxes) depending on your scale and how evil you are (tech industry just done gone been doing some evil-maxxing speedrun these past 10 years huh).

ram prices

Why have I spent time over the past 10+ years focusing on things like creating byte-minimal perfect datastructures?

Maybe it was for this very day.

The day when you realize RAM prices are up 300% to 500% in 3 months becasue AI companies want to privately claim over 80% of the planet’s entire semiconductor output capacity for their digital god datacenters.

Maybe that’s why efficient data structures matter.

The truth is, the industry never actually even reconciled with “the 64-bit apocaylpse” when all pointers went from 4 bytes to 8 bytes due to the 32->64 bit expanded memory space adventure. If you have a billion 4 byte pointers tracking 16 bytes of customer data each, you’re already at 20% overhead using 4 GB of pointers to track 16 GB of data, but with 8 byte pointers, now you are burning 16 GB in user data and 8 GB in useless (to “business value”) pointer accounting (not to mention: blown CPU cache hierarchies and fetch paths due to just larger “fluff” everywhere; and worse, 64-bit pointers only use 48 bits for memory address space!).

But, nobody really cared. Memory was getting cheaper every year for years at a time.

Until it wasn’t.

Now maybe you’d be interested in systems and lifetime-experience experts capable of optimizing platforms down to 1 to 3 bytes of overhead instead of burining 8-96 bytes of overhead per data point if you are storing millions to billions of in-memory datapoints. Memory overhead minimization from data structure implementation efficiency gets you improved datapoint efficiency and greater business value and even power savings and lower app latency across all your platforms. You get improved data scanning and iteration speeds. You get faster and better and wake up feeling happier and looking more attractive, what’s not to like?

Imagine if memory prices don’t go down and every byte of RAM on every device suddenly matters a whole lot more. Think of the billions of phones with 8, 12, 16 GB RAM which may not increase again for years if things don’t turn around. Why isn’t your app more efficient? What if your servers won’t all just default to having 1.5 TB RAM forever like everybody seems to program as a default expecation these days? What if you actually have to be efficient for the next 5+ years with no relief from “cheap big servers everywhere on demand?” What if?

I still have some old unmaintained stats at my slightly (just slightly) abandoned in-memory database (due to lack of interest at the time) marketing page over here we’ll also revisit again soon:

Why burn 30 GB RAM storing user data when you can store the same user data in 3-5 GB RAM instead? All automatically? Just by using better from-the-ground-up “every-byte-matters” data architecture? In the right hands, these tools and mindsets can save billions of dollars when dealing with platforms at-scale.

but ymmv. who cares, just keep running mysql json graveyards doing ETL to elasticsearch and storing 96 replicated copies of your data across 13 cities in 9 countires fronted by cloudfail to run your 8,000 unauthenticated-dependency-deep “but it only takes 8 lines of code to write!” todolist demo app you rebraned as “AI todo lists for private equity vet clinics!” and got $80 million in VC funding at a $15 billion valuation after 3 months of bro-hustling work (“work”). nothing makes sense anymore anyway. good luck, call me when you get too lost in inefficiency and finally want to talk to somebody who knows how to turn the world around.

Metric	Min	Max	Mean	Median	Total
Humor	0	7	1.67	1	25
Helpfulness	1	9	7.80	9	117
Aggression	0	6	0.93	0	14
Spiciness	0	7	1.00	0	15