Opsmas 2025 Day 3: datakit & ram prices
TOC:
datakit
datakit oh datakit how thy data structures are so memory-efficient!
datakit (ref: datakit public release announcement) is my “most memory-efficient-possible” collection of data structures, OS abstractions, algorithms, and system/platform architectures.
This year I had a chance to revisit it and update for modern usage: more ARM optimizations, more correctness checks, more test coverage, more feature extensions, more capabilites, more data structures, more documentation, more usable interfaces, more optimizations, and more safety in many places.
random new stuff
basic modules and features are still:
Core & Utilities:
- databox - Universal 16-byte polymorphic container
- flex/mflex - Compressed variable-length arrays
- String utilities (str, dks, strDoubleFormat)
Scale-Aware Containers:
- multimap - 1-to-3-byte-per-entry-overhead Key-value store (Small/Medium/Full/Atom variants)
- multilist - Linked lists with multi-element nodes
- multiarray - Dynamic arrays (Small/Medium/Large variants)
- multidict, multilru, multiroar
Integer Sets & Cardinality:
- intset - Variable-width integer sets (16/32/64-bit auto-promotion)
- intsetU32 - Fixed 32-bit integer sets
- hyperloglog - Probabilistic cardinality estimation
Compression & Encoding:
- xof - XOR filter for double compression (Gorilla algorithm)
- dod - Delta-of-delta integer encoding
- float16 - Half-precision floating point
Memory Management:
- membound, fibbuf, jebuf, ptrPrevNext, offsetArray
System & OS:
- OSRegulate, setproctitle, versionOSRuntime
- fastmutex, timeUtil, multiTimer
Utilities:
- list, multiheap, intersectInt (SIMD-optimized)
more additions and improvements include, but are not limited to:
NEW MODULES
Fenwick Tree Template System (49 files)
- Complete Binary Indexed Tree implementation with 2-tier architecture (Small/Full)
- 10 numeric types: I16, I32, I64, I128, U16, U32, U64, U128, Float, Double
- Operations: O(log n) point update, prefix query, range query, lower bound
- Memory efficiency: I16 uses 75% less memory than I64, better cache utilization
- Automatic tier transitions: Small (contiguous, 128KB max) → Full (unlimited)
- Comprehensive documentation and testing (fenwickI64Test.c)
Segment Tree Template System (49 files)
- Range query structure supporting MIN/MAX/SUM operations in O(log n)
- Same 10 numeric types and 2-tier architecture as Fenwick
- Full tier includes lazy propagation for efficient range updates
- Template-based code generation from 3 core implementation files
- Testing infrastructure with operation-specific validation
Persistence Subsystem (15 files in src/persist/)
- Pluggable serialization framework for all major datakit structures
- Snapshot files (.snap) for point-in-time state capture
- Write-Ahead Log (.wal) for incremental operation durability
- Automatic crash recovery via snapshot + WAL replay
- Configurable compaction to merge WAL into snapshots
- Multiple checksum algorithms (XXH32/64/128) for data integrity
- Supported structures: flex, intset, multilist, multimap, multidict, multiOrderedSet, multilru, multiroar
- persistCtx wrapper provides automatic WAL logging with sync policies
- Comprehensive test suite with round-trip verification
NEW DATA STRUCTURES
multiOrderedSet (8 files, 3-tier architecture)
- Sorted set storing (score, member) pairs
- Automatic tier promotion: Small (flex) → Medium (indexed flexes) → Full (dict+flex)
- Operations: Add, Remove, GetRank, RangeQuery, PopMin/Max, IncrBy
- Tiering thresholds: 2KB (Small→Medium), 6KB (Medium→Full)
- Full tier uses multidict for O(1) member lookup + flex for sorted iteration
multiFenwick (4 files)
- Binary Indexed Tree wrapper using multilist for storage
- Supports any numeric databox type without type decay
- O(log n) operations: Update, Query, RangeQuery, Get, Set, LowerBound
- Leverages multilist’s automatic tier management
atomPool (2 files)
- Unified string interning interface with pluggable backends
- VTable-based polymorphism for backend abstraction
- HASH backend (stringPool): O(1) operations, ~84 bytes/entry
- TREE backend (multimapAtom): O(log n) operations, ~22 bytes/entry, 73% memory savings
- Reference counting with automatic cleanup on release
stringPool (2 files)
- High-performance hash-based string interning (atomPool HASH backend)
- Two-way mapping: multidict (string→ID) + array (ID→entry)
- O(1) average-case operations for intern/lookup
- Free list for ID reuse after release, keeps IDs compact
- 1-based IDs (0 = invalid/not found)
ulid (2 files, 3,949 lines)
- Universally Unique Lexicographically Sortable Identifiers
- Time-ordered IDs with nanosecond precision + randomness
- Base36 encoding for compact string representation (25 chars for 128-bit)
- Multiple SIMD implementations: Scalar, SWAR, SSE2, AVX2, NEON
- 13+ variant types: EPOCH2020/2024, NS/US/MS precision, DUALNS, SNOWFLAKE
- Runtime capability detection with auto-selection of best SIMD path
- Monotonic counter prevents collisions within same nanosecond
MAJOR REFACTORS
multidict (6,112+ lines added)
- Conditional insert: AddNX (if-not-exists), AddXX (if-exists), Replace
- Atomic operations: GetAndDelete, PopRandom, IncrBy, IncrByFloat
- Dict operations: Copy, Merge (REPLACE/KEEP modes)
- Bulk operations: AddMultiple, DeleteMultiple
- Memory management: SetMaxMemory, EvictToLimit with eviction callbacks
- Eviction policies: NONE, RANDOM, LRU, LFU, SIZE_LRU
- Optional LRU tracking with zero overhead when disabled
- Byte-based expansion mode for large-value workloads
- Enhanced statistics: GetStats, GetDetailedStats, GetLoadMetrics with histograms
- Result codes: multidictResult enum (ERR, OK_REPLACED, OK_INSERTED)
- Better load factor control: separate expand/shrink thresholds (200%/10%)
- Byte tracking: keyBytes, valBytes, totalBytes fields for memory accounting
- Struct size: 124→140 bytes with new self-management fields
multilru (5,674+ lines added)
- Adaptive entry width system: 9 tiers (5-16 bytes) scaling to 1 exabyte
- Width 5 (16-bit): 64K max entries
- Width 9 (32-bit): 4B max entries
- Width 16 (60-bit): 1E max entries
- Full S4LRU implementation: Multi-level scan-resistant eviction (up to 64 levels)
- Weight tracking (optional): Per-entry weights for size-aware eviction
- Policy system: MLRU_POLICY_COUNT, SIZE, HYBRID
- Eviction strategies: LRU, SIZE_WEIGHTED, SIZE_LRU
- Configuration API: multilruNewWithConfig for declarative setup
- Bulk eviction: EvictN, EvictToSize with callbacks
- Introspection: GetStats, GetNLowest, GetNHighest
- Weight operations: InsertWeighted, UpdateWeight, GetWeight, TotalWeight
- Level mask optimization: O(1) lowest-entry lookup via bitmap
- Intrusive free list for efficient slot recycling
- Per-level tail pointers for S4LRU demotion semantics
- Zero-overhead optionality: weight tracking adds 8 bytes/entry only when enabled
- Promotion/demotion: Hits move up one level, evictions demote down before removal
multiroar (8,422+ lines added)
- Bitcount: Count total set bits across all chunks
- Min/Max operations: Get min/max set bit, IsEmpty check
- Comparison: Intersects, IsSubset, Equals
- Rank/Select (succinct): Rank (count bits in [0, pos)), Select (find k-th set bit)
- Range operations: RangeCount, BitSetRange, BitClearRange, BitFlipRange
- Set difference: AndNot (A-B) with in-place and constructor variants
- N-way operations: AndN, OrN, XorN for N≥2 bitmaps (in-place and constructors)
- Iterator: Forward iteration with Init, Next, Reset
- Bulk operations: BitSetMany, BitGetMany, ToArray, FromArray
- Similarity metrics: Jaccard, HammingDistance, Overlap, Dice
- Serialization: Serialize, Deserialize, SerializedSize for persistence
- Memory stats: MemoryUsage introspection
- Remove operation: Clear bit with auto-conversion between representations
- Const correctness: Many functions now accept const multiroar*
- SIMD-optimized popcount via StrPopCntExact
intset (Tiered refactor, 7 new files)
- Three-tier architecture with automatic promotion:
- Small: int16_t only, max 64KB or 32K elements
- Medium: int16_t + int32_t, max 8MB or 2M elements
- Full: int16_t + int32_t + int64_t, unlimited
- Separate sorted arrays per type within each tier (was: single array with encoding)
- Pointer tagging: Lower 2 bits encode tier type (SMALL/MEDIUM/FULL)
- Zero-copy upgrades when possible, arrays reused during transitions
- Memory efficiency: Small sets use 2 bytes/element, upgrade only when needed
- Per-tier iterators: intsetSmallIterator, intsetMediumIterator, intsetFullIterator
- Binary search implemented per-type within each tier
- New files: intsetCommon.h, intsetSmall.{c,h}, intsetMedium.{c,h}, intsetFull.{c,h}, intsetTest.c
offsetArray (548+ lines added, was stub)
- Sparse array with automatic offset adjustment for upward/downward growth
- Template-based type system: offsetArrayCreateTypes(Name, ValueType, IndexType)
- Operations: Grow, Get (lvalue), Set, Count, Low/High bounds, Empty check
- Memory efficiency: Only allocates storage for actual range being used
- Dynamic range: Grows both up and down while maintaining efficient access
- Use case: Sliding windows, network packet sequences, time-series data
PERFORMANCE & BENCHMARKING
compressionBench (682 lines, new)
- Comprehensive framework comparing datakit vs varint compression
- Integer: dod (delta-of-delta), varintDelta (ZigZag), varintBP128 (SIMD block-packed)
- Float: xof (bit-packed XOR), varintFloat (IEEE 754 component separation)
- Metrics: bytes/element, encode/decode throughput (Mops/s), compression ratio
- Test scales: Small (100), Medium (10K), Large (1M), XLarge (10M)
- Table formatting with performance summaries
dataspeed (1,331+ lines added)
- System performance benchmark suite with comprehensive profiling
- CPU: Integer/float ops, FMA throughput, function call overhead (Gops/s)
- Memory: Bandwidth (read/write/copy GB/s), latency (L1/L2/L3/RAM ns)
- Cache hierarchy benchmarks at different levels
- System info detection: CPU model/count/freq, cache sizes, memory
- SIMD support: AVX-512, AVX2, SSE2, ARM NEON detection and usage
- Accurate timing: Compiler barriers, memory fences, RDTSC for cycle-accurate
- Statistical analysis: min/max/median/p95/stddev with sample collection
- Output formats: Human-readable, CSV, JSON with box-drawing reports
- Platform-specific timing: mach_absolute_time (macOS), clock_gettime (Linux)
membound (993+ lines added)
- Optimize: O(1) freelist search via bitmap + CTZ (was O(LOGMAX) linear scan)
- Optimize: Replace division with shifts for power-of-2 atom sizes
- Add: SIMD-accelerated zeroing in memboundCalloc (AVX-512/AVX2/SSE2/NEON, 256-byte threshold)
- Add: memboundCalloc (zero-init), memboundBytesUsed, memboundBytesAvailable, memboundCapacity, memboundOwns
- Fix: Bitmap corruption during link/unlink operations
- Fix: Upgrade statistics from uint32_t to uint64_t to prevent overflow
- Improve: NULL safety, overflow protection, growth validation
- Safety: IncreaseSize only allows growth when no allocations exist
memtest (Major refactor: 417 deleted, 948 added)
- Complete rewrite with modular context system
- Multiple test patterns: Address, random fill, solid (0/0xFF), checkerboard (0xAA/0x55)
- Progress callbacks for custom reporting
- Result structures: memtestResult with bytes_tested, errors_found, passes, duration
- Two modes: Destructive (fast) and Preserving (backup/restore, cache-defeating)
- API functions: memtest, memtestWithResult, memtestWithProgress, memtestInteractive, memtestAllocAndTest, memtestProcessMemory
- Platform support: Linux (/proc/self/maps), macOS (limited), BSD (procfs-dependent)
- Comprehensive unit test suite for all patterns and modes
- Timing tracking via memtest_time_ns, throughput reporting (MB/s)
TESTING & VALIDATION
databox (451 lines added)
- Extensive fuzz testing with 10K+ random comparisons using oracles
- Test coverage: signed/unsigned/double comparison correctness
- Cross-type comparison consistency validation
- Bytes comparison with strcmp oracle
- 128-bit integer operations
- Stress test: 100K random comparisons
- Boundary testing: embed threshold (0-16 bytes), INT64_MIN/MAX, UINT64_MAX
- Type coverage: void, error, bool, null, floats, all numeric types
databoxLinear (1,103+ lines added)
- Linear array operations: Append, insert, remove with bounds checking
- Memory safety: Reallocation with capacity tracking
- Iterator support: Forward/backward iteration
- Sorting: Multiple sort algorithms with comparison functions
- PARTS_DECODE macro enhancements for fixed-width types
linearBloom Enhancements
- Simplified bit indexing: Direct slot/mask computation with LB_SLOT/LB_MASK macros
- Early-exit check variant: linearBloomHashCheckEarlyExit for faster negative queries
- SWAR-optimized decay operations: Half, Quarter, Eighth (O(1))
- Time-based exponential decay with probabilistic rounding (xorshift64 PRNG)
- LUT-based deterministic decay (faster than per-entry float multiply)
- Fix: Boundary value handling (entries 21, 42 spanning word boundaries)
- Performance: SWAR avoids per-entry multiplies, ~90K boundary values fit in stack
General Testing Infrastructure
- ctest.h: Better test framework integration
- All new components include extensive test suites
- Performance validation: Benchmarks verify correctness alongside speed
- Cross-platform: Tests work on Linux, macOS, BSD with platform-specific paths
BUG FIXES & CORRECTNESS
Core Fixes
- xof: Fix undefined behavior (bits >> 64) with proper guard for lengthOfOldBits
- membound: Bitmap corruption during link/unlink, overflow in statistics
- linearBloom: Boundary value corruption in SWAR operations
- databox: Signed comparison edge cases, embed threshold correctness
- intset: Variable encoding issues resolved via tiered separate-array design
Build System
- Remove multimapIndex.{c,h} (75+26 lines, obsolete)
- Remove multimapAtomLRU.h (1 line, unused)
- Add DATAKIT_TEST mode option independent of BuildTestBinary
- Add fenwick/ and segment/ subdirectories with conditional 128-bit support
- Ensure tests can run with assertions enabled (add_definitions(-UNDEBUG))
ENHANCEMENTS TO EXISTING MODULES
float16 (282+ lines added)
- Batch conversion implementations for arrays
- ARM64 NEON hardware conversion support
- Consistent handling across x86 (F16C) and ARM64 platforms
hyperloglog (143+ lines added)
- Enhanced cardinality estimation operations
- Improved merge and union operations
flex (803+ lines added)
- Enhanced encoding/decoding operations
- Better memory management
- Improved iterator support
dod (120+ lines added)
- Enhanced writer API with dodCloseWrites, dodGetWriterBytes
- Better support for delta-of-delta compression
- Improved read operations (dodReadAll)
dks (392+ lines added)
- UTF-8 substring operations: DKS_SUBSTR_UTF8
- Improved string manipulation functions
- Better memory safety and bounds checking
str utilities (158+ lines added)
- strPopcnt: Enhanced popcount operations (165+ lines)
- strToBufSplat: Improved buffer operations (64+ lines)
- strToNative: New native conversion utilities (68+ lines)
- strUTF8: UTF-8 handling enhancements (134+ lines)
xof (302+ lines added)
- Resumable reader with O(1) sequential access
- Better documentation and API clarity
- Fixed undefined behavior in bit shifts
timeUtil (40+ lines added)
- Enhanced time utilities
- Better cross-platform time handling
list (minor fixes)
- Correctness improvements
multiarray (268+ lines added)
- Enhanced array operations
- Better memory management
multilist (343+ lines added)
- Improved list operations
- Better tier management
multiTimer (8+ lines changed)
- Minor improvements
INFRASTRUCTURE & TOOLING
Compiler Optimizations
- Add: COMPILER_BARRIER, DO_NOT_OPTIMIZE, TIMING_BARRIER for accurate benchmarking
- Prevent optimizer from invalidating benchmark measurements
- Memory fences for instruction reordering prevention
Platform Support
- Enhanced macOS support (mach_absolute_time)
- Better Linux support (clock_gettime, /proc/self/maps)
- BSD support (procfs-dependent features)
- ARM64 (NEON) optimizations throughout
- x86-64 (AVX-512, AVX2, SSE2) optimizations
Build System (CMakeLists.txt, 351+ lines changed)
- Add fenwick/ subdirectory with CMakeLists.txt
- Add segment/ subdirectory with CMakeLists.txt
- Add persist/ subdirectory integration
- Conditional 128-bit type compilation (check_type_size("__int128_t"))
- DATAKIT_TEST mode option (independent of BuildTestBinary)
- Ensure assertions enabled for tests (-UNDEBUG)
- Updated compiler flags and optimizations
BENEFITS
Memory Efficiency
- Tiered architectures reduce memory footprint by 50-75% for small datasets
- Type-specific implementations (I16/I32 vs I64) enable optimal sizing
- Adaptive width system in multilru scales from 64K to 1 exabyte
- Separate arrays in intset tiers eliminate encoding overhead
Performance
- SIMD optimizations: AVX-512, AVX2, SSE2, NEON throughout
- O(1) operations via bitmap optimization (membound CTZ)
- Cache-friendly Small tiers: 20-33% faster than Full for hot data
- Branch-free code patterns in critical paths
Correctness
- Comprehensive fuzz testing with oracles
- Extensive boundary and edge case testing
- Fixed undefined behavior (bit shifts, overflows)
- Better assertions and validation
Maintainability
- Template systems: 49 files generated from 3 templates (zero code duplication)
- DRY principle: Single source of truth
- Clear separation of concerns (tiers, backends)
- Comprehensive test coverage for all new code
Production Readiness
- Persistence subsystem with crash recovery
- Serialization support for network transmission
- Memory limits and eviction policies
- Comprehensive statistics and introspection
- Cross-platform support (Linux, macOS, BSD, ARM64, x86-64)
stats
===============================================================================
Language Files Lines Code Comments Blanks
===============================================================================
C 170 139753 99778 17746 22229
C Header 147 20444 11094 6340 3010
CMake 3 848 571 147 130
-------------------------------------------------------------------------------
Markdown 53 15739 0 10256 5483
|- BASH 29 554 409 93 52
|- C 51 23340 15088 4222 4030
|- CMake 5 65 41 13 11
|- Makefile 1 26 18 3 5
|- YAML 1 31 24 1 6
(Total) 39755 15580 14588 9587
===============================================================================
Total 373 176784 111443 34489 30852
===============================================================================license
as always, my large 10,000+ hour projects like datakit are released under a custom “it’s source-available for you to read, but you must pay me to use it” license. If you want to use it and your company consists of more than just a squirrel and an entry-level intern, rates start as low as $64,000 per year and go up to $65,000,000 per year (net after taxes) depending on your scale and how evil you are (tech industry just done gone been doing some evil-maxxing speedrun these past 10 years huh).
ram prices
Why have I spent time over the past 10+ years focusing on things like creating byte-minimal perfect datastructures?
Maybe it was for this very day.
The day when you realize RAM prices are up 300% to 500% in 3 months becasue AI companies want to privately claim over 80% of the planet’s entire semiconductor output capacity for their digital god datacenters.
Maybe that’s why efficient data structures matter.
The truth is, the industry never actually even reconciled with “the 64-bit apocaylpse” when all pointers went from 4 bytes to 8 bytes due to the 32->64 bit expanded memory space adventure. If you have a billion 4 byte pointers tracking 16 bytes of customer data each, you’re already at 20% overhead using 4 GB of pointers to track 16 GB of data, but with 8 byte pointers, now you are burning 16 GB in user data and 8 GB in useless (to “business value”) pointer accounting (not to mention: blown CPU cache hierarchies and fetch paths due to just larger “fluff” everywhere; and worse, 64-bit pointers only use 48 bits for memory address space!).
But, nobody really cared. Memory was getting cheaper every year for years at a time.
Until it wasn’t.
Now maybe you’d be interested in systems and lifetime-experience experts capable of optimizing platforms down to 1 to 3 bytes of overhead instead of burining 8-96 bytes of overhead per data point if you are storing millions to billions of in-memory datapoints. Memory overhead minimization from data structure implementation efficiency gets you improved datapoint efficiency and greater business value and even power savings and lower app latency across all your platforms. You get improved data scanning and iteration speeds. You get faster and better and wake up feeling happier and looking more attractive, what’s not to like?
Imagine if memory prices don’t go down and every byte of RAM on every device suddenly matters a whole lot more. Think of the billions of phones with 8, 12, 16 GB RAM which may not increase again for years if things don’t turn around. Why isn’t your app more efficient? What if your servers won’t all just default to having 1.5 TB RAM forever like everybody seems to program as a default expecation these days? What if you actually have to be efficient for the next 5+ years with no relief from “cheap big servers everywhere on demand?” What if?
I still have some old unmaintained stats at my slightly (just slightly) abandoned in-memory database (due to lack of interest at the time) marketing page over here we’ll also revisit again soon:
- https://carriercache.com/replace-memcached
- https://carriercache.com/benchmarks
- https://carriercache.com/specs
- https://carriercache.com/business
Why burn 30 GB RAM storing user data when you can store the same user data in 3-5 GB RAM instead? All automatically? Just by using better from-the-ground-up “every-byte-matters” data architecture? In the right hands, these tools and mindsets can save billions of dollars when dealing with platforms at-scale.
but ymmv. who cares, just keep running mysql json graveyards doing ETL to elasticsearch and storing 96 replicated copies of your data across 13 cities in 9 countires fronted by cloudfail to run your 8,000 unauthenticated-dependency-deep “but it only takes 8 lines of code to write!” todolist demo app you rebraned as “AI todo lists for private equity vet clinics!” and got $80 million in VC funding at a $15 billion valuation after 3 months of bro-hustling work (“work”). nothing makes sense anymore anyway. good luck, call me when you get too lost in inefficiency and finally want to talk to somebody who knows how to turn the world around.