Opsmas 2025 Day 12: carrier-db & carrier-cache
carrier db & carrier cache
We’re back here again.
what happens if you have the world’s most efficient in-memory database system, but nobody notices?
carrier db (a redis-protocol-compatible optimized concurrent data storage and processing system) and carrier cache (a memcached-protocol-compatible version) exist and, at scale, can reduce your memory cache node count by 30% to 60% or more depending on your data shape and access pattens. Not to mention, which we shall mention, the carrier platforms are fully multi-core concurrency capable, so you don’t end up doing really stupid things like allocating a 64+ core server with 128 GB RAM to some lazily implemented single core server process. What a waste! Maybe people will care more now about memory and compute efficiency since the price of RAM has gone up 500% in six months?
We previously covered carrier db a couple times ago in best database ever but there have been even more advancements since then.
Let’s review.
(also, the modern carrier db/cache updates aren’t fully released or updated in public anywhere; if this system, architecture, platform, and extensive capability surface sounds useful to you, let me know, i guess, and i can get you a newly optimized build or two to try out.)
carrier db architecture
Carrier is a massively parallel, lock-free, bytecode-driven in-memory database reaching extreme performance through:
- Zero-copy operations where possible
- Per-worker data partitioning eliminating global lock contention
- Dynamic VM dispatch for zero-overhead bytecode execution
- Custom implementations of all critical path components
- Lock-free ring buffers for inter-thread communication
- CPU pinning and thread-local storage optimization
Thread Topology Auto-Configuration
Carrier automatically determines optimal thread counts based on CPU cores:
| CPU Cores | Accept/Parse | Workers | Replier | Total Threads |
|---|---|---|---|---|
| 4 | 2 | 3 | 2 | 7 + listener |
| 16 | 4 | 13 | 4 | 21 + listener |
| 32 | 5 | 27 | 5 | 37 + listener |
| 64 | 6 | 54 | 6 | 66 + listener |
| 128 | 16 | 108 | 16 | 140 + listener |
| 256 | 20 | 220 | 20 | 260 + listener |
Data Flow: Request to Response
1. Connection Acceptance (Listener Thread)
Client connects
│
▼
Listener: accept(listenSock) → clientFd
│
▼
Listener: loopyRingPublishEntry(clientFd) → Parser[hash(clientFd) % N]
2. Parsing (Parser Thread)
Parser: read(clientFd) → buffer
│
▼
Parser: DRPTokenize(buffer) → tokens
│
▼
Parser: carrierDRPParser(tokens) → {cmd, args}
│
▼
Parser: zvmNew(cmd, args) → zvm instance
│
▼
Parser: hash(key) → workerIdx
│
▼
Parser: loopyRingPublishEntry(zvm) → Worker[workerIdx]
3. Execution (Worker Thread)
Worker: zvmRun(zvm) → executes bytecode
│
├─ Key lookup in per-worker KV store
├─ Value manipulation
├─ Cross-shard coordination (if needed)
└─ Store result in zvm
│
▼
Worker: loopyRingPublishEntry(zvm) → Replier[round-robin]
4. Reply (Replier Thread)
Replier: formatResponse(zvm) → RESP/ASCII format
│
▼
Replier: write(clientFd, response)
│
▼
Replier: zvmFree(zvm)
Core Components Deep Dive
ZVM: The Virtual Machine (Zero Virtual Machine)
ZVM is a combination of architectures (a mix of sqlite3 VM and erlang VM and luajit architecture internals) joined together for a more dyanmic high-performance distributed massively-multi-core capable predictable-performance memory-efficient database platform.
Purpose: Execute command bytecode with zero dispatch overhead
Design:
- Register-based (not stack-based) for fewer instructions
- ~200+ opcodes covering all operations
Example bytecode (SADD):
Op# | Opcode | P1 | P2 | Description
----|---------------------|-----|-----|---------------------------
0 | zopKeyCreate | 0 | - | Create key register 0
1 | zopKeySetName | 0 | 0 | key[0] = args[0]
2 | zopKeyExistsAsType | 0 | 5 | Check type, jump if wrong
3 | zopForEachValue | 1 | 2 | Loop over args[1..]
4 | zopSetAdd | 0 | 1 | Add value to set
5 | zopJumpOffset | -2 | - | Loop back
6 | zopKeyUpdate | 0 | - | Persist changes
7 | zopReplyCount | 0 | - | Reply with count added
8 | zopHalt | - | - | Done
ZVML: The Command Language
Purpose: Define commands declaratively, compile to bytecode at startup
Why not pre-compile?
- Allows runtime command modification
- Easier development (edit
.zvml→ rebuild → restart) - Type safety and error checking at compile time
- Template system for code reuse
Compilation pipeline:
.zvml source files
│ (embedded into binary as C strings)
▼
zvml_lexer.re (re2c) → Tokens
▼
zvml_parser.y → AST
▼
ir/irgen.c → Intermediate Representation
▼
codegen/codegen.c → ZVM bytecode (zvmOperation[])
▼
Register in global command tables
Example ZVML:
command SCARD {
access: READ
run_where: ONE
pre_access: VALUE_PTR
params: (key: Key)
returns: Integer
pre {
key: ensure_type(SET_MAP) or return_zero
}
body {
let count = @set_count(key)
return count
}
}
Compiles to: 6 ZVM operations
loopyRing: Lock-Free Communication
Purpose: Pass data between threads without locks
Design:
- MPSC (Multi-Producer Single-Consumer) ring buffers
- Each level has independent rings for each instance
- Atomic head/tail pointers
- Memory barriers for cross-thread visibility
Structure:
Parser Thread 0 ─┐
Parser Thread 1 ─┼─→ Ring[Worker 3] ─→ Worker Thread 3
Parser Thread 2 ─┘
Data Structures: databox
Purpose: Universal value container
Size: 16 bytes on 64-bit systems
Types supported:
- Integers: i8, i16, i32, i64, i128, u8, u16, u32, u64
- Floats: f32, f64
- Bytes: embedded (≤8 bytes) or pointer (>8 bytes)
- Complex: Set, List, QSet, QNSet, HLL, Bitmap
Protocol Parsing
DRP (RESP):
- State machine parser (parser generator + re2c)
- Zero-copy where possible (points into read buffer)
- Handles pipelining (multiple commands in one read)
DMP (Memcached ASCII/Binary):
- Text protocol:
get key\r\n - Binary protocol: Fixed headers + data
Pipelining support:
Client sends:
SET key1 val1
SET key2 val2
GET key1
Parser creates:
ZVM[SET key1 val1] → Worker[hash(key1)]
ZVM[SET key2 val2] → Worker[hash(key2)]
ZVM[GET key1] → Worker[hash(key1)]
All execute in parallel!
Replies sent in original order.
Process Architecture
Forked Processes
Parent (carrierMain)
│
├─> Logger Process (fork)
│ └─> Async log writing
│
└─> Server Process (fork)
├─> Listener Thread (1)
├─> Parser Threads (N)
├─> Worker Threads (M) ← CPU PINNED
└─> Replier Threads (N)
Why fork?
- Logger can get backloged without taking down server
- Parent can restart logger if it dies
- Clean separation of concerns
Logger:
- Shared memory region (mmap)
- Lock-free circular buffer
- Async writes to disk/stdout
- Automatic respawn if crashed
LRU Modes
1. SNLRU (Segmented N-Level LRU):
- Multiple levels based on access frequency
- Promotion on hits
- Evict from lowest level first
2. RANDOM_MEMORY:
- Bloom filter for admission control
- Random eviction among eligible keys
- Lower overhead than LRU
3. RANDOM:
- Pure random eviction
- Lowest overhead
Configuration
Network
A key component of carrier’s security architecture is allowing multiple network port bindings, but setting security boundaries at each network configuration.
You can have “admin” networks (like localhost or internal only for stats and admin commands, so you never “accidentially” expose admin features to public IPs), you can bind a network to a namespace so clients are automatically “chrooted” into a sub-namespace of your server so they can’t “escape up” to other keyspaces (i.e. clients can create deeper nested ns namespace entries, but they can’t escape up to above or beyond the top-level namespace as defined in a network config, etc).
network main {
host = "0.0.0.0"
port = 64175
protocol = drp
access = "read write stats"
}
network secure {
host = "0.0.0.0"
port = 64176
protocol = drp
enabletls = true
certchain = "/path/to/cert.pem"
privatekey = "/path/to/key.pem"
}
Runtime
runtime {
protocolParserInstances = 8 # Parser threads
dataWorkerInstances = 24 # Worker threads
networkReplierInstances = 8 # Replier threads
pinWorkersToCPU = true # Enable CPU pinning
}
Memory
memory {
softStayUnder = "16GB" # Memory limit
precacheMultiplier = 4 # Object pool size
}
lru {
enable = true
snlruLevels = 4 # LRU promotion levels
}
Startup Sequence
1. main() parses command line
2. Fork logger process
3. Fork server process
4. Server: carrierGlobalInit()
└─> zvmlCommandsInit()
├─> Compile all .zvml files
├─> Generate bytecode
└─> Register commands
5. Load [config file](https://carriercache.com/specs#internals)
6. Setup network (TLS, sockets)
7. processingTreeInstantiate()
├─> Allocate per-worker KV stores
├─> Create loopyRing buffers
├─> Spawn Listener thread
├─> Spawn Parser threads
├─> Spawn Worker threads (CPU pinned)
└─> Spawn Replier threads
8. Enter event loop (never returns)
Shutdown Sequence
1. Receive SIGTERM/SIGINT
2. Set state->shutdown.now = true
3. Levels halt in reverse order:
Replier → Worker → Parser → Listener
4. Wait for all threads to finish
5. Free resources
6. Exit
Graceful shutdown: Completes in-flight requests before exiting
Key Files Reference
| Component | Primary Files |
|---|---|
| Entry point | src/carrierMain.c, src/carrier.c |
| Global init | src/carrierGlobal.c |
| Listeners | src/hierarchy/level-0-Listener.c |
| Parsers | src/hierarchy/level-1-B-Parser.c |
| Workers | src/hierarchy/level-2-Worker.c |
| Repliers | src/hierarchy/level-3-Replier.c |
| ZVM | src/zvm/zvmRuntime.c |
| ZVML | src/zvml/*.c, src/zvml/commands/*.zvml |
| Storage | src/carrierState.c |
| Threading | src/hierarchy/processingTree.c |
| Ring buffers | deps/loopyRing/ |
| Data structures | deps/datakit/ |
Build System
CMake Structure
CMakeLists.txt
├─> deps/CMakeLists.txt # Build all dependencies
└─> src/CMakeLists.txt # Build Carrier
├─> Generate parsers
├─> Generate lexers (re2c)
├─> Generate ZVM dispatch tables
├─> Embed ZVML sources
├─> Compile ZVML compiler
└─> Link everything
What Makes Carrier Better
ZVML: Domain-Specific Command Language
What is it: A custom language for defining database commands compiling to optimized bytecode at startup.
Why it’s unique: Most databases hardcode commands in C. CarrierDB and Carrier Cache define commands declaratively.
How It Works
Write a command in ZVML:
command SCARD {
access: READ
run_where: ONE
pre_access: VALUE_PTR
params: (key: Key)
returns: Integer
pre {
key: ensure_type(SET_MAP) or return_zero
}
body {
let count = @set_count(key)
return count
}
}
Carrier compiles at startup:
ZVML Source → Lexer → Parser → IR → Codegen → ZVM Bytecode
Result: Command executes as if handwritten in C, but:
- Type-safe at compile time
- Easy to modify and extend
- Template system for code reuse
- Cross-shard support built-in
Benefits
Compared to hardcoded C:
- 10x faster development (no rebuild for command changes)
- Type safety catches errors before runtime
- Templates eliminate code duplication
- Easier to audit and verify correctness
Compared to interpreted scripts:
- Compiles to native bytecode (not interpreted)
- Zero overhead dispatch
- No FFI boundary crossings
- Deterministic performance
Example: All 111 commands compile in ~2-3ms at startup.
Template System
Define once, instantiate for multiple types:
template SetOperation<Prefix, Type, Prim> {
command {Prefix}ADD {
params: (key: Key, values: String...)
returns: Integer
body {
for_values(1) {
let added = @{Prim}_add(key, value)
}
return added
}
}
}
# Instantiate for 3 set types
instantiate SetOperation<S, SET_MAP, set> # SADD
instantiate SetOperation<QS, SET_Q, qset> # QSADD
instantiate SetOperation<QNS, SET_QN, qnset> # QNSADD
Result: 3 commands from 1 template, all type-checked and optimized.
Bitmap Operations with Cross-Shard Support
What is it: Full Roaring bitmap support with distributed operations.
Why it’s unique: Cross-shard bitmap operations
Operations
Basic:
SETBIT key offset value
GETBIT key offset
BITCOUNT key [start end]
Advanced:
BITOP AND dest key1 key2 key3 # Works across shards!
BITOP OR dest key1 key2 key3
BITOP XOR dest key1 key2 key3
BITOP NOT dest key
Similarity metrics:
BITJACCARD key1 key2 # Jaccard similarity
BITHAMMINGDIST key1 key2 # Hamming distance
BITOVERLAP key1 key2 # Overlap coefficient
BITDICE key1 key2 # Dice coefficient
Statistical:
BITMIN key # Minimum set bit
BITMAX key # Maximum set bit
BITRANK key position # Rank of bit
BITSELECT key k # Select k-th set bit
Cross-Shard Example
BITJACCARD bitmap1 bitmap2
# If bitmap1 on Worker 3, bitmap2 on Worker 15:
Worker 3:
1. Lock bitmap1
2. Signal ready
3. Wait for Worker 15
Worker 15:
1. Lock bitmap2
2. Read bitmap1 from Worker 3's memory
3. Compute Jaccard similarity locally
4. Signal complete
Worker 3:
1. Receive completion signal
2. Unlock bitmap1
3. Forward result to client
Cross-Shard Set Operations
What is it: Set operations (UNION, INTERSECT, DIFF) work across workers.
Why it’s unique: Most sharded systems don’t support cross-shard set ops.
Operations
SUNION dest key1 key2 key3 # Union across any workers
SINTER dest key1 key2 key3 # Intersection
SDIFF dest key1 key2 key3 # Difference
SUNIONSTORE dest key1 key2 key3 # Store result
# Also for QSets and QNSets:
QSUNION, QSINTER, QSDIFF, QSUNIONSTORE, ...
QNSUNION, QNSINTER, QNSDIFF, QNSUNIONSTORE, ...
ZVML Compiler Internals
Complete Compilation Pipeline
┌─────────────────────────────────────────────────────────────┐
│ 1. EMBEDDING (Build Time) │
│ src/zvml/commands/*.zvml → zvml_embedded.c │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. LEXICAL ANALYSIS (Startup) │
│ zvml_lexer.re (re2c-generated) │
│ │
│ Input: "command SADD { ... }" │
│ Output: [COMMAND, IDENT, LBRACE, ...] │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. PARSING (Startup) │
│ zvml_parser.y │
│ │
│ Input: Tokens │
│ Output: AST (CommandNode tree) │
│ │
│ CommandNode { │
│ name: "SADD" │
│ params: [(key, KEY), (values, STRING_VARIADIC)] │
│ pre: [TypeCheck(SET_MAP, or_create)] │
│ body: [ForValues, OpCall("set_add"), Return] │
│ } │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. IR GENERATION (Startup) │
│ ir/irgen.c │
│ │
│ • Expand templates │
│ • Resolve type parameters │
│ • Expand meta-primitives │
│ • Type checking │
│ • Generate intermediate representation │
│ │
│ define @SADD: │
│ %key = param 0 │
│ %type_check = call @ensure_type(%key, SET_MAP) │
│ br %type_check, %body, %error │
│ %body: │
│ %count = call @set_add_multi(%key, variadic 1) │
│ ret %count │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. CODE GENERATION (Startup) │
│ codegen/codegen.c │
│ │
│ • Map IR operations to ZVM opcodes │
│ • Allocate registers │
│ • Compute jump offsets │
│ • Optimize (constant folding, dead code elimination) │
│ │
│ zvmOperation ops[] = { │
│ {zopKeyCreate, 0, 0, 0, 0, 0}, │
│ {zopKeySetName, 0, 0, 0, 0, 0}, │
│ {zopKeyExistsAsTypeOrCreate, 0, 5, SET_MAP, 0, 0}, │
│ {zopForEachValue, 1, 3, 0, 0, 0}, │
│ {zopSetAdd, 0, 1, 0, 0, 0}, │
│ {zopJumpOffset, -2, 0, 0, 0, 0}, │
│ {zopKeyUpdate, 0, 0, 0, 0, 0}, │
│ {zopReplyCount, 0, 0, 0, 0, 0}, │
│ {zopHalt, 0, 0, 0, 0, 0} │
│ }; │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 6. REGISTRATION (Startup) │
│ zvmlRegisterCommands() │
│ │
│ • Lookup slot: carrierGlobalToken.drp.cmds.set.sadd │
│ • Reset slot (clear any manual bytecode) │
│ • Copy ZVML bytecode to slot │
│ • Command now active! │
└─────────────────────────────────────────────────────────────┘
Template Instantiation
Template definition:
template SetOp<Prefix, Type, Primitive> {
command {Prefix}CARD {
params: (key: Key)
returns: Integer
pre {
key: ensure_type({Type}) or return_zero
}
body {
let count = @{Primitive}_count(key)
return count
}
}
}
Instantiation:
instantiate SetOp<S, SET_MAP, set>
What happens:
- Create new AST with substitutions:
{Prefix}CARD→SCARD{Type}→SET_MAP@{Primitive}_count→@set_count
Type-check with concrete types
Generate IR and bytecode
Result: SCARD command exists, type-safe, optimized
Meta-Primitive Expansion
Definition:
meta verify_numeric() expands_to {
@verify_all_values_numeric128()
}
meta error_not_numeric() expands_to {
@error(3, 2) # ERROR_TYPE_VALUE, CARRIER_ERR_VALUE_NOT_NUMERIC
}
Usage:
command QNSADD {
pre {
@verify_numeric() # ← Expands here
}
}
Expansion (irgen.c):
- See
@verify_numeric - Look up in meta registry
- Get body:
@verify_all_values_numeric128() - Expand inline
- Continue with IR generation
Recursion: Metas can call other metas!
ZVM Bytecode System
Operation Encoding
zvmOperation structure (40 bytes):
typedef struct zvmOperation {
uint8_t opcode; // 1 byte: Operation type
// 5 tagged parameters (6 bytes each = 30 bytes)
struct {
uint16_t tag : 4; // Parameter type (reg, literal, offset)
uint16_t val : 12; // Parameter value
} p1, p2, p3, p4, p5;
uint8_t padding[9]; // Alignment
} zvmOperation;Parameter tags:
TAG_REGISTER- Register index (key[0], val[0])TAG_LITERAL- Immediate valueTAG_ARG- Argument indexTAG_OFFSET- Jump offset
Advanced ZVML Features
Type Families and Traits
Problem: Sets, QSets, and QNSets share 90% of operations but differ in storage.
Solution: Type families with @if_type conditionals
template UnifiedSetOp<Prefix, Type, Prim> {
command {Prefix}ISMEMBER {
params: (key: Key, value: String)
returns: Integer
pre {
key: ensure_type({Type}) or return_zero
# Only QNSet requires numeric validation
@if_type {Type} in [SET_QN] {
@verify_numeric()
}
}
body {
@if_type {Type} in [SET_Q, SET_QN] {
# Quantum sets use sorted lookup
let exists = @{Prim}_member_sorted(key, value)
} @else {
# Standard sets use hash lookup
let exists = @{Prim}_member(key, value)
}
return exists
}
}
}
# Generates 3 optimized commands:
instantiate UnifiedSetOp<S, SET_MAP, set> # SISMEMBER (hash)
instantiate UnifiedSetOp<QS, SET_Q, qset> # QSISMEMBER (sorted)
instantiate UnifiedSetOp<QNS, SET_QN, qnset> # QNSISMEMBER (sorted + numeric check)
Key insight: @if_type evaluated at template instantiation (compile-time), not runtime!
Result:
- SISMEMBER: No numeric check, hash lookup (5 ops)
- QSISMEMBER: No numeric check, sorted lookup (5 ops)
- QNSISMEMBER: Numeric check, sorted lookup (7 ops)
Variadic Parameters
Support for variable argument counts:
command SADD {
params: (key: Key, values: String...) # ← Variadic
returns: Integer
body {
for_values(1) { # Loop over args[1..]
@set_add(key, value)
}
return @set_count(key)
}
}
Pre-Condition Patterns
Declarative key validation:
pre {
# Ensure key exists with correct type, or create it
key: ensure_type(BITMAP) or create
# Ensure key exists, or return default
key: ensure_type(SET_MAP) or return_zero
key: ensure_type(LIST) or return_minus_one
# Parameter validation
offset: validate(0..4294967295)
bit: validate(0..1)
}
Mutable state:
- Per-worker KV stores (one owner)
- Atomic counters (explicitly synchronized)
- Ring buffers (lock-free algorithm)
Lock-Free Guarantee
Guarantee: Single-key operations never take locks.
Code path (GET operation):
parseFromAcceptedFd() // No locks
→ parseDRP() // No locks
→ DRPPrepare() // No locks
→ zvmNew() // No locks
→ sendToWorker() // Lock-free ring publish
→ zvmRun() // No locks
→ multimapGet() // No locks
→ reply() // No locksLocks only used for:
- Cross-worker coordination (explicit in bytecode)
- LRU updates (per-worker lock)
- Dynamic command registration (rare)
woah you mean i can dodge bullets
no, i’m saying when the time comes you won’t have to.
carrier db and carrier cache were designed for maximum memory efficiency, maximum security, and maximum computational efficiency. As our world accelerates towards more and more compute lockdown at higher and higher compute prices, you can’t afford to use slow+lazy traditional database compute platforms wasting more valuable by weight than gold RAM just because legacy developers refuse to learn modern systems.
get with the times. save money. save yourself. use carrier cache and carrier db for everything. when you do, you’ll feel better about life the universe and everything.