A Trillion Dollar Data Structure? In *this* Economy?

what’s a trillion dollars¹ between friends?

crisis in the linked list forums

recently on the linked list fan forums, a well regarded programmer — perhaps the most highly regarded coder ever — wrote a love letter to linked lists, and the forums went wild.

so what are linked lists and why are so many internet celebrity programmers wrong about them? let’s get into it.

<cue extended opening title sequence>

now for a word from our sponsor: knowing your shit — are you annoyed by being constantly confused about everything? have you tried just knowing your shit going forward instead of not knowing your shit? give knowing your shit a try today!
^{promocode: KNOWSHIT valid for the first 10,000,000 subscribers limited time offer not available in rural saskatchewan}

Reverse TOC:

The Answer: A Trillion Dollar Data Structure
- More than The Answer: Even More
The Question: Why LLs are BBadd

what is linked list

linked list is a list linked together by pointers.

what are pointers? pointers are whiny little pieces of garbage.

single

so here’s a basic linked list data layout:

[pointer to next list element]
[pointer to data]

so now it looks like: [A] -> [B] -> [C] -> [] — (list object itself is: list { head = A, size = 3 })

but above is 3 list elements with 2 pointers each -> 3 elements == 6 pointers

double

we can be a little fancy and give it pointers to the previous element too (so you could walk backwards from any node too):

[pointer to previous list element]
[pointer to next list element]
[pointer to data]

so now looks: [] <-> [A] <-> [B] <-> [C] <-> [] — (object: list {head = A, tail = C, size = 3})

but above is 3 list elements with 3 pointers each -> 3 elements == 9 pointers

why bad?

implementation simple, production folly

there’s a rift in the devlop-sphere between “only do conceptually simple/pure things and never look back” versus “do the best thing possible, no matter how much work it takes because optimization saves energy/performance/cost/humans over time”².

speaking to linked lists, their top four complaints on modern systems are:

linked lists are “links” of small (tiny!) individually allocated things, which in a long lived system will end up randomly placed throughout your entire memory space
traversing random “links” of small memory addresses definitely kills performance (plus, each memory allocation involves allocator accounting overhead metadata (not to mention mismatched size padding) so you don’t actually “see” your entire memory used size internally (but your operating system does!))
invoking allocator requests for each new list element kills your performance
deleting a long-lived randomly-temporally-spatially-allocated data structure now also kills your performance, because even if during “standard operations” you never need to traverse your entire 10 million element linked list, when you go to delete an entire huge list, you now have to walk somewhere between 10 million and 20 million individual pointers from memory³ — which, altogether now: KILLS YOUR PERFORMANCE. this is the opposite of vroom vroom and we do not approve.

overhead

the two primary reasons you should never use unbounded linked lists in large/live data products:

pointer overhead can overwhelm data size
dereferencing pointers in long-lived systems is death of performance

so painfully simple “my first data structure” linked lists give you three of the worst things in data land:

excessive metadata wasting your expensive RAM
unpredictable long-tail data access causing system pauses from inefficient memory access
- you don’t know where your pointers go in memory⁴ (L1/L2/L3 caches / RAM?)
- you don’t know the alignment of elements so a single access could evict 1-2 CPU cache lines
unpredictable creation latency due to running allocator requests for each new addition

since we also live in the future, we must recognize pointers are now 8 bytes each instead of 4 bytes in legacy 32-bit systems too.

let’s run some numbers:

an implementation-naive double linked list of 100 elements
with 32 bit pointers:
- 100 elements * 3 pointers per element (prev, next, data) * 4 bytes per pointer = 1,200 bytes of overhead (12 bytes per element)
- so, if your data is less than 12 bytes in each element, you have over 100% wasted memory, but the pointers are small so the accounting overhead is less of a burden than 64-bit systems.
with 64 bit pointers:
- 100 elements * 3 pointers per element (prev, next, data⁵) * 8 bytes per pointer = 2,400 bytes of overhead (24 bytes per element)
- so, if your data is less than 24 bytes in each element, you have over 100% wasted memory for just useless accounting data!
  - thought experiment: now imagine you have 300 million lists each with 5,000 elements: that’s 36 TB of wasted overhead not serving any user-pleasing purpose! pure wasted money and user time due to developer incompetence.
  - (plus add an extra 65% overhead for memory allocator fragmentation over time! now you’re wasting 60 TB for no reason other than simple developers stopped learning two decades ago for some unknowable reason)
- not to mention additional unpredictable latency caused by excessive allocator overhead and unpredictable system access time for scattered random non-spatially-related data through dozens, hundreds, or thousands of gigabytes in memory space.

there’s lies, damned lies, and computers

additionally, everything about computers is a lie.

multitasking? lie. we just run each process for 1 ms to 10 ms, pause it, run another one, pause it, repeat forever. to our worm-evolved ape brains it looks like multiple things are running concurrently, but it’s mostly lies.

memory addresses? lie. we have gigabytes or terabytes of physical memory, but it gets divided up into random clumps lazily attached to each process as needed then mapped into a fake linear address space your program can see but doesn’t necessarily reflect the underlying hardware reality.

If you think your memory is what you see, you don’t know memory.

Your operating system gives each process a linear view of its own memory space, but the underlying physical pages of memory can be anyway in your RAM banks. On huge modern systems with terabytes of RAM across 48 RAM slots, part of your process memory could be on completely opposite memory chips sometimes even requiring trips through additional CPUs too! As a developer, your goal is to reduce the number of small non-linear allocations every step of the way so your data remains as spatially consistent in hardware as possible.

The worst thing for performance in any system is having thousands of tiny allocations all referencing each other. Each new allocation, if performed over a long lived process, has no guarantee of being physically adjacent even if the numbers look relatively sequential in your local process address space. You can only improve your chances of obtaining physical guarantees if you allocate memory blocks in a single allocator request⁶.

Trillion Dollar Dollar Data Structure Yall

How do we fix the mental virus colloquially named linked lists?

Fixing the mind virus is as simple as spending a couple thousand hours writing memory and CPU optimal data layouts with matching easy to use data access utilities.

To fix linked lists, we need two things:

eliminate next/prev pointers
store data in blocks without allocating new memory for each new datum
- this is an alternate take on unrolled linked list using a custom linear fully-typed vectorized data format enabling multi-element storage without requiring per-element allocation.

THUSLY, we create a 3-tier system for using “lists” of “linked” elements:

small (less than 128 kb total data)
- just a linear buffer with no pointers for data access
  - (data = buffer)
- here, the buffer is our linear fully-typed multi-element data holder, so one memory allocation holds multiple elements
medium (two small lists)
- two linear buffers stored in an array next to each other
  - (data[0] = buffer0; data[1] = buffer1)
- here, we have two small buffer referenced positionally (so next of the last buffer0 is head of buffer1)
large (full scale unlimited size)
- unlimited linear buffers stored in scalable arrays
  - (data.get(0) = buffer0; ...; data.get(N) = bufferN)
- here, we have an expanded version of medium with growable arrays of buffer references

The implementation secret to unlimited lists without prev/next pointers is using a scalable array underneath which may have amortized pointers for every 2048 linear buffers (and each linear buffer can hold anywhere from 1 to 32k elements⁷), so our pointers are reduced from “2-3 pointers per list element” to “1 pointer per 2048 buffers” (and if a buffer has 32k elements, we’ve reached 1 pointer per (2048 * 32768) = 67 million elements!)

Pointer-minimization is accomplished using an underlying multi-tier growable array implementation (where we can just use array-based positional encoding for next/previous direction instead of allocating unnecessarily excessive memory everywhere):

small
- just a regular array
- (holding up to 512 pointers in a traditional linear array)
- (things[42] = {buffer0, buffer1, ..., buffer41})
medium
- an array of regular arrays
- (holding up to 512 small arrays indexed by an outer array, so 512 * 512 = 262k arrays max)
- (thingsMedium[4] = {thingsA[42], thingsB[42], thingsC[42], thingsD[42]})
large / unlimited
- a fully expansive collection of medium arrays
- (holding unlimited numbers of 262k medium arrays with no scalability limit)
- (thingsFull_(3) = {thingsMediumA[4] <^> thingsMediumB[4] <^> thingsMediumC[4]})
- this is the only support structure in the entire data system having an actual “linked” element, but even here we use optimized xor links to minimize unnecessary overhead⁸.
- also note: for fast positional searching, each node tracks the current number of elements in the inner opaque buffer allocations, so we can quickly index huge lists by skipping over thousands or millions of elements without ever having to actually “open up” and traverse the inner data buffers.

what about da code?

What does the code for these things look like? here’s a snomp pompk:

the supporting scalable array implementation

$ wc -l multiarray*
     353 multiarray.c
     184 multiarray.h
     569 multiarrayLarge.c
      35 multiarrayLarge.h
      26 multiarrayLargeInternal.h
     439 multiarrayMedium.c
      37 multiarrayMedium.h
      23 multiarrayMediumInternal.h
      81 multiarrayMediumLarge.h
     250 multiarraySmall.c
     201 multiarraySmall.h
      15 multiarraySmallInternal.h
    2213 total

the actual list implementation

$ wc -l multilist*
    1568 multilist.c
     111 multilist.h
     196 multilistAdapter.h
      58 multilistCommon.h
    2885 multilistFull.c
     140 multilistFull.h
     495 multilistMedium.c
     117 multilistMedium.h
      15 multilistMediumInternal.h
     279 multilistSmall.c
     117 multilistSmall.h
       9 multilistSmallInternal.h
    5990 total

BUT WAIT — THERE IS MORE

Since we now have an amazing zero-qua-minimal-pointer data structure here, what else can we do?

We can implement scalable zero-pointer maps too!

$ wc -l multimap*
    1205 multimap.c
     102 multimap.h
    1168 multimapAtom.c
      55 multimapAtom.h
     101 multimapCommon.h
    2399 multimapFull.c
     112 multimapFull.h
      29 multimapFullInternal.h
      75 multimapIndex.c
      26 multimapIndex.h
     761 multimapMedium.c
      83 multimapMedium.h
      17 multimapMediumInternal.h
     495 multimapSmall.c
      75 multimapSmall.h
      22 multimapSmallInternal.h
    6728 total

Performance of our near-zero-overhead maps is legendary⁹: our real world measured overhead is less than 2% for arbitrary key-value maps at scale and we still retain library performance of 2 million to 5 million inserts per second depending on the data size¹⁰.

additionally, a metric almost never tested but which matters tons: how fast can you delete things? If you have a linked list grown in a live server over 3 months of heavy data processing containing a million elements and you want to delete it… well, your process can potentially block for seconds due to combinations of inefficient pointer traversal and allocator calls for each node and its contents! Under my bulk buffer-backed-by-array-metadata method, lists of millions or billions of elements can be deallocated with 1/200,000^th the number of memory and allocator calls (it really is amazing watching the benchmarks for freeing a huge link list take 7,000% longer than freeing the same data in an actual modern-architected system).

draw the rest

If you were closely paying attention (which you weren’t), you noticed we haven’t mentioned the most important part of a zero-pointer system: the linear type-aware buffer enabling us to not allocate additional unique memory for each new data item.

My modern linear type-aware buffer thing is called flex for flexible buffer (which also includes a thing called databox for boxing multiple data types in a shared container) and it’s all sized as:

$ wc -l flex* databox*
    7208 flex.c
     243 flex.h
    1114 databox.c
     478 databox.h
     531 databoxLinear.c
      79 databoxLinear.h
    9720 total

where does all this live?

this work was part of my 5 year project re-implementing in-memory caches for:

modern hardware (fully multi-threaded async operations throughout, non-blocking data operations where possible with a custom preemptive VM architecture)
modern use cases (no restrictions on data types in nested elements, unlimited nesting depth, delete root of nested trees drops all sub-elements, lock clients to specific namespaces, enables native JSON output for all commands directly from the server)
and modern organizational requirements (security, monitoring, observability)

There’s some older size improvement benchmark data versus existing systems on an abandoned product page too (but the numbers are still valid).

and yes, those results are accurate: my modern approach results in using 600 MB to store 80 million elements where legacy memcached required 8 GB to store the same data — I think this is an amazing improvement, but in modern { i woRk in thE cLouD } mindsets nobody gives a crap about 90% performance improvements or 80% cost reductions anymore. people don’t realize their modern “click to deploy” lifestyle is really just “click to destroy” on a long enough time horizon.

one would think companies would be willing to pay for such massive improvements — my current calculations estimate all this aggregate data structure optimization work plus associated modern server improvements could save 5,000 organizations¹¹ around the world about $100 million (some of them even $100 million per year), so if we multiply those savings out on a conservative 5x multiple, this is a $2.5 trillion data structure project (which nobody actually cares about).

ta da.

so what?

this work is a development genre i call “impossible software” — you couldn’t set a project plan and hire people to create this final result because it requires weird experience history combined with bursts of concentrated sequential work where the goal is “make a better thing better than any other thing in the world” and such development paths can’t be organized, they can’t be “unit tested” against as an up-front goal — “test: does this data structure use the least memory possible yet?” — so we’re left in a mix of madness: create the best software possible under the burden of creative isolation, then emerge to find… nobody cares because you’ve built a different paracode architecture no organization has conceptual models to evaluate cost-benefit or ROI tradeoffs of. relative cost to build: $50 million; potential value if used at scale: $2.5 trillion; relative value in the market of uninformed tech bros: $3.50.

is it anywhere?

there’s an implementation of these wrapped in my modern optimized memcached clone at https://carriercache.com/ with more tech specs and documentation (but no downloads there).

there’s an implementation of these with demo downloads available (old but still works¹²) at https://carrierdb.cloud/ — also includes new command guides and some changelogs and overall product details.

there’s also an architecture overview from two years ago at https://matt.sh/best-database-ever

releasing any of this as a public code dump seems untenable these days. if i post it all online for free, multiple trillion dollar companies will steal it, deploy it on 20 million servers, resell it for private profit, and use it to subsidize their own infrastructure margins then never pay anything for it. but if i try to sell it, nobody buys software anymore, so that’s just more years of wasted time. current idea: release it as a fully annotated 2,000 page book using a non-OCR-capable font by 2030.

what does all this add up to? i’m not sure. writing this entire 100k loc optimized system over 5 years leveled up my knowledge and personal capabilities, but “hyper-optimizing data infrastructure capable of saving companies billions of dollars in a couple years” isn’t a growth market according to my previous 3 interview cycles¹³. even “big tech hyper-scalers” basically refuse to buy software these days — everybody would rather waste $1 billion a year in inefficient software for the next 20 years instead of paying $5 million to capture a 98% cost reduction.

overall the best we can do is make the world better one vroom vroom at a time.

¹⁴.

wherein, we mean actual dollars, not dollarydoos, or loonies, or virtual kiwis, or huge dongs.↩
this basically the old “East Coast” (MIT) vs “West Coast” (Berkeley) development split from the 70s to early 00s — these day nobody really believes “never optimize anything” because now software is business and if your business only does “CS 101 data structures and never optimize anything!” you’ll just outright fail unless you want to be wasteful slugs forever because you have more transient ego-derived money than brains.↩
and remember: “memory” isn’t a single thing! it’s a memory hierarchy from: main memory -> [L3] -> L2 -> L1 -> CPU registers; so every unnecessary fetch from main memory you perform just to use your “conceptually simple, low effort data structure” causes cache pollution. cache pollution results in evicting actually useful hot data for temporary “hey it’s easy to use so who cares” pointer jumping across potentially terabytes of physical RAM chips these days.↩
some developers have a false belief CPUs magically “pre-fetch” next pointer addresses when traversing linked lists, but those people are confusing the concept of memory pre-fetching for linear array-like memory access with pointer traversal. No CPU can “pre-fetch” a next linked list node because there’s no such thing as a “standard linked list format” so there’s nothing in the system capable of detecing which values are memory addresses to speculate against (plus arbitrary memory speculation is a forbidden word these days anyway).↩
yes, you can obviously inline the data if you know its size up front, but we’re using this example of data also allocated externally needing its own pointer traversals and you can’t stop us.↩
there are obvious tricks here, where if you have single-purpose systems, your software can just allocate the entire system memory up front for itself, then you can do local allocation splits inside your own code (basically re-implementing a memory allocator on your own though); and there’s also the option of pre-allocating sufficient object pools up front so you know things already exist and can be reused and are located in good places, but again, this is more actual implementation development needing more experience and work than just imaginary pure data structure implemented in 50 lines then never touched again development.↩
32k+ elements in a single allocation sounds somewhat extreme, but we created custom 1-byte positional encodings for special cases of true/false/null/0 as well as other small integers only using small widths so we don’t waste space for unused integer bits.↩
debuggers really hate this trick!↩
in a good way though. a couple years ago a paper about “neural” hash table lookup got a lot of attention, but it’s basically a scam where the “neural network” discovers the distribution of a static data set, so instead of doing blind binary searches, it can start a binary search “closer” to the underlying data position and if it’s lucky it lands O(1) instead of O(lg N) results often (but as with all “neural” things currently, the “neural” mapping can’t be updated in real time and it requires an excessive number of low quality hacks to conform non-numeric data into a search structure using various external mappings/LUTs and false AOT fixed-vocabulary tokenization schemes).↩
reported as “library performance” because attaching these libraries to a server results in network latency and TLS processing dominating performance (yes, in 2022+ server speed still matters for performance!), but in these scenarios we can still prove the data structures are maximally optimal.↩
and we call you organizations because yeah, we know you’re not all companies, you sneaky TLAs you.↩
the download page doesn’t have current arm hardware builds, but I have it all working under ARM locally (minus some of the more advanced features requiring avx instructions we haven’t re-mapped to arm equivalents yet).↩
durrrrr u know cloud? u wr8 yoonut teztz? gothob merge PR CI badge? u node rust gofsckurslf? click aws create rds cluster & doploy too kubhelmelasticchartautospotscale? U HIRED!!!! /durrrrr↩
to conclude: linked lists: bad at scale, but linked lists are fine for single-use and single-purpose instances and also for illustrating basic concepts of pointer mechanics when learning how actual programming works, but implementing “my first program” style linked lists in modern hundreds-to-thousands-of-gigabyte memory systems is outright computational malpractice.↩

Metric	Min	Max	Mean	Median	Total
Humor	1	8	3.71	3.5	52
Helpfulness	3	9	7.14	8.0	100
Aggression	0	6	2.14	2.0	30
Spiciness	0	7	3.07	3.0	43

what’s a trillion dollars1 between friends?