datakit: the most efficient data structures possible

datakit — the most efficient data structure library

A while ago I made a data structure library fixing memory efficiency problems across many data structure types, but I never ended up releasing it.

Now I’m releasing it.

TOC:

History / Why
databox
Efficient Maps
Efficient Lists
Efficient Strings
Efficient Arrays
Efficient JSON Writing
Other Weird Things
What
Conclusion

Here’s details all about the datakit library.

History / Why

I consider this a genre of IMPOSSIBLE SOFTWARE.

What is IMPOSSIBLE SOFTWARE?

My definition of IMPOSSIBLE SOFTWARE: when you can’t just hire anybody to create something.

It’s when you can’t assign tasks to people and end up with the best result. You can’t break apart the design from the implementation from the motivation. You can’t plan the design up front because the more you create the more you discover along the way, which all loops back to informing the design itself. You can’t create this work when operating under a control-hierarchy full of meetings and status reports and consequences. You can’t give a “status update” because the final result is still 18-96 months away. You can’t schedule your KPIs and OKRs and daily standups and weekly VP status meetings and monthly board roll call reports around building what you know can exist but not how (yet). It’s a process of cyclical iteration and refinement until satisfied. We build the future along the way.

The results here rely on my experience across programming language implementations, database implementations, operating system internals, compilers and interpreters, hardware sympathy, some coincidental accidental discoveries, and building systems for purpose cohorts to stand against other systems similar but lesser.

The software is IMPOSSIBLE to plan and schedule and organize, yet we persist.

also, my usual pity party disclaimer: i don’t think i’ve ever passed a coding technical interview (sorry, my brain doesn’t work that way?) and I haven’t had a resume reply in something like 10 months at this point? so just consider everything i say useless and outdated and hazardous to your professional success. for the project on this page, I burnt down my savings working on this and that for about 5 years, until I finally accepted the grim reality nobody actually buys software anymore and my dreams of selling the world’s most efficient in-memory database to 1,000 companies for $5 million ARR each would never happen. whoops. opportunity cost of missing out on a decade of income and savings and exponentially growing passive investment hoarding? never heard of it.

Usage

How to Usage It?

use me baby one more time

It’s here at mattsta/datakit on le githubs.

The code is readable if you focus on it, but there’s no overall documentation except for detailed code comments. The best case for examples of how to use datakit are the extensive test cases located at the end of all important source files.

For more active documentation or advice or bug fixes or security reviews, my current creative output rate floats with my AI Indulgence rate. Pricing is always difficult because we live in a bimodal world where 10 companies have $100 billion in free cash sitting around and everybody else can’t make payroll. Consider 1 indulgence for a yearly license and 5 indulgences for a perpetual license (which, as much as people argue, is not unreasonable). Sorry, but I can’t pay my continually rising rent and food costs and buy supplies for building AI robots and renting GPUs with github stars alone, but also feel free to github sponsor me for a recurring $10k USD per month too.

why it matters

A quick purpose rundown: RAM is power hungry and expensive. It’s best if we can reduce any and all unnecessary storage overhead when possible. The purpose of datakit is to provide high performance data structures with minimal overhead so we don’t waste your expensive memory. datakit is founded on using only pointer-free maps, pointer-free lists, low metadata strings, and having metadata grow as your data grows instead of large fixed metadata upfront for small datasets.

Use cases for low-overhead, pointer-free, highly-optimized data structures include: in-memory databases serving unstructured high cardinality data, efficient garbage collector state tracking, limited memory systems to avoid unnecessary pointer overhead bloat, and much much more! Many hours were spent comparing alternative implementations of various logic flows in godbolt using various compile flags to achieve the most optimally generated code possible.

About 10 years ago the first 1 TB RAM servers started showing up. Just recently, AWS announced they now have servers with up to 32 TB RAM ready to go. If you are using expensive high-RAM servers you want your budget to go towards storing your actual data and not just storing redundant unoptimized unnecessary structural overhead everywhere.

Imagine you had a dataset with 100 billion entries. Each extra byte of unnecessary overhead is 100 GB of RAM used for no reason. datakit goes to great effort to remove every unnecessary byte of overhead everywhere possible. Other traditional simple data structures can waste 30% to 60% of your RAM just with idle metadata overhead. datakit models can be higher performance with as little as 2% to 7% metadata overhead depending on your data layout.

databox

A major goal of datakit is providing clean C interfaces everywhere. Traditional footguns are either minimized or outright removed.

Due to our centralized polymorphic data container, we can implement clean interfaces all over the place.

Traditionally, C interfaces look something like this:

bool appendToListBytes(list *input, uint8_t *string);
bool appendToListInt(list *input, int32_t i);
bool appendToListUInt(list *input, uint32_t i);
bool appendToListInt64(list *input, int64_t i);
bool appendToListUInt64(list *input, uint64_t i);
bool appendToListFloat(list *input, float f);
bool appendToListDouble(list *input, double d);

uint8_t *readFromListBytes(list *input, size_t position);
int32_t readFromListInt(list *input, size_t position);
uint32_t readFromListUInt(list *input, size_t position);
int64_t readFromListInt64(list *input, size_t position);
uint64_t readFromListUInt64(list *input, size_t position);
float readFromListFloat(list *input, size_t position);
double readFromListDouble(list *input, size_t position);

and then consider if you want to write functions which replace one type with another type: you would end up with 7^2 == 49 functions instead.

Our interfaces are much cleaner here where just two functions replace the 14 functions above:

bool appendToList(list *input, databox *box);
bool readFromList(list *input, size_t position, databox *result);

As much as possible, you want to remove complexity from developer thought so we can centralize decision complexity in a better tested and verified implementation abstraction itself.

the databox is a traditional sum type tagged union with 8 bytes of data and 8 bytes of metadata:

typedef union databoxUnion {
    int8_t i8;
    uint8_t u8;
    int16_t i16;
    uint16_t u16;
    int32_t i32;
    uint32_t u32;
    int64_t i64;
    uint64_t u64;
    intmax_t i;
    uintmax_t u;
    float f32;
    double d64;

    /* Special access cases: retrieving external 16 byte integers */
    __int128_t *i128;
    __uint128_t *u128;

    /* Various ways to store and retrieve bytes */
    union {
        const char *ccstart;
        char *cstart;
        char cembed[8];
        const uint8_t *custart;
        uint8_t *start;
        uint8_t embed[8];
        size_t offset;
    } bytes;

    /* Pointer shorthands */
    void *ptr;
    uintptr_t uptr;
} databoxUnion;

_Static_assert(sizeof(databoxUnion) == 8, "databox value too big!");

typedef struct databox {
    databoxUnion data;      /* 8 bytes */
    uint64_t type : 8;      /* databoxtype; valid types start at offset 1 */
    uint64_t allocated : 1; /* bool; true if need to free data.bytes.start */
    uint64_t created : 1;   /* bool; if retrieved+created, true. */
    uint64_t big : 1;       /* bool; true if actually a databoxBig */
    uint64_t unused : 5;
    uint64_t len : 48       /* length of 'data.bytes'. 281 TB should be enough. */
} databox;

_Static_assert(sizeof(databox) == 16, "databox struct too big!");

Since the databox is guaranteed to only be 16 bytes, you can also cleanly stack allocate and return it from functions when necessary too:

databox databoxNewUnsigned(uint64_t val) {
    databox box = {.type = DATABOX_UNSIGNED_64, .data.u = val};
    return box;
}

databox databoxNewSigned(int64_t val) {
    databox box = {.type = DATABOX_SIGNED_64, .data.i = val};
    return box;
}

Or you can just declare them directly as well:

const databox DATABOX_BOX_TRUE = {.type = DATABOX_TRUE};
const databox DATABOX_BOX_FALSE = {.type = DATABOX_FALSE};
const databox DATABOX_BOX_NULL = {.type = DATABOX_NULL};
const databox DATABOX_BOX_VOID = {.type = DATABOX_VOID};

It’s also important to note databox is our intermediate data format, but databox is unrelated to the in-memory or on-disk storage formats. Actual storage formats are converted to and from databoxes. A 16 byte databox to represent TRUE as an argument input doesn’t mean we waste 16 bytes storing the value in our data structures. Each data structure API accepts databoxes as inputs, converts the databox value to proper storage types for the data structure, then can also retrieve the efficient serialized values back out as a databox containers again when being read too.

Efficient Flat List / Flat Map

okay, fine, you made a sum type, big deal, whatever.

A problem with data structures is you have to balance space vs. speed efficiency and space vs. speed varies depending on if your use cases are read heavy, write heavy, or balance read/write usage.

How can we combine the best of space and speed efficiencies?

The most efficient data structure is a regular array with sequential data access and full math-oriented access since every element can be (roughly) instantly jumped to just using direct offset math with no other pointers or logic in the middle. Directly allocated arrays are also memory and performance optimal for scanning because your CPU will see how to pre-fetch future memory addresses as you read along. But, arrays fall apart under two scenarios: inserting and deleting elements if your array requires adjusting underlying allocation extents.

For space efficiency, we want to make data as linear as possible so we don’t waste overhead on pointer storage and pointer chasing. Maximizing use of linear data structures also means we get incredibly fast sequential scan performance (versus slow pointer chasing individually allocated nodes across pages with no locality) and also unlimited performance in deletion of data structures (which, if you’ve never noticed, fully deleting a large data structure using thousands of allocated pointers across a large memory space can block your program substantially). Linear data structures also means we minimize “allocator overhead” because your memory allocator never allocates exactly the bytes you ask for, but it allocates bytes in its own blocks larger than what you actually asked for, so the fewer allocations you need, the more memory you save from an additional source of unnecessary overhead waste.

For speed efficiency, we want direct access to each data point without further unwrapping or parsing.

We also want to avoid operational overhead as much as possible by reducing allocations and rebalancing operations.

Unfortunately, not all data formats can just be shoved into an array and left untouched for maximum performance. Data has problems: storing strings of different lengths, storing different data types (strings, multiple integer types, multiple float types, native types of pointers or offsets, etc), inserting and removing data from various positions…

A while ago Pieter Noordhuis made a nice flat list/map library called ziplist which had some unique features:

reasonably efficient element metadata (variable length integers for sizes in multiple places)
allows traversal of elements forwards or backwards in the flat byte-array list
element traversal with data “count” so you can treat a list as key-value map by jumping elements

Unfortunately it was abandoned prematurely and ended up with some long term scars:

the interface is full of footguns where each write function accepts input pointer and sometimes returns a different result pointer leading things to often work in testing but die in production due to developers forgetting to re-assign every result back to the input again.
maintaining the reverse traversal mechanism is too complex and fragile because of size updates
it maintains a cache of total element count, unless there’s too many elements, then it just counts by doing a linear scan.
each list has an “END” byte marker to terminate the structure, but it’s just a wasted byte
the interface didn’t have a clean “replace” feature, so updates were always a full delete + full add
the interface supported writing multiple data types, but then the API exploded into too many special-type-purpose functions since there is no unified data container interface

Using the good ideas as a starting point, I rewrote it over a year using the original ideas but refactored out all the broken components.

I named my replacement flex for “flexible buffer” and it has:

efficient metadata for total structure length (varints, can grow to unlimited length)
efficient metadata for total element count (varints, can grow to unlimited length)
no “END” byte marker because we can just use math of “total size + current position” to determine when we reach the end
clean reverse element traversal metadata via backwards variable length integers
automatic size reductions for integers and floats
- all integers are reduced to minimum byte widths for their storage size in byte-level access sizes of: 8, 16, 24, 32, 40, 48, 56, 64, 96, 128 byte integers both positive and negative as well as floats/reals in 16, 32, and 64 byte widths.
- basically: if you store a double, but the double fits in a float with no loss, we store the float instead. for integers, each next byte quantity has a “floor” of the previous byte quantity, so we can store more details than the size would suggest in byte widths. also we store negative number metadata in the type byte for the entry and not the entry itself, so we don’t need to do anything awful like zig-zag integer encodings (and also, thusly, our negative integer storage is as efficient as the non-negative and positive integer storage)
fully typed operations using a simple databox interface so a single write function can write any data, and a single read function can return any data
update-in-place operations to avoid delete + write for updates
binary search capability — super important for using flat buffers as efficient maps!
- adding cached binary search capability to a multi-type byte-oriented flat map is a world first from what I can tell. Basically, there are additional write and read functions taking both a pointer to the flex itself, but also another pointer to the middle element of the flex — and on write, the in/out middle element pointer argument is updated in-place for the caller to retain middle-entry position updates for maximally efficient binary search performance.
global ordering logic — a feature inspired by erlang types: any databox types can be compared to any other type.
- if types match as comparable (e.g. any number vs any other number), they are compared by value; but if types don’t match (e.g. boolean vs string), then they are just ordered by their type index.

The overall performance measure for a flex: we can write about 1 million entries per second to a nicely sized flex buffer even with all the pointer-free byte-array-oriented data structure manipulation overhead involved. PDG (pretty darn good).

Another major benefit of flat buffers and pointer-free systems is deleting large blocks of data. We don’t need to break our memory bus by chasing down scattered pointers fallen out of cache. We can delete hundreds to millions of items with a couple single calls instead of having to call free a million times to delete a million items as in more traditional/legacy systems (ever notice things like when you shutdown your browser, it can take minutes just for the browser to “stop and restart?” — all of those are pointer scrambling operations nobody has cared to optimize for 40 years).

which brings us to how we use the efficient flex flat buffer to build even higher level optimized data structures.

Efficient Growable Maps

All the datakit data structures follow a 3-tier storage pattern:

for a limited number of elements, just use a single flex directly for the storage
- a flex flat buffer can operate directly as a list or as a map (by treating even elements as keys and odd elements as matching values)
for more elements, upgrade the single flex to 2 allocated flex instances (by splitting it in half) then use both to extend existing data
for even more elements, upgrade the dual flex to a “full” multi-flex interface allowing unlimited space-efficient fully-typed data structure growth

The clever part though: the growth pattern from small -> medium -> full is transparent to the user. A small multimap has only 16 bytes of metadata overhead, a medium multimap has 28 bytes of metadata overhead, and a full multimap has 56+ bytes of metadata overhead and grows as more flex containers are allocated into to the multimap lookup interface.

We track the small -> medium -> full status via pointer tagging the data structure struct pointer itself, then the APIs can rapidly switch between the underlying small -> medium -> full API implementations with some bit shifting and jumps.

The API automatically detects when a current level needs to be upgraded to the next size category and it happens transparently during an element insert operation.

The API is also extremely clean because every data structure input is a double pointer, so the caller never has to worry about forgetting to update their historical pointers with updated references when they get altered/reallocated/modified out from under the caller. The API is footgun-free.

Here’s the centralized map API, but each of these functions is implemented three times for each of the three underlying small -> medium -> full implementations, all handled transparently via this single meta-API wrapper using clean tagged pointer dispatch logic:

/* opaque multimap type; there's no user accessible data here. */
typedef struct multimap multimap;

multimap *multimapNew(multimapElements elementsPerEntry);
multimap *multimapNewLimit(multimapElements elementsPerEntry,
                           const uint32_t limit);
multimap *multimapNewCompress(multimapElements elementsPerEntry,
                              const uint32_t limit);
multimap *multimapNewConfigure(multimapElements elementsPerEntry, bool isSet,
                               bool compress, flexCapSizeLimit sizeLimit);
multimap *multimapSetNew(multimapElements elementsPerEntry);
multimap *multimapCopy(const multimap *m);
size_t multimapCount(const multimap *m);
size_t multimapBytes(const multimap *m);
flex *multimapDump(const multimap *m);

bool multimapInsert(multimap **m, const databox *elements[]);
void multimapInsertFullWidth(multimap **m, const databox *elements[]);
void multimapAppend(multimap **m, const databox *elements[]);

void multimapInsertWithSurrogateKey(
    multimap **m, const databox *elements[], const databox *insertKey,
    const struct multimapAtom *referenceContainer);

bool multimapGetUnderlyingEntry(multimap *m, const databox *key,
                                multimapEntry *me);
bool multimapGetUnderlyingEntryWithReference(
    multimap *m, const databox *key, multimapEntry *me,
    const struct multimapAtom *referenceContainer);
void multimapResizeEntry(multimap **m, multimapEntry *me, size_t newLen);
void multimapReplaceEntry(multimap **m, multimapEntry *me, const databox *box);

bool multimapExists(const multimap *m, const databox *key);
bool multimapExistsFullWidth(const multimap *m, const databox *elements[]);
bool multimapExistsWithReference(const multimap *m, const databox *key,
                                 databox *foundRef,
                                 const struct multimapAtom *referenceContainer);

bool multimapLookup(const multimap *m, const databox *key, databox *elements[]);
bool multimapRandomValue(multimap *m, bool fromTail, databox **found,
                         multimapEntry *me);

bool multimapDelete(multimap **m, const databox *key);
bool multimapDeleteFullWidth(multimap **m, const databox *elements[]);
bool multimapDeleteWithReference(multimap **m, const databox *key,
                                 const struct multimapAtom *referenceContainer,
                                 databox *foundReference);
bool multimapDeleteWithFound(multimap **m, const databox *key,
                             databox *foundReference);
bool multimapDeleteRandomValue(multimap **m, bool fromTail, databox **deleted);

int64_t multimapFieldIncr(multimap **m, const databox *key,
                          uint32_t fieldOffset, int64_t incrBy);

void multimapReset(multimap *m);
void multimapFree(multimap *m);

void multimapEntryResize(multimap **m, const databox *key, size_t newSize);
void multimapEntryReplace(multimap **m, const databox *key,
                          const databox *elements[]);

bool multimapFirst(multimap *m, databox *elements[]);
bool multimapLast(multimap *m, databox *elements[]);

bool multimapDeleteByPredicate(multimap **m, const multimapPredicate *p);
bool multimapProcessPredicate(const multimapPredicate *p, const databox *value);
typedef bool(multimapElementWalker)(void *userData, const databox *elements[]);
size_t multimapProcessUntil(multimap *m, const multimapPredicate *p,
                            bool forward, multimapElementWalker *walker,
                            void *userData);

bool multimapIteratorInitAt(const multimap *m, multimapIterator *iter,
                            bool forward, const databox *box);
void multimapIteratorInit(const multimap *m, multimapIterator *iter,
                          bool forward);
bool multimapIteratorNext(multimapIterator *iter, databox **elements);

void multimapIntersectKeys(multimap **restrict dst,
                           multimapIterator *restrict const a,
                           multimapIterator *restrict const b);

void multimapDifferenceKeys(multimap **restrict dst,
                            multimapIterator *restrict const a,
                            multimapIterator *restrict const b,
                            const bool symmetricDifference);

void multimapCopyKeys(multimap **restrict dst, const multimap *restrict src);

Efficient Growable Lists

The same pattern exists for lists too. multilist is my 3rd generation of space-efficient unrolled flat-buffer-backed linked lists which originally started as my quicklist redis refactor, but it was never complete and always had more overhead than necessary, which is now fixed here forever due to the 3-tier underlying efficient data structure system.

The multilist API looks like:

/* opaque multilist type; there's no user accessible data here. */
typedef struct multilist multilist;

multilist *multilistNew(flexCapSizeLimit limit, uint32_t depth);
multilist *multilistNewFromFlex(flexCapSizeLimit limit, uint32_t depth,
                                flex *fl);
multilist *multilistDuplicate(const multilist *ml);

size_t multilistCount(const multilist *m);
size_t multilistBytes(const multilist *m);

void multilistPushByTypeHead(multilist **m, mflexState *state,
                             const databox *box);
void multilistPushByTypeTail(multilist **m, mflexState *state,
                             const databox *box);


void multilistInsertByTypeAfter(multilist **m, mflexState *state[2],
                                multilistEntry *node, const databox *box);
void multilistInsertByTypeBefore(multilist **m, mflexState *state[2],
                                 multilistEntry *node, const databox *box);

void multilistDelEntry(multilistIterator *iter, multilistEntry *entry);
bool multilistDelRange(multilist **m, mflexState *state, mlOffsetId start,
                       int64_t values);
bool multilistReplaceByTypeAtIndex(multilist **m, mflexState *state,
                                   mlNodeId index, const databox *box);

void multilistIteratorInit(multilist *ml, mflexState *state[2],
                           multilistIterator *iter, bool forward,
                           bool readOnly);
#define multilistIteratorInitReadOnly(ml, s, iter, forward)                    \
    multilistIteratorInit(ml, s, iter, forward, true)
#define multilistIteratorInitForwardReadOnly(ml, s, iter)                      \
    multilistIteratorInit(ml, s, iter, true, true)
#define multilistIteratorInitForward(ml, s, iter)                              \
    multilistIteratorInit(ml, s, iter, true, false)
#define multilistIteratorInitReverse(ml, s, iter)                              \
    multilistIteratorInit(ml, s, iter, false, false)
#define multilistIteratorInitReverseReadOnly(ml, s, iter)                      \
    multilistIteratorInit(ml, s, iter, false, true)

bool multilistIteratorInitAtIdx(const multilist *ml, mflexState *state[2],
                                multilistIterator *iter, mlOffsetId idx,
                                bool forward, bool readOnly);
#define multilistIteratorInitAtIdxForward(ml, s, iter, idx)                    \
    multilistIteratorInitAtIdx(ml, s, iter, idx, true)
#define multilistIteratorInitAtIdxForwardReadOnly(ml, s, iter, idx)            \
    multilistIteratorInitAtIdx(ml, s, iter, idx, true, true)
#define multilistIteratorInitAtIdxReverse(ml, s, iter, idx)                    \
    multilistIteratorInitAtIdx(ml, s, iter, idx, false)
#define multilistIteratorInitAtIdxReverseReadOnly(ml, s, iter, idx)            \
    multilistIteratorInitAtIdx(ml, s, iter, idx, false, true)

bool multilistNext(multilistIterator *iter, multilistEntry *entry);

void multilistRewind(multilist *ml, multilistIterator *iter);
void multilistRewindTail(multilist *ml, multilistIterator *iter);

void multilistIteratorRelease(multilistIterator *iter);

bool multilistIndex(const multilist *ml, mflexState *state, mlOffsetId index,
                    multilistEntry *entry, bool openNode);
#define multilistIndexGet(ml, s, i, e) multilistIndex(ml, s, i, e, true)
#define multilistIndexCheck(ml, s, i, e) multilistIndex(ml, s, i, e, false)

void multilistRotate(multilist *m, mflexState *state[2]);
bool multilistPop(multilist **m, mflexState *state, databox *got,
                  bool fromTail);
#define multilistPopTail(ml, s, box) multilistPop(ml, s, box, true)
#define multilistPopHead(ml, s, box) multilistPop(ml, s, box, false)

void multilistFree(multilist *m);

multilist has the same 3-tier small -> medium -> full implementation logic where small has 8 bytes of metadata overhead, medium has 16 bytes of metadata overhead, and full has 24+ bytes of metadata overhead.

The transitions between small -> medium -> and full are handled automatically due to the double pointer in/out parameters for the multilist on mutating functions and there is also optional transparent compression of the datasets built-in too (all those mflexState parameters allow you to pass in optional stack-allocated space to use as scratch space for the compression to work across calls).

Efficient Strings

strings in C are always an excellent bikeshedding opportunity.

There’s modern libraries like StringZilla which is a maximum-complexity, maximum-performance library, but there is a middle ground where we need simple access with simple controllable interfaces.

The original sds library attempted some clever tricks like storing string metadata inside the allocation of the string memory itself before the string access pointer like this where the library always returns a pointer to data but it knows it can look back 8 bytes from the start to find the actual length:

struct {
    uint32_t len;
    uint32_t free;
    char *data;
}

the primary drawback there was you always have 8 bytes of overhead for every string and also you are limited to a maximum of 4 GB string buffers. So, lazy users would say “just make them all 64 bits!” but now you are wasting 16 bytes of metadata for every string (if you have 100 billion strings, you’re wasting 160 GB of RAM just on over-allocated metadata!).

Of course, I fixed all this too.

I refactored the sds pattern to such an extent it’s a full re-implementation by rewriting every function to operate using metadata logic of:

the size quantities adjustable on 8 bit slices
- so you can have 8 bit, 16 bit, 24 bit, 32 bit, 40 bit, or 48 bit metadata for size lengths
- you also have those same bit widths for the free value
- but you also have the option of not having a free value so you only have length if you are just storing fixed data and don’t need to waste free length metadata storage.

I called my re-implementation a dks for “datakit string” — but due to the multiple ways metadata is handled (“with free” or “without free-length” metadata), the dks is an abstract implementation with two concrete implementations:

mds — mutable data string
mdsc - mutable data string, compact

just remember mds is powered by dks.

The metadata storage format isn’t a simple struct now, but instead just a byte array formatted as:

[CACHED LENGTH][CACHED FREE][DATA]

We are free to manipulate those bytes as any size and values we need. For 8 bit metadata, we allocate a 1 byte length field and 1 byte free field, but — the secret is here — we store the metadata size lookback type information in the last two bits of the free field. So technically our free value is always 2 to 3 bits less than the allocated size for it.

In the case of 8 bit metadata, we have all 8 bits available for length storage, but only (8 - 2) == 6 bits available for free size storage, which isn’t really a problem because small values are small values. On larger values like 32 bit storage, we have 4 bytes for the string length and (32 - 3) == 29 bits available for the free size storage (why the 2 bits vs 3 bits difference? 8 and 16 bit storage uses 2 type bits while 24, 32, 40, and 48 bit storage uses 3 type bits inside the free length).

Great, we have different size classes, but how are they managed? Obviously you don’t want an API with 7 implementations of every function for individual size classes.

Switching Strings

Each string manipulation function starts with a call to populate a dksInfo struct with details to detect which of the underlying 7 string metadata storage sizes is being used:

/* dksInfo = dks information for administrative tasks.
 *   (8 + 8 + 8 + 8 + 4) = 36 byte struct:
 *     - 64-bit start pointer
 *     - 64-bit buf pointer
 *     - 64-bit length
 *     - 64-bit free
 *     - 32-bit type enum */
typedef struct dksInfo {
    uint8_t *start;
    DKS_TYPE *buf;
    size_t len;
    size_t free;
    dksType type;
} dksInfo;

dksInfo is kinda like a single-purpose databox where we use dksInfo as the logical place for storing the parsed/extracted active string metadata we operate on, then before we return the string back to the user, we re-process the dksInfo values to determine whether the string types need to be updated.

The way we handle two string implementations of mdsc and mds sitting on top of dks:

dks is almost entirely parameterized via macro expansion for function names
mds and mdsc both have different concrete implementations of the dksInfo extraction function.
- mds knows how to populate both len and free
- mdsc knows free doesn’t exist, so it only populates len (at the expense of storing the size-expansion bits at the end of len instead of stealing them from free)

So, the dks implementation looks a bit “shouty” with C macro names in all caps everywhere, because mds and mdsc rename the underlying dks functions via isolated macro namespace takeovers, which turn into mds and mdsc prefix function names when linking against mds.c or mdsc.c (and dks.c can’t be linked against because it’s not a compile target, it’s only use is to be included in mds.c or mdsc.c with macro names being pre-defined for name takeovers).

Due to being a C string-like library, dks qua mds / mdsc retain the classic C footgun pattern of returning a new string pointer on write operations which may or may not differ from your original input. It’s just the way those things work, but it would be nice to move the implementation to a full double-pointer in/out parameter to avoid it entirely.

also, as a bonus, mds sitting on dks has always had proper size upgrade logic, which the legacy sds library has had a broken realloc size upgrade logic sitting around for a dozen years nobody has cared to fix. Also, the sds attempt at multi-size metadata has problems with struct padding and general implementation weakness we tolerate no longer.

My dks improvements are the most efficient string metadata system possible given these usage patterns.

Efficient Arrays

Now, even weirder, what if we want to improve C arrays themselves?

The only real problem with C arrays is once they get too big, you can’t easily insert into them anymore because each insert requires a new allocation and potentially a full O(N^2) memmove() if you want to do bad things like always prepending to an array (or just if realloc() can’t extend your current allocation any longer and must completely move your allocation to a different location).

We can follow our previous 3-tier self-tagging pattern to improve C arrays outright too with a very simple interface:

multiarray *multiarrayNew(uint16_t len, uint16_t rowMax);
size_t multiarrayCount(const multiarray *m);
size_t multiarrayBytes(const multiarray *m);

void *multiarrayGet(multiarray *m, multiarrayIdx idx);
void *multiarrayGetHead(multiarray *m);
void *multiarrayGetTail(multiarray *m);
void multiarrayInsertAfter(multiarray **m, multiarrayIdx idx, void *what);
void multiarrayInsertBefore(multiarray **m, multiarrayIdx idx, void *what);
void multiarrayDelete(multiarray **m, const uint32_t index);

There’s the entire multiarray interface implemented across 3 size classes:

native — 8 bytes overhead
medium — 16 + (16 * N) bytes pointer array
full — 24 + (16 * N) bytes pointer xor linked list

Basically, we can have a “regular” array acting normally, but this has the same performance problems of a single array.

Or we can have an “array of arrays” where a parent array attaches N child arrays. This delays the performance problems until each array becomes too big to manage with efficient memory allocations.

Finally we can have an “array of xor linked lists of arrays of arrays” allowing unlimited scalability because now you can allocated unlimited memory-size-efficient arrays as your data grows.

Why bother with multiarray though? One way we minimize pointer overhead is by using arrays to store flex addresses for multimap and multilist. Normally, one wouldn’t use arrays to store growing data due to the previously mentioned “when array get too big array performance breaks” problem, but now we have an unlimited array growth system maintaining the high performance and low overhead of arrays everywhere. We don’t end up with linked lists of flex pointers because we use multiarray to manage the metadata of flex lists inside of multilist and multimap at the full storage levels. Actually, there are no standard linked lists used in datakit. When we need a linked list, we use xor linked lists instead to minimize wasted memory overhead everywhere possible.

Efficient JSON Writing

Writing JSON is annoying. The JSON spec is annoying.

JSON demands:

no trailing commas (sin!)
all keys must be strings, so these can never be JSON keys unless wrapped in quotes: numbers, true, false, null
only double quotes are allowed
also, due to only quoted strings allowed, you have to scan all your inputs to escape quotes or other unacceptable values

I couldn’t find a library to do all of the above as efficiently as I wanted, so I adapted my own.

dj.c (for “datakit json”) provides:

adapted SIMD checks of input strings for escaping from rapidjson (both x64 and arm now)
efficient metadata for creating a new write request (only 64 bytes per JSON document being generated)
a caller-agnostic way of appending to collections without worrying about the “trailing comma” problem
- dj.c always writes a trailing comma to the intermediate generation buffer, but it also detects when you switch to a different type and then removes the previous (now abandoned) trailing comma.
remembering “which type you are in” for nested objects and nested lists is done via a bitmap for maximum memory efficiency
- why do we need to remember which type we were in? because we have to close the container eventually! if you write 100 nested items, as we un-nest on the way back up, we have to match the closing tag to the correct type based on each potentially different data type we were writing before we started the next nesting level.
also, because the dj library knows which type you are writing, you can use djCloseElement() and it knows whether to append ], or }, automatically
also, because we know types and positions, after you open a map, the first thing you write is the key, which gets an automatic : appended, then the next write is its value.
and as a bonus, since dj.c operates as a generic “open type, write data, close type” wrapper, we can have additional types like python set syntax in addition to JSON object syntax.

API looks like:

void djInit(djState *state);
void djInitWithBuffer(djState *state, mds *buf);
mds *djConsumeBuffer(djState *state, bool reset);
void djAppend(djState *dst, djState *src);
void djFree(djState *state);
mds *djFinalize(djState *state);
mds **djGetMulti(djState *state, size_t *ssCount);
mds **djFinalizeMulti(djState *state, size_t *ssCount);

/* Auto-detect whether closing list or map and do the right thing */
void djCloseElement(djState *state);

void djMapOpen(djState *state);
void djMapCloseElement(djState *state);
void djMapCloseFinal(djState *state);
void djArrayOpen(djState *state);
void djArrayCloseElement(djState *state);
void djArrayCloseFinal(djState *state);
void djTrue(djState *state);
void djFalse(djState *state);
void djNULL(djState *state);
void djString(djState *state, const void *data_, size_t len,
              const bool supportUTF8);
void djStringDirect(djState *state, const void *data, size_t len);
void djNumericDirect(djState *state, const void *data, size_t len);

/* Bonus: for generating python-style sets */
void djSetOpen(djState *state);
void djSetCloseElement(djState *state);
void djSetCloseFinal(djState *state);

/* consume any typed databox value and write it into the JSON */
void djBox(djState *state, const databox *const box);

Other Weird Things

datakit also includes custom implementations of:

dod.c - a library for delta-of-deltas (or diff of differences) compact sequential integer storage implemented as a bit stream
xof.c - a library for compact sequential float storage implemented as a bit stream using the xor float trick
strDoubleFormat.c - I ported a nice human-preferences-readable float formatting library to C from erlang which was ported to erlang from scheme. It did require the addition of a bignum library though
multilru.c - an attempt at memory efficient L4N LRU evictions
jebuf.c - align your malloc() size requests to the next jemalloc size bucket they will hit anyway using efficient binary search through size classes
fibbuf.c - like jebuf.c except use fibonacci size increments
hyperloglog.c - a performance and usability refactor of the redis hyperloglog implementation
multiroar.c - even more memory efficient roaring bitmaps
ptrlib.h - a helper for automatically handling pointer tagging logic with user-defined bit widths for storage
ctest.h - a simple test case and performance harness wrapper
intset.c and intsetU32.c and intsetBig.c - adapted from redis integer set logic, but extended for better memory efficiency, better performance, and better usability
utilities in util.c and str/* - many maximially machine-efficient operations for common data manipulation and parsing problems collected from other projects
membound.c - an adapted internal memory allocator from sqlite3’s mem5 implementation
multiTimer.c - an attempt at a memory-efficient timer implementation adapted from freebsd timer logic
linearBloom.c - a simple bit-oriented bloom filter in linear array
linearBloomCount.c - a bit-oriented bloom filter with counts for each entry instead of boolean on/off checks

What

The data structure library has components building on each other:

a varint library :: 6,000 lines
a byte-oriented fully-typed flat buffer implementation :: 8,000 lines
a common data interchange object so all our C functions can accept arbitrarily sized strings/integers/floats/bools all as one parameter
efficient array built in a 3-tier system for small/medium/full usage :: 2,000 lines
efficient lists (also 3-tier) built on optimally chained pointer-minimized flat buffers :: 6,000 lines
efficient maps (also 3-tier) also built on pointer-minimized flat buffers :: 7,000 lines
space-optimized string library (2-tier: one knows about free space, the other doesn’t) :: 3,000 lines
special types: optimized sequential integer storage, optimized sequential float storage, integer sets from 32 to 128 bits
a custom JSON writer :: 1,200 lines
bitmap bloom filters (both element checking and counting)

All data components of the library are implemented in a maximally user friendly way with fully isolated types between user calls and API interfaces. All metadata is fully encapsulated in opaque structs so users don’t try to read into values which can change across calls.

The line counts are illustrative but not prescriptive. A 6,000 line library isn’t just sitting down and typing 6,000 lines in two weeks obviously, but months to years of combination exploration and refining implementation details constrained by our goals of balancing maximum memory efficiency with optimized access speeds.

The datakit library also serves as a repository or collection of efficient reusable C functions around things like string and number manipulation, timers, formatting, etc borrowed from other permissive projects like freebsd, sqlite3/4, and luajit to name a few.

Overall, I’d say this project has over 5,000 hours of (uncompensated) work put into it combined across design, implementation, experimentation, debugging, refactoring, and refinement.

Status

overall status: datakit exists and you can look at it to praise me, but there’s no guarantees and you better not ask any questions without dropping five to nine to thirteen figures of good ole american cash monay on my head.

Conclusion

conclusion: pay people for writing software. because it’s “just typing” doesn’t mean it isn’t valuable. sure, it’s “just typing” but it’s also building entire conceptual cathedral universes in mental state space to advance the knowledge and capabilities of the world.

I’ve had multiple companies ask for direct consulting work, reject my requests for paid contractual arrangements, then say they won’t pay because “it just requires thinking and typing,” then they add me to their weekly “knowledge sharing progress update” calls because “we just want your ideas, it doesn’t require work.” — I just find it astounding so many people think software is free and requires no talent or experience and should never be professionally compensated (unless you have some sweetheart deal in forever-middle-management at a public company with exponentially growing stock for infinite passive accumulation). In my own day-to-day life economics I can’t argue I don’t have to pay rent because “the building is already built, so I don’t have to pay you” or argue against paying doctors because “you’re just talking and writing, so you’re not actually doing anything, so I don’t have to pay you,” but software is assumed to be low status people begging for scraps everywhere even though we build the modern world out of our own heads.

products can’t be made just from json yaml node docker terraform configuration files glued together no matter how many senior copy/paste developers with 500 hours of experience you put together in a corpochatroom.

anyway, i could use a few dozen more high value github sponsors or employment or sponsor an open source software maintainer funding agreement or else my next step is just living under a bridge and working at some rural walmart i guess: https://github.com/sponsors/mattsta