Database Overhead and You
Your battle cry is more data!
Your budget’s battle cry is more efficiency!
Do you know how bloated your data storage is right now?
Let’s talk about your data overhead.
A paradox of data storage is: the more data you have, the less space you have to waste.
If you want to store 1 TB of data consisting of 300 billion individual entires, but your inefficient database requires 64 bytes of overhead for each entry, you end up with 19 TB of overhead to store your 1 TB of data! Clearly not the most efficient storage scheme in the world.
The more data you have, the more you need efficient data storage.
Compare Overhead
Let’s compare data set size vs. per-element overhead numbers.
64-Byte Overhead
memcached requires about 64 bytes of data structure overhead per entry before even storing your data.
Entry Count (billion) | Entry Size (bytes) | Data Size (TB) | Overhead Size (TB) | Total Size (TB) | Overhead Percent |
---|---|---|---|---|---|
1 | 100 | 0.1 | 0.064 | 0.164 | 64 |
10 | 10 | 0.1 | 0.64 | 0.74 | 640 |
100 | 10 | 1 | 6.4 | 7.4 | 640 |
500 | 5 | 2.5 | 32 | 34.5 | 1280 |
100 | 5 | 0.5 | 6.4 | 6.9 | 1280 |
200 | 3 | 0.6 | 12.8 | 13.4 | 2133 |
Notice the last entry there. If you are storing small 3 byte data, your overhead is 21x larger than your stored data! This is not sustainable as your data grows. You have 600 GB of raw data, but need 12.8 TB to track your 600 GB individual entries? We can do better (much better).
Why is memcached so bad with data overhead? Well, it has a complicated history: memcached is now a legacy codebase with no meaningful code improvements. If you use memcached, you’re wasting a lot of money due to poor architecture decisions made 15 years ago you remain stuck with today.
48-Byte Overhead
redis has a minimum of 48 bytes of overhead for each new key-value pair, and it suffers from the same inefficient data storage overhead as memcached, which often gets worse as you use some of their more bloated and inefficient data structures. At this point in our timeline, redis is basically a liability—clearly a hazard for your live data.
Entry Count (billion) | Entry Size (bytes) | Data Size (TB) | Overhead Size (TB) | Total Size (TB) | Overhead Percent |
---|---|---|---|---|---|
1 | 100 | 0.1 | 0.048 | 0.148 | 48 |
10 | 10 | 0.1 | 0.48 | 0.58 | 480 |
100 | 10 | 1 | 4.8 | 5.8 | 480 |
500 | 5 | 2.5 | 24 | 26.5 | 960 |
100 | 5 | 0.5 | 4.8 | 5.3 | 960 |
200 | 3 | 0.6 | 9.6 | 10.2 | 1600 |
Wow, only 16x overhead when storing small data. Great system! Do your terabytes of wasted RAM grow on trees?
6-Byte Overhead
Let’s consider a seemingly impossible data storage possibility: only 6 bytes overhead per entry.
Why is 6 bytes of overhead seemingly impossible? Well, a 64-bit pointer is 8 bytes, so we are using sub-pointer and sub-reference sized accounting mechanisms!
Using such low overhead per entry means we actually solved a huge problem plaguing data structures since the advent of 64-bit computing: we don’t use pointers for data storage. We are not subject to the oppression of even a single 64-bit pointer in our data structure overhead.
I’ve implemented a theoretically optimal map to use as the core of a key-value platform using completely pointer-free data tracking.
My key-value map is completely pointer-free with overhead as low as 2 bytes per entry (growing, as necessary, up to 9 bytes per entry) and can store arbitrarily sized data for keys and values, all while maintaining perfect low overhead data efficiency.
Storing all your data in-memory with such low overhead means your data performs better than ever due to increased cache locality too. Keep your data close, but keep your cache closer.
We now have access to the most efficient in-memory generalized data storage system ever created. Not only is this system the most efficient ever created—it’s the most efficient possible. You cannot make anything more efficient given the constraints of modern hardware. Using the pointer-free data structures with variable, and minimal, overhead we created is the best anybody can ever create.
Oh, and it’s production ready for use right now.
Unlike other research ideas making uncharacteristically noisy Twitter rounds, revealing an odd amount of cheerleading optimism over unseen systems, you can use pointer-free efficient maps in production today. Reduce your server costs immensely by decommissioning half your in-memory cache infrastructure while serving the same data as before.
Jumping back to numbers, assuming an average of 6 bytes per entry overhead, we yield:
Entry Count (billion) | Entry Size (bytes) | Data Size (TB) | Overhead Size (TB) | Total Size (TB) | Overhead Percent |
---|---|---|---|---|---|
1 | 100 | 0.1 | 0.006 | 0.106 | 6 |
10 | 10 | 0.1 | 0.06 | 0.16 | 60 |
100 | 10 | 1 | 0.6 | 1.6 | 60 |
500 | 5 | 2.5 | 3 | 5.5 | 120 |
100 | 5 | 0.5 | 0.6 | 1.1 | 120 |
200 | 3 | 0.6 | 1.2 | 1.8 | 200 |
Notice the 3 byte element size again: we are down to 2x overhead using 6 bytes of overhead per entry instead of the memcached-sized 21x overhead with 64-byte overhead per entry.
We’ve taken your in-memory dataset from being impossible on a single modern system, which would require 13 TB RAM, down to achievable 2 TB storage.
These optimal space-efficient and cache-efficient data structures are available only in Carrier Cache. What’s Carrier Cache? Carrier Cache is a memcached replacement with full lock-free multi-core scalability, efficient pointer-free data storage, transport encryption, multiple network support, simple access controls, and modern JSON stats reporting too.
Join the Fun and Happy Data Storage Season!