Redis Quicklist - From a More Civilized Age

Adventures in Encodings

[Short attention span? Jump to graphs.]

Redis is a data structure server.

One of the data structures it servers is a doubly linked list.

The Redis lists have two representations: regular and and ziplist.

Regular linked lists are CS 101 lists with each node having next/prev pointers. ziplists are solid chunks of memory with internal size offsets.

Regular linked lists have overhead of 40+ bytes per entry. Ziplists have overhead ranging from 1 byte to 10 bytes per entry (depending on entry size and type).

If you store a million integers in a linked list, your data size is 4 MB but your overhead is 40+ MB. If you store the same in a ziplist, your data size is 4 MB and your overhead is around 1 MB.

But—ziplists of 1 million elements aren't a good idea since a ziplist is a solid chunk of memory. What we need is a way to store lists of unbounded size in ziplists while also keeping the ziplist size small for individual operations.

We can combine both approaches: make linked list (fat) of ziplists (compact).

Let's call the hybrid list a quicklist.

(NB: Twitter started doing this years ago to store everybody's timelines, but for various reasons they haven't contributed their implementation back to the world1. This is my re-imagining of the concept.)

[[Did you know Twitter runs over 10,000 Redis servers per data center and their Redis deployment uses over 100 TB of live memory per datacenter? If they have 3-5 datacenters, they probably have 300 TB to over half a petabyte of live Redis data. In a very real way, Twitter is just Redis-as-a-Service for certain operations.]]

Quicklist

Quicklists have one tunable (🐟) option: the maximum size of each internal ziplist.

Below are comparisons between original Redis linked list encoding (overhead of 40+ bytes per entry) and the new quicklist encoding (overhead of 1 byte to 10 bytes per entry).

The quicklist results are reported by graphs—quicklists vary their memory usage depending on how large ziplists are allowed to grow2.

Results

Benchmark #1: Store 200 lists each with 1 million integers

Linked List (original)

Linked lists only have one data representation, so we just report their results as a table. There's nothing interesting to graph for the old implementation.

Allocator Allocated Memory Allocator Fragmentation Total Process Size
jemalloc 11.86 GB 2% 12.1 GB
libc (ptmalloc) 13.34 GB 33% 17.7 GB

Quicklist (new; using jemalloc allocator)

The quicklist implementation improves memory usage by storing multiple elements per list node. Graphs show memory improvements (hopefully) by increasing the number of entries stored in each ziplist.

Every length holds the same data. The space savings are due to more efficient encodings, not because we are storing less data.

Highlights of int test (jemalloc)
ziplist length Allocated Memory
1 8.9 GB
2 6.0 GB
3 4.0 GB
4 4.0 GB
74 1.0 GB

Technical note: a quicklist with ziplist length 1 has the same properties as a regular linked list — every node holds one element. But, notice how much more memory efficient the quicklist is even with one entry per ziplist: regular linked list required almost 12 GB while this test of one entry per ziplist in the quicklist uses 9 GB and goes down rapidly the more elements you allow per ziplist.

For more details about why the ziplist with size 1 is more space efficient than the original linked list, see the Internals section below.

Quicklist (new; using default libc allocator)

The memory test with the default libc allocator requires more memory and introduces a higher internal fragmentation rate than the jemalloc allocator, but the immediate memory reduction is smoother because of the less advanced allocator.

Benchmark #2: Single list with 23,588,600 elements

This comparison is a single list with every word in OS X /usr/share/dict/web2 appended to a list 100 times (for i in {1..100}; do cat web2 |xargs redis-cli rpush abc; done), resulting in one list with 23.5 million small elements (small = the longest word is 24 characters).

The web2 file is 2.4 MB of text. Storing 2.4 MB of text 100 times is 240 MB of text. Anything more than 240 MB is overhead due to internal accounting by both the memory allocator and Redis itself.

Linked List (original)

Allocator Allocated Memory Allocator Fragmentation Total Process Size
jemalloc 1.67 GB 3% 1.7 GB
libc (ptmalloc) 1.72 GB 39% 2.4 GB

Quicklist (new; using jemalloc allocator)

Highlights of dict test (jemalloc)
ziplist length Allocated Memory
1 1.4 GB
2 0.8 GB
3 0.6 GB
4 0.5 GB
32 0.3 GB

Notice how making the ziplist length too big (starting around 290 entries per ziplist for this use case) starts causing more memory to be used.

Quicklist (new; using default libc allocator)

Benchmark #3: 3,000 lists of 800 elements, each element 2.5k

This test is opposite of the others. The previous tests benchmarked large lists of small elements. This test is many lists of medium sized elements.

Linked List (original)

Allocator Allocated Memory Allocator Fragmentation Total Process Size
jemalloc 8.12 GB 3% 8.4 GB
libc (ptmalloc) 7.92 GB 1% 8.0 GB

Quicklist (new; using jemalloc allocator)

Highlights of 2k quicklist test (jemalloc)
ziplist length Allocated Memory
1 8.1 GB
2-6 9.2 GB
7 7.9 GB
13 8.5 GB
20 7.8 GB

What's going on here? This test is storing 800 individual 2.5k blobs per list with a total of 3,000 lists.

Unlike the others, this test strongly shows the underlying over-allocation behavior of the malloc implementation. You can see here jemalloc prefers to over-allocate memory requests. The initial optimization of placing just a few elements together in a ziplist doesn't improve memory usage since jemalloc is over-allocating for us. (The jemalloc over-allocation allows subsequent re-allocation requests to be delivered in-place with no copying and no fragmentation).

But, once we reach double digit ziplist lengths, everything gets better. A safe number here is probably a ziplist length of 32.

Quicklist (new; using default libc allocator)

For these large sequential requests, the libc allocator actually does well, but these tests were conducted in the absence of any other concurrent access, allocations, or deletes on the server. Under normal conditions, the libc allocator can fragment 25% to 40% causing more severe memory usage problems.

If you had an entirely static dataset comprised of large elements that you only WORM against, you could potentially be better off using the libc allocator for this use case.

Internals

Those were the benchmarks. The rest of this page is a description of how the Redis linked list implementation works and how the ziplist implementation works so you can see how using pieces of each gives us more efficient memory usage.

Linked List

A regular Redis linked list is a simple doubly linked list. The list itself tracks: the head and tail of the list for O(1) unshift and push operations as well as the length of the list so we don't need to traverse the entire list to count it.

A linked list looks like: [0] <-> [1] <-> [2] <-> ... <-> [N]

Here, [0] is the head of the list and [N] is the tail of the list.

Each node is a struct listNode containing three pointers: previous node, next node, and value. (3 pointers = 8 bytes * 3 = 24 bytes total)

Each list value is a pointer to a struct robj (redis object) containing: 32 bits of metadata, an integer reference count, and a pointer to the contents. (2 integers + 1 pointer = 2 * 4 bytes + 1 * 8 bytes = 16 bytes total)

The value inside the struct robj is a Redis sds with two integer fields and the string contents. (2 integers + contents = 2 * 2 bytes + contents = 4 bytes + contents)

For every value inserted into this list encoding, Redis creates around 40 bytes of metadata plus additional overhead due to internal malloc accounting.

If you store a 10 character string in a list, you are storing 10 bytes of data and over 40 bytes of metadata generating 4x more overhead than contents.

Pros

  • O(1) head/tail access
  • O(1) append/prepend
  • O(1) delete from either end
  • Any internal deletions just join previous/next pointers together.
  • Any internal insertions just create new previous/next pointers in sibling nodes.

Cons

  • O(N) access to elements
  • requires oppressive memory overhead for storing small values
  • not cache efficient for traversing the list

Ziplist

It turns out linked lists are great in theory, but not in practice if you need to store a list of many not-huge elements. Chained pointers don't allow for any cache locality while traversing the list and all the pointers generate additional overhead we can avoid.

To help get around the problem of linked lists having more overhead size than contents, Redis has an alternate list encoding called a ziplist.

A ziplist is a solid block of memory where all elements are smooshed together. Such an arrangement makes for great data locality—your entire data structure can fit into CPU-appropriate caches.

But, a large solid block of memory can make certain operations tricky: namely, inserting into the middle of a list or deleting from the middle of the list.

So, Redis restricts ziplist usage to lists with a combination of small values and small element counts. With both linked lists and ziplists available, Redis automatically switches between ziplist encoding and linked list encoding as the length of your list grows or shrinks.

A ziplist looks like:

[total size][tail offset][cached element count][entry 0]...[entry N][END]

So, an empty ziplist is 11 bytes:

[size=4 bytes][tail offset=4 bytes][count=2 bytes][END=1 byte]

But, an empty ziplist isn't very useful. Lists need list entries.

A ziplist entry looks like:

[entry header][contents]

An entry header looks like:

[length of previous entry][length of this entry]

ziplists cleverly store the length of each previous element too so you can traverse the list back-to-front as well.

ziplist entries use variable encodings, so if you store small data, the offset sizes are small too. If you store numeric data, the ziplist entry will store your numbers as native integers instead of as strings.

ziplist entries can represent numbers in one of 5 compact forms: numbers 0 to 12 use exactly 4 bits of space, numbers representable in one byte use one byte of storage, and the same goes for two bytes, four bytes, and eight byte numbers.

But, we're not done yet!

ziplist entries can also be strings, and ziplists have an efficient encoding scheme there too. If the length of your string fits in 6 bits, the size is stored inside the metadata byte.

If the length of your string fits in 14 bits, the size is stored inside the metadata byte plus one additional byte, for a total of two bytes used. Otherwise, if your string is longer than 16,383 characters, a full 4 bytes are used to store the length of your string (plus the additional byte of metadata saying "this is a full 4 byte size").

Pros

  • sequential memory representation
  • no internal pointers required
  • efficient integer storage and efficient offset recording
  • cache efficient
  • O(1) head/tail access

Cons

  • Inserting to the list requires moving all subsequent elements down
    • inserting also requires reallocating memory (since your solid block of memory must now grow) which could cause your entire list to be copied to a new memory location in addition to your list now needing to move all subsequent list elements further down memory after the new reallocation copy as well.
  • Deleting from the list requires moving all subsequent elements up
    • deleting the tail of the list is the most efficient case
  • individual ziplists should be kept to a cache-friendly size
    • keeping ziplists small also helps with the speed of reallocation and the speed of move-all-the-elements operations

Quicklist

A quicklist looks like:

[ziplist 0] <-> [ziplist 1] <-> ... <-> [ziplist N]

The quicklist stores list-max-ziplist-entries entries per ziplist. When one ziplist grows beyond that number, a new linked list node is created with a new ziplist. Each linked list node holding a ziplist stores a cached count of the ziplist length so we can skip over entire ziplist ranges when doing long traversals.

We still have O(1) access to the head of the quicklist (element 0 in ziplist 0) and we still have O(1) access to the tail of the quicklist (tail element in ziplist N).

The tricky parts are inserting into the middle of already full ziplist nodes in the middle of the list. For that, we split existing ziplists so we can keep the element counts below the fill size.

Because we allow arbitrary ziplist splitting, we also attempt to merge internal ziplists that have been split too many times back into one ziplist that is still under the fill size.

Pros

  • Efficient memory usage for storing lists of any length
  • Efficient memory representation for cache-friendly traversals
  • O(1) access to head and tail of lists
  • Deleting large ranges are efficient since we can just drop entire internal ziplists nodes with no traversals
  • Keeps existing RDB and AOF formats so old DBs can be loaded into the new implementation
  • If you force the ziplist fill factor (list-max-ziplist-entries) to one entry per ziplist, you get the same performance as a traditional linked list but with even better memory utilization.

Cons

  • Inserting elements into the middle of a quicklist may require splitting existing ziplists and creating new internal quicklist nodes.
  • Deleting from the middle of a quicklist may require joining ziplists if too many splits leave, for example, 3 ziplist nodes each with 2 elements hanging around.
  • All inserts invoke a realloc on the ziplist itself, potentially triggering a memory copy of the entire ziplist to new memory.
  • Inserting into the head of the quicklist requires a moving the existing ziplist entries down one position to fit the new head entry.

Recap

Linked lists provide O(1) access to the head and tail of the list allowing for constant time insertion and deletion from either end. But if you are storing many small values (timestamps, references to other data, usernames, etc), your in-memory size will be dominated by data structure overhead instead of your data itself.

ziplists provide O(1) access to the head and tail of the list, but appending to the list requires reallocating memory (which potentially copies the entire list to new memory), and appending to the list requires a reallocation plus an explicit moving of current memory to the new offset, which means the entire list can potentially be copied twice on inserts (but in-memory copies are very fast if you run reasonable hardware).

Removing the tail element of a ziplist is easy: just move the [END] marker one entry up and update the cached data about the size and count of the ziplist.

Removing the head element of a ziplist is trickier: you have to delete the head element then move the remaining elements up one position since ziplists represent one contiguous memory space.

Because ziplists may have their entire contents copied and moved on insert/delete, it's best to keep their size limited for quicker operations. You don't want to have a 20 GB ziplist because every list pop from the head would require moving memory for the rest of the 20 GB ziplist up one entry to accommodate for the delete.

Ideally, we could have small ziplists linked together to allow for quick single-ziplist operations while still not limiting the number of elements we can store in an efficient, pointer-free data structure.

For more information see the first half of this page.

Implementation available at Matt's GitHub Redis Quicklist branch.

fin.

  • 1,000 lines of C for the quicklist
  • 1,000 lines of C for the tests
  • Two weeks of coding
  • 2,800+ words in this post

  1. Contributing to non-twitter-originated open source projects doesn't help them "be one of the top revenue generating Internet companies in the world." Sure, let's just exploit everybody else's hard work without contributing back. Bueno.

  2. A quicklist is a linked list of ziplists, but each ziplist has a maximum size before we create a new linked list node. The parameter is the number of elements in each ziplist before cutting a new linked list node with another ziplist.