Redis as JSON document store

Redis Experiment #3: JSON Storage

This is my last experiment for a while. Enjoy.

It started out simple enough.

hmgetalljson key

Have a hash…

Get some JSON…

How about wrapping arbitrary commands?

jsonwrap command args…

Have a multi-result…

Get some JSON…

Have multiple score-value pairs…

Get some JSON…

How about hash from JSON?

hmsetbyjson key json

Have some JSON…

Get a hash…

Nice, but not JSON-y enough.

How about storing nested JSON?

Donut document:

One line for easier inline command adding:

jsondocset key json

Set donuts document inside Redis…

jsondocget json-key

Get donuts document back…

Eva JSON:

Set eva document inside Redis…

Get eva document back…

jsondocmget json-key1 json-key2json-keyN

Get eva and donuts with one command…

Setting multiple JSON documents?

redis-cli supports option -x meaning “read the last argument from stdin.”

jsondocsetbyjson field-for-keys json-array-of-maps

Investigate a newly added document…

127.0.0.1:6379> jsondocget 31144714
{"milestone":null,"html_url":"https://github.com/antirez/redis/pull/1673","user":{"login":"dchest","id":52677,"avatar_url":"https://avatars.githubusercontent.com/u/52677?","gravatar_id":"641aceb7e3d2eebea49f397c38048d0b","url":"https://api.github.com/users/dchest","html_url":"https://github.com/dchest","followers_url":"https://api.github.com/users/dchest/followers","following_url":"https://api.github.com/users/dchest/following{/other_user}","gists_url":"https://api.github.com/users/dchest/gists{/gist_id}","starred_url":"https://api.github.com/users/dchest/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/dchest/subscriptions","organizations_url":"https://api.github.com/users/dchest/orgs","repos_url":"https://api.github.com/users/dchest/repos","events_url":"https://api.github.com/users/dchest/events{/privacy}","received_events_url":"https://api.github.com/users/dchest/received_events","type":"User","site_admin":false},"comments":0,"created_at":"2014-04-09T10:10:49Z","number":1673,"state":"open","labels_url":"https://api.github.com/repos/antirez/redis/issues/1673/labels{/name}","labels":[],"url":"https://api.github.com/repos/antirez/redis/issues/1673","id":31144714,"assignee":null,"events_url":"https://api.github.com/repos/antirez/redis/issues/1673/events","pull_request":{"url":"https://api.github.com/repos/antirez/redis/pulls/1673","html_url":"https://github.com/antirez/redis/pull/1673","diff_url":"https://github.com/antirez/redis/pull/1673.diff","patch_url":"https://github.com/antirez/redis/pull/1673.patch"},"body":"","title":"Fix typo in 00-RELEASENOTES","comments_url":"https://api.github.com/repos/antirez/redis/issues/1673/comments","updated_at":"2014-04-09T10:10:49Z","closed_at":null}

What about the inner life of JSON documents?

jsonfieldget json-key sub-key1 sub-key2sub-keyN

Extract batter list…

Extract 4th batter in list (0-based indexing)…

Extract the type of the 4th batter…

Extract eva year(s)…

jsondockeys json-key sub-key1 sub-key2sub-keyN

Investigate keys for eva…

Investigate keys for 1st donut topping…

Pop a topping

Existing noms…

jsonfieldrpop json-key sub-key1 sub-key2sub-keyN

Remove a bad nom…

Get updated noms…

Push better noms

jsonfieldrpushx json-key sub-key1sub-keyN new-json

Push much preferred topping (adding a string to array of maps)…

Observe all the noms…

Stuff a ballot box

Chicken Dead JSON

Set chicken…

Get chicken…

jsonfieldincrby json-key sub-key1sub-keyN integer

Vote for MOR CHICKENGEIST

Check out your affront to democracy…

jsonfieldincrbyfloat json-key sub-key1sub-keyN double

Boost some ratings (IEEE-754 style)…

Shut it all down

jsonfielddel json-key sub-key1sub-keyN

Demand plain donuts…

Verify the blasphemy has been removed…

jsondocdel json-key

Actually, you’re going keto…

Verify you can’t sneak one at 3am…

What’s all this now?

This is an experiment in using Redis to:

  • parse incoming JSON
  • store full JSON as Redis data types
    • with unlimited nesting support
  • allow access and updating of nested fields
  • generate result JSON

Data Types

As we saw in the examples above, this approach supports round-tripping JSON data types with no loss of type information.

The seven JSON data types are: map, list, number, string, true, false, null. The Redis JSON parser makes sure to store a bare true differently than the string "true". The same goes for numbers — the number 42 is stored differently than the string "42".

JSON types are in heavy contrast to the default Redis storage model where, by default, everything is stored as a string. If you tell Redis SET mykey 5 then GET mykey will return the string "5". It is fully up to the caller to interpret results appropriately.1

Implementation Background

We use boxing to store full JSON types in Redis. “Boxing” is when you store information about a type with the type, so you can figure out the true intent when you read it back.

A good example of boxing is any non-JIT interpreter. Let’s use Python as an example. If you type 1 + 1 in Python, the interpreter doesn’t run one plus one because the interpreter doesn’t store integers directly. The interpreter stores [TYPE][THING]. In this case: [INTEGER]1 [OPERATOR]+ [INTEGER]1. Performing your operation, Python has to “unbox” the values to generate a usable expression. So, Python checks:2

  • What is the type of first operand?
  • Oh, it’s an integer. Let’s read a number.
  • What’s the type of operator?
  • Oh, it’s a valid operator. Let’s read it.
  • What’s the type of final operand?
  • Oh, it’s an integer too. Let’s read a number.
  • Apply the operands to the operator.
  • What is the type of result? It’s an integer.
  • Let’s store [INTEGER]2.

Python eventually obtained enough information to run your original request of 1 + 1, but in a very slow and indirect way. You can see why math in Python is typically 20x to 50x slower than using a module calling out to native code skipping the entire boxing->unboxing->boxing process (e.g. numpy/blas/cython).

Implementation

For JSON storage in Redis we box JSON types because we must round-trip native JSON types.

The Redis JSON type box is very tiny. It’s just one byte. One byte is eight bits. Since JSON has seven types, we can use one bit to represent each type and still have one bit free.

The Box

We’re going to use brackets to represent “bits in a box.”

Here’s an entire Redis JSON type box with nothing defined: [00000000] (eight bits with no positions set).

Types are defined by position in the box, so each bit position represents a specific type. The (0-based) position meanings are:

  • Bit 1: type is number
  • Bit 2: type is string
  • Bit 3: type is map
  • Bit 4: type is list
  • Bit 5: true
  • Bit 6: false
  • Bit 7: null

So, to represent the number 42, we store [00000010]42. To represent the string "bits in a box" we store [00000100]bits in a box.

Optimization: One More Type

Because Redis stores everything internally as strings, we can make one big optimization: If your container type (a map or list) has homogeneous contents, we can skip boxing each individual value.

We only defined seven bits above. We have one remaining bit free:

  • Bit 0: type is homogeneous container

Now, to store {"name":"rocko","struggle":"modern life"} we add the strings to a Redis hash directly instead of boxing each string like [00000100]rocko.

With the addition of our homogeneous container type, we can define our map as:

  • is map
  • is string
  • is homogeneous

Our box for the entire map becomes [00001101]. This means: if you store homogeneous types of numbers or strings in your map (or list), your only storage overhead is one byte to define the type of the map (or list). If you have a map with 1,000 values, your storage overhead is only one byte instead of 1,000 bytes if we boxed each string individually.

Notice we can do the same with numbers. {"age": 20, "height": 4000} can store numbers directly without individual boxing because we define the map here as:

  • is map
  • is number
  • is homogeneous

Giving us one box for the entire map of: [00001011].

The same goes for lists too. Storing a list of only numbers or only strings has an overhead of exactly one byte to define the type of the homogeneous storage container regardless of the size of the list.

Note: The homogeneous type of the container is based on values in the map. It’s okay if your map keys are strings while all your values are numbers. That still counts as a homogeneous number map. JSON defines keys only as strings anyway, so they can’t ever be non-homogeneous themselves.

Those Other Types

What about our friends true, false, and null? Those types have zero variability and we can use their boxes as an immediate value.

For example, a JSON true value results in storage of only [00100000] with none of the actual letters of true being stored anywhere. true only exists when we reconstruct JSON for output.

Nesting Syndrome

How do we store nested maps and lists? After all, we can store {"a": {"b": {"c": [null, {"d":false}]}}} and it gets reconstructed just fine.

Rules for nested storage:

  • Every top-level JSON document is a JSON map
  • Every top-level JSON document is named by its storage key
  • Every JSON map is a Redis hash
  • Every JSON array is a Redis list
  • Keys of sub-types are stored as depth-appended keys
    • Sub-containers of arrays use their list position for sub-keys

A depth-appended key is just the name of the current field appended to its parent’s name with :.

Storing the document above with name zombo will create keys: zombo for the document itself only containing the a map, zombo:a to store the b map, zombo:a:b to store to store the c map, zombo:a:b:c to store the list, and zombo:a:b:c:1 to store the d map.

zobmo, zombo:a, zombo:a:b, and zombo:a:b:c:1 are Redis hashes. zombo:a:b:c is a Redis list.

Now we have these keys in redis:

Storing highly nested JSON with only one value per map is a pathological case just for testing. You wouldn’t want to store large amounts of zero-contents nested maps like that because it’s taking up overhead for creating new top-level keys, but each key doesn’t contain much useful information.

Reconstruction

How are pointers to sub-containers stored? We rely on a combination of boxes and depth keys.

In the zombo example, the top-level document stored in the zombo Redis hash is: a => [BOX] where [BOX] is the type of the sub-container.

For this specific case, a => [00001000] meaning “When you try to read a, look for a top-level hash key with name zombo:a.”

Then, when we fetch zombo:a, we get b => [00001000] meaning “Go read a top-level hash key with name zombo:a:b.”

And thus it goes all the way down the container tree. The only trick here is: sub-containers of lists are referenced by position in the list instead of by their name, since list elements have no names3.

Processing All The Things

Given all the box types and nested naming rules above, we have nine (9) total ways to retrieve nested data.

Four cases are speed optimizations because we detect homogeneous container types, so we can basically do a big LRANGE [key] 0 -1 or a big HGETALL [key] instead of inspecting each individual value for unboxing.

The remaining five cases are for reconstructing JSON values based on individual box types.

Le Traversal Cases

This section is a more wordy version of the jsonObjFromBox() function.

1 – BOX_FETCH_MAP_NUMBER – [00001011]

Field points to homogeneous map of only numbers, so we can do HGETALL and return only JSON numbers.

2 – BOX_FETCH_MAP_STRING – [00001101]

Field points to homogeneous map of only strings, so we can do HGETALL and return only JSON strings.

3 – BOX_FETCH_LIST_NUMBER – [00010011]

Field points to homogeneous list of only numbers, so we can do LRANGE 0 -1 and return only JSON numbers.

4 – BOX_FETCH_LIST_STRING – [00010101]

Field points to homogeneous list of only strings, so we can do LRANGE 0 -1 and return only JSON strings.

5 – BOX_FETCH_MAP_DECODE – [00001000]

Field points to map with mixed types, so we must unbox each individual value (potentially pointing to more nested maps or lists) and reconstruct the appropriate JSON type for each individual value.

6 – BOX_FETCH_LIST_DECODE – [00010000]

Field points to list with mixed types, so we must unbox each individual value (potentially pointing to more nested maps or lists) and reconstruct the appropriate JSON type for each individual value.

7 – BOX_PARSE_NUMBER_AFTER_BOX – [00000010]

Field is a number after the box. Chop the box off and return the remaining bytes as a JSON number. Only used when storing numbers in mixed-type containers.

8 – BOX_PARSE_STRING_AFTER_BOX – [00000100]

Field is a string after the box. Chop the box off and return the remaining bytes as a JSON string. Only used when storing strings in mixed-type containers.

9 – BOX_DECODE_IMMEDIATE – [00100000] or [01000000] or [10000000]

Field is represented by the box itself. No more data needed. Read the box and generate an appropriate JSON true, false, or null based on which bit is set in the box.

Codex

This experiment started, grew, and quickly became un-brain-keepable. After the first file reached 1,200 lines, I split it into seven total files for easier management and more clearly defined interfaces.

After the entire codebase reached 3,000 lines, I stopped adding features and wrote this. There are many unimplemented features and abilities, but no need to write production capabilities into a one-off experiment.

The code lives in the Kit of Redis Module Tools (krmt).

If you want to play around with JSON commands, you can build a json.so module and load it into a Dynamic Redis. Dynamic Redis is my set of patches on top of Redis allowing runtime loading of modular commands.

If you run JSON commands, run them with the expectation of having your keyspace cluttered with one top-level key for every sub-container.

If you run JSON commands, only use them for keeping transient data or for playing around with JSON API dumps like in the issues example. Further work on the JSON commands could introduce different data formats making any long-term storage unreadable.

It’s a toy, not a product.

If you want to add missing functionality, optimize exiting functionality for speed and/or space, fix bugs, add tests, or extend features, feel free to open issues or pull requests on krmt.

Benchmarks

I’m not including actual benchmarks here since perfomance varies wildly based on the shape and size of the JSON you are importing.

But, as a baseline, I can JSONDOCSET the eva JSON 16,000 to 17,000 times per second depending on the compiler and optimization settings used.

That’s obviously not amazing Redis 100,000+ operations-per-second speed, but it’s still usable (and you may not be write-heavy with importing JSON anyway).

Scrappy Little Tests

Estimated test coverage: 30% to 65%.

% ./runtest --single unit/type/jsonobj
Cleanup: may take some time... OK
Starting test server at port 11111
[ready]: 73676
Testing unit/type/jsonobj
[ready]: 73675
[ready]: 73677
[ready]: 73678
[ready]: 73680
[ready]: 73679
[ready]: 73681
[ready]: 73682
[ready]: 73683
[ready]: 73684
[ready]: 73685
[ready]: 73686
[ready]: 73687
[ready]: 73688
[ready]: 73689
[ready]: 73690
[ok]: JSONDOC - set simple
[ok]: JSONDOC - remove simple
[ok]: JSONDOC - verify remove cleaned all keys
[ok]: JSONDOC - set complex
[ok]: JSONDOC - remove complex
[ok]: JSONDOC - verify remove cleaned all keys
[ok]: JSONDOC - add sample types
[ok]: JSONDOC - remove all basic types
[ok]: JSONDOC - verify remove cleaned all keys
[ok]: JSONFIELD - basic get string
[ok]: JSONFIELD - basic get number
[ok]: JSONFIELD - basic get list (numbers)
[ok]: JSONFIELD - basic get list (strings)
[ok]: JSONFIELD - basic get list (with map)
[ok]: JSONFIELD - basic get map
[ok]: JSONFIELD - basic get true
[ok]: JSONFIELD - basic get false
[ok]: JSONFIELD - basic get null
[ok]: JSONDOC - remove all test documents
[ok]: JSONDOC - verify remove cleaned all keys
[ok]: JSONFIELD - multi-layered get list
[ok]: JSONFIELD - multi-layered get first
[ok]: JSONFIELD - multi-layered get second
[ok]: JSONFIELD - multi-layered get second key
[ok]: JSONFIELD - multi-layered get third
[ok]: JSONFIELD - multi-layered get fourth
[ok]: JSONDOC - remove all test documents
[ok]: JSONDOC - verify remove cleaned all keys
[ok]: JSONFIELD - deep nested get - level 1
[ok]: JSONFIELD - deep nested get - level 2
[ok]: JSONFIELD - deep nested get - level 3
[ok]: JSONFIELD - deep nested get - level 4
[ok]: JSONFIELD - deep nested get - level 5
[ok]: JSONFIELD - deep nested get - level 6
[ok]: JSONFIELD - deep nested get - level [too low]
[ok]: JSONFIELD - deep nested get - level [too high]
[ok]: JSONDOC - remove all test documents
[ok]: JSONDOC - verify remove cleaned all keys
[ok]: JSONFIELDDEL - basic delete
[ok]: JSONFIELDDEL - verify document after delete
[ok]: JSONFIELDDEL - basic map delete
[ok]: JSONFIELDDEL - verify document after delete
[ok]: JSONFIELDDEL - basic list delete
[ok]: JSONFIELDDEL - verify document after delete
[ok]: JSONFIELDDEL - basic list delete II
[ok]: JSONFIELDDEL - verify document after delete
[ok]: JSONFIELDDEL - nested list delete
[ok]: JSONFIELDDEL - verify document after delete
[ok]: JSONFIELDDEL - nested list delete direct positional
[ok]: JSONFIELDDEL - verify document after delete
[ok]: JSONFIELDDEL - nested map to empty map
[ok]: JSONFIELDDEL - verify document after nested map delete
[ok]: JSONFIELDDEL - nested list to empty list
[ok]: JSONFIELDDEL - verify document after nested list delete
[ok]: JSONFIELDDEL - nested map later field delete
[ok]: JSONFIELDDEL - verify document after delete
[ok]: JSONDOC - remove remaining test documents
[ok]: JSONDOC - verify remove cleaned all keys
[ok]: JSONDOCSETBYJSON - set map
[ok]: JSONDOCSETBYJSON - set map inside list
[ok]: JSONDOCSETBYJSON - set multi-map inside list
[ok]: JSONDOC - remove remaining test documents
[ok]: JSONDOC - verify remove cleaned all keys
[ok]: JSONDOCKEYS - get keys of a map
[ok]: JSONFIELDINCRBY - increment by integer
[ok]: JSONFIELDINCRBYFLOAT - increment by float
[ok]: JSONFIELDRPUSHX - rpush string
[ok]: JSONFIELDRPUSHX - rpush string verify
[ok]: JSONFIELDRPUSHX - rpush number
[ok]: JSONFIELDRPUSHX - rpush string verify
[ok]: JSONFIELDRPOP - rpop number
[ok]: JSONFIELDRPUSHX - rpop string verify
[ok]: JSONFIELDRPOP - rpop number
[ok]: JSONFIELDRPUSHX - rpop string verify
[ok]: JSONFIELDRPOP - rpop mixed contents
[ok]: JSONFIELDRPOP - rpop mixed contents verify
[ok]: JSONFIELDRPOP - rpop mixed contents
[ok]: JSONFIELDRPOP - rpop mixed contents verify
[ok]: JSONFIELDRPOP - rpop mixed contents
[ok]: JSONFIELDRPOP - rpop mixed contents verify
[ok]: JSONFIELDRPOP - rpop mixed contents last element
[ok]: JSONFIELDRPOP - rpop mixed contents verify empty
[ok]: JSONDOC - remove remaining test documents
[ok]: JSONDOC - verify remove cleaned all keys
[ok]: Check for memory leaks (pid 73692)
[1/1 done]: unit/type/jsonobj (0 seconds)

                   The End

Execution time of different units:
  0 seconds - unit/type/jsonobj

\o/ All tests passed without errors!

Cleanup: may take some time... OK

fin.

Summary:

  • 3,171+ lines of C implementing 15 JSON commands.
  • 4,500+ words in this post (including copy/pasted commands, JSON, and tests)
  • Two lost weekends

  1. One exception to the “everything is a string” rule is the result of INCRBY commands. Because Redis knows you are incrementing a number, it returns the result as an integer. Though, Redis doesn’t have a floating point type (which would complicate the protocol a lot), so when you run an INCRBYFLOAT, your result is returned as a string.

  2. This description is potentially factually wrong, but logically correct. I didn’t go back to look at any Python internals, but it’s just how ’terperters and language VMs without JIT or compile-time definitions have to do math.

  3. Yes, this means adding or removing elements inside of lists is tricky. In fact, right now we only support appending to lists and popping elements off the end of lists, so we don’t worry about needing to rename any sub-containers because their position in the list changed.