Thinking in Redis: Part One

Thinking In Redis (part one)

Redis is a data structure server. Redis doesn't serve raw data, but rather it serves data you've stored in a structured way. This means you must decide up front how you're going to store your data.

In a traditional RDBMS system, you create schemas with tables and columns up front before storing any data. In Redis, you don't need schemas or column names, but you do need to figure out how your data is best represented in terms of simple key-value strings, lists, hashes, sets, or sorted sets.

People use Redis because of its fast retrieval speeds, but in order to take full advantage of Redis, you must use proper data structures. For example, storing a million jobs in a list, but then needing to retieve jobs based on alphabetical order will kill your Redis performance. You need to use appropriate data structures for appropriate use cases.

This post covers how you can think about storing you data in Redis.

Why is Redis different?

Why do you have to change your thinking? Because Redis isn't SQL. Redis isn't BerkeleyDB. Redis isn't Hadoop. Redis isn't Riak. Redis isn't your file system.

In traditional SQL databases, you have two types of general queries: queries scanning an entire table (or multiple tables) or queries using an index. Table scans are O(N) while fully indexed queries are O(lg N).

In SQL systems, you have complex query planners to convert your SQL into internal database retrieval, table scanning, index lookup, and result returning commands.

In Redis, you run direct data retrieval commands with no queries or query planner abstractions in the middle.

Redis doesn't need an internal query engine because you run data retrieval commands directly. Complex server-side rewrites of what you meant to do aren't needed (and would get in the way of Redis speed optimizations anyway)

The lack of a query engine in Redis also means if you use Redis with SQL-based mindsets, by storing a lot of data then hoping to figure out your retrieval queries and indexing structure later, you'll end up in data retrieval troubles later.

Redis specializes in getting your data back to you as fast as possible. But, to do so, you must store your data in a way it can be returned to you as fast as possible. You get to decide how to store and how to retrieve your data.

Considerations

You should keep five things in mind when storing data in Redis: retrieval, storage, size, growth, and user access.

With some clever rearranging, thoe form an acronym: RUGSS. ("Check under the RUGGS before storing data in Redis?")

Retrieval

Retrieval is half of what Redis does best.

Before you put any data into Redis, you have to figure out how you want to get it back out. How you retrieve data is directly related to which of the five Redis data types you'll use to store it.

To review, the five Redis data types are: strings, lists, hashes, sets, and sorted sets.

Start by asking yourself what kind of results you want from your data.

What do you need to retrieve?

  • a value associated to a key?
  • the most recently added N elements of a list?
  • multiple fields associated with one overall key?
  • counts of elements you've stored under one collection?
  • membership testing to see if you've stored an element in a set already?
  • ranges of ordered elements in increasing or decreasing order?
  • highest N rated elements stored together?
  • lowest N rated elements stored together?

Once you've figured out how you want to read your data, you've pretty much isolated which Redis data type to use.

Structring your data storage around how you want to retrieve your data solves the one problem you must never get yourself into with Redis: storing data and not knowing where it is.

Custom Indexing

Sometimes you need to read data by more than one retrieval key. In traditional RDBMS setups, you have a primary key on your row then you have the option of doing full table scans to find other data, or you can manually specify secondary keys to be kept up to date as INSERT/UPDATE/DELETE operations occur.

In Redis, you can still access your data by more than the main access key, but you will need to maintain the secondary index yourlsef.

Let's look at an example. Say you are storing employee records in hashes. Each employee hash is named employee:[EmployeeId] and each hash has fields of name, location, hireDate, hireYear.

If you want to find employees by name instead of ID, you can't do that easily. You would have to HSCAN over all hashes, read the name of each hash, then record when you've found your match.

A better approach is to maintain a secondary index yourself. If you want to also look up people by name, when you create their employee hash, you also create a top level string of employeeName:[Name] with a value of the name of the employee hash: employee:[EmployeeID]. Then, to look up an employee by name, you GET employeeName:[Name] which returns the name of the hash you can read for more employee details.

The indexing approach assumes every name only points to one one primary value employee:EmployeeId. In this situation you can't have two employees with the same name or else you lose track of EmployeeId mappings.

A more common case is when you want to have one lookup key point to multiple records. Let's say we want to find all employees with the same hireYear. If we don't have secondary indexes, we would have to HSCAN all our hashes, read every hireYear, then record every matching result while ignoring the rest.

A better multi-lookup approach is to maintain a secondary index of hireYear to EmployeeId mappings. In this case, every time you add an employee, you can add their employee hash name to a set containing all employees with the same hire year. For example, if employee 3321 has a hire year of 2013, you can: SADD hireYear:2013 employee:3321. To read all employees who were hired in 2013, all you need is a quick SSCAN over the set hireYear:2013 (or, if you know the set is small, you can SMEMBERS the entire set.)

Storage

Storing data is the other half of what Redis does best.

When using Redis, you should only store what you know you're going to use.

You don't want to take the SQL approach of having a table with 100 columns, storing a lot of data, then figuring out how to read it all back later (and aggregate, deduplicate, count, etc).

Why can't we just store 100 datapoints for each website hit in Redis and just decide what we want to query later? Redis is optimized for explicit reading and not random discovery with ad-hoc querying like SQL systems. The explicit readiblity optimizations are what make Redis fast under normal conditions. Second, Redis stores all data live in memory. Storing data with ambiguous future needs is wasteful and will run you out of memory capacity sooner than is necessary.

After you figure out what to store, you need to decide how to store it. If you followed along with the Retrieval section, you already have an idea of how you can write your data to Redis. If you haven't decided on your reading strategy yet, you can ask yourself a few questions about what you need to store.

Is what you need to store:

  • just a value to get out again?
  • ordered naturally based on when you add it?
  • a group of fields and values all under the same retrieval key?
  • strings you need to keep together for quick membership testing?
  • ordered based on per-item information?

Size

Each Redis data type has its own in-memory storage overhead.

Because Redis is an in-memory data store, some data structures are constructed with pointers. On a 64-bit system, each pointer is 8 bytes wide.

What does an 8 byte pointer mean for your Redis storage requirements?

Let's say you're storing seven million key-value pairs. Each key is 12 bytes and each value is 32 bytes. You expect to end up with (12+32)*7000000 = 308 MB of memory usage. But, you didn't account for the pointers to maintain the seven million key-value relationships. Let's see how the math for storing seven million strings plays out below.

Example with Seven Million Keys

This exmaple uses real numbers from a Redis 2.8.x instance. The keys and values are seven million unique 12 byte keys pointing to seven million unique 32 byte values.

Attempt I

Empty redis: 1152 k (Redis is very tiny on startup.)

After running 7 million SETs: 983 MB

Whoops. The 983 MB shows Redis is using 2x the memory we expected because we didn't take into account data structure overhead.

We can do better though. Redis has a way of representing strings more compactly in hashes than in top-level strings. We can make a sequence of hashes where each hash contains a subset of our keys.

Trying again with the multiple-hashes with multiple-field-value-pairs approach, we end up with:

Attempt II

Empty redis: 1152 k (again, an empty Redis uses about 1 MB of RAM. You can run many Redis instances on one host easily.)

After running 7 million HSETs: 358 MB

That's over 2.7x smaller than storing top-level key value pairs.

This time we are creating a two-level lookup mapping. We create hashes named 0 to 70000 where each hash has 100 field-value pairs inside of it.

After running 7 million HSETs, creating 70,000 hashes, each with 100 fields (here, each field name is 12 bytes and each value is 32 bytes. The name of each hash is an integer increasing up to 70000), we see we're using 2.7x less space than storing our seven million key-value pairs in top level Redis strings.

Here, you can see the math works out almost pefectly. We have around 300 MB of user data. We have an 8 byte pointer for each 70,000 hashes = 56 MB. Add a little overhead for all of the hash names, and you have a very compact and very fast way of storing 7 million 12 byte keys pointing to 32 byte values in memory.

Other Size Optimizations

Redis has a few other space saving optimizations built in too.

The most relevant optimizations are: small lists get stored compactly without pointers (as a list grows, it automatically converts to a with-pointers data type), sets of only integers get encoded in a more efficient way, small hashes (under ~100 fields) get encoded in a more compact way (as we saw previously), and small sorted sets get stored compactly without pointers (like small hashes and small lists).

More information about compact representations in Redis can be found at OBJECT command documentation and Redis even has a specific documentation page about memory optimization.

Growth

Redis is great for rapid prototyping, but don't let your prototypes get away from you and become unmanageable live services. You need to plan into the future for growth and infrastructure expansion.

You need to keep in mind how your data and memory usage will grow as you add more users, services, and interactions to your services.

If you store 512 KB of data in Redis for each user of your service, then over the course of a month you add a million users, then you'll need almost 500 GB of RAM by the end of the month.

You can manage your memory usage in three ways:

  • You can optimize data storage as we covered in the Size section.
  • You can add more Redis servers to increase your live memory capacity (Redis Cluster will make this easier).
  • You can isolate data that isn't used frequently and store it on disk using a traditional disk-based database.

If you use Redis as a materialized view server or a caching layer, then you can set your keys in Redis to expire to help limit your memory usage requirements.

In the case of forgettable transient data, you should PEXPIRE or PEXPIREAT those keys so Redis knows it can clean unneeded keys out automatically to maintain free space for new writes.

As you design your applications to use Redis, you must keep in mind a growth strategy for how to either: expand your available live memory or how to move unused data out of memory to long term disk storage.

User Access

more specific than generic retrieval. how do you plan your operations so users can get/set data? namespaces by colons? make sure you don't end up with keyspace injections.

Redis only has one type of access control: an optional global access password. If you enable the Redis password feature, all connections have to present the password before any reads or writes are allowed.

Redis has no concept of database users, access permissions, or access restrictions. If you are connected to the server, you can read anything, write anything, delete anything, or erase the entire dataset.

There is no way to limit access to users based on who they are, where they are connecting from, or which commands they are allowed to run. (You can fake some access restrictions by using firewall rules, but those are outside of Redis.)

Redis does allow you to disable (or rename) any commands, which can give you limited restricted functionality, but the disabled or renamed commands take effect for all clients connecting to the server.

So, how do we allow users to access Redis?

There are two types of end users for databases:

  • users who are customers with their own data stored in Redis
  • users who are employees who need to get data out of Redis for running reports

In both cases, the only safe way to give somebody access is the same: create an API between the user and Redis.

For your customers, your website or native application uses your API to autheticate the user and gate access to the database on their behalf. The API layer makes sure the user isn't reading outside of their allowed keys and isn't constructing malicious keys inadvertently.

For your employees or analysts or anybody who needs to run reports and analystics, you can give them an API to use with their reporting scripts/websites. Your custom reporting API will be a less restrictvive wrapper than your end user API, but it can implement any access model you need. You could create an internal API service to restrict access to certain Redis keys or fields based on API credentials.

If the whole "create an internal API" solution is too much of a burden, you can also replicate your life Redis instance to another instance used just for internal analytics. Then, your employees can run all the bad queries they want, they can stall the server, and they can lockup the server without impacting production performance. You have the choice of making replicas read-only (so they don't drift from the master replication stream) or you can allow writes on the replicas so custom Lua scripts or set results or bitop results can be stored in the same instance.

The guiding rule is: don't let users talk to Redis directly without supervision.

Remember: if someone connects to Redis and does something bad (KEYS * or FLUSHALL), your entire system can suffer everything from mildly annoying higher latency to an outage to complete data loss.

Conclusion (Part One)

Using Redis efficiently requires you adjust how you approach some problems. You need to consider how you'll use Redis to store your live data versus your ever-growing historical data, how you can store your data efficiently, and how you can retrieve your data efficiently to give your users access to their data in the quickest ways possible.

Next, we consider how we can keep a dozen Redis features in mind while building out new services and products to enable fast, low latency interaction models. Building services based on Redis changes the way you can approach problems at scale.