Redis Benchmarks - Compilers and Options

Redis Benchmarks with Optimizations

(tldr; click some buttons)

Background

I took every Redis release starting with 2.6 and compiled them 12 different ways:

  • clang 3.5.0
    • -O2
    • -Os
    • -Ofast
    • -O2 -flto
    • -Os -flto
    • -Ofast -flto
  • gcc-4.9
    • -O2
    • -Os
    • -Ofast
    • -O2 -flto
    • -Os -flto
    • -Ofast -flto

Then, for each compiled version, I ran redis-benchmark in two modes, repeated each test three times, and then took the best performing run for each mode.

It took my computer (actual computer, real cores, no VM, no SMT/HT) about 5 days to run through the entire test. I repeated the entire test four times and assembled the best results from each run.

Why

The main job of Redis is storing and retrieving data. To store and retrieve data, Redis is constantly creating C strings and allocating/freeing memory. These benchmarks were a test to see if different optimization options can give us a higher performing Redis without any additional coding.

What is -flto?

LTO is short for Link Time Optimization. Both clang and gcc support LTO.

LTO is an alternative, and preferred, way to improve the optimize-at-compile-time capability of compilers. Compilers can only inline and optimize code when all relevant functions are evaluated at the same time and result in the same compilation unit (.o file). But, if your project is generating 50 independent .o files (or thousands of them), the compiler can’t reach into one compiled .o file, extract a function, and inline it in another .o file.

Historically, the way you get around optimization being limited to compilation units is to use an amalgamation where you basically concatenate all your source files together into one giant 110,000 line file. But, Redis conditionally includes some files based on your current system, so you can’t just cat *.c > redis-everything.c.

This is where LTO comes in. LTO retains modular function details in an intermediate format (for LLVM/clang the format is LLVM bitcode, for GCC the format is GIMPLE).

Now your compiler can still compile each source file to an independent object file, but it keeps the ability to cross-inline and optimize functions at Link Time, giving us full Link Time Optimization without needing to concatenate all our files together.

Relevance to Redis

Redis uses an internal string library for managing C strings along with an internal allocation wrapper for malloc/free calls. By using LTO, the compiler can cross-inline relevant calls to the abundantly used string creation/updating/destruction functions as well as relevant calls to zmalloc/zfree, resulting in reduced function call overhead for many operations.

You’ll notice compiles with LTO enabled dominate the benchmark results because they generate faster performing code.

Implications

GCC allows per-function optimization annotations. Based on empirical data, we can decorate our functions with GCC pragmas to optimize some functions for space so specific tight loop functions remain in hot cache easier, and for other functions we can optimize for lowering call overheads.

But, you’ll also notice clang outperforms GCC in most benchmarks.

Clang doesn’t allow per-function optimization levels, but we could also decide to use specific optimization levels for specific files instead of using one default -O2 across the entire project. Some data structures clearly benefit from using -Os or even -Ofast over -O2.

Note: none of my tests used -march, but it would be interesting to re-run them under that condition too.

Note: Redis currently has over 150 individual commands, but the redis-benchmark suite, by default, only tests about 10 common commands. There’s enough combinations of compiler options and commands to run tests filling a non-trivial cluster for a month of continuous performance testing across historical versions.

Methods

data

data benchmark is: redis-benchmark -n 120000 –csv -d 64 -r 9999999999

Run every default redis-benchmark test, but override the default setting of using only 2 bytes of data with using 64 bytes of data. Also populate the keyspace using random numbers up to 9999999999 (otherwise, redis-benchmark only overrides the same key for each test). Also run each test 120,000 times instead of the default 10,000 times.

pipe

pipe benchmark is: redis-benchmark -n 120000 –csv -d 64 -r 9999999999 -P 1000

Same as the previous test, but we add a pipeline of 1,000 requests.

This reduces network overhead and round-trip-time since instead of sending request->response, Redis receives request-request-…-request and replies with one response-response-…-response.

1,000 requests are bundled together, sent at once, then Redis sends 1,000 replies at once. Pipelining is how Redis scales up to a million operations per second.

Pipeline testing also reveals better Redis performance because your calls are not dominated by network overhead of sending/receiving 120,000 individual requests/responses. You’re connecting a total of 12,000 times with each connection sending 1,000 commands, then you receive a 1,000 element response.

Benchmark Computer

Redis is an in-memory database, which means the speed of your memory interface determines the speed of your Redis performance.

The tests for this benchmark were performed on a 2008 i7 920 @ 2.67GHz system running Ubuntu 14.04 with the following performance characteristics:

The key number to notice here is the memory write speed of 7.38 GB/s. If you are comparing systems for Redis performance, high performance throughput requires high performance memory write speed. (Note: this is not a high performance system. My laptop outperforms this server from 2008.)

For comparison, my MacBook Pro, a 2013 i7-4960HQ CPU @ 2.60GHz, has practical memory throughput almost twice as fast as the 2008 server:

Tests above were run using mem-speed.

You cannot estimate Redis speed based on CPU frequency. You must account for the memory speed of your system.

If you read the CPU spec sheets, the 2013 CPU has three times the memory bandwidth of the 2008 CPU at the same clock frequency (regardless of clock frequency, actually).

Graph Usage

Finally some results!

We have two methods of showing result graphs: Real View or Max Change View.

Real View zooms the graphs out about 10% so you can see larger trends.

Max Change View shows the most change. Max Change View will make normal benchmark variances look huge and volatile.

Note: the graphs don’t start at zero.

If you click a button and nothing happens, click again or reload the page. Some browsers have glitches with these manipulations.

Click on any bar for a per-compile-test breakdown for that version. Click on any bar again to return to the original graph. All in-depth compiler views render at max zoom view.

There are over 1600 individual graphs here for you to explore. Each benchmark graph shows the highest performing compiler option for that version out of about 20 tests. Each version has 12 individual sub-results showing the non-winning compiler performance too.

Graphs are powered by Flotr2 and my specialized 300 line CoffeeScript graph manager.

Click Buttons to Show Graphs and Do Things

Tap a button then scroll down to see all the graphs.

Graphs