Redis Benchmarks with Optimizations
(tldr; click some buttons)
I took every Redis release starting with 2.6 and compiled them 12 different ways:
- clang 3.5.0
Then, for each compiled version, I ran
redis-benchmark in two modes, repeated each test three times, and then took the best performing run for each mode.
It took my computer (actual computer, real cores, no VM, no SMT/HT) about 5 days to run through the entire test. I repeated the entire test four times and assembled the best results from each run.
The main job of Redis is storing and retrieving data. To store and retrieve data, Redis is constantly creating C strings and allocating/freeing memory. These benchmarks were a test to see if different optimization options can give us a higher performing Redis without any additional coding.
What is -flto?
LTO is short for Link Time Optimization. Both clang and gcc support LTO.
What is Link Time Optimization?
LTO is an alternative, and preferred, way to improve the optimize-at-compile-time capability of compilers. Compilers can only inline and optimize code when all relevant functions are evaluated at the same time and result in the same compilation unit (
.o file). But, if your project is generating 50 independent
.o files (or thousands of them), the compiler can’t reach into one compiled
.o file, extract a function, and inline it in another
Historically, the way you get around optimization being limited to compilation units is to use an amalgamation where you basically concatenate all your source files together into one giant 110,000 line file. But, Redis conditionally includes some files based on your current system, so you can’t just
cat *.c > redis-everything.c.
This is where LTO comes in. LTO retains modular function details in an intermediate format (for LLVM/clang the format is LLVM bitcode, for GCC the format is GIMPLE).
Now your compiler can still compile each source file to an independent object file, but it keeps the ability to cross-inline and optimize functions at Link Time, giving us full Link Time Optimization without needing to concatenate all our files together.
Relevance to Redis
Redis uses an internal string library for managing C strings along with an internal allocation wrapper for malloc/free calls. By using LTO, the compiler can cross-inline relevant calls to the abundantly used string creation/updating/destruction functions as well as relevant calls to zmalloc/zfree, resulting in reduced function call overhead for many operations.
You’ll notice compiles with LTO enabled dominate the benchmark results because they generate faster performing code.
GCC allows per-function optimization annotations. Based on empirical data, we can decorate our functions with GCC pragmas to optimize some functions for space so specific tight loop functions remain in hot cache easier, and for other functions we can optimize for lowering call overheads.
But, you’ll also notice clang outperforms GCC in most benchmarks.
Clang doesn’t allow per-function optimization levels, but we could also decide to use specific optimization levels for specific files instead of using one default
-O2 across the entire project. Some data structures clearly benefit from using
-Os or even
Note: none of my tests used
-march, but it would be interesting to re-run them under that condition too.
Note: Redis currently has over 150 individual commands, but the
redis-benchmark suite, by default, only tests about 10 common commands. There’s enough combinations of compiler options and commands to run tests filling a non-trivial cluster for a month of continuous performance testing across historical versions.
data benchmark is: redis-benchmark -n 120000 –csv -d 64 -r 9999999999
Run every default
redis-benchmark test, but override the default setting of using only 2 bytes of data with using 64 bytes of data. Also populate the keyspace using random numbers up to 9999999999 (otherwise,
redis-benchmark only overrides the same key for each test). Also run each test 120,000 times instead of the default 10,000 times.
pipe benchmark is: redis-benchmark -n 120000 –csv -d 64 -r 9999999999 -P 1000
Same as the previous test, but we add a pipeline of 1,000 requests.
This reduces network overhead and round-trip-time since instead of sending request->response, Redis receives request-request-…-request and replies with one response-response-…-response.
1,000 requests are bundled together, sent at once, then Redis sends 1,000 replies at once. Pipelining is how Redis scales up to a million operations per second.
Pipeline testing also reveals better Redis performance because your calls are not dominated by network overhead of sending/receiving 120,000 individual requests/responses. You’re connecting a total of 12,000 times with each connection sending 1,000 commands, then you receive a 1,000 element response.
Redis is an in-memory database, which means the speed of your memory interface determines the speed of your Redis performance.
The tests for this benchmark were performed on a 2008 i7 920 @ 2.67GHz system running Ubuntu 14.04 with the following performance characteristics:
matt@neon:~/repos/crc64-compare% ./mem-speed 512 32 Computing average call overhead... Average call overhead: 0.0394 microseconds Running 32 iterations of writing 512 MB of memory Completed writing. Wrote 16384 MB in 2.1660 seconds for a memory write speed of 7564.1216 MB/s (7.3868 GB/s) Now running a quick CPU speed test for 1500000000 iterations... Iterations completed in 3.38 seconds for an execution speed of 443.5 million (subtraction) operations per second using one core. Loop latency: 2.25 nanoseconds per iteration.
The key number to notice here is the memory write speed of
7.38 GB/s. If you are comparing systems for Redis performance, high performance throughput requires high performance memory write speed. (Note: this is not a high performance system. My laptop outperforms this server from 2008.)
For comparison, my MacBook Pro, a 2013 i7-4960HQ CPU @ 2.60GHz, has practical memory throughput almost twice as fast as the 2008 server:
matt@ununoctium:~/repos/crc64-compare% ./mem-speed 512 32 Computing average call overhead... Average call overhead: 0.0529 microseconds Running 32 iterations of writing 512 MB of memory Completed writing. Wrote 16384 MB in 1.2490 seconds for a memory write speed of 13117.2842 MB/s (12.8098 GB/s) Now running a quick CPU speed test for 1500000000 iterations... Iterations completed in 3.08 seconds for an execution speed of 487.1 million (subtraction) operations per second using one core. Loop latency: 2.05 nanoseconds per iteration.
Tests above were run using mem-speed.
You cannot estimate Redis speed based on CPU frequency. You must account for the memory speed of your system.
If you read the CPU spec sheets, the 2013 CPU has three times the memory bandwidth of the 2008 CPU at the same clock frequency (regardless of clock frequency, actually).
Finally some results!
We have two methods of showing result graphs: Real View or Max Change View.
Real View zooms the graphs out about 10% so you can see larger trends.
Max Change View shows the most change. Max Change View will make normal benchmark variances look huge and volatile.
Note: the graphs don’t start at zero.
If you click a button and nothing happens, click again or reload the page. Some browsers have glitches with these manipulations.
Click on any bar for a per-compile-test breakdown for that version. Click on any bar again to return to the original graph. All in-depth compiler views render at max zoom view.
There are over 1600 individual graphs here for you to explore. Each benchmark graph shows the highest performing compiler option for that version out of about 20 tests. Each version has 12 individual sub-results showing the non-winning compiler performance too.
Click Buttons to Show Graphs and Do Things
Tap a button then scroll down to see all the graphs.