Redis 7 Server Start-Up Internals

Redis 7 Internals Series: Redis server starts in standalone mode

Vadim Samokhin
Better Programming

--

Photo by Safar Safarov on Unsplash

This is the first post in the Redis 7 Internals Series. You can find this scenario here. Read it left to right; that’s the order in which each logic block is executed while the Redis server starts.

Everything starts with a main() procedure in the server.c file. There are several major parts there.

Config Initialization

It can be roughly divided into two stages. The first is initializing default config values hard-coded in the Redis source code. This logic is implemented in an initServerConfig() procedure. The second one is reading a config file and overwriting default settings; you can find it in a loadServerConfig() procedure.

initServerConfig()

There is a standardConfig static_configs[] variable defined in config.c file. Each standardConfig struct has, among others, a default value for a setting it defines and an address of a redisServer field that stores this value. For example, one of the static_config[] elements goes like this:

createBoolConfig("appendonly", NULL, MODIFIABLE_CONFIG | DENY_LOADING_CONFIG, server.aof_enabled, 0, NULL, updateAppendonly)

It means that the default value for server.aof_enabled is set to 0. Default config values are set in a initConfigValues() procedure. There, for each standardConfig struct defined in static_configs array, config->interface.init(config) is called. interface field is a typeInterface struct, which mainly deals with getting and setting data of a specific type. For example, for boolean config, init procedure is set to boolConfigInit — you can see that in a createBoolConfig macros which initializes a standardConfig struct:

#define createBoolConfig(name, alias, flags, config_addr, default, is_valid, apply) { \
embedCommonConfig(name, alias, flags) \
embedConfigInterface(boolConfigInit, boolConfigSet, boolConfigGet, boolConfigRewrite, apply) \
.type = BOOL_CONFIG, \
.data.yesno = { \
.config = &(config_addr), \
.default_value = (default), \
.is_valid_fn = (is_valid), \
} \
}

Besides, each standardConfig is added in a dictionary, that is, a hash table — so that a config name can find any item in constant time. It is represented as a dict struct in Redis.

You can check out a more concise Miro version here.

loadServerConfig()

Here, each setting provided in redis.conf, is read and set with the typeInterface’s set procedure. Thus, default values are overwritten.

All data manipulation procedures for each typeInterface implementation are defined in config.c.

Listening Sockets Creation

Pair this description with visuals.

connTypeInitialize()

It starts with ConnectionType initialization. This struct contains procedures that are related to a connection and specific to an underlying socket type. It has procedures both for a client and a server. Among other things, they connect to an endpoint, create a listening socket bound to a specific address, accept connections, write data to a client socket, read data from a listening or connection socket, and shut down a connection.

So, for now, initialization simply means that a static ConnectionType *connTypes[CONN_TYPE_MAX] is filled with connection types for TCP sockets, Unix domain sockets, and, if Redis was compiled with BUILD_TLS=yes, for tls (which in Redis can be used only with TCP sockets).

initListeners()

Two main things happen here. First, listening sockets are created, and second, their file descriptors are added in an epoll set. Let’s now go through everything in order.

Overall lifecycle of a network application
When clients send connection requests, they first create a socket on their side. After that, they send a connection request which boils down to connect() system call. To accept connection requests, the server side must have listening sockets. Each socket has some allocated memory for a queue which temporarily stores incoming client connection requests:

Remaining images by author

Each accept() system call issued against a listening socket extracts the first connection request and creates a new connected socket. This socket will be used to communicate with a specific client that made this connection request. This recently created connection socket will handle the next command the client sends. Here’s what the flow looks like:

When either server or client wants to terminate their connection, one of them first issues the shutdown() system call. It disables sending or receiving data on the specified socket so any pending requests can be handled. After that, the socket is closed, which removes the underlying file descriptor and releases its resources.

So, for now, listening sockets are created. In the case of TCP, this logic is implemented in a listenToPort() procedure. Binding and listening specifically are made in an anetListen() procedure.

Two more things worth noting. First, addresses that sockets are bound to are set as reusable. Second, created sockets are made non-blocking, but what does this mean?

Reusable addresses
If the address is not reusable, which is the default setting, only one socket can be bound with it. Otherwise, more than one address can be bound to it, and OS automatically distributes incoming requests to different sockets. Here’s what the flow looks like:

It’s useful when you want to restart a server quickly, but your operating system thinks that the address is still used by some socket, which, in the case of Redis, is not the case: it binds an address to a single socket. So that’s the way to avoid this situation. The address is made reusable in a anetSetReuseAddr() procedure.

Non-blocking sockets
When you read data from an ordinary blocking socket, and there is none available, the read() system call will hang — or block — until data becomes available. It’s not a busy-wait, that is, your CPU is not utilized, but further code execution just doesn’t happen. In the case of non-blocking sockets, a special code is returned instead: it means that there is no data to be read right now. Thus, your thread doesn’t hang, but your code should be ready for this scenario.

The same thing applies for the write() system call. But how can writing be blocked? First, this is what writing in a socket looks like.

When you issue a write() system call, you don’t actually write into the underlying file descriptor. It’s not like writing in a file. Moreover, there is no file at all. Instead, there is a file descriptor, which is just an entry in a file descriptor table maintained by a kernel for each process, and some kernel memory associated with it is called a buffer. I wrote about it earlier.

That’s where the kernel actually writes your data. What happens if you write a message that doesn’t fit into a buffer? If it’s a blocking socket, your write() system call hangs. When a kernel finally releases some memory, and your message fits in the buffer, the write() call still hangs. It waits until this message is delivered to a recipient. And since TCP has no upper bound for when exactly a message is delivered, your app can hang indefinitely.

Now, if I write to a non-blocking socket and a message doesn’t fit in a buffer entirely, only a part of a message is written, and a special code is returned. Your app should be ready to resume writing when the buffer is ready. And this brings us to an epoll instance.

epoll instance
Epoll instance is a Linux kernel data structure that consists of two lists: an interest list and a ready list. When you want to know that some file descriptor has some data to be read or it is ready for writing, you can add a file descriptor to an interest list, which is also called an epoll set. The epoll instance asynchronously tracks the state of the registered file descriptors and adds them to a ready list as soon as they become ready for I/O operations. You can get a ready list by calling an epoll_wait() procedure.

Each epoll set item is called an epoll event, which might sound misleading. Usually, when there is an “event,” it’s already happened. You can subscribe to events; after some time, they happen, and you can somehow get that event. That’s what happens in the case of an epoll set as well, but an epoll_event is probably an unfortunate name for an interest list item.

You do the following when you subscribe to read or write events for your file descriptors. First, you create an epoll_event struct. Then, you specify what exactly you’re interested in. If you subscribe to read events, it goes like epoll_event.events |= EPOLLIN. And finally, you specify a file descriptor you’re interested in: epoll_even.data.fd = my_fd;.

Socket with file descriptor 1 has some data to be read

In Redis, epoll logic is implemented in ae_epoll.c file. To get events that actually happened since the last time you checked, you can call epoll_wait.

Event handlers
How does Redis know which procedure to run when a specific event occurred? It keeps event handlers in an aeEventLoop struct’s aeFileEvent *events array. Its name seems misleading a bit; it should’ve been handlers I think. Array indices are file descriptors, so Redis can quickly find a correct handler. If you’re subscribing to read events, you want to set a read handler which is stored aeFileProc *rfileProc field, and if to write events, then you set a handler in a aeFileProc *wfileProc.

aeCreateFileEvent is a procedure which subscribes to specific events and sets corresponding handlers.

Quick Recap
So, in short, you subscribe to read or write events for your file descriptors, and then you check their state. If you do so periodically in an infinite loop, this whole mechanism is called an event loop. And that’s exactly what happens with listening sockets. Redis puts them in an epoll set, subscribes to read events, and sets aeFileProc *rfileProc field to ConnectionType.accept_handler, which is connSocketAcceptHandler in case of TCP connection. After that, Redis periodically checks whether those socket descriptors have pending client connections. When there are, Redis accepts them. There will be a separate post about it. For now, you can check this diagram.

Background I/O Initialization

There are some system calls that can block for quite some time. Most notably, it’s fsync, which actually moves data from OS memory pages to a storage device. Besides, if a large amount of memory has been allocated using malloc() and free() is called on it, the operating system may need to perform some behind-the-scenes operations, such as defragmentation of memory, which can take a significant amount of time and cause the free() call to block.

That’s why Redis doesn’t run operations that employ those calls in the main thread. Instead, it utilizes special background threads — one thread per job type, three threads in total. In a source code, they are often referred to as bio threads, where “bio” stands for background I/O, although free has nothing to do with I/O. There are four job types:

enum {
BIO_CLOSE_FILE = 0, /* Deferred close(2) syscall. */
BIO_AOF_FSYNC, /* Deferred AOF fsync. */
BIO_LAZY_FREE, /* Deferred objects freeing. */
BIO_CLOSE_AOF, /* Deferred close for AOF files. */
BIO_NUM_OPS
};

File closing is not a blocking call usually, and it returns most of the time immediately. But there are some obscure cases when it can block, and what’s worse, this behavior varies across systems. So generally, closing file descriptors in the background is a good practice. Moreover, in Redis, this is done with a prior fsync call, which is relatively costly.

Each thread runs bioProcessBackgroundJobs procedure. If the bio_cpulist setting is specified, it’s applied here with redisSetCpuAffinity procedure. It boils down to sched_setaffinity system call. Effectively, you specify a set of CPUs for those threads to run on. This can reduce the overhead of context switching, cache invalidation, and other CPU-related tasks, leading to faster execution times and improved performance.

Then, an infinite loop processes a job list specific to this thread. If it’s empty, the thread waits for a signal notifying about the new job and hangs:

if (listLength(bio_jobs[worker]) == 0) {
pthread_cond_wait(&bio_newjob_cond[worker], &bio_mutex[worker]);
continue;
}

If it’s not, Redis gets the first element from bio_jobs[worker] list and processes it.

bioSubmitJob is a procedure used to add a job for a background execution.

Threaded I/O

Redis is usually claimed to be single-threaded. That doesn’t mean that Redis has a single main thread for everything. As you’ve already seen in a previous part, there are background threads, for instance. Moreover, Redis forks child processes sometimes. So, what it actually means, instead, is that all client commands are handled by a single main thread on a data storage level.

For instance, if you have ten thousand simultaneously connected clients, and all of them issue a SET command against the same key at just the same moment of time; it means that Redis will set this key with a specified value ten thousand times, one by one, one at a time, in total isolation from each other, in the single main thread.

But this fact nevertheless leaves some space for optimization in specific scenarios. The primary scenario that is being targeted is as follows. If you have lots of clients connected at the same time, and they issue commands that have large responses, it might take some time to iterate all those clients and, for each of them, issue a write() system call.

This command copies data from Redis, which resides in user space, to a TCP buffer, which is in kernel space. So, first, it’s inherently slow because of this boundary crossing, and second, the larger response, the more time it takes to copy it. And Redis came up with a solution targeting this issue: it can write responses in threads running in parallel. Those threads are usually referred to as io threads. The number of threads can be set with io-threads setting.

The second scenario is conceptually the same, but it targets command parsing. Since it’s less typical for clients to have a large request body, this case is even rarer. This behavior is controlled by a separate setting, io-threads-do-reads.

I/O threads implementation is considered in more detail in this diagram. Threads are started in initThreadedIO() procedure. IOThreadMain is a procedure that runs in each thread.

A couple of notes about implementation: First, how does Redis understand that there are some responses to write or some incoming requests to parse? It checks getIOPendingCount procedure that returns several items to be processed by a thread whose id is passed as an argument. Second, how does Redis know exactly what it should do — send a response or parse a command? A global static variable defines it, io_threads_op. Here’s how to use this variable:

if (io_threads_op == IO_THREADS_OP_WRITE) {
writeToClient(c,0);
} else if (io_threads_op == IO_THREADS_OP_READ) {
readQueryFromClient(c->conn);
}

In the case of writing responses to clients, both thread operation and pending responses count are set synchronously in handleClientsWithPendingWritesUsingThreads. Reads are done in handleClientsWithPendingReadsUsingThreads.

Last but not least, it might seem that I/O threads are busy-waiting when there is nothing to do for them. Here’s the code:

for (int j = 0; j < 1000000; j++) {
if (getIOPendingCount(id) != 0) break;
}
/* Give the main thread a chance to stop this thread. */
if (getIOPendingCount(id) == 0) {
pthread_mutex_lock(&io_threads_mutex[id]);
pthread_mutex_unlock(&io_threads_mutex[id]);
continue;
}

They’re not, actually. Their state is controlled by locking corresponding mutexes, io_threads_mutex[id] — you can see that in a comment that goes like Give the main thread a chance to stop this thread. Initially, they are locked by the main thread. When Redis decides that there are enough clients to justify their processing with I/O threads, it unlocks their mutexes in a startThreadedIO() procedure. And when active clients quantity decreases, Redis simply acquires a lock again in a stopThreadedIO() procedure. That leads to I/O threads hanging when they try to lock already locked mutex.

I’ll write a separate post about how the whole client request handling is done with I/O threads. For now, you can head over to this Miro diagram.

Load Client Data From Disk Into Memory

First, here’s a quick overview of what persistence looks like in Redis.

Two persistence options in Redis

The first one is generating a data snapshot, all Redis data in a single binary file. This file is called an RDB file, and this whole mechanism is usually referred to as RDB, which stands for Redis database. The filename can be configured with a dbfilename setting.

The second one is logging all mutating commands in a human-readable format — the same one used in the Redis protocol. All files created by Redis when this option is used are collectively called AOF files, which stands for append-only file. You can enable this option with the appendonly setting. The specific implementation of an underlying mechanism changes across different Redis versions.

AOF in Redis 7

In Redis 7, the AOF mechanism was reworked significantly. It employs two types of files for storing client data.

The first one is a base file. It’s a snapshot of an entire Redis instance dataset. Its format depends on aof-use-rdb-preamble setting. If it’s on, the base file has raw binary data. In this case, the base file is no different from an RDB file. If it’s off, the base file contains data as a compacted list of commands.

For instance, if some client has set a specific key with different values a hundred times, only the last actual command will end up in a base file. The same applies to other data structures. If a client appended a hundred values to a list of one item per command, the base file will contain a single RPUSH command with a hundred arguments.

The second type is an increment file, or, for short, an incr file. That’s where all the currently run mutating commands are written. They are stored as is.

When the Redis server starts, it checks a config for a currently-used persistence mechanism and loads a corresponding file. In the case of AOF, it first loads a base file and then an incr file.

As new data is being written in storage, the current incr file requires rotation, so it wouldn’t grow indefinitely. This rotation process, usually referred to as AOF rewrite, runs periodically. Instead of overwriting existing files, new ones are created. In short, it goes like this. First, a new incr file is created. All the current commands are written in this new file now. After that, a new snapshot is taken. When it’s completed, previous versions of a base file and an incr file are deleted. I describe this process in more detail here, and I’ll write about it a bit later.

Now, let’s consider the loading part. As usual, you can find it in a diagram here.

Implementation of AOF files loading into memory

As I’ve written earlier, two types of files store client data. They can have several versions at once, so they have to be tracked somehow. That’s what a manifest file serves for. It stores currently active versions of a base file and incr files, along with their previous, historical, versions that will be deleted soon. If appendfilename setting equals my_appendonly_file, a manifest file can look like that:

# this is a base file, type "b" for "base" 
file my_appendonly_file.aof.2.base.aof seq 2 type b

# This is a previous version of a base file, mind ".base" prefix in a filename.
# Type h means "historical"
file my_appendonly_file.aof.1.base.aof seq 1 type h

# this is a previous version of an incr file. Mind ".incr" prefix.
file my_appendonly_file.aof.1.incr.aof seq 1 type h

# this is a currently active incr file. All current commands are written here
file my_appendonly_file.aof.2.incr.aof seq 2 type i

The AOF rewrite process starts with loading this manifest file into memory, namely in an aofManifest struct. Here’s what that looks like:

typedef struct {
aofInfo *base_aof_info; /* BASE file information. NULL if there is no BASE file. */
list *incr_aof_list; /* INCR AOFs list. We may have multiple INCR AOF when rewrite fails. */
list *history_aof_list; /* HISTORY AOF list. When the AOFRW success, The aofInfo contained
in `base_aof_info` and `incr_aof_list` will be moved to this list.
We will delete these AOF files when AOFRW finish. */
long long curr_base_file_seq; /* The sequence number used by the current BASE file. */
long long curr_incr_file_seq; /* The sequence number used by the current INCR file. */
int dirty; /* 1 Indicates that the aofManifest in the memory is inconsistent with
disk, we need to persist it immediately. */
} aofManifest;

This is done in a aofLoadManifestFromDisk() procedure.

Then, client data is loaded. First, Redis checks which persistence mechanism is currently used. If AOF is disabled, then it loads an RDB snapshot. Otherwise, AOF files are loaded: first, a base file, and second, all the incr files. Usually, there is a single incr file. But if a previous AOF rewrite process is completed with an error, Redis can have more than one incr file. Incoming commands are written in a single incr file at any given time.

Important consequence: if you’ve been using only RDB mechanism, when you stop Redis, then enable AOF, and then start your instance, your data from an RDB file will not be loaded. Like almost any setting in Redis, you can switch AOF on the fly: CONFIG SET appendonly yes.

loadSingleAppendOnlyFile() is a procedure that loads an AOF file. It has a back-compatibility check, so this procedure can work with files created in previous Redis versions. Earlier, if aof-use-rdb-preamblewas on, an AOF file contained a so-called preamble. It was just an initial Redis dataset in a binary format, no different from an RDB snapshot. So, if Redis starts for the first time after an upgrade, this procedure first reads this preamble if it’s present, and then it goes on with a so-called AOF tail. In Redis 7, a base file can contain either raw binary data or a compacted list of commands, not both.

When data is loaded, Redis opens an incr file from a manifest so it is ready for writing client commands. If it’s the first Redis 7 start, neither a base nor an incr file exists. In this case, they’re created, and a manifest file is updated.

Finally, if history files are left after the last AOF rewrite, Redis deletes them now, in a aofDelHistoryFiles() procedure. Like any file deletion in Redis, this is done asynchronously in a bio thread.

Event Loop

Overview

I’ve already mentioned an event loop concept when I talked about an epoll set. But epoll mechanism is an implementation detail of an event loop, which is, in and of itself, just an abstract concept. It is a programming construct that enables a program to efficiently handle multiple input/output operations without blocking or wasting CPU cycles. Typical cases are network I/O and file I/O. What powers any event loop is a specific multiplexing mechanism. A concrete kernel implementation tracks registered file descriptors and reports their state when you ask. Epoll is one of those implementations, the latest and the most performant one. It’s now used in Redis when it’s built on Linux systems.

Regardless of an implementation mechanism, an event loop has these basic steps:

  1. Register interest in I/O events for each file descriptor using the appropriate system call. In other words, you subscribe to an event. I/O event is basically a change in the state of a file descriptor. For instance, if you have a listening socket and some client connects to an address bound with it, the state of this socket changes: now it has a connection request in its TCP buffer, waiting to be read and accepted.
  2. Wait for I/O events on the registered file descriptors by blocking the system call depending on your multiplexing mechanism. You can wait indefinitely until at least one event happens or with some timeout.
  3. When a blocking system call from a previous step returns, you get events that have happened so far.
  4. Determine the type of event that occurred (read, write, etc.) and handle it appropriately by executing the associated callback function.
  5. Repeat.

Event loop with an epoll mechanism

First, Redis creates a so-called epoll instance: epoll_create(1024);. This system call effectively creates a container for an epoll set and a ready list and returns a file descriptor associated with it. In early Linux versions, its single argument hinted to the kernel about what size it should expect an epoll set to be. But now it is ignored. To start tracking the state of a file descriptor, you issue an epoll_ctl system call. Besides adding an item in an epoll set, this procedure can also modify and remove it. And when you want to get all events that happened to registered file descriptors, you call epoll_wait.

Event loop in Redis

I’ve built, installed, and run Redis on Linux, so I’ll refer to an epoll as a multiplexing mechanism. As usual, you can combine reading this text with this diagram.

First, an epoll instance is created inaeApiCreate procedure, which is called during event loop initialization in aeCreateEventLoop. To track some file descriptor’s state, you can call aeApiAddEvent(). aeApiDelEvent() removes a file descriptor from an interest list.

As the last step in server start-up, Redis runs an infinite loop of aeProcessEvents procedure, effectively implementing an entire event loop iteration. It has several logical parts.

  1. The first one is all the logic that should be run before waiting on events. Since this wait is blocking with a specific timeout, the logic executed before epoll_wait is collected in a procedure called beforesleep. It’s implemented here. Among others, it calls procedures dealing with reading clients’ commands and writing a response to them. I’ll write a separate post about several scenarios when a client sends a request, and Redis handles it. Nevertheless, you can already find the corresponding diagrams here.
  2. After that, Redis asks for all the events that happened since the latest event loop iteration. This logic is implemented in aeApiPoll procedure. In it, Redis issues a epoll_wait system call. It’s always made with a specific timeout, so it never blocks indefinitely. Then, Redis collects fired events in aeEventLoop->fired.
  3. Then, Redis loops over aeEventLoop->fired and runs corresponding handlers. They are stored in an aeEventLoop->events array. Handlers are added by calling a aeCreateFileEvent procedure.
  4. Finally, so called time events are handled. Those are events indicating that specific time interval has passed. In C, they are known as timer events. You can subscribe to them natively with creating a timer with timerfd_create and putting a resulting file descriptor in an epoll set, and then setting a timeout with timerfd_settime procedure. It can come in handy for executing a logic that needs to be run periodically. However, there is actually no such thing as a classical timer event in Redis. Instead, it emulates them. First, you register a handler with a aeCreateTimeEvent procedure, where you effectively specify a timestamp when your handler should run. Then, at the end of each event loop iteration, processTimeEvents is called. There, Redis iterates each handler and checks whether a specified timestamp has come. If it has, the corresponding handler is executed.
    That’s how cron is implemented in Redis, for instance. We’ll see shortly what happens there, but first a quick tour of a beforeSleep() procedure.

beforeSleep()

There are some procedures that should be executed before receiving new events. As I’ve already written, among them are procedures dealing with clients’ input and output; there will be a separate post covering those. Here, I’ll briefly walk through procedures related to blocking clients. Hopefully, someday I’ll update this part with key invalidation tracking feature that uses a pub/sub mechanism.

Clients’ blocking

There is a set of procedures executed in beforeSleep() that all relate to a scenario when client connection is blocked. Blocking means that Redis doesn’t parse and, hence, doesn’t execute any commands that a client sends, though it reads and stores those raw requests in client’s buffer.

Overview
First, let’s consider how this blocking generally works using BRPOP command as an example.

When a client sends a BRPOP command, then Redis blocks a client for specified keys with a specified timeout if none of those keys exist. This logic is implemented in a blockForKeys() procedure. There, each client first sets the keys it blocks on. Then, each key is added in a redisDb->blocking_keys dict, and its value is a list of clients blocked on it. Thus, Redis has access to both the keys blocked by a given client and the clients blocked on a given key. Besides, timeout is handled in the same manner. It’s set to a currently blocked client, as well as it’s added in a server.clients_timeout_table radix tree. Finally, a client is marked as blocked.

When some other client adds a first item into any of non-existing lists the first client blocks on, this key is added in a database. dbAdd() procedure checks whether there are any clients blocked on it, and if there are, this key is added in a server.ready_keys list.

handleBlockedClientsTimeout()
This procedure walks through a server.clients_timeout_table radix tree and gets all clients with expired timeout. Each of them is unblocked. It means that the client is not marked as blocked anymore, hence from now on all of its futher commands will be processed. Finally, Redis puts it in a server.unblocked_clients list which will be processed in the next part.

processUnblockedClients()
Here, Redis iterates server.unblocked_clients list. As I’ve written above, it has all the clients unblocked recently. If any of those sent any commands while being blocked, they were suspended; now it’s time to parse and execute them. I’ll write a separate post about command execution scenario, but you can check out a Miro diagram right away.

handleClientsBlockedOnKeys()
This procedure iterates through recently added client keys that at least one client blocks on. As I’ve mentioned earlier, they are stored in a server.ready_keys list. For each such key, each client that blocks on it is unblocked. As I’ve already written, from now on all of its futher commands will be processed: both parsed and executed. Besides, it ends up in a server.unblocked_clients list that Redis iterates in a processUnblockedClients() procedure described above.

Redis Cron

Redis cron refers to a serverCron procedure. It runs hz times per second. But if dynamic-hz is 1 (it’s the default), then the cron frequency is adapted according to the number of active clients:

if (server.dynamic_hz) {
while (listLength(server.clients) / server.hz >
MAX_CLIENTS_PER_CLOCK_TICK)
{
server.hz *= 2;
if (server.hz > CONFIG_MAX_HZ) {
server.hz = CONFIG_MAX_HZ;
break;
}
}
}

There is lot going there. Redis stats is updated, shutting down is handled gracefully, some info is written in logs, some postponed logic is executed — I won’t dive into those. I’ll describe what I consider the most significant parts.

clientsCron()

Actually, the comment section of this procedure already does a great job of describing what’s going on here. In short, it takes all the active clients and processes 1 / server.hz of them, so all the clients are supposed to be processed within about one second. For each client, Redis runs the following procedures:

  1. If client timeout is specified, Redis calculated an amount of time when a current client has been idle. If it exceeds a config setting, a client is disposed of and connection is closed —connection socket’s file descriptor is first removed from an event loop and then closed.
  2. client struct’s querybuf field size is decreased if needed. This is where a client request body is stored until it parsed. The same thing is done to client’s output buffer, which serves for storing data waiting to be written to a connection socket.
  3. Client memory usage statistics is updated (clientsCronTrackExpansiveClients is about clients expensive in terms of memory they use).
  4. If client-output-buffer-limit is reached, a client is closed immediately.

Historically, Redis has used the following trick here. In the beginning of each iteration, the tail of server.clients is moved to a head. Why? If this client is removed, O(n) complexity is avoided: Redis starts searching an item from list’s head, but since it is always the first one, the search always takes a constant time. However, as far as I understand, this is not needed anymore. When a client connection request is accepted and client struct is created, server.clients list item is cached there:

void linkClient(client *c) {
listAddNodeTail(server.clients,c);
/* Note that we remember the linked list node where the client is stored,
* this way removing the client in unlinkClient() will not require
* a linear scan, but just a constant time operation. */
c->client_list_node = listLast(server.clients);

Moreover, since new clients are added in the tail of server.clients list, clientsCron() will process them first, leaving old clients neglected for some time.

databasesCron()

Here are the procedures needed for Redis storage level maintenance.

activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW)
You can find a visual counterpart here.

This procedure removes expired keys. They are also removed upon retrieval, but what if they are never retrieved? They can fill all the available memory. That’s why keys are expired proactively.

Keys with set expiration time are stored in a redisDb’s expires field, which is a dict struct. There are actually two hash tables in each dict for gradual resizing. It’s needed when a number of items in a hash table grows too large, so hash table size must be adjusted accordingly. Or, the number of elements may get too low so a hash table is shrunk for more efficiency. I’ll write more on that later. For now, it’s enough that there are two of them and both of them are implemented as arrays.

That’s how a hash table in an redisDb’s expires field could look like:

Those violet circles are bucket entries, or hash table elements. When they belong to the same bucket, they form a singly linked list, with each element pointing to the next. Each element is represented as a dictEntry struct:

struct dictEntry {
void *key;
union {
void *val;
uint64_t u64;
int64_t s64;
double d;
} v;
struct dictEntry *next; /* Next entry in the same hash bucket. */
void *metadata[]; /* An arbitrary number of bytes (starting at a
* pointer-aligned address) of size as returned
* by dictType's dictEntryMetadataBytes(). */
};

Expiration time is stored in v.s64 field as a unix timestamp. The key pointer that is used in redisDb->expires item is the same key pointer that is used in redisDb->dict’s dictEntry storing a client value. Now, let’s suppose that the circles marked as red are expired:

Redis walks through each hash bucket in a hash table, and if a current bucket has a linked list of bucket entries, it’s iterated either. This iteration logic is implemented in dictScanDefrag() procedure. It’s more complicated than that, but for now this approximation is enough. For each visited bucket entry, Redis invokes expireScanCallback() procedure. It checks whether an expiration time has passed, and if it has, then a key is deleted from both dictionaries — the one storing an expiration time itself and the one storing actual client data:

Redis continues this scanning until it hits either a bucket entries limit or hash buckets limit, so this won’t take too long. In reality, the former limit is several dozens on average, but in this example it’s 3. The latter is twenty times more.

By the end of this iteration, redisDb->expires’s hash table looks like this:

And redisDb->dict’s hash table looks like that:

This process will continue on the next cron iteration from where it left off.

activeDefragCycle()
Defragmentation is the process of reducing memory fragmentation, which occurs when free memory in a program is divided into small pieces that are too small to be useful for allocating new objects or data. This can lead to inefficient memory usage and slower program performance over time. Defragmentation helps to consolidate free memory into larger contiguous blocks, which can then be more effectively used for new allocations. This process is quite costly and is disabled in Redis by default. Typically, you can identify that you need defragmentation by examining info memory output periodically.

Defragmentation itself boils down to reallocating memory for a current pointer in case it’s worthwhile. Redis performs this process incrementally, the same way it approaches key expiration. That is, it doesn’t perform all the required work in a single run, because it might take a while and Redis will be unresponsive during those periods. Instead, such kinds of operations are carried out in small steps, in a controllable manner with tight limits, but often. It iterates hash tables with dictScanDefrag() procedure, walks through hash table buckets, and, for each bucket, defragments three types of pointers: dictEntry, its key, and v.val. It starts with dictEntrys:

  1. First, dictDefragBucket() is invoked for a current bucket. There, activeDefragAlloc() is executed against each bucket entry. It boils down to two things. First, Redis asks jemalloc memory allocator whether it’s worthwhile to reallocate a current pointer. And if it is, this pointer’s memory is reallocated and an old pointer’s memory is copied into a new one. After that, an old pointer is freed.
  2. After that, Redis walks through each dictEntry in a current bucket again, and for each one, executes defragScanCallback(). This procedure boils down to calling defragKey() procedure. First, it defragments a key. Then, Redis performs defragmentation on a value by using a procedure that depends on the type of a current dictEntry.v.val.

tryResizeHashTables()
If the number of elements in a hash table becomes too low relative to its size, this can lead to several problems. First, a large number of buckets means that a lot of memory is allocated for the hash table, but most of it remains unused. Second, performance degrades when a sparse hash table is scanned. That’s why Redis periodically shrinks underutilized hash tables. tryResizeHashTables() procedure does exactly that: it reduces the size of hash tables if they are fewer than ten percent full.

Resizing process never touches a hash table with client data. Imagine a hash table with a million keys. If it resized in-place, each key must be rehashed and moved to a new slot. This might take a while. Instead, a hash table with new size is allocated in a dict->ht_table[1]:

Its size is set to the closest power of two greater or equal to the number of elements in a hash table.

Resizing here only shrinks a dictionary. Expanding takes place in several places, most notably — while inserting new value, namely in a dictFindPositionForInsert() procedure.

Note that creating a new hash table and moving data there from an old one, that is, dict->ht_table[0], are two different processes. The process of moving data into a new hash table is called rehashing. It’s done incrementally while interacting with a dictionary, both in read and write operations. For instance, check out dictFind(), dictGetRandomKey(), dictFindPositionForInsert(), etc. Besides, it’s performed in each cron iteration with incrementallyRehash() procedure.

incrementallyRehash()
So, we have a dict that has just been resized, that is, a second hash table with a new size:

It’s time to move the first hash table’s data there. When it’s performed in a cron iteration, Redis moves bucket entries in a hundred-size batches, and with a specific time limit which is one millisecond. So, let’s start with a first bucket:

When the second hash table is larger than the first one, new bucket index is calculated by taking an item key’s hash. So their distribution in a new hash table has good chances to be even. This is not the case when the second hash table is smaller. In this case, new hash is not calculated. Instead, a new bucket index is taken as a current index in a first hash table masked with the size of smaller has table. Since hash tables are always powers of two, all the elements residing in the first bucket of a first hash table end up in the first bucket of a second hash table, and so on:

That’s barely a problem though. If a hash table is shrinking, it means it’s really sparse and a scenario when any single bucket has lots of elements is unlikely.

Then Redis continues with the next non-empty bucket:

And so on. When done, hash tables swap places and the second hash table is reset:

Until then, when rehashing is still in progress, reads are done in both hash tables. New keys are added in the second one, so they are not lost when rehashing is over.

A new process forking and a copy-on-write technique
That’s about all that happens in a databasesCron() procedure. One thing to note is that all those manipulation with in-memory data are allowed only when there is no child process. Why? When a child process is forked, than it has access to all the data that the parent process has. This happens thanks to copy-on-write technique implemented by an OS. But if either process mutates some shared data, an underlying memory page is copied and becomes visible only to a process that did a modification. Another process still sees data that was actual at the moment of forking.

This CoW technique has associated costs, such as increased space usage and potential performance overhead. That’s why Redis avoids maintenance tasks when there is a child process running. That’s the reason why you occasionally see this check throughout a codebase. Besides, data mutation in a child process is worthless — the parent won’t see it anyway. That’s why resizing — and, hence, a following rehashing —are disabled in a child process.

rdbSaveBackground()

This procedure creates an rdb snapshot. By default, this procedure is triggered once an hour if there is a single change, every 5 minutes with at least 100 changes, and once a minute with at least 10000 changes. Those settings can be changed though.

All of the heavy lifting takes place in a child process. There, an rdbSave() procedure is executed and a process exits. Its implementation is quite straightforward but tedious. The whole instance data is first written in a file. For performance reasons, some of this data might be stored in a buffer within the C library’s I/O subsystem. So when the write is over, Redis first needs to make sure that all of this data is moved to the specified file. That’s what a fflush procedure does. However, this does not guarantee that the data has been fully written to the disk. It can be cached by OS or disk controller. To actually move this data on disk, fsync call is needed.

The rdbSaveRio() procedure in Redis is responsible for writing the contents of the Redis database to disk. It starts by writing some metadata and function definitions, and then proceeds to write the contents of each database. For each entry in the database, the procedure writes the key and value. The key is preceded by a value type identifier, and the value is written in its raw binary form. In addition to the key and value, the procedure writes the expiration time of the key, as well as either the LRU or LFU eviction information, depending on the configured eviction policy.

rewriteAppendOnlyFileBackground()

I’ve already wrote about the process of loading data from files in case of AOF mechanism. Here is a rewrite process described.

It’s triggered when all of the following conditions hold true:

  1. AOF is on (that is, appendonly setting is 1)
  2. rewrite process has not already started
  3. current AOF file size exceeds auto-aof-rewrite-min-size setting
  4. aof file size grew more than auto-aof-rewrite-percentage config value since the latest rewrite
  5. and a rewrite process is not postponed because of repeated failures

This process is performed in several steps.

Flush Redis buffer into an AOF file
First of all, if there are commands in Redis memory not yet written in an AOF file, it’s done right away.

New incr file created where the current commands are written
Then, a new incr file with the next sequence number is created and added in a manifest file. For instance, a manifest file had looked like this before rewrite process started:

file my_appendonly_file.aof.2.base.aof seq 2 type b

# If previous aof rewriting ended with an error, history files can be left here. They are deleted in the end of successful aof-rewriting.
file my_appendonly_file.aof.1.base.aof seq 1 type h
file my_appendonly_file.aof.1.incr.aof seq 1 type h

# This is a currently active incr file. All the current commands are logged here.
file my_appendonly_file.aof.2.incr.aof seq 2 type i

It can contains history files if the latest rewrite ended with failure. Then, previous incr file fsynced and closed. And, finally, a new incr file is set to write current commands. Now, a manifest file looks like that:

file my_appendonly_file.aof.2.base.aof seq 2 type b

file my_appendonly_file.aof.1.base.aof seq 1 type h
file my_appendonly_file.aof.1.incr.aof seq 1 type h

file my_appendonly_file.aof.2.incr.aof seq 2 type i
file my_appendonly_file.aof.3.incr.aof seq 3 type i

Entire Redis instance snapshot is written into temporary base file
After new incr file is ready to writing all mutating commands, a new child process is forked. There, the actual rewrite happens. It’s implemented in a rewriteAppendOnlyFile() procedure. The whole Redis dataset is written in a temporary file called temp-rewriteaof-bg-<current pid>.aof. If aof-use-rdb-preamble is enabled (it’s so by default), then the data is written in a raw binary format — the same one that an rdb snapshot has. If it’s disabled, data is written as a human-readable compacted list of commands. It iterates through all dictEntrys in each dict and, depending on a value type, writes a set command. For instance, for a key key5 having a simple string value value5, it writes:

*3
$3
SET
$4
key5
$6
value5

It has a form of a command, but unlike an incr file, there is no history in this file. Each key is written only once, with its current value.

Worth noting that during the snapshot creation, the parent process continues to handle client commands, including mutating ones, which Redis writes to a new incr file. However, the child process that takes the snapshot does not see these changes. It only sees a snapshot of the data at the moment of forking. As a result, the data in the snapshot and the live dataset neither intersect nor have gaps. Hence, there is no data duplication or data loss when they are loaded into memory when Redis starts up.

When snapshot is ready, it becomes a new base file
In each cron iteration, Redis checks whether a snapshot creation has completed. If it has, then the temporary file is renamed to the new base file, while the previous base file becomes historical. Additionally, all the incr files except for the currently active one become historical as well. When a file becomes historical, it is not renamed; rather, it is marked as such in the manifest file, which is persisted at this point. Now it looks like that:

# this is a new base file
file my_appendonly_file.aof.3.base.aof seq 3 type b

# this is a previous incr file
file my_appendonly_file.aof.2.incr.aof seq 2 type h
# this is a previous base file
file my_appendonly_file.aof.2.base.aof seq 2 type h
# this is a base file left from the latest non-successful aof rewrite
file my_appendonly_file.aof.1.base.aof seq 1 type h
# this is an incr file left from the latest non-successful aof rewrite
file my_appendonly_file.aof.1.incr.aof seq 1 type h

# this is a currently active incr file where incoming commands are written
file my_appendonly_file.aof.3.incr.aof seq 3 type i

Finally, the historical files are deleted both from disk and from the manifest file, which is updated again:

file my_appendonly_file.aof.3.base.aof seq 3 type b

file my_appendonly_file.aof.3.incr.aof seq 3 type i

In closing

That’s what Redis server start-up looks like in a standalone mode. I haven’t considered cluster capability, replication, or sentinel mode. However, I hope that if you dive into those, you’ll have some background helping you with that.

--

--