Member-only story

On Building Scalable Systems

Understanding scalability

Kislay Verma

Published in

Better Programming

13 min readMar 6, 2022

In software engineering, scalability is the idea that a system should be able to handle an increase in workload by employing more computing resources without significant changes to its design.

Why don’t systems scale

Software, though “virtual”, needs physical machines to run. And physical machines are bound by the law of physics. The speed of light limits how fast a CPU can read data from memory. The information-carrying capacity of wires determines how much data can be moved from one part of a system to another. Material sciences dictate how fast a hard disk can spin to let data be read/written.

Collectively, all this means that there are hard limits to how fast our programs can run or how much work they can do. We can be very efficient within these limits, but we cannot breach them. Therefore, we are forced to use software or hardware design tricks to get our programs to do more work.

The problem of scalability is one of designing a system that can bypass the physical limits of our current hardware to serve greater workloads. Systems don’t scale because either they use the given hardware poorly, or because they cannot use all the hardware available to them e.g. programs written for CPUs won’t run on GPUs.

To build a scalable system, we must analyze how the software plays with hardware. Scalability lives at the intersection of the real and the virtual.

The key axes of scalability

Latency

This is the time taken to fulfil a single request of the workload. The lower the latency, the higher the net output of the system can be since we can process many more requests per unit time by finishing each one fast. Improving latency can be understood in terms of “speed up” (processing each unit of workload faster) and is typically achieved by splitting up the workload into multiple parts which execute in parallel.

Throughput

The other axis of scalability is how much work can be done per unit time. If our system can serve only one request at a time, then throughput = latency X time. But most modern CPUs are multicore. What if we use all of them at…