Why We Over-Engineer Software (and How to Break the Habit)

Published in

Better Programming

13 min readMay 6, 2023

With ready access to public cloud computing, container orchestrators, and microservices architecture, it has become trivial to create distributed systems of nearly limitless scale and complexity. While all of these tools have their purpose, it’s important for engineers to carefully consider when and whether to use them, especially in smaller organizations. Making the wrong choices can make you less agile, less financially sound, and less successful. This article explores some potential causes and proposes some remedies for the common anti-pattern of unnecessary complexity.

The Problem

We Begin With Complexity

When we interview for software engineering jobs, we must navigate a synthetic, challenging, and often stressful interview process to prove to potential employers that we are qualified to continue doing the things we’ve already been doing for the past several years. During this process, we may be asked to solve multiple medium or hard algorithmic questions in a short time in what feels more like a spectator sport than coding. We are usually required to design a novel system in under an hour whose requirements we must first extract from the interviewer. None of these exercises is much like what we do in our actual jobs, but are instead challenges designed to produce varying degrees of “signal” that hiring committees can use to level and differentiate candidates.

The system design interview, in particular, is a construct where candidates are asked to demonstrate wide and deep understanding of systems at a scale you might find in the very largest companies. We use concepts like microservices, consistent hashing, event busses, service mesh, protocol buffers, WebSockets, Kubernetes, API gateways, data pipelines, data lakes, and other technology buzzwords to show that we know what they are and how they are used. This gives potential employers confidence that we are up to date on all of the latest techniques and tools and can do what might be asked of us.

Complexity Breeds Complexity

If we get the job, we are given the opportunity to use all of the advanced knowledge we demonstrated during the hiring process to solve problems for the business. We dutifully employ these technologies and often end up creating rather complex systems based on the architectural patterns pioneered by Google, LinkedIn, Meta, Amazon, Apple, and Netflix.

Corporate leadership is often happy to see this complexity as it demonstrates that the company is now mature, the “engineering bar” is high, and the company’s systems will be able to scale with the business no matter how much the business grows. This complexity is further leveraged as a recruiting tool to show candidates that they will get to work with all of the latest technologies and that their future coworkers are bright and up-to-date on modern tools and techniques.

What’s Wrong With That?

The problem with this approach is that many companies will never grow large enough or achieve enough scale for this complexity to be beneficial. The largest companies developed these novel technologies and patterns to deal with serious problems of scale that they encountered as their user base and transaction volumes grew exponentially over extended periods of time. In many cases, no good solutions existed so they built their own solutions for the scale the business achieved, often at great initial and ongoing cost. Many of these companies graciously open-sourced some of their technologies for the good of the broader technology ecosystem, making them “free to use” for other companies.

When not operating at scale, many of the above patterns and technologies are dangerous because they can slow progress, dramatically increase costs, and multiply the cognitive load on engineering staff. This is a huge problem for smaller companies because it makes the business perform worse in the ways that matter most; agility and profitability. This creates a trap for businesses that find themselves in an economic downturn or a suddenly more competitive market sector as margins tighten. This mismatch is exacerbated because reward systems for engineering staff are often not sufficiently linked to the business performing better (more on that in an upcoming article).

Two Examples

Let’s start with a simple system that’s relatively easy to understand and compare its minimal form to what can happen when we over-engineer it. For the sake of this discussion, let’s assume that we have a generic website where users can register and purchase things.

The Simple Version

The minimal architecture for a simple e-commerce website is often referred to as a monolith, meaning the entire codebase is encapsulated in a single source repository and is generally deployed as a single unit. Many startups begin here because it allows them to move quickly when the codebase and engineering teams are small. The added benefit is the entire system can often fit on a laptop for easy local development.

In the diagram above, we can choose some sensible defaults from a single cloud provider (in this case Amazon Web Services or AWS) to keep infrastructure tooling consistent:

Primary Development Language/Framework: Ruby+Rails
we could also choose Python+Django, Go+Buffalo, or any other sensible language/framework
Frontend Language/Framework: JavaScript/TypeScript + React
nearly all modern web frameworks will require some JavaScript framework, and React is very popular
Batch/Background Processing: Ruby+Sidekiq
Python+Celery, Go+Faktory would be alternatives if we picked a different primary language
Source Control: Git/GitHub
CI/CD: GitHub Actions + Terraform
Image Repository: AWS ECR
Container Orchestration System: AWS ECS
Keys: AWS KMS
Secrets: AWS Secrets Manager
Firewall/Tiered Cache: Cloudflare
DNS: Cloudflare
Load Balancer: AWS ELB or ALB
File/Object Storage: AWS S3
Cache+Cache Nodes: AWS ElastiCache for Redis
Database + Read Replicas: Postgres on AWS RDS
Messaging: Sendgrid
Payment Processing: Stripe
Identity/SSO: Okta
Logs & Observability: DataDog

We will assume that external services are exposed via versioned REST APIs since most third-party services offer that option. We can keep the cognitive load lower by doing the same for internal API calls so we can use tools like Postman to test all of our APIs. All internal service-to-service calls will be made synchronously, though it’s possible to achieve asynchronous functionality with background processing via Sidekiq if needed. Any APIs exposed to external users will be exposed through the same ingress that is used for general web browser traffic. There is no “event bus” nor are there any data pipelines, data warehouses, or data lakes. All analytics are done against database read replicas when required.

This design includes 16 critical components in which outages could cause user-facing, site-wide outages, or degradation. Because the system is minimal, most of the components are critical but redundant deployments help mitigate the risk of failures causing downtime.

The Complex Version

We could argue that the simple version above isn’t really simple, but we can flex our architectural muscle and make the system quite a bit more impressive.

Let’s begin by (partially) breaking up the monolith into microservices. Let’s also introduce an API gateway, Kubernetes, an event bus (using Kafka), service mesh, multiple DNS systems, multiple CI/CD pipelines, data pipelines (using a separate Kafka cluster), and a data lake. In addition to REST APIs, let’s add support for gRPC APIs (which includes protocol buffers) because we know it’s a lot faster than REST and we desire high performance. We will add a second CI/CD system (Jenkins) because some of the staff prefer it to GitHub Actions.

Let’s add an additional language and framework (Python/Django/Celery) to give developers more options, and make sure we remember to add Go and Groovy to our list of supported languages because we’re using Jenkins (with a Groovy DSL) and Kubernetes so we’re likely to need Kubernetes (Go) operators to manage a few custom resources we’ll create. For better performance and decoupling, we will fully separate our front-end UI deployment from our backend monolith. Let’s also add another option for logging and observability (Sumo Logic) because our first choice (DataDog) was deemed too expensive to ship all of our logs to.

Let’s take inventory of how these architectural flourishes have changed our system:

If we assume our Python service and data platform are not mission-critical, we now have 33 critical components in which per-component outages could cause site-wide issues. Based on how we calculate availability, this makes outages much more likely because the overall system can only be as available as its least available dependent component. We have more dependent components, so keeping the same availability as a simpler system requires much greater rigor. In practice, the more components, the harder consistently high standards are to enforce.

We have replaced reliable local library calls with over-the-network API calls. This has a twofold impact:

The network adds latency, making every invocation of a remote service a little bit slower than a local library call
Any network call can fail or be rate-limited, meaning we need to be more careful about error handling than if we were merely calling code in a locally-available library

3 new languages have been introduced: Python, Groovy, and Go
While this increases options, this reduces the ability to freely reallocate staff to different teams. In many cases, new languages will need to be learned which requires ramp-up time and reduces agility.

We now have 3 options for integration patterns:

REST APIs
gRPC
Event Bus

This means that when an API changes, we may need to update the schema registry in the case of event bus integrations or get new client “stubs” with the changes reflected in the case of gRPC. Regardless, we now need to know three different patterns to work across our distributed ecosystem. There is also a risk, if we are sharing a central schema registry, that breaking changes could be introduced that may impact other services.

11 net new Components:

API Gateway (Amazon API Gateway)
Frontend Platform+Edge Compute (Vercel)
Internal DNS (AWS Route53)
Kafka for Event Bus (AWS MSK)
Kafka for Data Pipelines (AWS MSK)
Multiple Kafka Connect Source and Sink deployments to read/write from various databases and topics
Data Lake: we’ll assume AWS Lake Formation with Amazon Athena for querying and S3+Parquet for data storage
Istio Service Mesh for mTLS and routing between the API gateway and our Python microservice
Multiple Kubernetes Clusters (AWS EKS)
Ingress Load Balancers (AWS ALB)
Jenkins CI/CD

Each additional component requires new or increased staffing to ensure business continuity. Each new component also requires monitoring and regular updates to ensure vulnerabilities are addressed and the component remains healthy. Each component brings with it the risk of misconfiguration which can increase the cost of operating it through over-provisioning, or increase the risk of failure though under-provisioning.

Approximately 16 Duplicated Components:

3 Databases+read replicas
3 Redis clusters (one for the monolith plus one per microservice)
2 Kafka Clusters
2 DNS systems
Cloudflare external + Route53 Internal. This does not include Kubernetes internal cluster DNS
2 CI/CD Systems
2 Logging & Observability solutions
2 Kubernetes Clusters

In the best case, this duplication limits the blast radius for outages and allows per-service tuning, and all deployments are managed consistently well across teams and services. In the worst case, they are all managed differently with differing levels of quality and tooling. Regardless of how these duplicated concerns are managed, it usually takes more effort to manage more resources.

Approximately 10–15x increased operating cost
Counting the additional containers, deployments, managed services, and staff required, we can expect operating costs to be substantially higher. Even Amazon, one of the most ardent promoters of microservices and the public cloud, has found the costs to be prohibitive for some internal purposes.

Too Big for Full-Stack Local Development
Even with downsized deployment configurations, there are likely too many moving parts to allow the full system to be run on a developer’s laptop in Docker or minikube. There are ways to enable development in the cloud, but developers will need mocks or shared development instances of services in order to simulate a fully functional system.

How Do We Avoid These Mistakes?

There is technically nothing wrong with the simple or complex system examples above. They both provide similar functionality and the complex system, while it costs much more to build and operate, might be a good fit for some companies at a certain size and scale. That said, the complex example is probably not the best starting point for most companies. To avoid premature optimization, the following approaches are helpful:

Develop a Culture that Rewards Simplicity

When interviewing candidates, instead of encouraging them to impress you with their breadth and depth of knowledge and to design systems at a web scale, consider architectural simplification as one of the key criteria you use to evaluate them. Ask questions like, “Can you think of ways to make this system simpler or more reliable?” If they can identify things in their design that they don’t need, reward them for it. If they don’t add unnecessary components in the first place, even better.

After hiring an individual contributor, especially at more senior levels, make architectural simplification a part of their performance evaluation. Reward employees that take systems that are complex and fragile and make them simpler and more robust. Having technical staff on the lookout for efficiencies and simplification will ensure that systems are as nimble and reliable as they can be when you need to move quickly.

Minimize The Number of Simultaneous Changes

When a company begins a difficult exercise like moving bounded contexts out of a monolith, the temptation is to extract, translate, and refactor in a single “big bang” operation. If teams feel that they would like to see a concern implemented in a different language or using different tooling (that they also want to learn), it’s common to sell this translation exercise as a valid and necessary part of the extraction effort.

Assuming the existing language and framework are still adequate for the workload, it’s usually better to first extract the code to an external service and after the extraction is complete, consider refactoring, porting, or otherwise restructuring the service if something is still lacking. I’ve seen many migration efforts struggle to do too many things at once and risk the success of the overall effort or make very, very slow progress. If the existing code is written in Ruby, it’s usually best to keep that constant until the service is truly a separate concern. You can get more help from other engineers if you’re changing as little as possible and they can still understand the code you’re moving.

Encourage Standards and Consensus

I’ve seen battles between directors, engineers, and architects that are resolved by all parties deciding to do it their own way and eschewing conventions and standards in favor of greenfield development with novel tools and technologies. While this can be empowering to the intrepid engineer, it usually ends in bespoke solutions that don’t work well with other parts of the ecosystem and increase operational cost because they are “exceptional” in all of the wrong ways.

Instead of decoupling, encourage consensus that results in a single way of addressing a concern that can be adopted broadly. Rather than developing tools for yourself or your immediate team, think of everything you build as something that you will share with your entire engineering organization. If the problem is common to many engineers but you don’t think your solution will be adopted, consider enhancing something that’s already being widely used or discussing the concern with other engineers until you can get agreement and buy-in that your approach can be widely used and that they will help you encourage adoption across the organization.

Prefer Doing Nothing Over Doing The Wrong Thing

Many organizations become obsessed with velocity, and use Agile methodologies like Scrum to “maximize flow” and throughput of teams. In many cases, a business doesn’t know exactly what it needs to do so it simply tries to ensure that engineers are kept busy so they remain in a constant mode of rapid execution.

The problem with this approach is that things often get built that aren’t needed, but instead of throwing them away, they are treated as a going concern that requires support and ongoing investment. The reason for this is that most engineers take pride in their work and do not want to work on “throwaway” software.

In times when it’s unclear what needs to be done, use this time to allow engineers to experiment, update skills, or collaborate with product managers and other leadership to determine what should be done next. This will prevent the creation of “cruft” and will increase the ownership when it’s time to execute rapidly once you figure out what to build.

Defer Optimization Until Just Before It’s Needed

It’s always better to be proactive and solve performance and availability problems before they impact the business. It can be tempting to aggressively “future-proof” solutions by making them support massive scale-out before it’s needed, but this is usually a mistake.

Instead, make SLI definition, monitoring, and capacity planning a part of service ownership. When a team is responsible for ensuring service level indicators are well-defined and meeting target objectives, optimization is part of the contract. If the usage patterns for service will change substantially based on some new product launch or partnership, service owners should be consulted and be allowed to factor that into their planning and ensure they can keep indicators in the proper ranges with increased traffic. If a team keeps SLOs consistent while taking on a lot more traffic, reward them (financially) because more volume should mean more revenue.

Ensure That Engineering Success and Company Success are Strongly Correlated

Too many times, engineers are rewarded for creating technically impressive solutions, and this “bar raising” is how they can justify moving up within the organization. This is often referred to as “resume building”, where engineers can show future employers impressive systems they’ve worked on as a means to move out and move up. This is unfortunate because your subject matter expert leaves the building by extolling the virtues of something they built that you may not have needed. Now you must decide to support this creation with additional staffing or decommission it due to a lack of internal knowledge.

Instead, encourage engineers to put the needs of the business ahead of technically impressive solutions by ensuring that they always do better when the company does better.

I’ll dedicate another article to this topic, but the TL;DR is to make sure you are (financially) rewarding the engineering behavior you actually need to grow and improve the business.