Late Stage Microservices

Published in

Better Programming

5 min readMay 8, 2023

I read about the Amazon Prime Video team’s effort to scale and cut costs by 90% by combining their microservices into a monolith. While I’m sure there’s more to the story here, there are similar stories published around the Internet; a kind of rebuttal to the microservices craze of the last decade.

It reminded me of the last few years I spent at Twitter, where the number of microservices (in the thousands) peaked and then began to fall, even as traffic increased. All kinds of application logic was absorbed into monolithic managed platforms of one sort or another. The principle of creating purpose-built microservices is not intrinsically good, and after a certain point, didn’t seem to be a positive ROI for many of our teams either.

Midjourney: Sisyphus pushing a boulder, barely strapped together with cables, with visible schematics and computer code contained within it

A microservice architecture can be costly. There has been a narrative over the last decade that this is a natural policy for large engineering organizations, because:

Teams can iterate on their services with autonomy from the broader organization and its code
Small components of the serving architecture can be modified without disruption of the larger system
The architecture allows for the definition of clear, manageable application and failure domains via independent services
The architecture allows for independently tunable or scalable workloads

While these attributes can be true, they aren’t necessarily properties of microservice systems and no others. Relatedly, microservices don’t solve for problems of organizational scale for free.

One has to account for the complexity of work that is offloaded to small service owning teams and the broader ecosystem as a result — in contrast to a singular or small number of monoliths. The offloaded work depends on the maturity of the organization’s technology stack; but it can include:

Distributed operational responsibilities and overhead amongst all service-owning teams

The understanding and configuration of service infrastructure components from job scheduling and deployment tools, service discovery, RPC frameworks, VMs and associated tuning, service monitoring and logging
Distributed definitions and development processes for SLOs and SLIs, service failure and fault handling, backpressure handling and error signaling
Distributed and often uncoordinated capacity planning, load testing, integration testing implementation and responsibilities

Uncoordinated service API definitions and serving architecture design

Application logic which interacts with many backend services is often navigating a series of bespoke service API definitions, semantics, data models, versioning systems and occasionally even differing protocols
Engineers must understand and reasoning about the overall service architecture and traffic patterns in order to build functional things
Organic workarounds for suboptimal or bottlenecked serving paths, distributed data replication and consumption, circular request dependencies

Fragmented approaches to common classes of functional work:

Localized approaches to problems of inference and ranking, event processing, HTTP/internal protocol API translation and data model mapping
General fragmentation and difficultly holding up standards across the stack

But in particular, there is a cyclical, self-imposed dysfunction of microservice architecture. I mentioned above that a core principle of this design pattern is the compartmentalization of application and failure domains into independent services.

The problem is that the process of creating, building and operationalizing a new service is expensive. The integration of that service into the production serving architecture is difficult. The attention required to tune another service for efficient operation; to set up corresponding service accounts, observability and logging profiles, integration tests, the oncall rotation and everything else just adds to the maintenance and operational overhead of the team. Most organizations lack automation and end-to-end scaffolding for this kind of complex boilerplate work.

Midjourney: A massive tall decrepit monolith surrounded by people looking upwards at it — Midjourney: A massive tall decrepit monolith surrounded by people looking upwards

This overhead is a major disincentive for teams to stick to the UNIX-inspired microservice architectural principle of “do one thing well”. It’s just easier to build new use cases into existing services. At the furthest end of the spectrum, teams operate “macro” services which begin to encapsulate everything they’re responsible for as a team, which might include a variety of products or capabilities. Ownership quality falls over time, because reorganizations are a relatively common occurrence. You’re working with a ball of mud.

Macro-service outages aren’t limited to one specific feature set, but instead an unpredictable or unprincipled swath of the system, which might resemble historical org charts. The site may or may not tolerate these services being unavailable. It’s harder to optimize services for heterogeneous workloads; call paths cannot be disentangled. A former colleague coined this “the microlith”.

Microservices encode and incentivize local optimization into the serving architecture. It is hard to hold teams accountable to a principled design for serving architecture when the goal is the federation of service development. Similarly, it is difficult to solve backend-wide optimization problems when the properties of dependent services are bespoke. Without carefully designed serving architecture and supporting service platforms, leverage is hard to find — reusable product platform code is scattered across a series of large-ish services or shared libraries; the frontend services usually bespoke.

Microservice leverage has been found at the compute layer (whatever is orchestrating and executing service jobs or containers), service discovery, the VM (applicable in certain contexts) the service framework (like gRPC, Finagle or Rest.Li) and centralized developer tooling. Of course, service meshes nicely encapsulate some of these concerns.

Another way to approach these problems is reducing the functional overhead. So rather than try to reduce the repeated infrastructure overhead of running the service, you can reduce the costs of teams re-implementing common classes of work. This is why managed platforms are so successful — even if creating or operating a service is expensive, if we can thin-out its application logic by outsourcing to a public or private managed platform, it becomes less of a burden on its owning team.

It’s important for organizations to carefully evaluate architectural trade-offs based on their specific needs, goals, and technology stack maturity. Microservices provide certain advantages. They also introduce complexities and operational overhead which seem manageable at first, but compound over time. Failure to address those problems makes the end-to-end development process a little easier than working in a monolith.

A few things to keep in mind, especially if you’re working with a legacy microlith:

“Big services” or monoliths are not necessarily bad; they require a different kind of tooling, infra support and set of principles than microservices, but they should be built to capture well-defined common elements of the serving architecture
Advocating for the principle of “one task, one service” tends to be much less practical than drawing intuitive boundaries around areas of tight collaboration, performance concerns, and distinct failure domains — and pushing teams to stick to the plan
You can alleviate the complexites and overhead of “too many services” by either reducing the vertical depth of the stack service owners are responsible for (service frameworks, service mesh, compute), as well as the span of the application space through managed services or platforms for common workflows

Late Stage Microservices

Written by Mike Cvet