Better Programming

Advice for programmers.

Follow publication

Can We Stop With Those Horrible “System Overview” Dashboards Already?

Boris Cherkasky
Better Programming
Published in
7 min readJan 23, 2023

--

It’s 2AM, you’re not asleep, your phone rings constantly with production issues.

You are staring at your endless “system overview” dashboard, scrolling up and down through dozens of charts, showing different metrics and SLIs — some you’ve seen before, some are totally new to you, some are broken altogether.

Some charts are spiking, some are dropping, the “big picture” is just cryptic.

it’s hard to reason about what goes in the system.

You’re tired. You’re frustrated.

If you are not familiar with this late night debacle, and such “system overview” dashboards — first, lucky you, and secondly, let me explain — it’s those huge dashboards with too many charts, trying to show the technical state of a very complex production system in a single dashboard.

In this post I’ll try to break down why “system overview” dashboards tend to do more harm than good, and if you still choose to use them — I’ll try to nudge you into doing them a bit differently.

Why We’re Eager To Have “System Overview” Dashboards

Every engineer who’s being on call wishes for two things:

First one is for the night to pass without production incidents, but if there’s an incident — we want to have a silver bullet that’ll “tell us” what went wrong.

“System Overview” dashboards tries to answer the second wish — a single place that shows in one glimpse, the whole state of the system, and tells the barely awake, very tired and frustrated on call engineer — what happened in the system.

They usually fail to do so, and in the worst cases, when they are poorly built — they can put the miserable oncaller on the wrong path altogether.

Let’s try to understand this by exploring two angles:

  • How dashboards evolve
  • The limitations of building such a dashboard

Evolution of system overview dashboard

We usually start with a small application, a few microservices, a very homogeneous stack, with very little diversity in the connections between the systems. You build a dashboard that gives you a good understanding on how it operates.

With time, your system evolves with additional sub systems, more diversity, different stack, and we usually just add additional charts — covering the new system.

And just like that, little by little, a “system overview” dashboard is being born. With little intention and limited planning.

Now that you already have an unintentional “system overview” dashboard, it’s being adopted, used, and liked by engineers — since the system is still simple enough to comprehend — the dashboard is relatively useful!

With time, as more and more incidents happen (with scale and business success of course), additional charts are being added ad-hoc to cover our blindspots.

The already unplanned and undesigned dashboard is growing out of hand.

You need to scroll to get the intended “overview”, you see charts describing different subsystems, and you may even no longer be familiar with all the subsystems within that dashboard.

The system keeps on evolving, components change, and the overview dashboard starts to be unmaintained and unreliable — it has no owners who clean it up, or too many owners, either way — it’s a mess.

The dashboard now spreads over too many contexts, with limited connection between charts. You get different (and in times contradicting) signals from using it, and your most talented engineers build dashboards for themselves, ones that suit their needs.

The dashboard is in a state of glorified, unmaintainable mess.

The lack of ownership, planning and design is how those dashboards evolve to their worst state, but they have additional inherent limitations.

Limitations

The reason system overview dashboards tend to fail is that they have two major limitations:

First — They usually try to summarize a very complex system into a single dashboard — what worked where your system was 2 microservices using HTTP is likely not to work when 200 microservices are at hand, using numerous communication protocols. At that phase, most of your engineers don’t even have a full mental model of the system in their head, let alone summarizing it into one dashboard.

Second — The dashboard is bound to a very small physical area — your computer monitor, and sometimes even worse — a 13’’ or 15’’ laptop monitor.

Take a look at an illustration of a nuclear reactor control room, a very complex system too — but their “control room” or “system overview” isn’t limited — they have a physical room! With multiple screens, gauges, and other instruments!

Under those limitations, it’s infeasible to both show a very wide “overview” of the system, and both pin point every specific outage cause, and even trying to do so is a lost cause — the system is just too complex to show in one view.

One of the common ways engineers try to fight this, is by adding context — adding comments, free text widgets explaining the different charts in the dashboard, etc. But with the limited physical space of the dashboard, this just adds up to additional clutter — what works in simplifying a production dashboard of a single system, doesn’t scale that well into a “system overview” use case.

Consider the analogy where every gauge in the above nuclear control room has an instruction book next to it (to be precise, I’m sure they have run books, it’s just not next to each gauge).

In the final part of this post, I’ll try to explain how some additional context might be useful.

As with all complex systems, that have limitations — addressing, and solving them requires design, intent, engineering effort, and time.

“System Overview” Dashboards Aren’t Evil — How We Use Them Is

There’s a reason why we design our systems in a modular way — in repositories, services, each feature is implemented using many classes, and classes are broken into methods, each with smaller responsibilities.

Debugging the code takes time and going through many layers and components — you don’t just read one single 100K line “main” method hoping to find the bug.

Same goes with your health — if you have a stomach ache, you go to a physician, who does a first check, and sends you for additional tests, and the expertise of other doctors. There’s a process.

For some reason with the “system overview” dashboard — we try to skip that process, and have this one view to answer everything! And this is the root of all evil with those dashboards.

So the bottom line is — “System overview” dashboard should provide a wide, product centric overview of the system. Nothing else.

Doing it right — keep it overview

Remember the purpose of a “system overview” dashboard — it should answer two questions:

  1. “What is the state of your product”
  2. Which component / capabilities of the product are impaired

It’s not a silver bullet, It shouldn’t point to the root cause.
It’s a tool for initial triage, and “one view” of system liveness.

To do so — the dashboard can’t be too in-depth, and should summarize the state of each component into a single signal — “is the component ok”, most often visualized as a “traffic light” — red and green. To make this excellent, add a link to a detailed sub-system dashboard below each such “traffic light”.

The only technical metrics (i.e. number and not green/red signals) this dashboard is allowed to have is the main, customer facing SLIs of the product as a whole. For example — if you’re building an ecommerce site — latency of the website, rate of placed orders.

Assign the planning and design of that dashboard to someone from your team (either a single person, the oncall engineer, one of the managers — doesn’t matter) — it should be clear who’s responsible for the liveness and maintenance of this important tool.

Want to know whether you’ve done this right? A fellow colleague, Oleg C described it best: “If every oncall engineer can easily read and understand your dashboard during a 2AM SEV1 outage, and get an initial triage going — you’ve probably done something right”.

Final Thoughts

Having a well designed “system overview” dashboard can be an amazing and useful tool to triage and start understanding production issues. Having an ad-hoc unplanned one can do more harm than good over time.

Treat your main observability tools, your key dashboards as projects — design them, put in the effort, and make sure they have an owner.

As always, thoughts and comments are welcome on twitter at @cherkaskyb

--

--

Boris Cherkasky
Boris Cherkasky

Written by Boris Cherkasky

Software engineer, clean coder, scuba diver, and a big fan of a good laugh. @cherkaskyb on Twitter

Responses (5)

Write a response