The Path of Getting Comfortable in Production

My path to getting comfortable in production

Boris Cherkasky
Better Programming

--

Image source https://unsplash.com/photos/feXpdV001o4 by Anas Alshanti

4 years ago, I was quite a decent coder, I’ve been doing desktop development for about 6 years, I even managed a small team, but I never had the chance to work with highly available distributed systems leveraging short delivery cycles.

4 years ago I made the switch, going “all in” for a SAAS company, moving away from my comfort zone to new technologies. It was then that I learned of this thing called “production” — i.e. what happens after the last keystroke of code in the IDE, and what happens after my code is shipped to the “the cloud”, which was for me at the time — unknown.

Looking back, I can confidently say that getting comfy in production is a journey that takes time, requires quite a different skillset from coding, and a lot of collaborative work.

So if you are a developer who never got an alert from production, or do not fully understand what happens after the CI finishes its part — this post is exactly for you! I won’t explain all of those in detail, BUT, I’ll describe the path I’ve taken, and hopefully inspire and make it easier in this journey.

Step 1 — Make some friends in your OPS/SRE teams

You’re about to NOT understand a lot of things, You’re gonna be lost at times! You’ll need “someone on the inside” to talk to.

Since a lot of the processes, tools, and practices will be company internal and specific — Googling will only get you that far — you’ll need a mentor.

Step 2 — Revisit your Definition of Done (DOD)

A working feature is good, a tested feature is better, and an automatically tested feature is amazing. You’re not done though — how do you assert over time your feature works in production though? How can you investigate failures within that feature 6 months from now? How do you assert the overall production stability remains intact?

You need to change your state of mind — software stability evolves and changes over time, thus your work is done only when all of the above is covered. And that includes:

  1. Monitoring — Your feature’s behavior is monitored, and does not affect the overall behavior of the system (the 4 golden metrics are a good starting point).
  2. Logs — make sure you have relevant logs to trace back what goes on in the system, especially when something goes wrong. Make sure you have the exception and stack trace available in some system.
    You can start with verbose logging on feature launch, and reduce and refine them with time.
  3. Are your spans and traces marked and labeled correctly to cover all flows of interest?

Lastly — understand the difference between positive and negative monitoring, and make sure you’re positively monitoring your feature change.

  • Positive monitoring means you’re actively asserting your feature is working correctly.
  • Negative monitoring (usually) means you’re passively monitoring that nothing was broken.

Step 3 — Understand your infra

You have to understand what takes code to production, how it’s being built, and how it’s being deployed — those are as important as understanding your code since they can cause problems, or solve them, just as easily.

Being able to understand and control this process can give superpowers — you can create custom images — interact with environments, and even create your own!

Understand your build and CI process — ideally — add some functionality to it, make sure you can change it if the opportunity arises. Learn and get comfortable with the rollout and revert process.

Understand your production layout — where are your instances, what’s their runtime, how are they configured and connected to the world, how new software gets delivered to them, how can you get or divert traffic from/to them.

Step 4 — Monitor, Monitor, Monitor

I used to start every workday by checking 3–4 main dashboards for the systems in my domain. I was getting to know the traffic patterns — understanding what “normal looks like” (check out this talk that coins the term “intuition engineering”).

So, what is “normal”? What was I looking at?

The obvious metrics were Latency, Throughput, Error rates (yep, some systems have errors as a “normal state”), but I went deeper — common CPU utilization patterns, network, memory, latency distributions, customer traffic distribution. You could have pinged me at any time of day, and I’ll tell you with a high degree of confidence what’s the throughput of our system, who are the customers that generate most of the volume, and the number of replicas in the fleet. Knowing the “normal” — I could spot “anomalies”.

At some point, this constant monitoring and experimenting with metrics, logs, and observability tools will lead you to be able to monitor anything with different tools and angles, making you independent — I’ve learned the hard way that observability tools too can fail during outages, and being able to monitor systems with a level of redundancy is a gift.

Lastly, investigate abnormalities, even such that didn’t cause any alerts or visible impact. You’ll learn so much from them! You might learn that some of your customers have traffic spikes, or that your system goes through cleanup cycles at some times of the day (Postgress VACCUM or GCs as an example).

In one interesting case, I’ve managed to learn we had a case of bad integration with a customer by investigating one abnormal traffic pattern. The system became better.

Step 5 — Remove “not my responsibility” from your vocabulary

To really learn and feel the system, you can’t be limited only to the 4–5 services or components under your domain. You need to understand how other systems work too, what makes them tick.

it doesn’t have to be a thorough 360 degrees deep understanding of every bit and byte, but a general scheme of things can be very insightful.

After you’ve been monitoring production enough (step 4) — “diving into” other systems, and moreover — understanding problems of other systems is that much easier. You can understand when the DB returns errors from looking at logs even if you’re not familiar with the DB or the service — and you’re already an expert on how to dig out those error logs.

HTTP error codes are the same regardless of service. Maybe you won’t fully understand the root cause or impact — but it’s totally within your power to understand what’s going on, and what’s failing.

Hopefully — most systems are stable enough, so other services are just another learning opportunity, Moreover — understanding more systems gets you far more confident and capable of assessing business impact (wait to step 7).

Step 6 — Collect incidents — investigate each one

Jump on incident war-rooms, sit at the back. Just watch and listen. Learn, feel the excitement (or terror).

Go through or attend postmortems. Understand what happened — understand what are the impacted SLIs, and how to monitor them. Those SLIs are later translated to business impact!

You’ll learn so much just from being there, and seeing how engineers operate. Each such incident will be embedded into your professional self, and next time it (or something similar to it) happens (and it will), You’ll have a bit more context from all others, and some head start.

You’ll understand the difference between the impact (what happens to the business), the symptoms (what happens to your systems), and the Root Cause (what started the debacle).

Eventually, you’ll just get comfortable in production Incidents — it’s part of the job, just like giving a code review or talking about your design.

Step 7 — Get to know the business

Your average sales engineer does not understand what “we having latency issues around ElasticSearch” means to their customers! There’s a domain barrier between the technical implementation of the system, and the value it generates for the end consumer.

As an engineer, you have to understand the bottom-line value your components give and be able to explain the impact of any issue in the system, from the end-users perspective.

A few years ago when we were a small startup, In one famous incident, an on-call engineer reported that “there’s a backlog of jobs” to the business — who obviously had no clue what those jobs are. This raised so many questions and caused too much unnecessary panic.

Step 8 — You ask for guidance once and do it yourself the next time

Try to be as independent as possible, learn to do anything within your permissions set and role’s boundaries:

  • need to change how your image is built? Amazing! Ask for guidance, and later make sure you can redo that independently.
  • Need to ssh / kubectl to some instance to check something? Amazing! sit next to your favorite colleague and write down what they are doing, and later make sure you can redo that independently.
  • Need to create some elaborate metric in Prometheus or a dashboard in Kibana? Amazing! Take one of your observability experts by the hand and learn how they do it, what they put their focus on, and later make sure you can redo that independently.
  • Neet to debug something complex in production? Amazing! Sit down with your team lead and jot down how this should be achieved, and later make do it independently.

After you’ll be able to do all those things yourself, you can start taking more responsibility within the team and deliver value faster. Moreover — you’ll start understanding the DevOps domain and jargon, and you’ll be able to communicate with the operations team that much better.

Retrospective

There you have it, this is the path I’ve taken. It is an intensive journey, involving dozens of services, lots of colleagues, a few mentors, and the occasional outage. The skills gathered are priceless, and the possibilities are endless.

Want to Connect?Hope you enjoyed this, and feel free to reach out with anything on Twitter @cherkaskyb

--

--

Software engineer, clean coder, scuba diver, and a big fan of a good laugh. @cherkaskyb on Twitter