What Changes Do Most Observability Tools Miss?
If every service outage can ultimately be traced back to a change, why don’t our monitoring and observability tools provide better understanding of change impact to prevent outages and shorten the time to mitigate problems? More simply, why don’t modern tools provide DevOps change monitoring? Human errors, software bugs, and automated responses have been the causes of major outages in the past year (Cloudflare, Fastly, Akamai, Facebook, AWS, Azure, Google, Atlassian). For a more complete list of causes of failures over the years, check out Dan Luu’s list of post-mortems. There’s a common thread in these failures: planned changes caused unexpected consequences. This can happen for a variety of reasons — microservices and containerization and orchestration strategies can result in a combination of service interactions that can negatively impact service performance.
What most post-mortems are not telling you
Many post-mortems identified several areas of improvement to prevent similar problems from recurring. They’re often about improving change procedures and automation to ensure there are no unintended consequences; redesigning the system architecture to allow for mitigations; and more automations to mitigate some or all of the impact. The focus is mostly on improving the processes and automations in pre-production. When simple configuration errors can cause a domino effect, there’s a bigger lesson to be learned. If you can’t tell if anything has changed in your environment, then you can’t start the troubleshooting process reliably and efficiently.
In today’s complex IT environment, we’re dealing with unknown unknowns in production. Not a single DevOps engineer can possibly know all the infrastructure and the interconnected services, and all the changes happening around them. If you’re not seeing how your code deployment or configuration change impacts production, how would you reduce change management risks? With changes happening around us at a faster pace, DevOps teams can protect themselves from a major service outage by monitoring changes in production environments.
Unlearning old habits
If you’re a swimmer, then you’ve probably heard about the Shaw Method of swimming. It’s considered very effective in making swimming feel effortless, comfortable, and efficient. The real challenge of the Shaw Method is unlearning the bad habits we’ve been practicing for years. In order to move with ease and without strain, you have to become more mindful and pay attention to the relationship between the head, neck, and back throughout the stroke cycle.
This method is applicable to observability. To move forward with observability and troubleshoot better, you need to unlearn the traditional way of using logs, metrics, and events for monitoring. Imagine swimming in an ocean of data. Instead of zigzagging through the water (data) with your frantic arm, head and leg movements (separate tools, each with its own UI and query language) during an incident response, how much more efficient would it be if you could see the connections between the data? When you can make associations between operational data and understand the relationships between the components in your IT environment, the first-level triage becomes easier. Start by asking: What caused the change? What are the useful logs or traces that service owners can use to troubleshoot effectively? How can I provide access to that data without making them shift between tools?
Connecting the right data, and tracing forward
With traditional monitoring tools, we got used to finding and searching for logs, metrics, and traces when something goes wrong. To do better observability, we need to move from finding and searching to building meaningful data connections from the start. Then with the right associations and application topology visualizations, we can proactively monitor changes in our environment and trace forward and backward at any time.
Tracing forward tracks all changes, such as a code commit or configuration change, by following a timeline. But time alone is not enough to pinpoint a cause. To accurately monitor how a change impacts a service’s upstream and downstream dependencies, we would have to link symptoms to service(s). For example, assume you committed code in Git and need to monitor if the change impacts the service. The timeline would need to show the commit and pipeline runs on a particular branch, change events on related services, such as Kubernetes and AWS, and upstream service status.
For the developer, this speeds up troubleshooting in production. When investigating the impact of code change, developers and service owners can trace forward to:
- Infrastructure change
- Kubernetes spec change
- Kubernetes image change
This timeline is just one part of the DevOps dashboard we’re building. For DevOps change monitoring, we’re thinking about how DevOps teams can see:
- The blast radius of a change
- Certification expiration date as events and root causes
- Config file changes as event and root causes
The beauty with this timeline is that you can trace back to changes for root cause analysis.
Why DevOps Change Monitoring Matters
If 76% of all performance problems can eventually be traced back to changes in the environment and most downtimes are spent looking for the change culprit, it’s time to connect the dots, and track changes and relationships to troubleshoot better. If we as an industry are requiring developers to own their code, then we need to provide them the tools to easily track changes in production. In the end of the day, developers need to be able to answer:
- Did I break anything? Where?
- Did my changes go in the way that I expected it to go?
With a timeline that serves as a system of record for changes and all the different streams of activities, it becomes easier for developers and DevOps teams to trace forward and backward for real-time troubleshooting. You no longer need to hunt through log files to figure out who made a change or wait on a response in Slack. Logs and distributed traces are still useful, but should be reserved for more detailed investigation once you know where to look — not a starting point.
If you feel change anxiety every time you push the deploy button (even for small changes) let us know how we can help. We’ll add you on the list for early access to our beta.