Observability in the time of microservices
If you’d like to get early access to the observability platform, shout out here!
Traditionally, software was written in the form of a monolithic application, which comprised of a frontend service, a backend service and a database. From a monitoring perspective, you needed to monitor a limited number of VMs or servers. Traditional APM (Application Performance Management) tools provided code-level visibility to understand performance bottlenecks as the infrastructure primarily comprised of limited number of large application components with limited cross-service interactions.
The overhaul of the application architecture and breaking down of monolith applications into hundreds of microservices with the advent of containerization calls for a paradigm shift in monitoring. The new way of monitoring microservices driven infrastructure is popularly called Observability.
In microservices environment, the number of containers that need to be monitored is pretty large and these containers live and die dynamically in the order of minutes or hours. Also, each container is lightweight and has a small piece of code executing within it. The traditional code level monitoring by APM tools does not suffice as the real performance bottlenecks happen often in the service to service interactions. Also, unlike in traditional setups where you can run a monitoring agent in each server, you cannot run an agent for every container as the overhead will be huge.
Monitoring before the existence of microservices architecture has been largely about collecting signals such as logs and metrics from systems and analyzing those to find out the underlying cause of any issues. At a time when requests went through monolithic apps, this approach, however manual, was manageable considering the few number of microservice hops.
Increased adoption of public cloud, microservices development and need for agility have led to an explosion in the number of endpoints and services to monitor, making the amount of data generated humanly indigestible to use and diagnose.
We’re reaching a point where one more data point hinders more than helps in root cause analysis.
It is no surprise that a vast majority of users out there use an average of 3+ tools, mostly because of different capabilities and different data sources were collected.
Monitoring is alerting on the basis of core symptoms. Observability is surfacing the root cause after automatically correlating across various data sources. Current monitoring tools fall short of being able to surface all those events within your deployments in the following ways:
- Usually SRE teams and developers have a lot of monitoring tools and tons of signal and yet the vast majority of the metrics are never looked at. There’s a lot of tribal knowledge that goes in while debugging an error.
- Alerts are either too noisy or not enough. Despite being buried sometimes under alert storms, issues are frequently escalated by the users themselves.
In effect, observability isn’t a substitute for monitoring, but both are complementary. Observability products use the signals derived from monitoring products and give you insights out of the box. As per the SRE book*, “monitoring of complex systems should itself be simple”. A comprehensive observability platform ties all these aspects in one coherent story to enable your developers and SREs to take the actions they need to. This includes collecting the different data sources from your deployments at various collection points, combining them into giving you insight around anomaly detection and root cause analysis.
An observability platform would enable you to derive insights by analyzing the traffic patterns within your deployments. This includes:
- understanding which metrics matter
- understanding the relationship between several data sources
- identifying which of those are anomalous
- escalating to the right stakeholders
A monitoring platform would provide the raw signals the health of your application, while observability has the potential to quantify end users impact and be more nuanced in pin-pointing root cause (such as a release).
A common way to represent the analogy is the journey to a self driving car from the cars of yesterday.
The holistic method to achieve microservices observability is basing it on the interactions between those microservices. Every other data source: metrics, logs, and other data points have to be correlated with the request path that a request goes through. The substrate of modern application debugging thus becomes the trace, with all else correlated to it. Moreover, with the explosion of endpoints and massive amounts of data available, the only way to debug modern applications is through insights derived from that data.
Would a platform that ties together the three pillars of monitoring and then surfaces insights automagically be interesting? Please let us know about it here!
PS: *the author of this post is an author of the SRE book as well, so yes a bit of shameless promotion for the book.