It is hard to argue that the people that know the most about what a service does and how it behaves are the folks that designed and built it. It follows that if you want to monitor that service to ensure that it behaves as expected, those same folks would be best positioned to identify what to watch. And this is important as the need for monitoring is well established. No one wants to have their service fail and not be aware of it until their customers start complaining. However, in many organizations, monitoring is an after thought, often tagged on as an additional function as the service moves out of development and into the production environment – the proverbial “tossing over the wall”.
As a result, these organizations often employ a separate team (monitoring group, surveillance team, etc. – lots of names, same function) made up of monitoring specialists who are tasked with retroactively providing this key operational capability. Unfortunately, they are generally not the designers or implementers of the services being monitored and so are not the subject matter experts who understand the inner workings of the systems they are applying these monitoring capabilities to. The outcome is often significant deficiencies in the relevance of the monitoring to the services being monitored. Symptoms of these challenges will be familiar to many of you:
- Frustration at the lack of understanding and visibility into the state of key services
- Resentment towards spend on enterprise monitoring tools that appear to have limited effectiveness
- Perception of a lack of skills and/or capabilities on the part of the monitoring team
Attempting a workaround…
To address these problems, you may find that individual service teams set up parallel monitoring efforts focused on their own services. In principle, I’m not against this. I’ve already pointed out that the folks best positioned to understand what needs to be monitored for a service are the same ones that designed and implemented it. This is a primary feature of DevOps and SRE practises. But the running parallel separate monitoring infrastructures has its own challenges that I’ll discuss a bit later.
Given this situation, it is certainly fair to ask the question: do you even need a monitoring team? Now, before anyone gets too upset, I believe the answer is yes. But the role of this team is different than what is likely in place today.
First, let me get one thing out of the way. I don’t believe that the problem here is a lack of desire or capability on the part of monitoring teams. In almost every case, they are trying to do the right things. The problem is one of mismatches between capabilities and expectations. This is a function of the monitoring team being responsible for both running the platforms that provide the monitoring capabilities (infrastructure instrumentation, APM, log aggregation, event correlation, etc.) as well as implementing how those capabilities are used (definition of metrics, assignment of thresholds, design of dashboards, etc.). This is at the core of the problem.
Platform versus Content
At the risk of slipping into hyperbole, asking the monitoring team to be the maintainer of the shared monitoring platforms as well as the authors of its content is akin to asking the folks over at YouTube to not only maintain the video sharing platform but also be responsible for populating its content. YouTube would be a much less effective environment if this was the case. The variety, quality, and specificity, not to mention volume of content available on YouTube is a direct outcome of delegating the content production away from the platform developers/maintainers to the larger consumer community.
On the other hand, asking all those individual content authors to also provide their own video platform has major challenges. Sure, you can argue that it offers an opportunity for the specific content producer to tailor the delivery platform to their needs, but few will have the expertise or resources to do this. Even those that do will now be unable to take advantage of the benefits of a common platform such as easily identifying correlation between other content producers based on shared interests.
The same holds true for monitoring. You can envision an environment in which the service owners each provide their own specific set of monitoring tools and apply them to their own specific needs. But the running of these tools to provide monitoring capabilities is not, nor should it be, the core business of these service owners. Even in the case of DevOps and SRE, where monitoring is very much a first-class citizen, it is the utilization of the monitoring capabilities, not creation of the platform, that is central to the paradigm.
Furthermore, if these team specific initiatives proceed independent of a more holistic cultural approach then, at the organizational level, it can result in a proliferation of independent but overlapping and even conflicting capabilities and tooling. The financial impacts are self evident but there are also negative impacts on value added functions related to dependencies between different services and shared infrastructure such as correlation and incident management. I would even go so far as to say that this lack of a shared platform largely precludes the possibility of leveraging some of the most promising aspects of modern technologies such as AI.
Using Experts in the areas they are expert in
If only it was possible to assemble a group of monitoring domain expert practitioners that could take responsibility for running the shared components of a monitoring platform and make monitoring capabilities available to the service teams that utilize them. Oh, wait, it is possible – it’s the monitoring team!
So, as I said earlier: yes, you do need a monitoring team, but it likely shouldn’t have the mandate that it has today. Begin by separating the concept of the monitoring platform from its content. On one side you’ll have the monitoring team responsible for implementation and support of the shared monitoring platform components. On the other side you’ll have the service owners utilizing the provided monitoring platform to realize their service specific monitoring content. Together, they can realize a monitoring practice that is more complete, efficient, and successful than either could on their own.