[part 2] Approaching your Observability Strategy

Darren Boyd
Dec 8, 2022
4 min read

Updated: Dec 13, 2022

I want to start this post by paraphrasing a quote from a truly revolutionary figure, Carlota Perez, who stated,

"Technological revolutions require existing businesses to master a new means for production. For the digital revolution that we firmly sit in now, that new means of production is software.”

This ongoing digital revolution is demanding new mastery in the domain of software delivery that is driving the need for Observability. Some of these factors include:

Software architectural complexity, composition, operating costs, and time-to-market is ever increasing
Managing data ingest costs with existing observability tooling while preserving the value of data, requires constant evaluation and consideration
Tool sprawl and tool dependence are creating solutions that are often siloed and therefore brittle
Resilience engineering is an imperative for any team providing Platform Services as a centralized function
Product Teams often need to introduce their own tooling to meet evolving and novel needs not met by current corporate strategy
E2E testing complexity and testing economics are normalizing Testing in Production (TiP) and Chaos Engineering as accepted standards which by very nature require the ability to ask arbitrary questions of your services

What is the state of Observability today?

The concept of Observability is forever evolving, and there is quite a bit of legacy ballast to deal with. This is the state of Observability that we see in the market.

Most of the customers we engage with often have deployments leveraging disparate libraries, collection agents/daemons, protocols, data models, and backends which contribute to making correlation challenging and brittle
Applications are being instrumented with various language-dependent client libraries primarily using vendor-based solutions, though more and more we are seeing the introduction of open standards
Tracing data introduced the notion of request context propagation, to move context between services and processes, and we are seeing enterprises experiment more heavily in this space. High-fidelity data capture is becoming the norm, negating the need to develop sampling strategies (constant probability, recency, key:value collection, event triggers, dynamic sampling, head/tail-based, rate limited, adaptive)
Log data is increasing in observability utility through the use of log linking, custom attributes, exemplars, etc.

Though strategy is highly dependent on characteristics that are unique to each individual organization, there are common components and starting points we advocate for when building an Observability strategy. That said, all strategies must be calibrated against a backdrop of organizational direction, organizational dynamics (team alignment, team boundaries), existing team capabilities, and a general appetite for continuous improvement.

The first thing to discuss is the high-level taxonomy of what makes up an Observability strategy. This taxonomy comprises the following:

Domain: Observability Operating Model (OOM)

Establishing an Observability Operating Model (OOM) is easily the most fundamental and critical element to success when building a strategy. Observability is a journey and establishing clear operational motives and common goals will allow many teams to self-identify and co-create value.

Domain: Observability Pipeline

An Observability Pipeline establishes guardrails (a compliant facility that aligns with IT control narratives) and advanced data controls (e.g., obfuscation, removal of superfluous data, enrichment), all while promoting vendor neutrality for observability platforms. After establishing an OOM, this is the first set of initiatives we typically promote.

Domain: Telemetry Collection & Context Propagation

Establish architectural tenets that commoditize telemetry collection and propagate context to backend systems. This can be a long pole to attack, and you will need to think carefully about the developer experience as you focus on signal emission capture. Build vs buy is a common theme that gets debated heavily here. If open standards are of interest, you may need to start incentivizing developers to (re)instrument their code which can be challenging.

Domain: Platform(s) & Data Interpretation

Aggregation of disparate observability tools (COTS/custom) creates a common interpretation layer for all teams to operate from, and correlation improves overall central tendency metrics (e.g., MTTD, MTTR) while simplifying tools rationalization efforts. We often see multiple observability platforms at play in organizations, and rationalization is a challenging proposition - building this layer will further decouple reliance on the observability platforms and position the organization not only for a rationalization event (if so desired), but also enter an era of rapid experimentation capability.

Domain: Incident Analysis & Feedback

Maturing an observability discipline promotes value extraction beyond baseline central tendency metrics (e.g., human factors, agile metrics, regression/performance management, resilience engineering, chaos engineering, telemetry verification automation, etc.). This is a people/process element that can yield incredible engagement and results. Spend considerable time here as this represents a powerful feedback mechanism into improvement cycles.

Beginning to develop the strategy

In order to develop the strategy we need to first establish a body of knowledge. Here are some of the areas we focus the organization around.

Establish focus - Services Observability is very different from other forms of observability such as Business Observability, Model Observability, and Data Quality Observability and therefore we need to establish focus and intention else we can't deliver a strategy. Ensure all stakeholders agree with what you are initially establishing.
Evaluate current ad-hoc capabilities - Traditional monitoring is considered a post-hoc capability and one the organization likely has established. Complex systems demand analysis in runtime and reaction to erroneous behavior in near real-time, making this a requirement to meet complex failure modes. Determine the organization's appetite in terms of ad-hoc expectations.
Understand current capabilities with respect to central tendency metrics such as Mean Time to ‘X’ (MTTx). If you cannot measure, we cannot improve, making this a gate to starting. As you onboard teams ensure they have some ability to measure central tendency metrics and establish target operating goals.

For a glimpse into what Observability is and why it's important, please review our previous post, [part 1] What is Observability and why do Enterprises need it

In an upcoming article, we will delve deeper into each of the domains identified above, though let us know where you have begun your journey and if this guide was helpful!