WHAT IS OBSERVABILITY AND WHY DO CUSTOMERS NEED IT
If someone hears Enterprise Observability they may understand it, support it, and even grasp its complexity. But will they realize it is the Magnum Opus of IT
—- AVP Enterprise Data, Financial Services Customer
Observability is an emergent property of complex systems. Without this discipline, as systems mature in complexity so too will entropy in the environment.
The discipline of Observability is fundamentally about improving software delivery and operations. This is a socio-technical domain requiring cross-functional commitment from multiple teams.
When implemented correctly Observability represents an amplifying function for forward-thinking Product Teams and is necessary to meet the needs of modern architecture.
WHEN DO CUSTOMERS NEED OBSERVABILITY
THE VALUE CONTINUUM
Single stateful data store
System metrics are relevant, and emphasis is on post-hoc analysis
Fairly static and unchanging set of resources with known thresholds of expectation
Focus is uptime and failure prevention; post-incident analysis is adequate to meet goals
Correlation occurs across a limited number of dimensions
Telemetry insufficient for ad-hoc understanding
Infrastructure is dynamic and elastic with loosely coupled services across Product Teams
Focus on reliability and resiliency with constructs such as error budgets, SLOs, SLIs
Correlation occurs across an unlimited number of dimensions & cardinality
COMMON CHALLENGES DRIVING OBSERVABILITY
Technological revolutions require existing businesses to master a new means for production. For the digital revolution that we firmly sit in now, that new means of production is software.
Honorary Professor at University College London
Honorary Professor at University of Sussex
Adjunct Professor at TalTech, Estonia
Software architectural complexity, composition, operating costs, and time-to-market is ever increasing
Managing data ingest costs with existing observability tooling, while preserving value of data, requires constant evaluation
Tool sprawl and tool dependence are creating solutions that are often siloed and therefore brittle
Resilience engineering is an imperative for teams providing platform services
Product Teams often need to introduce their own tooling to meet evolving and novel needs not met by current strategy
E2E testing complexity and testing economics is normalizing Testing in Production (TiP) and Chaos Engineering as accepted standards
Applications instrumented with various language-dependent client libraries
Metric data is typically stored in a Time-Series Database (TSDB) and helps to answer ‘known unknowns’
Tracing data introduced the notion of request context propagation to move context between services and processes.
Log data has a long legacy with limited support for tracing and monitoring (log linking, custom attributes)
As current deployments often leverage disparate libraries, collection agents/daemons, protocols, data models, and backends - correlation becomes challenging and brittle
Commoditize telemetry collection and propagate context objects (key:value) at time of collection. In this way, if both the tracing and metrics signals are enabled, recording a metric may automatically create a trace exemplar
Develop an observability pipeline to route full-fidelity data to low(er) cost mediums, while passing on propagated context to any backend platform
Solution-specific instrumentation not only promotes vendor lock-in, but the surface area of software diversity is too great for a commercial team to keep up with
Domain: Observability Operating Model (OOM) Establishing an Observability Operating Model (OOM) is easily the most fundamental and critical element to success. As Observability is a journey, establishing clear operational motives and common goals will allow many teams to self-identify and co-create value.
Domain: Telemetry Collection & Context Propagation Establish architectural tenets that commoditize telemetry collection and propagate context to backend systems. This can be a long pole to attack, and you will need to think carefully about the developer experience as you focus on signal emission capture.
Domain: Observability Pipeline An Observability Pipeline establishes guardrails (compliant facility that aligns with IT control narratives) and advanced data controls (e.g., obfuscation, removal of superfluous data, enrichment), while promoting vendor neutrality for observability platforms.
Domain: Platform(s) & Data Interpretation Aggregation of disparate observability tools (COTS/custom) creates a common interpretation layer for all teams to operate from, and correlation improves overall central tendency metrics (e.g., MTTD, MTTR) while simplifying tools rationalization efforts.
Domain: Incident Analysis & Feedback Maturing an observability discipline promotes value extraction beyond baseline central tendency metrics (e.g., human factors, agile metrics, regression/performance management, resilience engineering, chaos engineering, telemetry verification automation, etc.)
Products to Watch
EraDB - Analysis of hyper-cardinality data at scale, EraDB has married its underlying time-series database and machine learning-driven indexing functionality with support for the Elasticsearch API and support for cloud object storage services to deliver EraSearch
- Observe Inc. - Workflow focus with Snowflake backend. Observe invests in UI and workflow to simplify troubleshooting and discovery while leveraging cloud-based data warehouse Snowflake as its data platform
Lightrun - Works via an agent-based approach wherein developers use an IntelliJ IDEA plug-in which inserts the necessary code into a production platform
Nobl9 - focuses on SLO-related automation built on data ingested from the observability tooling already being leveraged in a customers environment
Bionic - Platform automatically reverse engineers applications, providing an inventory with architecture and dataflows, monitoring critical changes in production, and enabling developer guardrails to enforce architecture
DEFINING A STRATEGY
When done correctly, Observability represents an amplifying function that elevates many different domains in an organization (rising tide phenomenon).
Though strategy is highly dependent on characteristics that are unique to each individual organization, there are common starting points we typically advocate for.
Click on each strategic element below for more information:
Strategy starts with understanding organization directionality, dynamics, and cohortsStrategy typically starts with an evaluation of organizational direction, organizational dynamics (team alignment, team boundaries), existing team capabilities, and appetite for continuous improvement across applicable cohorts. It is important to establish focus (Services Observability is very different from other forms of observability such as Business Observability, Model Observability, and Data Quality Observability) Evaluate current ad-hoc capabilities which demands analysis in runtime, providing prompt reaction to erroneous behavior in near real-time, making this a requirement to meet complex failure modes Understand current maturity with respect to central tendency metrics such as Mean Time to ‘X’ (MTTx) Understand TCO expectations as a dimension of planning against a cost baseline, how those costs should be captured, and where those costs should be allocated
Establish an Observability Operating ModelEstablishing an Observability Operating Model (OOM) is easily the most fundamental and critical element to success. As Observability is a journey, establishing clear operational motives and common goals will allow many teams to self-identify and co-create value. Establish clear boundary of responsibilities between Product Teams, Operations/Global Command Center (OCC), Strategy & Architecture, Site Reliability Engineering (SRE), and Platform Engineering/Shared Services. Here we consider the role of centralized and decentralized behaviors as it relates to Observability Establish prioritization regarding Product Team operationalization, including alignment with Developer and/or Platform Engineering advocacy and adoption programs, architectural goals, integration to operational capabilities around notification, ticketing systems, dashboard creation, KPI onboarding/gates, etc. Determine guidance around Operational Support Patterns that provide select Product Teams greater autonomy to introduce/manage observability tooling and processes, and define what contracts must be established with the organization to support such patterns Develop maturity and discipline regarding observability practice capability (introduction and deprecation) through introduction and refinement of architecture patterns, framework(s), tools, and ready-to-use code libraries Develop a formal and consistent commitment to measuring lost engineering cycles across Product Teams (e.g., partnering with Dev/Platform advocacy) for continuous process improvement/refinement
Build an Observability Pipeline to decouple source-to-sink, providing control over dataAn Observability Pipeline establishes guardrails (compliant facility that aligns with IT control narratives) and advanced data controls (e.g., obfuscation, removal of superfluous data, enrichment), while promoting vendor neutrality for observability platforms. This is often a starting point in terms of implementation, and like all other domains this should align with the Observability Operating Model (OOM) where careful consideration was weighed when contemplating a buy vs build approach and centralized vs decentralized deployment patterns. The benefits of creating this beach head are fundamental if building toward vendor independence as this domain not only prevents vendor lock-in, but allows for safe & rapid tools experimentation while potentially reducing costs of existing tooling. Note: Establish and manage this Observability Pipeline as a Product, and build a robust set of advocacy programs to ensure community engagement. Building this agnostic facility to route MELT data with guardrails opens the proverbial spigot to expanded tools usage. As per the Observability Operating Model (OOM), clear definitions of 'acceptable' have to be established. Will you allow any Product Team to instantiate their own tools, or will this be owned by Platform Engineering? Establish cohorts based on maturity/capability/needs and just like a preventative vs detect/correct path for security, either allow for tools through a measurable discipline and program, or provide a measure of control via a gated process. The better you know your constituents, the more prescriptive you can be with your tools alignment. If you allow for bespoke tooling - the Observability Pipeline allows you to audit usage per cohort/Product Team. Pay close attention to the 'why' behind usage decisions. With groups that are less mature, Platform Engineering tends to provide more tools leadership and ownership. With groups that are more mature, the opposite is true. The more advanced groups will provide drag for the teams just starting out and should be considered as you develop your advocacy programs.
Focus on Telemetry Collection with Context PropagationEvaluate and establish architectural tenets that commoditize telemetry collection and propagate context to backend systems. This can be a long pole to attack, and you will need to think carefully about the developer experience as you focus on signal emission capture.
Establish a Common Data Interpretation layer with Central Tendency Metric baselinesFocus on aggregation and correlation as a Common Data Interpretation layer and move teams to the common interface (Global Command Centers, SREs, Developers, etc.). Aggregation of disparate observability tools (COTS/custom) creates a common interpretation layer for all teams to operate from, and correlation improves overall central tendency metrics (e.g., MTTD, MTTR) while simplifying tools rationalization. Without this approach, unwinding platform usage can become a protracted effort.
Unlock value beyond central tendency measuresMaturing an observability discipline promotes value extraction beyond baseline central tendency metrics (e.g., Agile metrics, regression/performance management, resilience engineering, chaos engineering, etc.)
FROM FIELD PERSPECTIVES
TO FIELD SUCCESSES
When considering Observability as a large-depth-of-field domain we ‘think globally but act locally’.
We consider the whole when devising strategy but act pragmatically to deliver value with every engagement on our customer’s overall journey!
CO-CREATING VALUE VIA
Overlaying gap analysis from Platform Engineering Team & Product Team roadmaps will help craft viable pilots
Product Teams/Developer community may be evaluated against an Observability Maturity Model (OMM) to help prioritize value streams
HOW WE CAN HELP!
As with many domains that have broad applicability to an organization the option always exists to allow Product Teams to address individual activities on their own. We always caution that without a cultural shift to proactively think about observability in every business and architectural decision, organizations tend to lose efficiencies and accretive benefits.
Observability truly fits the aphorism, "a rising tide raises all ships". Many areas are direct or indirect benefactors of getting observability right, and we specialize in activating a company around a common set of goals.
Please contact us to learn more about, or exchange ideas on, this domain!