Observability is a scale problem

Warning: Hot Take

Observability (with “arbitrarily wide log events”) is a solution to a problem that most fields and businesses don’t have.

Measuring from First Principles

Most businesses are not pushing the boundaries of science or research. Most businesses are built to solve a particular problem and focus on providing services or products at a lower-cost or higher-satisfaction than their competitors.

This matters because it means the questions you will be asking of your data are already known or knowable before you begin.

When you provide a service to a customer, you’re probably going to want to know about errors in service, latency or delays in processing, overall throughput of the service to customers, and your cost in resources to support it. There will be other additional questions you’ll want to ask, but you’ll want those baseline questions answered in a meaningful way before you find yourself able to answer higher-order questions.

This applies whether you’re talking about a restaurant, a car manufacturer, or a software shop. Your key performance indicators (KPIs) are a well-known set of values and you don’t have to look very far to apply them to your current process. That’s not to say you aren’t doing something unique in your business – you probably are! But what you are not doing is something beyond the realm of process and procedure.

Why structured events though

Structured events allow for storing data with an often-arbitrary organization, of unlimited cardinality, and without necessarily needing to act on the input on receipt.

These are all wonderful properties. We like this; we want this; it is an unmitigated good for a business to manage its events in this way.

But… you said structured events are Bad

When looking at telemetry, you must split the relevant information into two categories: Business Events, and Operational Telemetry. The former is always high-value data and can/should be stored as structured events. Operatinal telemetry should Not.

Business events are things like “Person A bought X, Y, and Z” or “Customer R modified a deployment in zone Q” – these are things your Business operations, turned into Events. These Events are part of your business history and are always worth the time and money to retain.

What isn’t worth retaining are the Operational events. The fact that a particular operation took 16ms instead of 23ms on the backend service is not something your business is going to care about months or years from now. The fact that a particular operation took three retries to complete is not something the business will care about in six weeks. If they do care about arbitrary questions about those particular operations, then they need to be willing to spend exponentially-increasing volumes of money to retain and process said data.

Separate your business and operational data, then what?

This will take time and effort, but once you do successfully categorize your data into “operational” and “business” buckets, you can treat them differently. When new services are built, they can cleanly flag events generated for each.

Business events get long, expensive storage options. They get schemas and requirements, with clear reasons for the data to exist.

Operational events get weeks or months of lifetime before being compacted or deleted.

No need for expensive historical Splunk queries to figure out why your service is slow – collapse your data into point-metrics and do that. Even if cardinality is high, you’re storing a small dataset and adding more points for tracking is a small addition to the set.

My datset is already small…

Then literally anything will work for what you have. You don’t need honeycomb, Splunk, DataDog, or any of that – you need grep and a handful of refurbished 26TB disk drives for storage.