0/ When large eng orgs rely on metrics for both monitoring *and* observability, they struggle with cardinality.

This is a thread about “the two drivers of cardinality.” And which one of those we should kill. :)

🧵👇

1/ Okay, first off: “what is cardinality, anyway?” And why is it such a big deal for metrics?

“Cardinality” is a mathematical term: it’s *the number of elements in a set*... boring! So why tf does anybody care??

Well, because people think they need it, then suddenly, "$$$$$$$."
2/ When a developer inserts a (custom) metric to their code, they might think they’re just adding, well, *one metric*. …
3/ … But when they add “tags” to that metric – like , , or (shiver) – they are actually creating a *set* of metric time series, with the *cardinality* of that set being the total number of unique combinations of those tags.
4/ The problem is that some of those tags have many distinct values.

E.g., when I was working on Monarch at Google, there was a single gmail metric that became over 300,000,000 (!!!) distinct time series. In a TSDB, that cardinality is the unit of cost.

So again, “$$$$$$$.”
5/ Okay, so that’s why people care about metrics cardinality. Now, what are the two *drivers* of that cardinality?

A) Monitoring: more detailed *health measurements*
B) Observability: more detailed *explanations of changes*

Let’s take these one at a time…
6/ First, “More Detailed Health Measurements” (monitoring):

Consider an RPC metric that counts errors. You need to independently monitor the error rate for different methods. And so – voila – someone adds a “method” tag, and now there’s 10x the cardinality for that metric.
7/ … And also 10x the cost. But that’s justifiable, as there’s a business case for independently monitoring the reliability of distinct RPC methods.

Put another way, you might rightly have different error budgets for different RPC methods, so their statistics must be separable.
8/ Bottom line: When it comes to “measuring health,” we often *need* cardinality in order to hone in on the signals we actually care about. Increasing cardinality to *proactively* monitor the signals we care most about is usually a worthwhile tradeoff.
9/ Now what about using cardinality for “More Detailed Explanations of Changes?”

This is the real culprit! And, frankly, should be abolished. :) Metrics cardinality is the wrong way to do observability – to explain changes.

More on that…
10/ Say your monitoring tells you that there’s a problem with a critical symptom – e.g., you’re suddenly burning through an SLO error budget at an unsustainable clip.
11/ After a painful outage, say you realize a single customer DOSed your service. So someone adds a `customer` tag “for next time.”

But this is unsustainable: each incident reveals a new failure mode, devs keep adding tags, and before long your metrics bill is out of control.
12/ The problem, in a nutshell:

Distributed systems can fail for a staggeringly large number of reasons. *You can't use metrics cardinality to isolate each one.*
13/ How *should* we explain changes to production systems?

Understanding change is the central problem of observability. Effective workflows might *start* with metrics, but they must pivot towards a multi-telemetry, multi-service guided analysis.
14/ So, to sum up: spend your limited cardinality budget on *monitoring*, and then look for observability that (a) naturally explains changes and (b) relies on transactional data sources that do not penalize you for high/unbounded cardinality.
15/ PS: For more on how to distinguish monitoring and observability, see this thread: https://t.co/2RHs36Nknc

PPS: If you’d like to discuss/debate/request-more-detail any of the above, reply to this thread or DM me!

More from Society

You May Also Like

शमशान में जब महर्षि दधीचि के मांसपिंड का दाह संस्कार हो रहा था तो उनकी पत्नी अपने पति का वियोग सहन नहीं कर पायी और पास में ही स्थित विशाल पीपल वृक्ष के कोटर में अपने तीन वर्ष के बालक को रख के स्वयं चिता पे बैठ कर सती हो गयी ।इस प्रकार ऋषी दधीचि और उनकी पत्नी की मुक्ति हो गयी।


परन्तु पीपल के कोटर में रखा बालक भूख प्यास से तड़पने लगा। जब कुछ नहीं मिला तो वो कोटर में पड़े पीपल के गोदों (फल) को खाकर बड़ा होने लगा। कालान्तर में पीपल के फलों और पत्तों को खाकर बालक का जीवन किसी प्रकार सुरक्षित रहा।

एक दिन देवर्षि नारद वहां से गुजर रहे थे ।नारद ने पीपल के कोटर में बालक को देख कर उसका परिचय मांगा -
नारद बोले - बालक तुम कौन हो?
बालक - यही तो मैं भी जानना चहता हूँ ।
नारद - तुम्हारे जनक कौन हैं?
बालक - यही तो मैं भी जानना चाहता हूँ ।

तब नारद ने आँखें बन्द कर ध्यान लगाया ।


तत्पश्चात आश्चर्यचकित हो कर बालक को बताया कि 'हे बालक! तुम महान दानी महर्षि दधीचि के पुत्र हो । तुम्हारे पिता की अस्थियों का वज्रास्त्र बनाकर ही देवताओं ने असुरों पर विजय पायी थी।तुम्हारे पिता की मृत्यु मात्र 31 वर्ष की वय में ही हो गयी थी'।

बालक - मेरे पिता की अकाल मृत्यु का क्या कारण था?
नारद - तुम्हारे पिता पर शनिदेव की महादशा थी।
बालक - मेरे उपर आयी विपत्ति का कारण क्या था?
नारद - शनिदेव की महादशा।
इतना बताकर देवर्षि नारद ने पीपल के पत्तों और गोदों को खाकर बड़े हुए उस बालक का नाम पिप्पलाद रखा और उसे दीक्षित किया।
@EricTopol @NBA @StephenKissler @yhgrad B.1.1.7 reveals clearly that SARS-CoV-2 is reverting to its original pre-outbreak condition, i.e. adapted to transgenic hACE2 mice (either Baric's BALB/c ones or others used at WIV labs during chimeric bat coronavirus experiments aimed at developing a pan betacoronavirus vaccine)

@NBA @StephenKissler @yhgrad 1. From Day 1, SARS-COV-2 was very well adapted to humans .....and transgenic hACE2 Mice


@NBA @StephenKissler @yhgrad 2. High Probability of serial passaging in Transgenic Mice expressing hACE2 in genesis of SARS-COV-2


@NBA @StephenKissler @yhgrad B.1.1.7 has an unusually large number of genetic changes, ... found to date in mouse-adapted SARS-CoV2 and is also seen in ferret infections.
https://t.co/9Z4oJmkcKj


@NBA @StephenKissler @yhgrad We adapted a clinical isolate of SARS-CoV-2 by serial passaging in the ... Thus, this mouse-adapted strain and associated challenge model should be ... (B) SARS-CoV-2 genomic RNA loads in mouse lung homogenates at P0 to P6.
https://t.co/I90OOCJg7o