tags : Kubernetes, PromQL, Logging, Prometheus

FAQ

What about Performance?

  • I’ll have another page when I have it, but for now, this be it.
  • Outside scope of Observability but consider things like
    • Universal Scalability Law
    • Amdahl’s law (system limited by the seq processes)
    • Little’s law
    • Kernel/Compiler level optimizations
  • See Perf Little Book

Monitoring from business prespective

  • Looking at the customer experience

What?

  • It’s about bring better visibility into the system
  • It’s a property of a system. Degree of system’s observability is the degree to which it can be debugged.
  • Failure needs to be embraced
  • Once we have the info, we also need to know how to examine the info
  • We want to understand health, we want to understand change
    • We want to know “What caused that change?”

Checklist

  • Observability is not purely an operational concern.
    • Allows test in prod.
    • Reproducible failures
    • Rollback/Forward flexibility
    • Allows us to understand it when it runs
  • Let’s us dig deeper v/s just let us know about the issue
  • Let’s us debug better w evidence v/s conjecture/hypothesis

The levels

Primary

  • Logs, Metrics, Traces
  • These need to be used in the right way to attain needed observability
  • Logs

    • See Logging
    • We don’t tolerate loss here. Because we need this shit so we can query a needle in the haysack
  • Metrics

    • System centric
  • Traces

    • Request centric
    • Most other tooling suggests you take proactive approaches which are useful for known issues. When you are debugging production systems you are trying to understand previously unknown issues. This is why tracing is useful.

Secondary

  • Events

    • Similar to logging, but more interesting to a human’s specific usecase than normal logs.
    • Difference (Eg. webserver)
      • Log: every request (whether to log every http request or no can be debatable)
      • Event: unhandled exceptions, config file changes, or 5xx error codes etc.
    • I think, with structured logging, this distinction is not really necessary if we’re looking for the correct events in the logs in an automated way
    • Unlike logs, events (either from somewhere or extracted out of logs) can be rendered in dashboards on top of visualized metric data. Might help w relating things, eg. metric spikes with an overlay of config file change event. That should be helpful?
    • Exceptions (Error Tracking)

      • Events that indicate programming errors should be recorded in a ticket tracking system, then assigned to a engineer for diagnosis and correction.
      • Exceptions
        • Pass down the Thread local storage, stack trace etc
        • Use some error tracking system like sentry etc.

Observability in dev/testing time

  • Strive to write debuggable code (being able to ask questions in the future, via metrics, logs, traces, exceptions, combination etc)
  • We should be testing for failure aswell
    • Best effort failure mode simulation, we can’t catch all failure modes.
    • We can be aware of the presence of failures
    • We can’t be aware of absence of failures ever
  • Assume the system will fail
  • Dev should be aware of things like
    • How deployed? envars?. How it gets loaded/unloaded
    • How it interacts w network? how it disconnects, exposed?
    • How it handles IPC, configs
    • How it discovers other stuff etc etc
  • Understand leaky abstractions of dependencies
    • Default configs of dependencies
    • Caching guarantees of dependencies
    • Threading model of dependencies

Testing in prod

  • Essentially means we can check something on a live system and the system allows us to see what’s happening to the system when we want to check it.

Pillars

Monitoring

Boxes

  • Blackbox: Symptom based(effect), less trigger(cause) based
  • Whiebox: We get data from inside of the system. (Detailed stuff)

What it need to give

  • Show the failure (health)
  • Impact of the failure (health)
  • Ways to dig deeper (health)
  • Effect of fix deployed (change)

What metrics to use?

  • USE (System Performance)

    • U: Utilization
    • S: Saturation
    • E: Error
  • RED (Request Driven Applications)

    • R: Req. Rate
    • E: Req. Error Rate
    • D: Req. Duration (Histogram)
  • LETS (For Alerting)

    • L: Latency
    • E: Error
    • T: Traffic
    • S: Saturation
  • Databases

    • No. of queries made v/s no. of rows returned
  • Other practices

    • We want to drop timeseries data that we don’t need to save bandwidth and space
    • We don’t want to monitoring everything really, we want to properly define our SLOs and monitor those
    • Closely track the signals that best express and predict the health of each component in our systems.
  • Instrumentation

    • We can see what we can monitor out of libraries that our program use. Eg. I think pgx exposes a stats function which can be nice data to instrument and send to prometheus.
    • For any sort of threadpool, the key metrics are the number of queued requests, the number of threads in use, the total number of threads, the number of tasks processed, and how long they took. It is also useful to track how long things were waiting in the queue.

Cardinality

  • We need to keep cardinality low but in certain cases tradeoffs are worth it
  • For health

    • We often need cardinality to hone in on signals we actually care about.
    • i.e if there’s cardinality increase for proactive monitor useful for indicating health, we probably want it. (proactive and health are the keyword here)
  • For change

    • Increasing cardinality so that we can you can explain the cause of something(explain changes) is probably the wrong way to go.
    • Eg. After a painful outage, say you realize a single customer DOSed your service. So someone adds a `customer` tag “for next time.”. We don’t want to do this because it’s unsustainable, both the process of adding another label to find cause and the customer tag.
    • Distributed systems can fail for a staggeringly large number of reasons. You can’t use metrics cardinality to isolate each one.
    • Instead look for observability that
      • (a) naturally explains changes
      • (b) relies on transactional data sources that do not penalize you for high/unbounded cardinality.

Types of metrics

  • Counters

    • Can only increase during the lifetime of the system.
  • Gauges

    • Can vary freely across its possible value range.
  • Distributions

    • Binning

      • client-side binning
        • Reporter decides buckets
        • Usually configurable per-metric
        • Changing the binning can cause vertical aberrations in visualisations.
      • collector-side binning
        • Client reports the events as-is, collector aggregates before storing.
        • Eg: collector receives raw distribution samples from its clients, and records {50,90,95,99}th percentiles over a trailing window.
        • Less flexible

Logging

See Logging

  • Sometimes we log with the idea that when something goes wrong, maybe i’ll come dig the log and find something useful. A better way for some of those cases would be to send a error report of something right from the program if in any-case that log needs to be “seen”.
  • Log derived metrics are okay sometimes

Tracing

########################  GET /user/messages/inbox
 ######                   User permissions check
    ####                  Read template from disk
    #########             Query database
             ###          Render page to HTML
                ##        Compress response body
                  ######  Write response body

Traces are used to understand the relationship between the parts of a system that processed a particular request. This is more useful in Distributed Systems.

Concepts

Analogy:

  • A trace => stack trace
  • A span => A single stack frame.
  • stack frame push-pop => span begin-end
  • Trace
    • Tree of spans
    • A trace is constructed by walking the tree of spans to link parents with their children.
    • It’s the timing of spans that is interesting when analysing a trace.
  • Spans
    • Single operation within a trace
    • Logical region of the system’s execution time. (A duration distribution)
    • Spans are nested
      • Root span : No parent
      • Other spans: Have parent
      • All spans except the root span have a parent span
    • spans have attributes
  • Trace Context
  • Examplars

Grafana Tempo

Auto vs Manual instrumentation for otel

  • Auto

    • Straightforward process that involves setting some environment variables (or via another configuration format)
    • Generates telemetry for underlying runtime components, but it cannot see inside your custom application logic to understand what’s going on.
    • Helps you understand the context surrounding your application
  • Manual

    • Enrich spans

      • Adding attributes to the current span (even ones created by auto-instrumentation) to capture useful information.
      • Eg. adding a userID to a span to understand which users are calling a particular endpoint
      • Eg. recording cache hit/miss
    • New spans

      • Eg. a web application may create a span when making a database call to encapsulate when the call is made and enrich it with the query parameters.
  • In practice

    • In practice we need both

Learn more about tracing

  • Headers, a lot of headers. That’s how it works :p when your application receives a request it needs to extract the trace parent and the rest of the trace context from headers. So whatever makes the request to your API should generate a parent trace ID and attach that to the request. If the request caused subsequent requests the trace parent should be passed in headers. There are multiple specs for those. We’ve chosen the W3C one as it’s vendor agnostic and has a lot of community support. I’d recommend going through that spec.

Concerns on Tracing

  • Trace size

    • We need not worry about any cardinality issues when adding attributes to traces
    • APM provider(sentry) does not bill by the size of the unit but by the number of units. So feel free to chuck-in as many attributes to traces that you want.
    • Only concern is egress traffic, which should be fine for now

Concepts

Latency

Percentiles

Reminded me about a story when Google tried to optimise their response times in Africa. Because of poor infra, many users were getting timeouts when searching on G, so they worked to improve it. They managed to cut number of bytes, change geo-location of some switches, and what they saw that the average response time increased, but their p99 stayed the same. What really happened is that users who never could connect before, became their >p95 and p99, and users who were p99 became p60-p80. So G engis made positive changes, but the numbers didn’t reflect it. Is average and p90 BS, and what’s the alternative?

The lesson here is that all models are wrong, but some are useful It’s not about what is BS and isn’t, it’s just a simple number, the problem is how you use it. If it’s useful for you and you know how to use it, then a blog post shouldn’t stop you from doing it.

  • Some orange side guy

See Statistics

Dashboards

  • Dashboards are nice but do read this
  • Debugging: “You come along trying to investigate something, and what do you do? You start skimming through dashboards, eyes scanning furiously, looking for visual patterns — e.g. any spikes that happened around the same time as your incident. That’s not debugging, that’s pattern-matching.

OpenTelemetry

  • Basically this is hot shit. In some sense, OTEL is very much the k8 for observability (good and bad)
  • OTel lets the open source projects use an abstraction layer so that you have an option to buy instead of self-host in any cases.
    • “oh shit, all this open source software emits Prometheus metrics and Jaeger traces, but we want to sell our proprietary alternatives to these and don’t want to upstream patches to every project”. - Some guy
    • Otel is great for avoiding vendor lock-in
    • A vendor-agnostic replacement for the client side of DataDog, New Relic, or Azure App Insights.

Pros

  • otel-collector it’s easy to get logs, tracing and metrics in a standardised manner that allows you to forward any metrics/tracing to SRE teams or partners.
    • Eg. send logs to Splunk, traces to New Relic & metrics to Prometheus. Also send a filtered traces and logs to a partner that wants the details.
    • devs can add whatever observability to their code and ops team can enforce certain filtering in a central place as well as only needing one central ingress path that applications talk to.
  • Despite of the cons, there seems to be no better alternative that otel. The alternative of not using otel is you’ll have to export data in certain exposition format which will not be compatible w other backends handling that data(monitoring/logs/traces), but sometimes that’s all you need.

Cons

  • OT might good for a certain language/stack and completely suck for another. Eg. exemplars are not supported in JavaScript/Node.
  • “It doesn’t know what the hell it is. Is it a semantic standard? Is a protocol? It is a facade? It is a library? What layer of abstraction does it provide? Answer: All of the above! All the things! All the layers!” - Someone on orange site
  • “Otel markets itself as a universal tracing/metrics/logs format and set of plug and play libraries that has adapters for everything you need. It’s actually a bunch of half/poorly implemented libraries with a ton of leaky internals, bad adapters, and actually not a lot of functionality.” - Another user on orange site
  • “OTel tries to assert a general-purpose interface. But this is exactly the issue with the project. That interface doesn’t exist.” - Another guy

OTel Collector Implementation

  • It usually does not need to listen on any ports
  • OpenTelemetry collector implements a Prometheus remote write exporter.
  • Collector is a common metrics sink in collection pipelines where metric data points are received and quickly “forwarded” to exporters.
  • Components of a “Collector”

    • Receiver
    • Processor
    • Exporter
  • Collector vs Distro?

Resources

APM

  • APM it’s usually via a vendor SDK or agent. You have to invest in that SDK or agent and it’s not portable.
  • But you can instead use OpenTelemetry, which is portable

War Stories

Scaling observability

  • At a large enough scale, it is simply not feasible to run an observability infrastructure that is the same size as your production infrastructure.

When overloaded you should be doing as little work as possible

Basically what’s happened for us on some very high transaction per second services is that we only log errors. Or Trace errors. And the service basically never has errors. So imagine a service that is getting 800,000 to 3 million request a second. And this is happily going along basically not logging or tracing anything. Then all the sudden a circuit opens on redis and for every single one of those requests that was meant to use that open circuit to redis you log or trace an error. You went from a system that is doing basically no logging or tracing to one that is logging or tracing at 800,000 to 3 million times a second. What actually happens is you open the circuit on redis because red is a little bit slow or you’re a little bit slow calling redis and now you’re logging or tracing 100,000 times a second instead of zero and that bit of logging makes the rest of the requests slow down and now you’re actually within a few seconds logging or tracing 3 million requests a second. You have now toppled your tracing system your logging system and the service that’s doing the work. Death spiral ensues. Now the systems that are calling this system starts slowing down and start tracing or logging more because they’re also only tracing or logging mainly on error. Or sadly you have a better code that assumes that the tracing are logging system is up always and that starts failing causing errors and you get into doing extra special death loop that can only be recovered from by only attempting to log or error during an outage like this and you must push to fix. All the scenarios have happened to me in production.

  • So prefer sampling errors, See Logging

Tactical stuff

What to monitor for?

Some notes on metrics

  • What we monitor should ideally describe customer experience / describe our system
  • Following are some base/ideal metrics to have. Also see Metrics For Your Web Application?
  • Ideally this list wouldn’t exist and we’ll have appropriate dashboards/tools for all

Application

MetricDescription
Realtime API Endpoint statsWhich endpoints are being hit, how many times etc.
Synthetic API Endpoint statsPreemptive checks to ensure correctness, delivery speeds etc
RED MetricsReq(Rate,Error,Duration)
Cache hit/miss rate

Database

MetricDescription
Availability
Connections
Database size/growth rate
queries made/rows returned
Connection Pool metricsDon’t use a pool at the moment
Response Latency
Cache hit/miss
Calls to the db/min
Client side DB pool
Server side DB pool

System

MetricDescription
Service Availability
Service Health
Node/Host metricsfd,io,mem,cpu,threads etc. for USE

What to alert on?

Some notes on alerts

  • We want to be alerted
    • When something’s broken
    • When something might break
    • When something is extra unusual
  • Things we want to take care of
    • False positives, Duplicate alerts etc.
    • Need to define proper failure modes and alert on them
    • Ensure receivers are correctly set and right people are notified. We currently only plan on using email and slack, so provision the receiver as per need. Ideally, we want every significant alert to end up on slack at the moment.
    • Think about the source of the alert. We can capture the signal for the alert from multiple places, eg. The alert for “too many 4XX” can be triggered from CW ALB metrics as-well as from Sentry. We want to determine the best source and alert from one place only.
    • Ideally for every alert we send, we want a dashboard entry/issue for the handler to inspect further

Some alert candidates

Component
Too many NXX in last M-mins
latency is higher than Xs