tags : Observability, PromQL

FAQ

Tactical stuff

Measuring latency

  • A system like Prometheus aggregates across time to work efficiently. So you can calculate things like mean latency, but you can’t see individual request latency. Eg. with python we can use prometheus histograms.

Service Discovery

AWS ECS

Exposing endpoints

What do we want to expose, prometheus or grafana?

  • Assuming a user would need to access the prometheus UI
  • This really is not a question about prometheus/grafana but more of a question about Infrastructure Security. (when Selfhosting)
  • In some cases companies keep their grafana instance publicly accessible from the internet but behind some auth. (Usually behind reverse Proxies) and the prometheus instance sits in a private subnet etc.
    • In other cases, both prometheus and grafana might be in a private subnet and you need to be in the VPC through a VPN or something similar to be in the same network. Otherwise, things can be public and firewalls etc can be used.

Can we expose on trusted networks?

  • Depends on security posture
  • Exposing that port on a “trusted” network is a possible attack vector for bad actors.
  • Exposing that port on the open internet (as is often the case) is an open invitation for attack.

Exposing prometheus/grafana

This is about protecting the endpoints provided by the prometheus binary

Exposing applications exposing prometheus metrics

  • Exposing /metrics endpoint to public internet
    • Blocking access to path at load balancer/reverse proxy seems like a good way to do it
    • The service would be accessible from the VPC anyway, so the prometheus collector instance will be able to scrape it using the private IP/whatever it gets from service discovery.

Instrumentation

Go (client-go)

Meta notes

  • When you simply use the metrics primitives, registering them and then exposing using promhttp.HandlerFor, you just get those metrics and some error metric. You don’t get the go and process collector metrics etc.
  • Cause of inconsistencies
    • If a metric inconsistent
    • If a metric is equal to the new metric already registered.
  • Registry

    • Collector & Metric have to describe themselves to the registry.
      • Because inconsistencies are supposed to be detected at registration time, not at collect time.
    • Usually you’d use the DefaultRegistry
      • Comes registered with NewGoCollector and NewProcessCollector
    • Custom Registry
      • Implement your own Register/Gatherer
      • Special properties, see NewPedanticRegistry.
      • Avoid global state, as imposed by DefaultRegisterer.
      • Use multiple registries at the same time to expose different metrics in different ways.
      • Separate registries for testing purposes.
  • Gatherer

    • Registry implements the Gatherer interface.
    • The caller of the Gather method can then expose the gathered metrics in some way.
  • Metric Types & Vector metric types

    • You need <MetricName>Opts for creation
    • Fundamental metric types
      • Gauge, Counter, Summary, and Histogram
      • Implement the Metric interface
      • Gauge, Counter, Summary, Histogram are interfaces themselves (These embed the Metric and Collector interfaces)
    • Vector metric types
      • GaugeVec, CounterVec, SummaryVec, and HistogramVec
      • This is when we partition samples along dimensions called “labels”
      • Implement the Metric interface and the Collector interface
      • GaugeVec, CounterVec, SummaryVec, HistogramVec are NOT interfaces themselves
  • Collector

    • A Collector manages the collection of a number of Metrics
    • For convenience, a Metric can also “collect itself”.
    • Creation of the Metric instance happens in the Collect method.
    • Usually you would not implement custom Metric but might implement custom Collector
    • Why custom collector?

      • Usecase1: A custom Collector seems handy to bundle Metrics for “common registration”
      • Usecase2: Already have metrics outside of prometheus context, “mirror” existing numbers into Metrics during collection.
        • Metric instances “on the fly” using NewConstMetric, NewConstHistogram, and NewConstSummary (also Must versions)
          • NewConstMetric for a float64 value: Counter, Gauge, and a special “type” called Untyped (When unsure of Counter or Gauge)
          • Convenience functions: GaugeFunc, CounterFunc, or UntypedFunc
        • The processCollector, goCollector, expvarCollector are collectors used for “mirroring” purposes.
    • Unchecked vs check collector

      • scrape inconsistencies
      • When exact metrics to be returned by a Collector cannot be predicted at registration time, but the implementer has sufficient knowledge of the whole system to guarantee metric consistency.
  • Instrumented Handler vs Uninstrumented Handler

Packages

  • prometheus

    • What it offers
      • Primitives to instrument
        • Eg. NewGauge, NewCounterVec etc.
      • Registry for metrics
        • Eg. You use prometheus.Registerer.MusRegister for each metric
  • promhttp

    • Expose registered metrics via http
    • Has
      • Handler()
        • Can think of it as a convenience function
        • sort of same was using InstrumentMetricHandler with HandlerFor
      • HandlerFor
        • Takes in prometheus.Gatherer and promhttp.HandlerOpts
        • HandlerOpts defines the behavior of the returned Handler
        • Custom instrumentation/Gathers etc.
  • promauto

    • Metrics constructors with automatic registration.
    • When you are not concerned with fine-grained control of when and how to register metrics with the registry
    • Allows you to ignore registration altogether in simple cases.
    • Provides constructors
      • Top-level functions
        • Return Collectors registered with the global registry (prometheus.DefaultRegisterer)
      • Set are methods of the Factory type
        • Return Collectors registered with the custom registry the Factory was constructed with.
        • Using prometheus.NewRegistry() and promauto.With(reg)
    • Registration Panicking

      • Input provided by the caller that is invalid on its own.
      • Incompatible metric is added to an unrelated part of the code
      • With global registry
        • Where simply importing another package can trigger a panic
        • If the newly imported package registers metrics in its init function
      • Steps with vanilla prometheus
        • You first create the metric/collectors
        • Decide explicitly if you want to register it with a local or the global registry
        • Decide if you want to handle the error or risk a panic.
      • Steps with vanilla promauto
        • Registration is automatic
        • If it fails, it will always panic.
        • Constructors will often be called in the var section of a file
          • Which means that panicking will happen as a side effect of merely importing a package.
  • push

    • Push registered metrics

Instrumenting Batch Jobs

  • Instrumenting long running services is straight-forward to think about. Batch jobs are somewhat harder to wrap the head around what to include, how to include etc.
  • See the “what to monitor/batch job” section in Observability for some more concrete reference

Gotchas/Tips

  • Cardinality
    • Don’t have job-run-id as a label due to cardinality, you can have by job-name.
      • But now how do you differentiate between different runs? For many things we’ll be using the gauge instrument in which only what’s the last value matters. In this case if you have multiple jobs and no differentiating label, the resulting metric for that series will be confusing.
        • In most of these cases you’ll have gauge with no labels, eg. outcome of a job can be indicated with a gauge with 0/1 value with no labels.
      • However, if it’s a type of job and there’s a guarantee that there will be no concurrent runs of it then not having the job-run-id is not a problem because the data will be time series and we’ll see gaps between runs when we visualize the data and things would make sense
    • If you want per row metric or something when running the job, you most of the time want to be logging that and process it later using recorded queries or something similar.
    • So basically two ways
      • Either you push the metrics for each run whenever it is done and are OK with losing some of the metrics
      • Or go with a separate tool that parses the output files(eg. logs) of your script runs and aggregates everything somehow
  • Model
    • It has to follow a push model instead of a pull one. Eg. As the script completes, it can push the metrics (e.g. req. duration histogram, CPU utilization, peak memory etc.)
    • This push model can be implemented via pushgateway and if you’re using OTEL(See Observability) then you can just use the OTEL Collector(eg. Grafana Alloy)

OTEL instrumentation

Metrics API

  • I think usually you’d just need one MeterProvider and is expected to be global
    • All configuration should happen at MeterProvider not in the Meter
  • Meter provides creation methods for various Instrument (s)
    • an Instrument is mapped to a Meter during “creation”

Instruments

  • NOTE: Whether an instrument is int/Float is identifying
  • Instrument names can be max 255 chars and case-insensitive.
  • unit is optional
  • Variants
    • Sync: Invoked inline with application/business processing logic.
      • Eg. How many bytes were transmitted
      • The Measurement can be associated with Context
    • Async: Register callbacks that invoked on demand (SDK collection)
      • Eg. Async gauge can collect info from source every 15s (i.e callback invoked every 15s)
      • The Measurement CANNOT be associated with Context
      • Callbacks
        • Callback functions will be called only when the Meter is being observed.
        • Callbacks can be registered/unregistered after after asynchronous instrumentation creation also
        • Callbacks should be reentrant safe
        • Multi-instrument callback
          • Callback that’s associated w multiple instruments
        • The API MUST treat observations from a single callback as logically taking place at a single instant, such that when recorded, observations from a single callback MUST be reported with identical timestamps.
        • The Async Instrument has register_callback and the callback itself will have unregister
      • Attributes
        • These can be registered at Instrument creation time
        • API should also be flexible to provide ad-hoc attributes at invocation time
    • UpDown
Instrument KindVariantOperationsEg.
Counter (+ve increment)SyncAdd (does not return anything)No. of req, No. of bytes
Counter (+ve increment)AsyncCallback (returns absolute value: list/observable result)CPU time by program, Total process heap size
Gauge (non-additive)SyncRecord, on-change -> Record (Based on change events)Background Noise
Gauge (non-additive)AsyncCallback, accessors -> Observable recordRoom temperature
UpDownCounter (additive)SyncAdd (takes in increment/delta value)Total heap size, No. of “active” req, No. of items in queue
UpDownCounter (additive)AsyncCallbackTotal heap size
HistogramSyncRecordReq. Duration, Size of Payload

Measurement

  • Consist of value:(attributes)

TODO Metrics SDK

https://opentelemetry.io/docs/specs/otel/metrics/sdk/ (TODO: To read, read only like 30% of it)

  • MeterProvider MUST provide a way to allow a Resource to be specified
    • shutdown method
    • ForceFlush method
      • Works only on push metrics
      • Does not work on pull metrics (by design)
  • Components
    • MetricReaders
    • MetricExporters
    • Views
      • Determines the aggregation type
        • (Drop, ~Default, Sum, Last Value, Explicit Bucket Histogram)
      • An instrumented library can provide both temperature and humidity, but the application developer might only want temperature.
      • If MeterProvider has no View registered, take the Instrument and apply the default Aggregation

Concepts

Relabeling

List of possible actions

Rules

  • source_labels specific
    • regex
      • When using regex you must have a replacement config set. It can be the capture group or a static string.
      • regex is matched against the value contained in the source_label
    • separator
      • Use this as a utility to join different label values into the target Eg.
    • target_label will be the final label and replacement will be the value of the target_label
  • action: labeldrop

Gotchas/Tips

  • Labels starting with a double underscore (__) are automatically removed after the last relabeling rule.
  • After relabeling, Prometheus sets the instance label to the value of __address__ if you do not set the instance label explicitly to another value before that.
  • service discovery(SD) mechanisms can provide a set of labels starting with __meta_ that contain discovery-specific metadata about a target.
  • labelmap is nice if you want to preserve the SD __meta labels with a different prefix