tags : System Design, Systems, Technical Postmortems, Observability

High Availability

Resilience v/s Availability

See CAP

  • Eg. If you don’t have a hot server running that you can hotswap your workload with, then the orchestrator will probably reschedule the workload if anything bad happens to it. In this case, we can say it has resilience but not availability.

Topologies

Traditional Passive-Active

  • Two or more identical systems running simultaneously, but only one system (the active node) is actively processing data and serving requests at any given time
    • The others (passive nodes) remain in standby mode, ready to take over if the active node fails.
  • This failover process typically involves transferring the workload, state, and connections from the failed node to the backup node

Application specific HA (Nomad specificc)

See Queues, Scheduling and orchestrator Orchestators such as Nomad offer alternative strategies

  • reschedule: reschedule the group or job if a task fails or dies
  • restart: similar to reschedule, but just restarting in place — see also check_restart
  • migrate: move the job to a different node if the client is marked for draining

Managing Environments

A cookbook recipe to define an environment, including its state, is crucial for smooth transitions from development to production. See git

TODO Deployments

AWS ECS and B&G / Canary Deployments

Meta notes

  • Vendors can have opinionated ways of deploying things
  • Some kind of service mesh helps. Heard good things about Linkerd

Rolling deployments

  • A rolling deploy starts sending traffic to the new application instances immediately, from the time that the first app instance starts.
  • Traffic flows to both versions at once, and the traffic naturally shifts from old versions to new versions as the old versions get killed off and the new versions get launched.
  • Rolling deployments will be faster than B&G hence

Rolling with additional batches

Blue-Green deployment

  • You deploy your new version to green, test and test it. Once you are done testing, you switch all traffic to green and green is the new production environment. You delete blue (the old production) and redeploy it, blue is now the new dev environment and you start testing the next version
  • What?
    • Serve the current app on one half of your environment (Blue)
    • Deploy your new application to the other (Green) without affecting the Blue environment.
  • Components
    • load balancer directs the traffic between B&G
    • Once deployment and test on green is finished, we switch entire traffic to it.
  • Pros/Cons
    • After switching and finishing deployment, Green(previously blue) can be a hot standby
    • By definition, launches an entire parallel set of new application instances first, and then once they are all up it starts shifting traffic over from the old application instances to the new application instances. So will need 2x the resources.

Canary deployment

  • Canary is a bit different and more complex than blue/green.
  • Cut over just a small subset of servers or nodes first, before finishing the others.
  • Feature toggle comes under canary dev

Service Discovery

ECS

See AWS

Consul

  • Can run with only servers and no agent
  • Debatable if we need it when we running inside ECS already

Istio

Linkerd

Load Testing

Some notes when using the k6 tool

Terms

  • SUT : System Under Test

Whats and Whys

  • A load test will tell you how scalabale you stuff is (See Scaling Databases)
  • What we get out of a load test?
    • Reliability: Validate reliability under expected traffic
    • Discover: Discover problems and system limits under unusual traffic.

Load tests

Primary load test types

  • Unit load test
    • Testing a single unit, like an API endpoint, in isolation.
    • Isolated API: Test isloated API endpoints. (Eg. similar to apache benchmark)
  • Scenario load test
    • Testing a real-world flow of interactions
    • eg. a user logging in, starting some activity, waiting for progress, and then logging out.
    • Essentially combining and re-ordering unit load tests
    • Subgroups
      • Integrated API: Test APIs that interact with other internal/external API
      • E2E API Flow: Test interaction between APIs
    • Goal
      • Test the target system with traffic that is consistent with what you’d see in the real world in terms of URLs/endpoints being hit.
      • Usually, this means making sure the most critical flows through your app are performant.

Check types

NameDesc
Smoke testVerify the system functions with minimal load.
“Average” load testDiscover how the system functions with typical traffic.
Stress testDiscover how the system functions with the load of peak traffic.
Spike testDiscover how the system functions with sudden and massive increases in traffic.
Breakpoint testProgressively ramp traffic to discover system breaking points.
Soak testDiscover whether or when the system degrades under loads of longer duration.

Performance Tests

  • We want to check for
    • Latency
      • How fast the system responds
      • http_req_duration
    • Availability
      • How often the system returns errors
      • http_req_failed

Checklist

  • Know the traffic pattern we want to test for
  • Need to have a goal
    • Eg. We might just want our API, app, or site to respond instantly (<=100ms
    • Eg. Above what level is a response time not acceptable, and/or what is an acceptable request failure rate.
  • Decide on test type
    • Know if we want to do Unit load test or Scenario load test
    • Goal -> Test Type -> Test Load

Testing with k6

Workloads

  • by Virtual users: vus, duration, iteration
  • by Req rate: per second or per minute
    • Check constant arrival rate executor
    • Check ramping-arrival-rate executor

Other features

  • Checks
    • Can be used to verify application logic
    • Can check for API responses and status code etc.
  • Thresholds
    • Can be used to test SLO/Reliability
    • set the test pass/fail criteria.