tags : System Design, Systems, Technical Postmortems, Observability
High Availability
- See Distributed Systems, See Data Replication
Resilience v/s Availability
See CAP
- Eg. If you don’t have a hot server running that you can hotswap your workload with, then the orchestrator will probably reschedule the workload if anything bad happens to it. In this case, we can say it has resilience but not availability.
Topologies
Traditional Passive-Active
- Two or more identical systems running simultaneously, but only one system (the active node) is actively processing data and serving requests at any given time
- The others (passive nodes) remain in standby mode, ready to take over if the active node fails.
- This failover process typically involves transferring the workload, state, and connections from the failed node to the backup node
Application specific HA (Nomad specificc)
See Queues, Scheduling and orchestrator Orchestators such as Nomad offer alternative strategies
- reschedule: reschedule the group or job if a task fails or dies
- restart: similar to reschedule, but just restarting in place — see also check_restart
- migrate: move the job to a different node if the client is marked for draining
Managing Environments
A cookbook recipe to define an environment, including its state, is crucial for smooth transitions from development to production. See git
TODO Deployments
TODO Links
- Elastic beanstalk
- Others
- https://www.reddit.com/r/kubernetes/comments/10din8a/questions_about_bluegreen_canary_deployments/
- https://www.reddit.com/r/kubernetes/comments/g645k6/how_is_everyone_doing_basic_application_level/
- https://circleci.com/blog/canary-vs-blue-green-downtime/
- https://www.reddit.com/r/kubernetes/comments/13bx8o1/how_to_decide_bluegreen_canary_and_rolling/
- https://www.reddit.com/r/aws/comments/18dk1xh/bluegreen_deployment_with_aws_codepipeline/
- https://www.reddit.com/r/aws/comments/zr5qbs/why_does_this_aws_whitepaper_say_that_rolling/
- https://www.reddit.com/r/aws/comments/mz4q2m/ecs_canarybluegreen_deployments_for_a_monolith/
- https://aws.amazon.com/blogs/containers/aws-codedeploy-now-supports-linear-and-canary-deployments-for-amazon-ecs/
- https://aws.amazon.com/blogs/containers/create-a-pipeline-with-canary-deployments-for-amazon-ecs-using-aws-app-mesh/
- https://github.com/aws/copilot-cli/issues/5326
- https://octopus.com/blog/ecs-canary-deployments
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-type-ecs.html
AWS ECS and B&G / Canary Deployments
- AWS CodeDeploy now supports linear and canary deployments for Amazon ECS | Containers
- Create a pipeline with canary deployments for Amazon ECS using AWS App Mesh | Containers
- Canary rollout · Issue #5326 · aws/copilot-cli · GitHub
- Support Blue/Green Deployment · Issue #1758 · aws/copilot-cli · GitHub
Meta notes
- Vendors can have opinionated ways of deploying things
- Some kind of service mesh helps. Heard good things about Linkerd
Rolling deployments
- A rolling deploy starts sending traffic to the new application instances immediately, from the time that the first app instance starts.
- Traffic flows to both versions at once, and the traffic naturally shifts from old versions to new versions as the old versions get killed off and the new versions get launched.
- Rolling deployments will be faster than B&G hence
Rolling with additional batches
Blue-Green deployment
- You deploy your new version to green, test and test it. Once you are done testing, you switch all traffic to green and green is the new production environment. You delete blue (the old production) and redeploy it, blue is now the new dev environment and you start testing the next version
- What?
- Serve the current app on one half of your environment (Blue)
- Deploy your new application to the other (Green) without affecting the Blue environment.
- Components
load balancer
directs the traffic between B&G- Once deployment and test on
green
is finished, we switch entire traffic to it.
- Pros/Cons
- After switching and finishing deployment, Green(previously blue) can be a hot standby
- By definition, launches an entire parallel set of new application instances first, and then once they are all up it starts shifting traffic over from the old application instances to the new application instances. So will need 2x the resources.
Canary deployment
- Canary is a bit different and more complex than blue/green.
- Cut over just a small subset of servers or nodes first, before finishing the others.
- Feature toggle comes under canary dev
Service Discovery
ECS
See AWS
Consul
- Can run with only servers and no agent
- Debatable if we need it when we running inside ECS already
Istio
Linkerd
Load Testing
Some notes when using the k6 tool
Terms
- SUT : System Under Test
Whats and Whys
- A load test will tell you how scalabale you stuff is (See Scaling Databases)
- What we get out of a load test?
- Reliability: Validate reliability under expected traffic
- Discover: Discover problems and system limits under unusual traffic.
Load tests
Primary load test types
- Unit load test
- Testing a single unit, like an API endpoint, in isolation.
- Isolated API: Test isloated API endpoints. (Eg. similar to apache benchmark)
- Scenario load test
- Testing a real-world flow of interactions
- eg. a user logging in, starting some activity, waiting for progress, and then logging out.
- Essentially combining and re-ordering unit load tests
- Subgroups
- Integrated API: Test APIs that interact with other internal/external API
- E2E API Flow: Test interaction between APIs
- Goal
- Test the target system with traffic that is consistent with what you’d see in the real world in terms of URLs/endpoints being hit.
- Usually, this means making sure the most critical flows through your app are performant.
Check types
Name | Desc |
---|---|
Smoke test | Verify the system functions with minimal load. |
“Average” load test | Discover how the system functions with typical traffic. |
Stress test | Discover how the system functions with the load of peak traffic. |
Spike test | Discover how the system functions with sudden and massive increases in traffic. |
Breakpoint test | Progressively ramp traffic to discover system breaking points. |
Soak test | Discover whether or when the system degrades under loads of longer duration. |
Performance Tests
- We want to check for
- Latency
- How fast the system responds
http_req_duration
- Availability
- How often the system returns errors
http_req_failed
- Latency
Checklist
- Know the traffic pattern we want to test for
- Need to have a goal
- Eg. We might just want our API, app, or site to respond instantly (<=100ms
- Eg. Above what level is a response time not acceptable, and/or what is an acceptable request failure rate.
- Decide on test type
- Know if we want to do Unit load test or Scenario load test
- Goal -> Test Type -> Test Load
Testing with k6
Workloads
- by Virtual users:
vus, duration, iteration
- by Req rate: per second or per minute
- Check
constant arrival rate executor
- Check
ramping-arrival-rate executor
- Check
Other features
Checks
- Can be used to verify application logic
- Can check for API responses and status code etc.
Thresholds
- Can be used to test SLO/Reliability
- set the test pass/fail criteria.