Infrastructure

tags : System Design, Systems, Technical Postmortems, Observability

High Availability

See Distributed Systems, See Data Replication

Resilience v/s Availability

See CAP

Eg. If you don’t have a hot server running that you can hotswap your workload with, then the orchestrator will probably reschedule the workload if anything bad happens to it. In this case, we can say it has resilience but not availability.

Topologies

Traditional Passive-Active

Two or more identical systems running simultaneously, but only one system (the active node) is actively processing data and serving requests at any given time
- The others (passive nodes) remain in standby mode, ready to take over if the active node fails.
This failover process typically involves transferring the workload, state, and connections from the failed node to the backup node

Application specific HA (Nomad specificc)

See Queues, Scheduling and orchestrator Orchestators such as Nomad offer alternative strategies

reschedule: reschedule the group or job if a task fails or dies
restart: similar to reschedule, but just restarting in place — see also check_restart
migrate: move the job to a different node if the client is marked for draining

HA architecture for some platform

For high availability, Redpanda Cloud uses a control plane and data plane architecture.

Control plane: This is where most cluster management, operations, and maintenance takes place. The control plane enforces rules in the data plane.

Data plane: This is where your cluster lives. The term data plane is used interchangeably with cluster.

Agent: Redpanda uses an agent to manage the data plane from the control plane.

Clusters are configured and maintained in the control plane, but they remain available even if the network connection to the control plane is lost.

Managing Environments

A cookbook recipe to define an environment, including its state, is crucial for smooth transitions from development to production. See git

TODO Deployments

TODO Links

AWS ECS and B&G / Canary Deployments

Meta notes

Vendors can have opinionated ways of deploying things
Some kind of service mesh helps. Heard good things about Linkerd

Rolling deployments

A rolling deploy starts sending traffic to the new application instances immediately, from the time that the first app instance starts.
Traffic flows to both versions at once, and the traffic naturally shifts from old versions to new versions as the old versions get killed off and the new versions get launched.
Rolling deployments will be faster than B&G hence

Rolling with additional batches

Blue-Green deployment

You deploy your new version to green, test and test it. Once you are done testing, you switch all traffic to green and green is the new production environment. You delete blue (the old production) and redeploy it, blue is now the new dev environment and you start testing the next version

What?
- Serve the current app on one half of your environment (Blue)
- Deploy your new application to the other (Green) without affecting the Blue environment.
Components
- load balancer directs the traffic between B&G
- Once deployment and test on green is finished, we switch entire traffic to it.
Pros/Cons
- After switching and finishing deployment, Green(previously blue) can be a hot standby
- By definition, launches an entire parallel set of new application instances first, and then once they are all up it starts shifting traffic over from the old application instances to the new application instances. So will need 2x the resources.

Canary deployment

Canary is a bit different and more complex than blue/green.
Cut over just a small subset of servers or nodes first, before finishing the others.
Feature toggle comes under canary dev

Service Discovery

ECS

See AWS

Consul

Can run with only servers and no agent
Debatable if we need it when we running inside ECS already

Istio

Linkerd

Load Testing

Some notes when using the k6 tool

Terms

SUT : System Under Test

Whats and Whys

A load test will tell you how scalabale you stuff is (See Scaling Databases)
What we get out of a load test?
- Reliability: Validate reliability under expected traffic
- Discover: Discover problems and system limits under unusual traffic.

Load tests

Primary load test types

Unit load test
- Testing a single unit, like an API endpoint, in isolation.
- Isolated API: Test isloated API endpoints. (Eg. similar to apache benchmark)
Scenario load test
- Testing a real-world flow of interactions
- eg. a user logging in, starting some activity, waiting for progress, and then logging out.
- Essentially combining and re-ordering unit load tests
- Subgroups
  - Integrated API: Test APIs that interact with other internal/external API
  - E2E API Flow: Test interaction between APIs
- Goal
  - Test the target system with traffic that is consistent with what you’d see in the real world in terms of URLs/endpoints being hit.
  - Usually, this means making sure the most critical flows through your app are performant.

Check types

Name	Desc
Smoke test	Verify the system functions with minimal load.
“Average” load test	Discover how the system functions with typical traffic.
Stress test	Discover how the system functions with the load of peak traffic.
Spike test	Discover how the system functions with sudden and massive increases in traffic.
Breakpoint test	Progressively ramp traffic to discover system breaking points.
Soak test	Discover whether or when the system degrades under loads of longer duration.

Performance Tests

We want to check for
- Latency
  - How fast the system responds
  - http_req_duration
- Availability
  - How often the system returns errors
  - http_req_failed

Checklist

Know the traffic pattern we want to test for
Need to have a goal
- Eg. We might just want our API, app, or site to respond instantly (<=100ms
- Eg. Above what level is a response time not acceptable, and/or what is an acceptable request failure rate.
Decide on test type
- Know if we want to do Unit load test or Scenario load test
- Goal -> Test Type -> Test Load

Testing with k6

Workloads

by Virtual users: vus, duration, iteration
by Req rate: per second or per minute
- Check constant arrival rate executor
- Check ramping-arrival-rate executor

Other features

Checks
- Can be used to verify application logic
- Can check for API responses and status code etc.
Thresholds
- Can be used to test SLO/Reliability
- set the test pass/fail criteria.

Networking (Data Center)

Others

I’ll think twice before using GitHub Actions again | Hacker News

🐏 mogoz

Table of Contents

Infrastructure

High Availability

Resilience v/s Availability

Topologies

Traditional Passive-Active

Application specific HA (Nomad specificc)

HA architecture for some platform

Managing Environments

TODO Deployments

TODO Links

AWS ECS and B&G / Canary Deployments

Meta notes

Rolling deployments

Rolling with additional batches

Blue-Green deployment

Canary deployment

Service Discovery

ECS

Consul

Istio

Linkerd

Load Testing

Terms

Whats and Whys

Load tests

Primary load test types

Check types

Performance Tests

Checklist

Testing with k6

Workloads

Other features

Networking (Data Center)

Others

Graph View

Backlinks

🐏 mogoz

Table of Contents

Infrastructure

High Availability §

Resilience v/s Availability §

Topologies §

Traditional Passive-Active §

Application specific HA (Nomad specificc) §

HA architecture for some platform §

Managing Environments §

TODO Deployments §

TODO Links §

AWS ECS and B&G / Canary Deployments §

Meta notes §

Rolling deployments §

Rolling with additional batches §

Blue-Green deployment §

Canary deployment §

Service Discovery §

ECS §

Consul §

Istio §

Linkerd §

Load Testing §

Terms §

Whats and Whys §

Load tests §

Primary load test types §

Check types §

Performance Tests §

Checklist §

Testing with k6 §

Workloads §

Other features §

Networking (Data Center) §

Others §

Graph View

Backlinks

High Availability

Resilience v/s Availability

Topologies

Traditional Passive-Active

Application specific HA (Nomad specificc)

HA architecture for some platform

Managing Environments

TODO Deployments

TODO Links

AWS ECS and B&G / Canary Deployments

Meta notes

Rolling deployments

Rolling with additional batches

Blue-Green deployment

Canary deployment

Service Discovery

ECS

Consul

Istio

Linkerd

Load Testing

Terms

Whats and Whys

Load tests

Primary load test types

Check types

Performance Tests

Checklist

Testing with k6

Workloads

Other features

Networking (Data Center)

Others