Nomad

tags : Kubernetes, Infrastructure

These notes are from when I am trying to run Nomad on hetzner

Intro

Nomad server is just a workload orchestrator. It only is concerned about things like Bin Packing, scheduling decisions.
Nomad doesn’t interfere in your DNS setup, Service Discovery, secrets management mechanisms and pretty much anything else

Concepts

Availability of “Platform” v/s Availability of “Application”

When you’re talking about nomad fault tolerance, we’re talking about “availability of platform/infra”
Application availability is controlled by the migrate stanza in the configuration

Implications of Raft consensus for nomad `server` consensus

See Consensus Protocols

Raft is used between servers
Nomad, like many other distributed systems, uses the raft consensus algorithm to determine who is the leader of the server cluster.
In nomad, the IP address is a component of the member ID in the raft data.
- So if the IP address of a node changes, you will either have to do peers.json recovery or wipe your state.

Fault Tolerance

With FT, the state is replicated across servers
- WE WANT TO MAINTAIN ATLEAST 1 QUORUM AT ALL COSTS
- If we loose quorum, then
  - You be able to perform read actions
  - But you won’t be able to change the state until a quorum is re-established and a new leader is elected.
- Ideally you’d want to run server in a different failure domain than the client(s)
- 2 servers
  - 2 servers = 2 quorum (i.e we need atleast 2 servers to elect a leader and maintain quorum)
  - If you have 2 servers, it’s impossible to elect a leader since none of the servers has a winning vote.
    - For this reason, there is zero fault tolerance with two servers
  - Running two servers only gives you no fault tolerance in the failure of servers
  - If one server fails, we loose the quorum, hence no leader for the entire cluster.
    - Without leader, no state writes, no re-schedules, no cluster state changes.
- 3 servers
  - 3 servers = 2 quorum (i.e we need atleast 2 servers to elect a leader and maintain quorum)
  - In this case, if we loose 1 server agent, quorum will still be maintained and things will keep running as expected.

Doubt
- I don’t quite understand the 3/5(odd no) server requirement, I think it should become more clear as I do Data Replication
  - Q: What happens if I run with 2 servr nodes, 4 server nodes?

FAQ

Sidecar pattern inside a task group

main task: Main tasks are tasks that do not have a lifecycle block.
Sidecar tasks: Init/post/sidecar tasks
- sidecar set to true means sidecar, else ephemeral task
For log shipper pattern, we also want to set leader:true in the main task
- When the main task completes all other tasks within the group will be gracefully shutdown.
- The log shipper should set a high enough kill_timeout such that it can ship any remaining logs before exiting.

Allocation and Port collision for static port

When we use static port we occupy port in host machine/node
When we set task group count > 1 and be using static port, and our cluster has <2 client, then we don’t have any way to make this happen. We either use dynamic port or go add a new client(node) or something.

`service` block & `network` block (can there be multiple?)

Block Name	Level	What	Multiple
service	`task` and `task group` (added later for consul connect support)	How `other services` will find the `task`	Yes, multiple `network:port` and multiple `services` is good combo, eg. expose metrics and web traffic in different ports
network	`group`	How the `tg` connects to the `host/node`	Not yet, but can have multiple `ports`

Note
- If you have multiple services in a task group then you need to explicitly specify the names for all the services. You can omit the name of atmost one service definition inside a Task block.

Hierarchy

Control

- Job
  - Task Group(s) / Allocation(s)
    - Task(s)

Type	Control	How many
Job	Controlled externally(job runner/airflow/cicd etc)	As many as the business needs are
Task Group	There’s no nomad specific ordering, they fire parallely but we can wait on things	Can be singleton(eg. PostgreSQL), can be multiple TG(independent, web&db), dependent on some other service by waiting
Task	`lifecycle` blocks, these are inside the task group and don’t get to control how TG themselves are ordered	This is based on the idea of main task and supporting tasks(init/cleanup/sidecar etc)

TODO Placement
```
- Region
  - Datacenter
    - Nodepool
- Namespace
```
- Difference between nodepool and namespace and when to use what
  - Namespaces | Nomad | HashiCorp Developer

TODO Communication

- Federation
- Gossip
- RPC
- See the architecture doc

TODO Allocation(task group) single or multiple tasks

This really depends but stick to the “if these really need to be on the same client”
If you have a singleton pg suppose then maybe 1TG:1Task(sidecars) would be good
If you have another application in which you run the custom db and web app, these 2 can be different

Terms

Task Group
- Group of tasks that need to run on the same client agent.
- Usually a host/node will only run one client, so put things in the same task group when we need them to run together. (Eg. shared filesystem, low latency etc)
Allocation
- It’s a Task Group when submitted. i.e "count" of a task group = no. of allocations for the task group. It’s like the same thing different forms. ice and water kind.
- Allocations is as the space/resources dedicated to a task on each machine.
- Each allocation can be across different machine/hosts/node/clients but, as the promise of the task group, each individual task group will run on a single node/host
  - Eg. If we do count=100, there will be 100 allocation across devices
Deployment
- A deployment is the actual object that represents a change or update to a job.
- As of the moment the concept of deployment only really applies to service jobs
- update block
  - Update strategy for new allocation
Evaluation
- It’s the submission to the scheduler. Eg. After successful evaluation of the job, a task group becomes an allocation

Configuration

Setup Config

See https://gist.github.com/thimslugga/f3bbcaf5a173120de007e2d60a2747f7

Nomad

Server agent

Client agent

Workload Config

Workload configuration is JobSpec
- This can be managed using Terraform ? (Q: Do we really need TF here?)
- The JobSpec itself sometimes need templating
  - eg. We cannot have a local file on your laptop be directly provided. In these cases we can either use Terraform variables as seen here or we can use HCL2 or maybe something else.
  - I am of the opinion that we use terraform only for infra provisioning(the backbone) and not for deployments and stuff, so we’d try to avoid terraform for deployments of services as much as we can unless we don’t have an option.
- Note about HCL2 file function usage in nomad
  
  Functions are evaluated by the CLI during configuration parsing rather than job run time, so this function can only be used with files that are already present on disk on operator host.

Application arguments

This is done via task>config>args, this also supports “variable interpolation”

application configuration file

The template stanza helps in writing files to NOMAD_TASK_DIR (via destination) which is isolated for each task and accessible to each task
This can later be supplied to the task in various ways
- bind mounts for docker task driver
- Directly accessing local (NOMAD_TASK_DIR) if using exec task driver
For docker task driver flow is like
- First get the file into NOMAD_TASK_DIR using template or combination of template and artifact
- Then bind mount the file from local (NOMAD_TASK_DIR) to the container

Env Vars and Secrets Management

Environment variables
- env stanza
  - Directly specify, supports interpolation
- template stanza
  - Can be used to generate env vars from files on disk, Consul keys, or secrets from Vault:

Nomad variables
- Securely stores encrypted and replicated secrets in Nomad’s state store.
- Nomad Variables is not meant to be a Vault replacement. It stores small amounts of data which also happens to be encrypted and replicated. My advice is to treat Nomad Variables as a stopgap until you are ready to transition to Vault.
- These as of the now can be using using the template stanza, writing it into a file in nomad task dir and setting env=true in template stanza
- ALC is preconfigured for certain paths
- Two ways of using nomad variables
  - Using explicit mention of path
    - with nomadVar "nomad/jobs/redis" }}{{ .maxconns }}{{ end
  - Using all of nomad variables in the task using nomadVarList
  - I prefer explicitly writing the path (1st choice)

Consul
- Using it with key}} and {{secret if you know the variable path
- Vault
- KV

Nomad x GPU / cuda

We’ll first need the NVIDIA Container Toolkit installed on the host machine.
- On nixos it’s just: virtualisation.docker.enableNvidia = true; (also needs 32bit support in graphics for some reason)

Auxiliaries

Plugins

Volumes and Persistence

We have
- Nomad volumes (host volumes, filesystem or mounted network filesystem)
- Container Storage Interface (CSI) plugins
- Docker volumes (Supported but not recommended plus nomad scheduler will not be aware of it)
So we either want to use Nomad volumes or CSI
- We’ve to specify it in the client agent and then use that volume in our job

Docker driver specific
- As mentioned before, better to use nomad volumes instead of docker volumes
- But docker mount config in the nomad docker driver can be useful sometimes. Currently supports: volume, bind, and tmpfs type
  - usecase could be config loading, but the atrifact stanza also helps with that
    - template stanza is enough for local files, for remote files, we might want to use artifact
  - Eg. The caddy official image takes in mount points, in which can use mount to mount the data and config file.
    - CONFIRM: This can be combination of volume(data directory) and bind mount(config file with artifact and template stanza)

Features

Restarts & Checks

update > healthcheck > check_restart > restart > fail? > reschedule

`restart`

# example
 
# if things don't come up based on check_restart
# - attempts: restart the machine 2 times
# - interval: till we reach 5m, with 25s(delay) gaps
# - mode: If sevice doesn't come up declare it fail
restart {
  interval = "5m"
  attempts = 2
  delay    = "25s"
  mode     = "fail"
}

This can be set in the task group and task level. Values are inherited and merged.
no. of restart “attempts” should happen within the set “interval” before nomad does what “mode” says
mode : delay makes sense when re-running the job after interval would possibly make it succeed, otherwise, we would go with fail

`check_restart`

When the restart happens is controlled by the check_restart stanza
grace : This is important because we want to wait for the container to come up before we evaluate the healthcheck results

`check` (Health Checks for Services)

This is just the health check, this does not trigger any activity
check_restart stanza can be set inside check block to specify what happens when the health check fails
- Then the restart block in turn later determines how. Then how often and till when rerstart is triggered is controlled by the restart block properties.

Shutdown Delay

For service, we have shutdown_delay
- Useful if the application itself doesn’t handle graceful shutdowns based on the kill_signal
- The configured delay will provide a period of time in which the service is no longer registered in the provider
- Thus not receiving additional requests

Service Discovery

Nomad Service

They can be at the task group or task level. I usually like to put them at task group level because it then is next to the network block and makes things easier to read.
Health checks
- Are set by check
- Can be multiple checks for the same service
- Used provider mandates what kind checks are possible

Providers

Native Nomad Service Discovery
- This was added later, prior to 1.3 only consul was there
- It’ll work for simpler usecases, for complex usecases, we might want to adopt consul

Consul

Consul Connect
- lets you use service mesh between nomad services.
  - Q: If this does this, what does only Consul do?

Others
- Worth noting that Traefik also supports discovery of Nomad services without Consul since Nomad 1.3.

Webassembly

See WebAssembly, Virtualization Scheduling WebAssembly-backed services with Spin and Nomad

Networking

See Containers
For each task group we have have a network stanza
- In the taskgroup:network we specify the port(s) that we want to use in our tasks. In other words we “allocate” the ports for the task.
But each task-driver which is specified in task stanza also can have its own networking configuration.

On `network:port`

to : Which port in the container it should allocate to
static : Which port in the host nomad should map the container allocated(to) port to.

Detour of `modes`

TODO Networking modes

See network Block - Job Specification | Nomad | HashiCorp Developer

You set these in the network stanza

You can use

none
bridge
- This is useful w nomad service discovery
- Has a “mapping with a certain interface” via the port staza
- This uses CNI plugins
- Things would NOT show up in netstat -tulnp, you’ll have to switch namesapce if you need to check this.
- If you have access to the interface, even if it????????
host
- No shared namespace
- Use host machine’s network namesapce
- Things would show up in netstat -tulnp
cni/<cni network name>
- This needs to be configured per client

TODO Service address modes

Docker Task Driver Networking notes

Nomad uses bridged networking by default, like Docker.
- You can configure bridge mode both in network stanza and docker task driver:network stanza, both are conflicting. See nomad documentation on this. Ideally, you’d only want to use nomad network bridge.
The Docker driver configures ports on both the tcp and udp protocols. No way to change it.
with task:config:ports the port name is used to create a dynamic env var. Eg. something like NOMAD_PORT_<port_name>, where port_name are the items in config:ports. This can be used inside the container.

Service Discovery and Networking

networking stanza has different modes, host is the simplest. bridge mode needs various CNI plugins.
If you’re using host networking mode and then use SD mechanisms to say populate the configuration file for some job. If you update the upstream job and the upstream job is having an dynamic port, now you configuration file has become stale! (This happened to me with nomad native SD and host mode for a job which was being used in Caddy reverse proxy)

TODO Networking Q

Does SD really not work in nomad if I have set mode to host?
same port, same ip address, same interface, different namespaces
- netstat is not showing port
- but i am able to access it from host
- I can listen for something on the same port on my server, curl still follows the old route(???)

Deployment

Production setup

How nomad is supposed to be run

Hashicorp doc sometimes uses the term node and client interchangeably. Which is confusing.

In docs, A more generic term used to refer to machines running Nomad agents in client mode.

Here’s what I’ll follow for my understanding

node is a bare metal server or a cloud server(eg. hetzner cloud vm(s)), can also be called the host

nomad agent : can be either nomad server or client

client: agent that is responsible for “running” tasks

server: agent that is responsible for scheduling jobs based on allocation & group to certain nodes that are running a client agent

Following the above terms, here are the guidelines

Nomad is supposed to be HA and be quorum-based, i.e you run multiple agents on different nodes. (Preferably in different regions)
You’re supposed to run one agent(server/client) in each node
You’re not supposed to run multiple agent on same node
- Not supposed to run multiple client agent on same node
- Not supposed to run multiple server agent on same node
- Not supposed to run server agent and client agent on same node (There’s demand around this)
For true HA, there are certain recommendation around how many server agent(s) should you run.

More on client
- A single client agent process can handle running many allocations on a single node, and nomad server will schedule against it until it runs out of schedulable resources

More on server
- The server itself is not managed. i.e If the server runs out of memory it’s not going to get re-scheduled etc.
  - So need to make sure we run the server in an environment where it can easily do what’s needed.
- A server is light-weight. For a starting cluster, you could likely colocate a server process with another application—clients less so because of the workload aspect. (But it depends on if you need HA)
  - server can run with other workloads(other selfhosted tools), even consul / vault etc.

Different topologies

single node nomad cluster

Run just one node, it in, run both client and server.
On dev-agent vs single node cluster
- Nomad has something called dev-agent, which is for quickly testing out nomad etc. It runs the server and client both in the same node.
  - This essentially means running the agent with -dev flag.
  - It enables a pre-configured dual-role agent (client + server) which is useful for developing or testing Nomad.
- Single Node nomad cluster is essentially the same as dev mode (-dev), but here we specifically specify that in our config and do not use the -dev flag.
- Ways to do it
  - With the -dev flag and a single agent 🚫
    - It won’t save your data and it turns off ACLs
  - With two agents using different ports. You could do this but additional hassle.
  - With a single agent and both the client and server configs in the same config file
    - This is the suggested way
- Warnings & Issues
  - It is strongly recommended not to run both client and server in the same node
  - HA only at application level, not at node level
    - If we have 2 allocation of same app, in a single-node setup. If one allocation fails, we get the second up because server is up and running and realizes that happened. But if the node itself goes down, then everything goes down.
  - Because we’ll be running just 1 client, drains and rolling upgrades will not work as documented because there is no place to move the active workload to.
    - TODO/DOUBT: I am not super sure about this one, why not just create another task on the same node?
  - You can’t run more than one instance of a workload that uses static ports.
  - Ideally in a nomad setup, you’d run server as non-root, and client as root as it needs OS isolation mechanisms that require root privilege. But in a single node setup where you’re running client and server from the same agent with combined configuration we’ll run both as root and not provide any User/Group directive in the systemd file hence.
- Issues
  - I am getting this warning as-well: Nomad 1.4 Autopilot with only single server - Nomad - HashiCorp Discuss

multi node but 1:1 server:client
- Run 3 nodes, run pair of sever:client on each.

Recommended topology
- See deployment docs

TODO Don’t understand / Doubts

When we say 3 node cluster, do we mean 3 servers?
- I am confused about if these ppl are talking about running server or client. Given the assumption that we either run server or client only in one node/host
An even number of server nodes (outside of the case of a temporary failure) can make the raft cluster unhappy.
If each client is isolated from the others though, then any number is fine :).
Pro Tip Idea: be sure to specify a different node_class to the client on the nomad server.
documentation recommends running client as root and server as nomad user?
Does it mean that minimum nomad cluster requires minimum of 5 instances? 3 server + 2 clients?

Failure Modes

clients losing their jobs
servers losing quorum

Backup of nomad data?

https://mrkaran.dev/posts/home-server-nomad/

Other notes

If you do not run Nomad as root, make sure you add the Nomad user to the Docker group so Nomad can communicate with the Docker daemon.

Single server nomad setup

In single nomad server cluster, state of the raft is the following, the server seems to be in follower state.

λ nomad operator raft list-peers

Node       ID                                    Address           State     Voter  RaftProtocol
(unknown)  9ba36f29-1538-eb7e-3a1e-a24dfb805b21  100.124.6.7:4647  follower  true   unknown

Understanding `bootstrap_expect`

quorum - Consul bootstrap-expect value - Stack Overflow

Should the value of bootstrap_expect be same across all servers? - Nomad - HashiCorp Discuss

Quorum requires at least ⌊(n/2)+1⌋ members. Once we have Quorum we can do leader election.
bootstrap_expect just makes the servers wait until there are n servers before it starts the leader election between these servers.
The deployment table in the docs is super useful but here’s a simplified summary
- bootstrap_expect = 0 : Not allowed
- bootstrap_expect = 1 : Single node cluster, (self elect)
- bootstrap_expect = 2 : 2 nomad servers, forms a quorum of 2 but no fault tolerance.
  - one server agent failing will result in the cluster losing quorum and not being able to elect a leader.
  - without a leader,
    - cluster cannot make state writes
    - cluster cannot make any changes to the cluster state
    - cluster cannot rescheduling workload
- bootstrap_expect = 3 : 3 nomad servers, we have a fault tolerant nomad quorum
When does it get used
- bootstrap_expect is used is when the server starts for the “very first time” and attempts to join a Raft cluster(that has not initialized yet)
- bootstrap_expect setting has no effect, even on subsequent restarts of the server.
  - Removing a server node is a longer procedure (They’ll have to be removed from the Raft peer list)

Understanding `server-join`

server-join is both a config and cli param
In config it can be start-join/retry-join based on usecase
server-join is how the nomad agent(server) will find the other nodes to connect to and validate against bootstrap_expect

Migrating a bootstrap cluster (single node server)

v1 [THIS DID NOT WORK]

AFTER SOME EXPERIMENTATION, WE CONCLUDE THAT GOING THIS ROUTE IS NOT SAFE
- Have the existing cluster running with bootstrap_expect=1
- Create a new host, add the following nomad server config. (It does not really need any client config)
```
  server = {
    enabled = true;
    bootstrap_expect = 2; # must be 2, even if the original single node server has set it to 1
    server_join = {
      start_join = ["<dns/ip of original bootstrap server>"];
    };
  };
```
- After the new server is up, we’ll see 2 servers in our nomad cluster. (This has quorum of 2 but no fault tolerance but it’s alright for us as we just need this as part of the migration)
- Comment out the server block in the “original single node server”
  - At this point everything should crash, nothing is supposed to work. Both old and new nomad severs will be down with different errors.
  - Now change
    - Old server
      - the bootstrap_expect on the new server to 1.
    - New server
      - TODO
- Things should start working again

v2
- Steps
  - Cleaup (If you’ve attempted v1)
    - rm -rf /var/lib/nomad if you messed up on the new server, never do it on the original server.
    - umount -R alloc/*/*/* if needed
  - Take the snapshot backup
    - export NOMAD_ADDR="http://$(ip -br -j a show tailscale0| jq -r '.[].addr_info[] | select(.prefixlen == 32).local'):4646"
    - nomad operator snapshot save nomad_state.backup (old server)
  - Run the new server as an individual bootstrap_expect=1
    - No need to have any relation with older server yet,
  - Restore
    - nomad operator snapshot restore test_nomad.backup (new server)
    - Copy over the keyring from the working server/backup to /var/lib/nomad/server/keystore aswell to the same location in new server. (You’d now have two keystore files)
    - Then do: nomad operator root keyring rotate --full so that you don’t have to maintain older keys.
    - sudo systemctl restart nomad
    - You can delete the old keys from the keystore now.
  - Make old clients point to new nomad sever (including old nomad server)
    - The server field in client block takes care of this
  - Now everything should be set

Fault Tolerance in Single Cluster Nomad

Availability
- An available system will also be recoverable

Recoverability
- After a fix, the system has recover making sure correctness is ensured
- Eg. WAL helps with this

Autoscaling Nomad Clients (AWS/others)

Boot time is the number one factor in your success with auto-scaling. The smaller your boot time, the smaller your prediction window needs to be. Ex. If your boot time is five minutes, you need to predict what your traffic will be in five minutes, but if you can boot in 20 seconds, you only need to predict 20 seconds ahead. By definition your predictions will be more accurate the smaller the window is.

https://depot.dev/blog/faster-ec2-boot-time

https://developer.hashicorp.com/nomad/tools/autoscaling/concepts
The Nomad Autoscaler is modeled around the concept of a closed-loop control system.
Components
- autoscaling policy
  - How users define their desired outcome
  - control the Nomad Autoscaler.
- Target
  - What users want to scale.
  - Eg. job group/allocations/Nomad clients(eg. EC2)
- Strategy plugins
  - receive current status of the target
  - compute what actions need to be taken.
- Target plugins
  - communicate with targets: Read/Write(mutate state, eg. increase ec2 count)
- APM plugins
  - read application performance metrics from external sources.
Concepts
- Node Selector Strategy: mechanism the Nomad Autoscaler uses to identify nodes for termination when performing horizontal cluster scale-in actions.
Resources
- Schedule-based autoscaling · Issue #448 · hashicorp/nomad-autoscaler · GitHub

autoscaler config

config file
- .. other config
- Registering plugins
  - Since plugins are of 3 kinds: apm, strategy and target
  - We have blocks in the top level autoscaler config with the same blocks.
  - These plugins can be then used as per need in the different policy files.
- points to policy file
policy file
- referred from the config file
- The scaling block (see scaling Block)
  - Can be specified as file pointed by the autoscaler config
  - Can also be a block in a normal nomad jobspec!
    - But the scaling:policy block is parsed by the nomad-autoscaler only even if it’s defined in the jobspec.
  - What goes in the block depends on “what you’re scaling”: application/node(cluster)/ or dynamic thingy
- Two main thing are check and target:node_selector_strateg. They determine when things scale and when things die respectively.
  - evaluation_interval is what determines how ofter this policy is evaluated/checked
- A policy can contain multiple check
  - checks get executed at time of policy evaluation
- check determine a scaling suggestion (can be one or more)
  - metric source
  - strategy to apply based on metric source “outcome”
  - If multiple outcomes, one is picked based on some rules. Mostly it’s ScaleOut
- outcome
  - ScaleIn : Vertical Scaling, ScaleOut : Horizontal scaling.
  - See Scaling Databases
- target:node_selector_strateg: How things are terminated

Gotchas

Application autoscaling vs Cluster autoscaling
- See Getting to know the Nomad Autoscaler - #6 by lgfa29 - Nomad - HashiCorp Discuss
- Application autoscaling will live in the jobspec
- Cluster autoscaling should live in its own file and be linked to the nomad-autoscaler config

Plugins

https://developer.hashicorp.com/nomad/tools/autoscaling/concepts/plugins
There are some default plugins for each, otherwise u can custom make stuff

What?	Description	Context
APMs	This will query some API for status of the system. Eg. expected state of the system etc.	This becomes the `source` in the `check` block
Strategies	Implement the logic of scaling strategy. Compares `AMP value` and `Target value` and decides.	result is the `outcome` (ScaleIn/Out/None etc)
Targets	Get info about the status of the `target` and then make changes to the `target` too	Eg. make changes to aws ASG

Learning resources/Doubts

ACL

What are the different tokens?

Access Control | Nomad | HashiCorp Developer

Service Discovery

Task Dependencies | Nomad | HashiCorp Developer
https://developer.hashicorp.com/nomad/docs/integrations/consul
Do we need consul? or native service discovery enough?
- When is the native service discover not enough?
- Examples
  - https://github.com/wenzel-felix/terraform-hetzner-nomad-consul-module
  - https://github.com/datakurre/nomad-with-nix
Nomad 1.3 Adds Native Service Discovery and Edge Workload Support

🐏 mogoz

Table of Contents

Nomad

Intro §

Concepts §

Availability of “Platform” v/s Availability of “Application” §

Implications of Raft consensus for nomad server consensus §

FAQ §

Sidecar pattern inside a task group §

Allocation and Port collision for static port §

service block & network block (can there be multiple?) §

Hierarchy §

TODO Allocation(task group) single or multiple tasks §

Terms §

Configuration §

Setup Config §

Nomad §

Server agent §

Client agent §

Workload Config §

Application arguments §

application configuration file §

Env Vars and Secrets Management §

Nomad x GPU / cuda §

Auxiliaries §

Plugins §

Volumes and Persistence §

Features §

Restarts & Checks §

restart §

check_restart §

check (Health Checks for Services) §

Shutdown Delay §

Service Discovery §

Nomad Service §

Providers §

Webassembly §

Networking §

On network:port §

Detour of modes §

TODO Networking modes §

TODO Service address modes §

Docker Task Driver Networking notes §

Service Discovery and Networking §

More on bridge networking §

TODO Networking Q §

Deployment §

Production setup §

How nomad is supposed to be run §

Different topologies §

TODO Don’t understand / Doubts §

Failure Modes §

Backup of nomad data? §

Other notes §

Single server nomad setup §

Understanding bootstrap_expect §

Understanding server-join §

Migrating a bootstrap cluster (single node server) §

Fault Tolerance in Single Cluster Nomad §

Autoscaling Nomad Clients (AWS/others) §

autoscaler config §

Gotchas §

Plugins §

Learning resources/Doubts §

Ongoing ref §

Concepts §

Nomad Cluster on Production §

ACL §

What are the different tokens? §

Service Discovery §

Graph View

Backlinks

Intro

Concepts

Availability of “Platform” v/s Availability of “Application”

Implications of Raft consensus for nomad `server` consensus

FAQ

Sidecar pattern inside a task group

Allocation and Port collision for static port

`service` block & `network` block (can there be multiple?)

Hierarchy

TODO Allocation(task group) single or multiple tasks

Terms

Configuration

Setup Config

Nomad

Server agent

Client agent

Workload Config

Application arguments

application configuration file

Env Vars and Secrets Management

Nomad x GPU / cuda

Auxiliaries

Plugins

Volumes and Persistence

Features

Restarts & Checks

`restart`

`check_restart`

`check` (Health Checks for Services)

Shutdown Delay

Service Discovery

Nomad Service

Providers

Webassembly

Networking

On `network:port`

Detour of `modes`

TODO Networking modes

TODO Service address modes

Docker Task Driver Networking notes

Service Discovery and Networking

More on bridge networking

TODO Networking Q

Deployment

Production setup

How nomad is supposed to be run

Different topologies

TODO Don’t understand / Doubts

Failure Modes

Backup of nomad data?

Other notes

Single server nomad setup

Understanding `bootstrap_expect`

Understanding `server-join`

Migrating a bootstrap cluster (single node server)

Fault Tolerance in Single Cluster Nomad

Autoscaling Nomad Clients (AWS/others)

autoscaler config

Gotchas

Plugins

Learning resources/Doubts

Ongoing ref

Concepts

Nomad Cluster on Production

ACL

What are the different tokens?

Service Discovery