tags : System Design, Distributed Systems, Web Development, Message Passing, Queues, Scheduling and orchestrator
What?
A queue of tasks. It’s a specific usecase of Message Queue
Distributed task queue
See Distributed Task Queue :: SysDesignMeetup :: 2022-July-02 - YouTube
What is a distributed task queue?
This tripped me a little in the start. People be using the word “distributed” to mean multiple things.
“Sare milke humko …”
- Distributed what?
Worker
: When you distribute the task over to multiple workers- workers can be processes in the same machine/node(same core, multiple cores etc)
- workers can be processes running on different machine/node
- Challenges: Maybe you want execute exactly once guarantee, maybe you want to make sure no task is being scheduled concurrently on two workers etc.
Broker(queue manager)
: When the broker itself is distributed.Queue
: This is only valid if the broker is distributed aswell, the queue is replicated and there is some shared consensus with the brokers around how to proceed with the items in the queue.
- Not all distributed queue implementation out there do all 3(client/broker/queue), some do it with combination of something or some setting etc. So it’s hard to classify.
Distributed transactions and Distributed queue
- If the usecase is really this, you might want to look at something like temporal (see Orchestrators and Scheduling)
- Distributed task queues are not supposed to handle distributed transactions! (The naming confused me)
The exactly once guarantee
- If your usecase allows for idempotence, dropped tasks etc. You don’t really need consensus.
- But if you need to make sure of exactly once execution of the task, we need to have some kind of distributed Consensus
- contrast: At-least-once Delivery | Cloud Computing Patterns
Durability vs Distributed
- Whether the task is durable(in-memory) or persisted is different thing that whether the task queue is called “distributed” (as “distributed” can mean many things)
- Levels of durability of items in the queue
- In-memory
- In-memory+persisted to disk: Survives program crash
- In-memory+persisted to disk+shared across machines: Survives machine crash
Delivery Gurantees
See Data Delivery
FAQ
TODO Components of a Task Queue (Doubt)
- Components
- Queue (the store)
- Broker (the queue manager)
- Sort of middleware that can handle things like routing, validate, store, handle backpressure, deliver stuff etc.
- The <thing>. (thing can be something like celery, hatchet etc.)
- I am confused around, if the
queue
andbroker
does everything, what doe the<thing>
do? - Who’s responsibility is task scheduling?
Broker
or the<thing>
This additionally might do task sceduling(trigger)
Job Queue vs Task Queue vs Task Scheduler
- A job queue/task queue is just the thing that stores the tasks
- A task scheduler is the thing that does things with the items (usually the broker is involved in all this)
How to choose one?
A task queue may be simple or complext, based on your use case you might or might not need the following or more, see what fits.
- Task scheduling (this is the only feature we need)
- Delaying or priority (schedule at time - not needed)
- Output (return values - we do not need as well)
- Re-submitting failed task (we do not want)
- Error fallback handlers (not needed)
- Message headers (not needed)
- Grouping (dependant tasks, again, not needed)
- Rate limiting (ditto)
- Compression (ditto)
Async task vs Background Tasks
- In terms of an incoming request, an
async task
will operate in an out-of-band manner but the request cannot complete unless the async task itslef returns. - If we want the the request to respond faster, we will have to use a in-memory/external job queue. i.e We need a background task.
Handling task failures
Level 1: Writing tasks to handle failure
- You write/design your tasks expecting some failures and be able to re-run safely.
- So even if the
Level 2: Using a distributed queue (w broker redundancy)
Level 3: Using a workflow engine
Task Queue vs Workflow Engine(Eg. Temporal)?
See Orchestrators and Scheduling
- They serve are different usecases
- You can be using a job queue and can build out a workflow like thing on top of it. In other cases you might want to combine a job queue with an workflow engine for certain things.
- In many cases, just a Task Queue might be enough, you might not need a workflow thing at all. Tradeoffs.
- A workflow engine is better suited for more complex workflow-oriented tasks, while a task queue is better suited for simpler tasks such as sending emails.
- If you need a high throughput of small tasks that need to be run ASAP, a message broker(task queue) is the better choice.
Debate around Database as Job Queue
It’s done widely in the industry, but there are edge cases we need to be aware of
Case of Redis
- This is what most actual job queues use
Case of PostgreSQL
- Postgres Job Queues & Failure By MVCC
- What is SKIP LOCKED for in PostgreSQL 9.5? - 2ndQuadrant | PostgreSQL
Task/Job Queue Patterns
- “Concept Job Queue”
- Eg. in this blogpost slack mentions that they use the concept of job queue to refer to ”Kafka+Redis” in which Kafka is being used for persistence and redis as a store for the actual executor. But this is just one example of an usecase, the combination can be infinite only.
- Job Queue with an additional staging area (aka Transactional enqueueing)
- Eg. With something like only sidekiq, in in case of a transaction rollback, we could end up with an invalid background job in the queue!
- To get around this, we put things into a staging db table before things get into the actual Job Queue.
- Instead of sending jobs to the queue directly, they’re sent to a staging table first, and an enqueuer pulls them out in batches and puts them to the job queue. This gives us nice ACID support for our workloads that go into the queue.
- See https://brandur.org/job-drain and Pattern: Transactional outbox
Pub/Sub Patterns
Fan in
1 sub, Multiple broadcasters
Fan out
Multiple subscribers, 1 broadcaster
Comparison of Background Task/Job Queue projects
Name | Distributed | Backend | Broker | Language | Comments |
---|---|---|---|---|---|
Celery | Yes | Redis | RabbitMQ/Redis | Mostly python | The queue itself is async but doesn’t support async functions |
RQ | No | Redis | Python | ||
SAQ | No | Redis | Python | like RQ but supports async using asyncio (See Python Concurrency) | |
Huey | No | Redis/in-mem/fs/sqlite | in-mem/redis | Python | |
APS | Yes | PG/Sqlite/mongo | PG/Redis | Python | |
tasiq | Yes | Python | |||
wakaq | Yes(?) | redis | redis | Python | |
BullMQ | NodeJS | ||||
asynq | Yes | Go | |||
machinery | Go | ||||
Hatchet | Yes | Postgres | RabbitMQ, NATS | Go | More than just a task queue, overrlaps with Temporal usecases |
River | Postgres based | Go | |||
Sidekiq | Ruby | ||||
GoodJob | Postgres based | Ruby, RoR | |||
Que | Postgres based | Ruby | Alternative to trying to write your custom db based queue system | ||
Custom DB based | DB of choice | - |
Celery
- Celery is a distributed task queue
- It dispatches tasks/jobs to others servers and gets result back
- It handles the task part of it but for the “distributed” part, it needs a Message Queue (MQ)
- Needs a
broker
(Message Transports) and abackend
(Result Store).- RabbitMQ (as the
broker
) and Redis (as thebackend
) are very commonly used together. - Can also be used in other combinations.
- RabbitMQ (as the
What does celery doesn’t support async mean?
- The queue itself is async
- But it doesn’t support async operations in the sense that the task operation itself will be blocking (no asyncio support)
- In that case we want to use someting like SAQ
Against Celery
There’s lot to be said against celery
- Lack of guarantee
- I don’t think you need something like Celery in Go. Goroutines + Stdlib Timers/Tickers + Some minor libraries for additional time based stuff is all you need. And NATS if you’re distribution tasks over multiple instances of your application.
- The Many Problems with Celery | Log Blog Kebab
- Suggestion for blog post about covering celery problems - Show & Tell - Temporal
Hatchet
Comparison of Message Brokers (can be used by a Task/Job/Message queue)
- See Message Queue
Name | Comments |
---|---|
RabbitMQ | |
NATS | Lot of good stuff |
NSQ | Lesser options than NATS |
RabbitMQ
NATS
What?
-
NATS Core
- Standard messaging API
- Has default support for queues using queue groups
-
NATS Jetstream
Builds on top of core
Usage
- With NATS as your queueing system then write some workers to take tasks from it and process them. You can keep job status in the DB.
- Usecase
- Pub/Sub
- Persistent Log Stream (like Kafka)
-
- Fanning out messages 2) Standard pub/sub messages 3) As a quick, insecure RPC mechanism 4) As a KV store, replacing redis
Alternatives
NATS documentation also has a nice comparision page
- RabbitMQ
- Kafka / Redpanda
- Redis Streams
- Redis steams is something that you use when you want pub/sub to be persistent
- But with NATS you get better ergonomics around this. Eg. better abstractions over acknowledging messages and then deleting them from queue, which redis stream expect the client to handle.
NATS Jetstream
- Limits
- a typical stream.
- N number of readers can be defined on the stream (called Consumers) each with independent stateful read pointers and delivery QoS. Messages in the stream can be read and re-read many times (does not change the stream). The “Limits” part is that you can put various growth limits on the stream (size, bytes, etc.).
- WorkQueue
- it turns it into a traditional Queue. Each acknowledge message delivery from a consumer, removes the message in the stream.
- Interest policy
- it turns it into a traditional multi-consumer Queue (sometimes called a Topic in some brokers). If there are no Consumers registered on the Stream, the stream message received is discarded. If there are 1 or more Consumers on the stream, then stream message discarded only after all have had a chance to read and acknowledge delivery.