tags : System Design, Distributed Systems, Web Development, Message Passing, Queues, Scheduling and orchestrator

What?

A queue of tasks. It’s a specific usecase of Message Queue

Distributed task queue

See Distributed Task Queue :: SysDesignMeetup :: 2022-July-02 - YouTube

What is a distributed task queue?

This tripped me a little in the start. People be using the word “distributed” to mean multiple things.

“Sare milke humko …”

  • Distributed what?
    • Worker: When you distribute the task over to multiple workers
      • workers can be processes in the same machine/node(same core, multiple cores etc)
      • workers can be processes running on different machine/node
      • Challenges: Maybe you want execute exactly once guarantee, maybe you want to make sure no task is being scheduled concurrently on two workers etc.
    • Broker(queue manager): When the broker itself is distributed.
    • Queue: This is only valid if the broker is distributed aswell, the queue is replicated and there is some shared consensus with the brokers around how to proceed with the items in the queue.
  • Not all distributed queue implementation out there do all 3(client/broker/queue), some do it with combination of something or some setting etc. So it’s hard to classify.

Distributed transactions and Distributed queue

  • If the usecase is really this, you might want to look at something like temporal (see Orchestrators and Scheduling)
  • Distributed task queues are not supposed to handle distributed transactions! (The naming confused me)

The exactly once guarantee

Durability vs Distributed

  • Whether the task is durable(in-memory) or persisted is different thing that whether the task queue is called “distributed” (as “distributed” can mean many things)
  • Levels of durability of items in the queue
    • In-memory
    • In-memory+persisted to disk: Survives program crash
    • In-memory+persisted to disk+shared across machines: Survives machine crash

Delivery Gurantees

See Data Delivery

FAQ

TODO Components of a Task Queue (Doubt)

  • Components
    • Queue (the store)
    • Broker (the queue manager)
      • Sort of middleware that can handle things like routing, validate, store, handle backpressure, deliver stuff etc.
    • The <thing>. (thing can be something like celery, hatchet etc.)
  • I am confused around, if the queue and broker does everything, what doe the <thing> do?
  • Who’s responsibility is task scheduling? Broker or the <thing> This additionally might do task sceduling(trigger)

Job Queue vs Task Queue vs Task Scheduler

  • A job queue/task queue is just the thing that stores the tasks
  • A task scheduler is the thing that does things with the items (usually the broker is involved in all this)

How to choose one?

A task queue may be simple or complext, based on your use case you might or might not need the following or more, see what fits.

  • Task scheduling (this is the only feature we need)
  • Delaying or priority (schedule at time - not needed)
  • Output (return values - we do not need as well)
  • Re-submitting failed task (we do not want)
  • Error fallback handlers (not needed)
  • Message headers (not needed)
  • Grouping (dependant tasks, again, not needed)
  • Rate limiting (ditto)
  • Compression (ditto)

Async task vs Background Tasks

  • In terms of an incoming request, an async task will operate in an out-of-band manner but the request cannot complete unless the async task itslef returns.
  • If we want the the request to respond faster, we will have to use a in-memory/external job queue. i.e We need a background task.

Handling task failures

Level 1: Writing tasks to handle failure

  • You write/design your tasks expecting some failures and be able to re-run safely.
  • So even if the

Level 2: Using a distributed queue (w broker redundancy)

Level 3: Using a workflow engine

Task Queue vs Workflow Engine(Eg. Temporal)?

See Orchestrators and Scheduling

  • They serve are different usecases
  • You can be using a job queue and can build out a workflow like thing on top of it. In other cases you might want to combine a job queue with an workflow engine for certain things.
  • In many cases, just a Task Queue might be enough, you might not need a workflow thing at all. Tradeoffs.
  • A workflow engine is better suited for more complex workflow-oriented tasks, while a task queue is better suited for simpler tasks such as sending emails.
  • If you need a high throughput of small tasks that need to be run ASAP, a message broker(task queue) is the better choice.

Debate around Database as Job Queue

It’s done widely in the industry, but there are edge cases we need to be aware of

Case of Redis

  • This is what most actual job queues use

Case of PostgreSQL

Task/Job Queue Patterns

  • “Concept Job Queue”
    • Eg. in this blogpost slack mentions that they use the concept of job queue to refer to ”Kafka+Redis” in which Kafka is being used for persistence and redis as a store for the actual executor. But this is just one example of an usecase, the combination can be infinite only.
  • Job Queue with an additional staging area (aka Transactional enqueueing)
    • Eg. With something like only sidekiq, in in case of a transaction rollback, we could end up with an invalid background job in the queue!
    • To get around this, we put things into a staging db table before things get into the actual Job Queue.
    • Instead of sending jobs to the queue directly, they’re sent to a staging table first, and an enqueuer pulls them out in batches and puts them to the job queue. This gives us nice ACID support for our workloads that go into the queue.
    • See https://brandur.org/job-drain and Pattern: Transactional outbox

Pub/Sub Patterns

Fan in

1 sub, Multiple broadcasters

Fan out

Multiple subscribers, 1 broadcaster

Comparison of Background Task/Job Queue projects

NameDistributedBackendBrokerLanguageComments
CeleryYesRedisRabbitMQ/RedisMostly pythonThe queue itself is async but doesn’t support async functions
RQNoRedisPython
SAQNoRedisPythonlike RQ but supports async using asyncio (See Python Concurrency)
HueyNoRedis/in-mem/fs/sqlitein-mem/redisPython
APSYesPG/Sqlite/mongoPG/RedisPython
tasiqYesPython
wakaqYes(?)redisredisPython
BullMQNodeJS
asynqYesGo
machineryGo
HatchetYesPostgresRabbitMQ, NATSGoMore than just a task queue, overrlaps with Temporal usecases
RiverPostgres basedGo
SidekiqRuby
GoodJobPostgres basedRuby, RoR
QuePostgres basedRubyAlternative to trying to write your custom db based queue system
Custom DB basedDB of choice-

Celery

  • Celery is a distributed task queue
  • It dispatches tasks/jobs to others servers and gets result back
  • It handles the task part of it but for the “distributed” part, it needs a Message Queue (MQ)
  • Needs a broker (Message Transports) and a backend (Result Store).
    • RabbitMQ (as the broker) and Redis (as the backend) are very commonly used together.
    • Can also be used in other combinations.

What does celery doesn’t support async mean?

Against Celery

There’s lot to be said against celery

Hatchet

Comparison of Message Brokers (can be used by a Task/Job/Message queue)

NameComments
RabbitMQ
NATSLot of good stuff
NSQLesser options than NATS

RabbitMQ

NATS

What?

  • NATS Core

    • Standard messaging API
    • Has default support for queues using queue groups
  • NATS Jetstream

    Builds on top of core

Usage

  • With NATS as your queueing system then write some workers to take tasks from it and process them. You can keep job status in the DB.
  • Usecase
    • Pub/Sub
    • Persistent Log Stream (like Kafka)
      1. Fanning out messages 2) Standard pub/sub messages 3) As a quick, insecure RPC mechanism 4) As a KV store, replacing redis

Alternatives

NATS documentation also has a nice comparision page

  • RabbitMQ
  • Kafka / Redpanda
  • Redis Streams
    • Redis steams is something that you use when you want pub/sub to be persistent
    • But with NATS you get better ergonomics around this. Eg. better abstractions over acknowledging messages and then deleting them from queue, which redis stream expect the client to handle.

NATS Jetstream

  • Limits
    • a typical stream.
    • N number of readers can be defined on the stream (called Consumers) each with independent stateful read pointers and delivery QoS. Messages in the stream can be read and re-read many times (does not change the stream). The “Limits” part is that you can put various growth limits on the stream (size, bytes, etc.).
  • WorkQueue
    • it turns it into a traditional Queue. Each acknowledge message delivery from a consumer, removes the message in the stream.
  • Interest policy
    • it turns it into a traditional multi-consumer Queue (sometimes called a Topic in some brokers). If there are no Consumers registered on the Stream, the stream message received is discarded. If there are 1 or more Consumers on the stream, then stream message discarded only after all have had a chance to read and acknowledge delivery.