tags : Data Engineering, Batch Processing Patterns
FAQ
How does differential dataflow relate to stream processing?
A problem statement for restaurant orders :)
Let’s assume we are building a data pipeline for Uber Eats where we keep getting orders. We want a dashboard which shows each restaurant owner
- How many orders?
- What are the Top 3 ordered dishes (with order count)?
- What’s the total sale amount?
There are 3 buttons on the UI which allow the users (restaurant owners) to see these numbers for different time periods:
- Last 1 hour
- Last 24 hours (= 1 day)
- Last 168 hours (= 24h * 7 = 1 week)
The dashboard should get new data every 5 minutes. How do you collect, store and serve data?
What is it really?
stream processing
andstreaming ingestion
are different things. There are systems which do one, there are also systems which do both, partially, fully, all tradeoffs.
Streaming ingestion
- This simply means ingesting to something in a streaming fashion (could be a OLAP database)
- This has to deal with things like de-duplications as OLAP systems don’t have a primary key etc. Schema and Datatype changes in source table etc.
- Along with streaming ingestion lot of OLAP databases allow for
continuous processing
(different from stream processing)- Underlying concept is that these systems store metrics state and update them as new data comes in.
- Some examples are: Dynamic table on Flink, Dynamic table on Snowflake, Materialized views on Clickhouse, Delta Live table on Databricks etc.
Stream processing
Stream processing systems are designed to give quick updates over a pre-defined query. To achieve that, incremental computation lies at the core of the design principle of stream processing. Instead of rescanning the whole data repository, stream processing systems maintain some intermediate results to compute updates based only on the new data. Such intermediate results are also referred to as states in the system. In most cases, the states are stored in memory to guarantee the low latency of stateful computation. Such a design makes stream processing systems vulnerable to failures. The whole stream processing pipeline will collapse if the state on any node is lost. Hence, stream processing systems use checkpoint mechanisms to periodically back up a consistent snapshot of global states on an external durable storage system. If the system fails, it can roll back to the last checkpoint and recover its state from the durable storage system.
- In most cases, they’re more flexible and capable than OLAP db’s
continious processing
features. So they sit in-front of the streaming ingestion into OLAP making sure OLAP gets cleaned and processed data.- But in most cases, you might not even not need stream processing, whatever
continious processing
feature OLAP DBs have might be enough, they’re great imo in 2024.
- But in most cases, you might not even not need stream processing, whatever
Architectures
- Lambda: Here we merge real-time and historical data together
- Kappa: If you don’t need a lot of historical data, and only need streaming data.
Tech Comparison
RT Stream Processor | DBMS | Tech Name | RT Stream Ingestion | What? |
---|---|---|---|---|
Yes | No | Apache Flink | Yes, but not meant to be a DB | Can be used in-front of a OLAP store |
Yes | No | Benthos/Redpanda Connect | No | Apache Flink, Rising Wave alternative |
Yes | No | Proton | No | Stream Processing for Clickhouse ? |
Yes | Yes (PostgreSQL based) | RisingWave | Yes, but not meant to be a DB | Apache Flink alternative, it’s pretty unique in what it does. See comparision w MateralizeDB |
Yes | Yes(OLAP) | Materialize DB | Yes | Sort of combines stream processing + OLAP (More powerful than Clickhouse in that case), supports PostgreSQL logical replication! |
? | Yes(OLAP) | BigQuery | Yes | It’s BigQuery :) |
Not really | Yes(OLAP) | Apache Druid | True realtime ingestion | Alternative to Clickhouse |
No | Yes(OLAP) | Apache Pinot | Yes | Alternative to Clickhouse |
No | Yes(OLAP) | StarTree | Yes | Managed Apache Pinot |
No | Yes(OLAP) | Clickhouse | Yes, Batched realtime ingestion | See Clickhouse |
No | Quack like it!(OLAP) | DuckDB | No | See DuckDB |
No | Yes(OLAP) | Rockset | Yes | Easy to use alternative to Druid/Clickhouse. Handles updates :) |
No | Yes(OLAP) | Snowflake | Yes but not very good | - |
No | No | NATS | Message Broker | |
No | No | NATS Jetstream | Kafka alternative | |
No | No | Kafka | kafka is kafka |
Druid vs Clickhouse
ClickHouse doesn’t guarantee newly ingested data is included in the next query result. Druid, meanwhile, does – efficiently, too, by storing the newly streamed data temporarily in the data nodes whilst simultaneously packing and shipping it off to deep storage.
- Druid is more “true” realtime than clickhouse in this sense but that much of realtime is not usually needed in most of my cases.
- What Is Apache Druid And Why Do Companies Like Netflix And Reddit Use It? - YouTube
Risingwave vs Clickhouse
- They can be used together
Benthos(Redpanda Connect) vs Risingwave
- Both are stream processors
- Risingwave is a stateful streaming processor
- Benthos is sitting somewhere between stateless and stateful systems
NATS jetstream vs Benthos
- NATS jetstream seems like a persistent queue (Eg. Kafka)
- Benthos is something that would consume from that stream and do the stream processing and put the output elsewhere.
things ppl say
- We use debezium to stream CDC from our DB replica to Kafka.
- We have various services reading the CDC data from Kafka for processing