tags : Data Engineering, Batch Processing Patterns

FAQ

How does differential dataflow relate to stream processing?

A problem statement for restaurant orders :)

Let’s assume we are building a data pipeline for Uber Eats where we keep getting orders. We want a dashboard which shows each restaurant owner

  • How many orders?
  • What are the Top 3 ordered dishes (with order count)?
  • What’s the total sale amount?

There are 3 buttons on the UI which allow the users (restaurant owners) to see these numbers for different time periods:

  • Last 1 hour
  • Last 24 hours (= 1 day)
  • Last 168 hours (= 24h * 7 = 1 week)

The dashboard should get new data every 5 minutes. How do you collect, store and serve data?

See https://www.reddit.com/r/dataengineering/comments/18ciwxo/how_do_streaming_aggregation_pipelines_work/

What is it really?

Rethinking Stream Processing and Streaming Databases - RisingWave: Open-Source Streaming SQL Platform

stream processing and streaming ingestion are different things. There are systems which do one, there are also systems which do both, partially, fully, all tradeoffs.

Streaming ingestion

  • This simply means ingesting to something in a streaming fashion (could be a OLAP database)
  • This has to deal with things like de-duplications as OLAP systems don’t have a primary key etc. Schema and Datatype changes in source table etc.
  • Along with streaming ingestion lot of OLAP databases allow for continuous processing (different from stream processing)
    • Underlying concept is that these systems store metrics state and update them as new data comes in.
    • Some examples are: Dynamic table on Flink, Dynamic table on Snowflake, Materialized views on Clickhouse, Delta Live table on Databricks etc.

Stream processing

Stream processing systems are designed to give quick updates over a pre-defined query. To achieve that, incremental computation lies at the core of the design principle of stream processing. Instead of rescanning the whole data repository, stream processing systems maintain some intermediate results to compute updates based only on the new data. Such intermediate results are also referred to as states in the system. In most cases, the states are stored in memory to guarantee the low latency of stateful computation. Such a design makes stream processing systems vulnerable to failures. The whole stream processing pipeline will collapse if the state on any node is lost. Hence, stream processing systems use checkpoint mechanisms to periodically back up a consistent snapshot of global states on an external durable storage system. If the system fails, it can roll back to the last checkpoint and recover its state from the durable storage system.

  • In most cases, they’re more flexible and capable than OLAP db’s continious processing features. So they sit in-front of the streaming ingestion into OLAP making sure OLAP gets cleaned and processed data.
    • But in most cases, you might not even not need stream processing, whatever continious processing feature OLAP DBs have might be enough, they’re great imo in 2024.

Architectures

Tech Comparison

RT Stream ProcessorDBMSTech NameRT Stream IngestionWhat?
YesNoApache FlinkYes, but not meant to be a DBCan be used in-front of a OLAP store
YesNoBenthos/Redpanda ConnectNoApache Flink, Rising Wave alternative
YesNoProtonNoStream Processing for Clickhouse ?
YesYes (PostgreSQL based)RisingWaveYes, but not meant to be a DBApache Flink alternative, it’s pretty unique in what it does. See comparision w MateralizeDB
YesYes(OLAP)Materialize DBYesSort of combines stream processing + OLAP (More powerful than Clickhouse in that case), supports PostgreSQL logical replication!
?Yes(OLAP)BigQueryYesIt’s BigQuery :)
Not reallyYes(OLAP)Apache DruidTrue realtime ingestionAlternative to Clickhouse
NoYes(OLAP)Apache PinotYesAlternative to Clickhouse
NoYes(OLAP)StarTreeYesManaged Apache Pinot
NoYes(OLAP)ClickhouseYes, Batched realtime ingestionSee Clickhouse
NoQuack like it!(OLAP)DuckDBNoSee DuckDB
NoYes(OLAP)RocksetYesEasy to use alternative to Druid/Clickhouse. Handles updates :)
NoYes(OLAP)SnowflakeYes but not very good-
NoNoNATSMessage Broker
NoNoNATS JetstreamKafka alternative
NoNoKafkakafka is kafka

Druid vs Clickhouse

ClickHouse doesn’t guarantee newly ingested data is included in the next query result. Druid, meanwhile, does – efficiently, too, by storing the newly streamed data temporarily in the data nodes whilst simultaneously packing and shipping it off to deep storage.

Risingwave vs Clickhouse

  • They can be used together

Benthos(Redpanda Connect) vs Risingwave

  • Both are stream processors
  • Risingwave is a stateful streaming processor
  • Benthos is sitting somewhere between stateless and stateful systems

NATS jetstream vs Benthos

  • NATS jetstream seems like a persistent queue (Eg. Kafka)
  • Benthos is something that would consume from that stream and do the stream processing and put the output elsewhere.

things ppl say

  • We use debezium to stream CDC from our DB replica to Kafka.
    • We have various services reading the CDC data from Kafka for processing

Resources