tags : DuckDB, Database, Data Replication, Data Engineering

FAQ

Refresh-able materialized views

  • For example, you may want to regularly perform a complete recomputation of a view over the full dataset because it uses a complex join, which is incompatible with an incremental approach.
  • Eg. batch processes performing tasks such as denormalization.

Replication

See Data Replication

  • Clickhouse in itself doesn’t support replication but can support based on Zookeeper with Clickhouse Keeper.

Serving/User facing data delivery

  • Eg. Snowflake is a OLAP like clickhouse aswell but it’s not meant to be usefacing
  • Eg. I computed a MV and I want to make it accessible to the user

DuckDB vs Clickhouse

  • I think DuckDB is awesome but i consider more of a tool than a DB.
  • It’s more like a analysis tool. You can probably use clickhouse for all things that duckdb does but you can’t do all things that clickhouse does in duckdb. Plus there’s clickhouse-local which is a more apple-apple with duckdb.
  • DuckDB is in-process, not like a proper DB
  • DuckDB compares closest to other single node query engines like Polars, Data fusion, pandas.
  • Clickhouse stores the data on-disk like a normal database

Materialized Views

NOTE: This is about normal Materialized Views in CH, not refreshable MV, those are not of our concern

Building Real-time Applications with ClickHouse Materialized Views - YouTube

How does it work for Sync and Async ingest? and delay.

Understanding the usecase of MV in CH

  • This means we need stream processing prior to ingestion, if that’s the case what’s the use of materlizaton even????
  • You can create MV with aggregations such as last 1h, last 1d, last 1w etc for certain something.
  • Using Materialized Views in ClickHouse
  • Many others

TODO Creating a MV

TODO JOINs and Materialized Views

Others

Issues with JOIN

MV with multiple source tables

Materialiezed view is updated when new block of data is inserted in the left-most table.

TODO De-duplication with Materialized Views

TODO Performance w Materialized Views

Stream Processing v/s CH Materialized Views

See Ingestion Patterns

What CH MV offers can be categorized as continuous processing in-contrast to stream processing. It works with some usecases but might not work with others, it’s less powerful than true stream processing so we might find ourselves combining CH with some stream processing software if our needs become complex.

Do we need stream processing if materalized views are there?

Combining stream processing with MV

  • “The key point is not to analyze data fast, but to analyze fresh and well-formatted data fast. That means data needs to be ingested and transformed in real time before analytics is performed.”
  • We usually would want the ingest “cleaned” data into our OLAP, in this case we’d put a stream processor infront of our OLAP system.

TODO De-duplication

TODO Schema changes/ Data type changes