tags : DuckDB, Database, Data Replication, Data Engineering
FAQ
Refresh-able materialized views
- For example, you may want to regularly perform a complete recomputation of a view over the full dataset because it
uses a complex join
, which is incompatible with an incremental approach.- Eg. batch processes performing tasks such as denormalization.
- These are not suitable for realtime needs
- This gives you the PostgreSQL materialized view behaviour to clickhouse
- I don’t think they’re needed for me at the moment, but can be useful in certain cases and is an experimental feature in CH.
- Refreshable Materialized View (experimental) | ClickHouse Docs
Replication
See Data Replication
- Clickhouse in itself doesn’t support replication but can support based on Zookeeper with Clickhouse Keeper.
Serving/User facing data delivery
- Eg. Snowflake is a OLAP like clickhouse aswell but it’s not meant to be usefacing
- Eg. I computed a MV and I want to make it accessible to the user
- “ClickHouse is good to serve direct queries from users.”
DuckDB vs Clickhouse
- I think DuckDB is awesome but i consider more of a tool than a DB.
- It’s more like a analysis tool. You can probably use clickhouse for all things that duckdb does but you can’t do all things that clickhouse does in duckdb. Plus there’s
clickhouse-local
which is a more apple-apple with duckdb.
- DuckDB is in-process, not like a proper DB
- DuckDB compares closest to other single node query engines like Polars, Data fusion, pandas.
- Clickhouse stores the data on-disk like a normal database
Materialized Views
NOTE: This is about normal Materialized Views in CH, not refreshable MV, those are not of our concern
Building Real-time Applications with ClickHouse Materialized Views - YouTube
How does it work for Sync
and Async
ingest? and delay.
- ClickHouse uniquely can efficiently update materialized views upon new data within a delay threshold.
- Normal materialized views are realtime materilized views only but they’re
not true realtime
like Druid. - Materialized View is an insert trigger the FROM table is the inserted buffe
Sync Ingestion
andAsync Ingestion
work the same for CH materliazed views.- See Are Materialized Views inserted synchronously? | ClickHouse Docs
- i.e It’s batched
- Computation of MV is NOT ATOMIC. So there might be inconsistency. (rare case)
Understanding the usecase of MV in CH
- This means we need stream processing prior to ingestion, if that’s the case what’s the use of materlizaton even????
- You can create MV with aggregations such as last 1h, last 1d, last 1w etc for certain something.
- Using Materialized Views in ClickHouse
- Many others
TODO Creating a MV
- MV never reads the source table!
- https://stackoverflow.com/questions/68656704/clickhouse-materialized-view-generating-too-slow
- https://den-crane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf 🌟
AggregatingMergeTree
- AggregatingMergeTree — have to be used with AggregateFunctions + -State + -Merge combinators
TODO JOINs and Materialized Views
Others
- CH doesn’t support Multi source MV?????
- Performance of materialized view with join · ClickHouse/ClickHouse · Discussion #59187 · GitHub
- https://stackoverflow.com/questions/51233488/update-materialized-view-with-join-statement
- https://github.com/ClickHouse/ClickHouse/issues/6213
- https://altinity.com/blog/2020-07-14-joins-in-clickhouse-materialized-views
- https://stackoverflow.com/questions/75243697/joining-large-tables-in-clickhouse-out-of-memory-or-slow
- https://www.propeldata.com/blog/materialized-views-in-clickhouse
Issues with JOIN
- something to do with shuffle joins not being supported
- https://stackoverflow.com/questions/51233488/update-materialized-view-with-join-statement
- https://clickhouse.com/blog/common-getting-started-issues-with-clickhouse
- https://github.com/ClickHouse/ClickHouse/issues/6213
- https://stackoverflow.com/questions/75243697/joining-large-tables-in-clickhouse-out-of-memory-or-slow
- https://www.propeldata.com/blog/materialized-views-in-clickhouse
- https://news.ycombinator.com/item?id=22362534
- https://www.reddit.com/r/dataengineering/comments/1c1swf6/are_click_house_joins_that_bad/
- https://clickhouse.com/blog/common-getting-started-issues-with-clickhouse#joins
- https://clickhouse.com/blog/clickhouse-fully-supports-joins-part1
- https://celerdata.com/blog/from-denormalization-to-joins-why-clickhouse-cannot-keep-up
- https://clickhouse.com/docs/en/sql-reference/dictionaries
- https://clickhouse.com/docs/en/guides/developer/cascading-materialized-views
- https://github.com/ClickHouse/ClickHouse/issues/20873
- https://github.com/ClickHouse/clickhouse-docs/issues/975
MV with multiple source tables
Materialiezed view is updated when new block of data is inserted in the left-most table.
TODO De-duplication with Materialized Views
TODO Performance w Materialized Views
Stream Processing v/s CH Materialized Views
What CH MV offers can be categorized as
continuous processing
in-contrast tostream processing
. It works with some usecases but might not work with others, it’s less powerful than truestream processing
so we might find ourselves combining CH with somestream processing
software if our needs become complex.
Do we need stream processing
if materalized views are there?
- Depends on usecase. But usually stea
- Also see Ingestion Patterns
Combining stream processing
with MV
- “The key point is not to analyze data fast, but to analyze fresh and well-formatted data fast. That means data needs to be ingested and transformed in real time before analytics is performed.”
- We usually would want the ingest “cleaned” data into our OLAP, in this case we’d put a stream processor infront of our OLAP system.
- Real-Time Data Enrichment and Analytics With RisingWave and ClickHouse
- CH MV might not work best with complex JOINs, in those cases we might need a
stream processing
thing before we ingest the data into CH.