tags : Data Engineering
User defined functions (UDF)
- Whenever possible we want to avoid this.
- See this for why: https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.map_rows.html#polars.DataFrame.map_rows
- We can play around with the native stuff to get most things done.
Syntax FAQ
Debugging
Query Explain
print(my_df.explain(optimized=False))
print(my_df.explain(optimized=True))
- Query plan - Polars user guide
Optimizations
Pushdown
-
Predicate pushdown (Row)
- It seems like predicate pushdown when downloading from s3 is not yet properly supported.
- “This might be a separate issue but seems related: Polars doesn’t yet support predicate pushdown into datasets nor streaming output. It eagerly calls to_table() on them instead. Both DataFusion and DuckDB can query datasets lazily with predicate pushdown; it would be nice if Polars did too.”
- https://github.com/pola-rs/polars/issues/9002
- Filter a polars LazyFrame column lowercase without materialize to DataFrame - Stack Overflow
- https://github.com/pola-rs/polars/issues/6395
- It seems like predicate pushdown when downloading from s3 is not yet properly supported.
-
Projection pushdown (Column)
Others
- Crucial parameters for streaming in Polars | Rho Signal
- RAM usage and predicate pushdown · Issue #3974 · pola-rs/polars · GitHub
Resources
- Understand Polars’ Lack of Indexes | by Carl M. Kadie | Towards Data Science
- Introduction to Polars - Practical Business Python
polars, loops, partitions, delta merges and python memory management
Issues:
- Memory release
- polars doesn’t release memory
- python doesn’t release memory
- parquet row groups also matter (think for us this is handled by DL)
- How allocation works with polars and python, this is expected
- https://github.com/pola-rs/polars/issues/3972
- https://www.reddit.com/r/Python/comments/5sf6h5/python_does_not_return_memory_to_linux/ddel5j6/?context=3
- https://stackoverflow.com/questions/76061800/polars-df-takes-up-lots-of-ram
- https://stackoverflow.com/questions/71540618/general-question-about-polars-memory-management
- https://stackoverflow.com/questions/77606023/polars-dataframe-is-keep-on-holding-memory
- https://stackoverflow.com/questions/64368565/delete-and-release-memory-of-a-single-pandas-dataframe
- Memory load
- delta-rs merge doesn’t seem to take partitioning info
What to do?
- We should focus more on reducing the memory load part of it, last resort if we’re not able to release memory we can restart and it’s a python/polars problem
- Loading not needed data into memory is a delta-rs issue that we need to fix