tags : Data Engineering
User defined functions (UDF)
- Whenever possible we want to avoid this.
- See this for why: https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.map_rows.html#polars.DataFrame.map_rows
- We can play around with the native stuff to get most things done.
Resources
- Understand Polars’ Lack of Indexes | by Carl M. Kadie | Towards Data Science
- Introduction to Polars - Practical Business Python
- dataframe - Difference between selecting columns using data frame name and pl.col() in polars - Stack Overflow
polars, loops, partitions, delta merges and python memory management
Issues:
- Memory release
- polars doesn’t release memory
- python doesn’t release memory
- parquet row groups also matter (think for us this is handled by DL)
- How allocation works with polars and python, this is expected
- https://github.com/pola-rs/polars/issues/3972
- https://github.com/pola-rs/polars/issues/3974
- https://www.reddit.com/r/Python/comments/5sf6h5/python_does_not_return_memory_to_linux/ddel5j6/?context=3
- https://stackoverflow.com/questions/76061800/polars-df-takes-up-lots-of-ram
- https://stackoverflow.com/questions/71540618/general-question-about-polars-memory-management
- https://stackoverflow.com/questions/77606023/polars-dataframe-is-keep-on-holding-memory
- https://stackoverflow.com/questions/64368565/delete-and-release-memory-of-a-single-pandas-dataframe
- Memory load
- delta-rs merge doesn’t seem to take partitioning info
What to do?
- We should focus more on reducing the memory load part of it, last resort if we’re not able to release memory we can restart and it’s a python/polars problem
- Loading not needed data into memory is a delta-rs issue that we need to fix