tags : Data Engineering
Incremental batch Jobs
Mets
- It’s given that for the inserts themselves we’ll be using upsert or handling de-duplication in some-way
- Secondly its given that all of these jobs by nature will be idempotent
- There’s also CDC and SCD Type2
Patterns
Process Indicator/ High Watermark pattern
- What: Fetch some data, keep a mark of last(highest) fetched record, on next run pick from there.
- It assumes all chunks are processed sequentially
- The last chunk processed has the largest set of ids
- This will not work at all when doing trying to run the job in parallel
- For when running in parallel, there will be no sequence and “order” will not matter and
last != largest
in the source dataset. - So this can only be useful if we fetching sequentially
- For when running in parallel, there will be no sequence and “order” will not matter and
- Examples of this pattern