Batch Processing Patterns

Incremental batch Jobs

It’s given that for the inserts themselves we’ll be using upsert or handling de-duplication in some-way
Secondly its given that all of these jobs by nature will be idempotent
There’s also CDC and SCD Type2

What: Fetch some data, keep a mark of last(highest) fetched record, on next run pick from there.
It assumes all chunks are processed sequentially
The last chunk processed has the largest set of ids
This will not work at all when doing trying to run the job in parallel
- For when running in parallel, there will be no sequence and “order” will not matter and last != largest in the source dataset.
- So this can only be useful if we fetching sequentially
Examples of this pattern