Scaling Batch Jobs for Reliable and Efficient Processing

This article explores the challenges of running batch jobs and provides an in-depth guide on possible approaches to make them more scalable…

Andrei Sukharev

Trade Republic Engineering

· ~9 min read · March 17, 2025 (Updated: March 17, 2025) · Free: Yes

This article explores the challenges of running batch jobs and provides an in-depth guide on possible approaches to make them more scalable and efficient.

It's the first day of the month, and you expect your interest payment to appear in your banking app — only to find it's missing. You refresh. Still nothing. Hours pass, and the payment is nowhere to be found 💸.

Interest payments may seem like a simple task: find balance, apply an interest rate, calculate taxes and credit the accounts . But when you're handling millions of users, this task turns into a massive batch job — requiring precision, speed, and reliability across the system 🎯.

So, what happens when this batch job doesn't run smoothly?

Customers lose trust when payments are delayed or inconsistent.
A single failure could result in millions of missed payments.
Scaling becomes a challenge when a single machine is responsible for handling everything.

At Trade Republic, we process millions of interest payouts each month. Initially, the process took a considerable amount of time, deployments had to be postponed since they reloaded pods, requiring everything to restart from scratch. Additionally if any part of it failed, we needed to intervene manually. However, by rethinking how we distribute the workload across multiple machines, we were able to achieve fast and reliable payouts 💡.

In this article, we'll explore how to design scalable batch jobs that can be applied to any large-scale task such as interest payouts. While there are many ways to scale batch jobs, our focus will be on a practical approach that doesn't require complex technologies or deep technical expertise, ensuring your jobs are completed on time — every time ⏰.

Goal

Let's define requirements to the task we want to implement.

Functional requirements:

Fetch 1 million records from the database that need processing.
Simulate work by pausing thread for 10 ms.
Mark each record as processed to track completion.

Non-functional requirements:

Fault Tolerance: The system must be able to recover if one of the pods fails.
Scalability: The system should efficiently handle an increase in the number of records to process.
Performance: The system should maintain high performance throughout the process.

Brute Force Approach: A Single Pod

Let's start with the simplest approach to get the job done: running everything on a single pod. This method is easy to set up and doesn't involve complex configurations, making it a good starting point.

⚙️ Step 1: Creating the Database Table

CREATE TABLE IF NOT EXISTS test_account (
   id        UUID PRIMARY KEY,
   processed BOOLEAN DEFAULT FALSE NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_t_account_processed ON test_account (processed);

The processed column helps track which accounts have already been handled, ensuring continuity in case of failures or redeployments.

⚙️ Step 2: Implementing the Batch Job

fun runBatchJob() {
    // Flag to check if any accounts were processed in the current iteration
    var anyAccountsProcessed = true
    logger.info("Batch job has been started for")
    // Continue processing accounts until none are left to process
    while (anyAccountsProcessed) {
        // Wrap account processing in a transaction for consistency
        transactionProvider.transaction {
            anyAccountsProcessed = processAccounts()
        }
        logger.info("Batch job has been finished")
    }
}

private fun processAccounts(): Boolean {
    // Fetch up to 10 unprocessed accounts from the database
    val accounts = dslContext
        .select()
        .from(TEST_ACCOUNT)
        .where(TEST_ACCOUNT.PROCESSED.eq(false))
        .limit(10)
        .fetchInto(TestAccountRecord::class.java)
    // If no accounts are returned, end the batch job
    if (accounts.isEmpty()) return false
    accounts.forEach { _ ->
        Thread.sleep(10)// simulating processing
    }
    // Update the accounts as processed in the database
    dslContext.update(TEST_ACCOUNT)
        .set(TEST_ACCOUNT.PROCESSED, true)
        .where(TEST_ACCOUNT.ID.`in`(accounts.map { it.id }))
        .execute()
    // Indicate that there are more accounts to process in the next iteration
    return true
}

⚙️ Step 3: Scheduling a Kubernetes CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cron-job
spec:
  schedule: "0 1 * * *"
  timeZone: "Europe/Berlin"
#  ... and so on

Performance Analysis 📈

We estimate that at 10ms per record, processing 1 million records sequentially would take at least 2 hours 46 minutes:

10ms per record = 0.01 seconds.
1,000,000 records * 0.01 seconds = 10,000 seconds.
10,000 / 60 = 166.67 minutes or 2h46m.

While this approach works, several challenges should be considered:

📌 Performance Bottleneck: Processing records one by one on a single pod is inefficient.

📌 Reliability: If the pod fails, no other pods are available to continue processing records.

It's worth noting that, in our example, the code does not handle any exceptions to keep it simpler.

Concurrent Processing

Coroutines

To improve performance, we can leverage coroutines to process multiple accounts concurrently.

    private fun processAccountsAsync(): Boolean {
        val accounts = dslContext
            .select()
            .from(TEST_ACCOUNT)
            .where(TEST_ACCOUNT.PROCESSED.eq(false))
            .limit(10)
            .fetchInto(TestAccountRecord::class.java)
        if (accounts.isEmpty()) return false
        // Use runBlocking to start a coroutine in a blocking manner for the current thread
        runBlocking {
            // run each account processing in a separate coroutine
            accounts.forEach { account ->
                async(Dispatchers.IO) {
                    delay(10)// simulating processing
                }
            }
        }
        dslContext.update(TEST_ACCOUNT)
            .set(TEST_ACCOUNT.PROCESSED, true)
            .where(TEST_ACCOUNT.ID.`in`(accounts.map { it.id }))
            .execute()
        return true
    }

If we process 10 accounts concurrently using coroutines, the total time required to process 1 million accounts would be:

Time per batch of 10 accounts: Each account takes 10 ms to process. Since we are processing 10 accounts concurrently, the total time per batch remains around 10 ms.
Total number of batches: Since we process 10 accounts at a time, the number of batches required is: 1,000,000 / 10 = 100,000 batches.
Total processing time: Each batch takes 10 ms = 0.01sec, so for 100,000 batches: 100,000 * 0.01 sec = 1,000 sec or 16.67 minutes.

So, with 10 accounts being processed concurrently in each batch on a single pod, the process should take at least 17 minutes (excluding time for selecting and updating accounts). This is a significant improvement over our initial implementation ⚡.

You might ask: why not process 100, 1000, or even more accounts per batch?

If we increase the batch size to 1,000 records, then a single batch would take:

1,000 * 10 ms = 10 seconds.

This results in long-running transactions, which:

Increase contention and blocking for other processes accessing the ACCOUNT table.
Make rollbacks more expensive.
Blocks autovacuum.
Reduce parallelism potential (which we will explore later).

A best practice is to keep transactions short to maximize throughput.

Performance Analysis 📈

Let's deploy this and measure the actual execution time. We are going to use one db.r6gd.xlarge instance with 4 vCPU and 32 Memory (GiB) and resource configurations set to 1000m CPU and 2Gi memory.

Results: 25 minutes to process all accounts — a significant improvement over the initial iteration!

What if a single pod goes down? Can we scale this batch job across multiple pods? Of course!

Parallel processing

Kubernetes CronJobs provide built-in parallelism and completion mechanisms. We can define this in our job specification:

spec:
  schedule: "0 1 * * *"
  timeZone: "Europe/Berlin"
  jobTemplate:
    spec:
      parallelism: 3 # → Allows up to three pods to run simultaneously.
      completions: 3 # The job is considered complete when three pods successfully finish execution.

If any pod fails, Kubernetes will restart it until all three pods have completed.

The Challenge: Avoiding Duplicate Processing and Locks

Running multiple pods introduces a race condition — if more than one pod attempts to process the same accounts, we could face:

📌 Duplicate processing — leading to inconsistent results and wasted resources.

📌 Database locks — slowing down performance due to transaction blocking.

Approach 1: Assign Account Ranges to Each Pod

One way to prevent duplicate processing is to assign a distinct range of accounts to each pod.

If we have 2 pods and 1 million accounts,

Pod 1 processes accounts 1–500K,
Pod 2 processes accounts 500K–1M.

Problems with This Approach

Requires schema changes — Our current IDs are random (UUIDs), so we'd need to introduce sequence numbers.
Not flexible — If the number of pods or accounts changes dynamically, we need to manually adjust the range distribution.
Handling pod failures — If a pod fails, another pod must detect it and pick up the unfinished range, or the restarted pod must figure it out.

Approach 2: Using FOR UPDATE to Lock Records

Another option is to use the FOR UPDATE clause in SQL, which locks the selected rows until the transaction is committed or rolled back.

✅ Prevents duplicate processing.

❌ Causes delays — Other transactions trying to update the same rows must wait for the lock to be released. If multiple pods select the same batch, some will be blocked, reducing parallelism.

Approach 3: FOR UPDATE SKIP LOCKED

A better approach is to allow pods to automatically pick unprocessed accounts without blocking each other. The SQL clause FOR UPDATE SKIP LOCKED achieves this:

✅ Locks selected rows, preventing duplicate processing.

✅ Skips already locked rows, allowing other transactions (pods) to process remaining records without waiting.

✅ Ensures each pod picks only unprocessed records, avoiding conflicts and deadlocks.

Here's how we can implement this approach:

    private fun processAccountsAsyncSkipLock(): Boolean {
        val accounts = dslContext
            .select()
            .from(TEST_ACCOUNT)
            .where(TEST_ACCOUNT.PROCESSED.eq(false))
            .orderBy(TEST_ACCOUNT.ID)
            .limit(10)
            .forUpdate()
            .skipLocked()
            .fetchInto(TestAccountRecord::class.java)
        if (accounts.isEmpty()) return false
        runBlocking {
            accounts.forEach { account ->
                async(Dispatchers.IO) {
                    delay(10)// simulating processing
                }
            }
        }
        dslContext.update(TEST_ACCOUNT)
            .set(TEST_ACCOUNT.PROCESSED, true)
            .where(TEST_ACCOUNT.ID.`in`(accounts.map { it.id }))
            .execute()
        return true
    }

Performance Analysis 📈

Let's run the test and observe the impact of parallelism.

Results: 8 minutes to process all accounts. 3 pods added — 3x faster 🚀.

By increasing the number of pods, we significantly reduce the total processing time while maintaining reliability. However, can we infinitely scale by just adding more pods? That's the next challenge.

Scaling Limits: When Parallelism Isn't Enough

At first glance, adding more workers to process records in parallel seems like the perfect solution. However, scaling batch jobs isn't just about increasing the number of machines — at some point, the bottleneck shifts elsewhere.

Even with FOR UPDATE SKIP LOCKED, database contention can become a limiting factor. As more workers attempt to lock and update records, the database starts experiencing:

1. Increased Locking Overhead.

Multiple workers competing for locks can slow down transactions.
Even with SKIP LOCKED, frequent updates on large datasets can create contention.

2. Extreme Number of Database Connections.

Each worker needs a separate database connection, consuming memory and CPU.
Too many connections lead to context switching overhead.
Databases have a connection limit — exceeding it can degrade performance or cause failures.

3. Higher Load on the Database.

Frequent queries and updates put stress on CPU, memory, and disk I/O.
High write activity increases WAL (Write-Ahead Logging) pressure.
Read-heavy workloads can overwhelm the buffer cache, causing performance degradation.

Mitigating Scaling Bottlenecks

To cope with these issues, we can do the following:

1. Database Optimizations

Tuning the database can reduce contention and improve efficiency:

Indexing & Partitioning.
PostgreSQL Parameter Tuning.

shared_buffers — Set to 25–40% of RAM to optimize caching.
wal_buffers — Adjust to improve write performance.
max_wal_size — Increase for high write workloads.
and so on

While database tuning helps, it has limitations — eventually, the database will become the bottleneck again.

2. Offloading to a Message Queue

Instead of processing accounts directly, we can use an event-driven approach:

Workers publish account records as events to a message queue.
Consumer services subscribe to these events and process them asynchronously.
The database load drops significantly since accounts are processed outside the main transaction scope.

Which give us the following advantages:

Reduces lock contention — No direct updates within a tight loop.
Handles failures gracefully — Failed messages can be retried instead of locking rows.

However, moving to an event-driven architecture introduces additional complexity:

Requires message broker setup and monitoring.
Needs idempotency handling to prevent duplicate processing.
Brings networks delays.

There are other approaches as well; consider all the options above as advices, not the only way to build batch jobs.

Conclusion — Lessons Learned

Scaling batch jobs is a challenge that requires balancing performance, reliability, and simplicity. We started with a brute-force approach using a single cron job, then introduced concurrent processing within one pod, then parallel processing with multiple workers, and finally explored optimizations like database locking strategies.

✍ Key takeaways from our journey:

✅ Parallelism is crucial — Distributing tasks across multiple workers speeds up processing and improves resilience. ✅ Database locking matters — FOR UPDATE SKIP LOCKED helps prevent contention but can still become a bottleneck at scale. ✅ Failure handling is critical — Retries, deduplication, and monitoring ensure smooth execution without duplicate processing.

In practice, the best approach depends on your data volume, infrastructure, and reliability requirements. A simple cron job might work for small workloads, but as scale increases, moving toward distributed processing is necessary.

By applying these principles, you can build batch jobs that are efficient, reliable, and scalable, ensuring that critical processes — whether payments, interest calculations, or data migrations — run smoothly, every time 🎯.

#postgresql #kubernetes #aws #batch-job #optimization

< Go to the original

Scaling Batch Jobs for Reliable and Efficient Processing

This article explores the challenges of running batch jobs and provides an in-depth guide on possible approaches to make them more scalable…

Goal

Functional requirements:

Non-functional requirements:

Brute Force Approach: A Single Pod

⚙️ Step 1: Creating the Database Table

⚙️ Step 2: Implementing the Batch Job

⚙️ Step 3: Scheduling a Kubernetes CronJob

Performance Analysis 📈

Concurrent Processing

Coroutines

Performance Analysis 📈

Parallel processing

The Challenge: Avoiding Duplicate Processing and Locks

Performance Analysis 📈

Scaling Limits: When Parallelism Isn't Enough

Mitigating Scaling Bottlenecks

1. Database Optimizations

2. Offloading to a Message Queue

Conclusion — Lessons Learned

Reporting a Problem