Async Job Processing System

A production-grade, DB-backed asynchronous job processing engine built with Spring Boot 3, PostgreSQL, and ThreadPoolExecutor. Designed for reliability, horizontal scalability, and clean extensibility — without the operational overhead of Kafka or Redis.

Architecture Overview

┌─────────────────┐     ┌──────────────┐     ┌──────────────────┐
│  REST API Layer  │────▶│  JobService  │────▶│  JobRepository   │
│  (Controller)    │     │  (Business)  │     │  (JPA/Postgres)  │
└─────────────────┘     └──────┬───────┘     └──────────────────┘
                               │                       ▲
                               │                       │
┌──────────────────┐     ┌─────┴────────┐     ┌───────┴──────────┐
│  HandlerRegistry │◀────│  JobPoller   │────▶│  ThreadPool      │
│  (Strategy)      │     │  (Scheduled) │     │  Executor        │
└──────┬───────────┘     └──────────────┘     └──────────────────┘
       │
       ├── EmailJobHandler
       └── ReportJobHandler

Flow:

Client submits a job via POST /jobs → persisted as PENDING
Scheduled poller fetches a batch of PENDING jobs every N seconds
Each job is atomically claimed (PENDING → PROCESSING) using optimistic locking
Claimed jobs are submitted to a ThreadPoolExecutor for async execution
Handler executes → job transitions to SUCCESS or retries until DEAD

Retry Strategy

Jobs use a bounded retry with exponential back-off potential:

Attempt	retryCount	Result on Failure
1st	0 → 1	Status → PENDING (re-queued)
2nd	1 → 2	Status → PENDING (re-queued)
3rd	2 → 3	Status → DEAD (stopped)

Why this approach:

Bounded retries prevent poison-pill jobs from consuming resources indefinitely
DEAD status creates a clear audit trail — operators can inspect and manually retry via POST /jobs/{id}/retry
No exponential backoff in this version to keep complexity honest. In production, you'd add a nextRetryAt timestamp and filter WHERE next_retry_at <= NOW() in the poller query
Max retry count is externalized (job.max-retry=3) — tunable per environment without code changes

How Optimistic Locking Prevents Double-Processing

The Job entity has a @Version field that JPA auto-increments on every update.

Race condition scenario:

Instance A and Instance B both poll and fetch Job X (status=PENDING, version=1)
Instance A calls markProcessing(jobX) → UPDATE ... SET status='PROCESSING', version=2 WHERE id=X AND version=1 → succeeds
Instance B calls markProcessing(jobX) → UPDATE ... WHERE id=X AND version=1 → 0 rows affected → OptimisticLockException
Instance B catches the exception and silently skips the job

Why optimistic over pessimistic locking (SELECT FOR UPDATE):

No row-level DB locks held during job execution — much better throughput
No risk of deadlocks from long-held row locks
Works cleanly across multiple app instances sharing the same database
The conflict window is tiny (only the status transition moment), so collisions are rare
SELECT FOR UPDATE would hold locks for the entire polling transaction, blocking other instances from even reading those rows

Horizontal Scaling

This system is designed to scale horizontally out of the box:

┌──────────┐     ┌──────────┐     ┌──────────┐
│ Instance1│     │ Instance2│     │ Instance3│
│  Poller  │     │  Poller  │     │  Poller  │
│  Workers │     │  Workers │     │  Workers │
└────┬─────┘     └────┬─────┘     └────┬─────┘
     │                │                │
     └────────────────┼────────────────┘
                      │
              ┌───────┴───────┐
              │  PostgreSQL   │
              │  (shared DB)  │
              └───────────────┘

Deploy N instances pointing at the same database. Each instance runs its own poller and executor. Optimistic locking guarantees that:

No two instances process the same job
No distributed coordination (ZooKeeper, etc.) is needed
Adding instances linearly increases throughput

Tuning per instance:

Adjust job.executor.pool-size based on instance CPU/memory
Stagger job.poller.interval across instances to reduce DB contention (optional)

Design Tradeoffs: DB Polling vs Message Queue

Aspect	DB Polling (this project)	Message Queue (Kafka/RabbitMQ)
Latency	Seconds (poll interval)	Milliseconds (push-based)
Throughput	Hundreds/sec	Tens of thousands/sec
Infra cost	Zero (reuse existing DB)	Additional broker to operate
Delivery guarantees	At-least-once via optimistic lock	Built-in (offsets, acks)
Ordering	FIFO via `ORDER BY created_at`	Partition-level ordering
Operational complexity	Low — just Postgres	High — broker tuning, monitoring
Best for	Low-to-medium volume, internal jobs	High-volume event streaming

This project chose DB polling because:

For most internal job systems (emails, reports, notifications), sub-second latency isn't needed
Eliminates an entire infrastructure component and its operational burden
The DB you already have is your queue — simpler deployment, fewer failure modes

Replacing DB Polling with Kafka (Future Migration Path)

If throughput requirements grow beyond what DB polling can handle:

Keep the JobHandler interface and HandlerRegistry unchanged — they're decoupled from the delivery mechanism
Replace JobPoller with a @KafkaListener that consumes from a topic per job type
Replace the POST /jobs endpoint to publish to Kafka instead of (or in addition to) writing to the DB
Keep the jobs table as a persistent audit log — write after successful processing
Remove @Version optimistic locking — Kafka consumer groups handle partition assignment and prevent double-processing

The Strategy Pattern in the handler layer means zero changes to business logic — only the delivery mechanism changes.

API Reference

Submit a Job

curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{"type": "EMAIL", "payload": {"to": "user@example.com", "subject": "Welcome"}}'

Response: 201 Created

{ "jobId": "550e8400-e29b-41d4-a716-446655440000" }

Check Job Status

curl http://localhost:8080/jobs/{jobId}

Response: 200 OK

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "EMAIL",
  "status": "SUCCESS",
  "retryCount": 0,
  "createdAt": "2026-03-04T10:15:30Z",
  "updatedAt": "2026-03-04T10:15:35Z"
}

List Dead Jobs (Paginated)

curl "http://localhost:8080/jobs/dead?page=0&size=20"

Retry a Dead Job

curl -X POST http://localhost:8080/jobs/{jobId}/retry

Health Check

curl http://localhost:8080/actuator/health

{
  "status": "UP",
  "components": {
    "job": {
      "status": "UP",
      "details": {
        "pendingJobs": 3,
        "deadJobs": 0,
        "activeThreads": 2,
        "poolSize": 5
      }
    }
  }
}

Running Locally

Prerequisites

Java 17+
Maven 3.8+
Docker & Docker Compose

With Docker Compose (recommended)

# Build the JAR
./mvnw clean package -DskipTests

# Start PostgreSQL + App
docker-compose up --build

Without Docker

# Start PostgreSQL separately, then:
./mvnw spring-boot:run

Run Tests

./mvnw test

Tests use H2 in-memory database via the test profile.

Project Structure

com.example.asyncjob
├── controller/       # REST API layer
├── service/          # Business logic, transaction boundaries
├── worker/           # Scheduled poller — the async engine
├── handler/          # Strategy pattern — one handler per job type
├── repository/       # Spring Data JPA interfaces
├── model/            # JPA entities and enums
├── config/           # ThreadPoolExecutor, @ConfigurationProperties
├── exception/        # Global exception handling, API error format
├── health/           # Custom Actuator health indicator
└── dto/              # Request/response records

Tech Stack

Component	Technology
Framework	Spring Boot 3.2
Language	Java 17
Database	PostgreSQL 16
ORM	Hibernate 6 / JPA
Concurrency	ThreadPoolExecutor
Monitoring	Spring Boot Actuator
Logging	SLF4J + Logback
Build	Maven
Container	Docker + Docker Compose
Test DB	H2 (in-memory)

avibhawnani/asyncJobEngine