Skip to main content
kotlin in depth advanced patterns for java engineers

Structured Concurrency and Failure Propagation

11 min read Chapter 6 of 21

The Problem Structured Concurrency Solves

In Java, concurrent work is fire-and-forget by default. You submit a Runnable to an ExecutorService, and the relationship between the submitting code and the running task is a Future reference you have to manage manually:

ExecutorService pool = Executors.newFixedThreadPool(4);

Future<?> task1 = pool.submit(() -> fetchPricing());
Future<?> task2 = pool.submit(() -> fetchInventory());
Future<?> task3 = pool.submit(() -> fetchReviews());

// What if fetchPricing() fails?
// task2 and task3 keep running — you have to cancel them manually
// What if this method throws before reaching future.get()?
// The tasks are orphaned — still running, nobody waiting for them

try {
    Pricing pricing = (Pricing) task1.get();
    Inventory inventory = (Inventory) task2.get();
    Reviews reviews = (Reviews) task3.get();
    return buildProductPage(pricing, inventory, reviews);
} catch (ExecutionException e) {
    task1.cancel(true);  // Hope you remembered all three
    task2.cancel(true);
    task3.cancel(true);
    throw e;
}

Even ExecutorService.shutdownNow() doesn’t guarantee task cancellation — it sends interrupts that tasks can ignore. And with CompletableFuture, there’s no cancellation propagation at all; cancel() on a CompletableFuture doesn’t propagate to upstream futures.

Kotlin’s structured concurrency makes a hard guarantee: when a scope ends, all coroutines launched within it have completed or been cancelled. No orphans, no leaks, no manual cleanup.

The Job Tree

Structured Concurrency

Every coroutine has a Job. Every Job (except the root) has a parent. This forms a tree:

val scope = CoroutineScope(Dispatchers.Default)

scope.launch {                          // Job A (child of scope's Job)
    launch {                            // Job B (child of A)
        launch {                        // Job D (child of B)
            delay(1000)
        }
    }
    launch {                            // Job C (child of A)
        delay(2000)
    }
}

The hierarchy for this code:

scope.Job (root)
  └── Job A
       ├── Job B
       │    └── Job D
       └── Job C

When you call launch or async inside a coroutine, the new coroutine’s Job becomes a child of the current coroutine’s Job. This is automatic — it happens through the CoroutineContext inheritance. The child’s Job is registered with the parent, forming the tree edge.

You can inspect this relationship:

scope.launch {
    val parentJob = coroutineContext[Job]
    val child = launch {
        val myJob = coroutineContext[Job]
        println(myJob?.parent === parentJob) // true
    }
    println(parentJob?.children?.toList()?.contains(child)) // true
}

Cancellation Propagation: Child Fails → Parent Cancels → Siblings Cancel

The defining behavior of structured concurrency is how failure propagates. The default rule: if a child coroutine fails with an exception (other than CancellationException), the parent is cancelled, which cancels all other children.

Walk through a concrete scenario:

suspend fun loadProductPage(productId: String): ProductPage = coroutineScope {
    val pricing = async { fetchPricing(productId) }      // child 1
    val inventory = async { fetchInventory(productId) }  // child 2
    val reviews = async { fetchReviews(productId) }      // child 3

    ProductPage(
        pricing = pricing.await(),
        inventory = inventory.await(),
        reviews = reviews.await()
    )
}

Suppose fetchPricing() throws a ServiceUnavailableException after 200ms, while fetchInventory() and fetchReviews() are still in flight. Here’s the cascade:

1. pricing's Job fails with ServiceUnavailableException
2. pricing's Job notifies its parent (coroutineScope's Job)
3. Parent Job transitions to "cancelling" state
4. Parent Job cancels all other children: inventory, reviews
5. inventory and reviews receive CancellationException
6. Once all children complete (cancelled or finished), parent throws
   ServiceUnavailableException to the caller of loadProductPage()

No manual cancellation. No try/catch/cancel boilerplate. The structure of the code defines the lifecycle.

CancellationException Is Special

Cancellation is not treated as failure. When a coroutine is cancelled, it throws CancellationException, but this exception propagates differently:

scope.launch {                              // parent
    val child1 = launch {
        delay(5000)
        println("Child 1 done")             // never reached
    }
    val child2 = launch {
        delay(1000)
        println("Child 2 done")
        child1.cancel()                     // cancel sibling
    }
    // parent is NOT cancelled — CancellationException doesn't propagate upward
}

The rule: CancellationException signals cooperative cancellation, not failure. A cancelled child completes normally from the parent’s perspective. Other exceptions (anything not a CancellationException) signal failure and propagate.

This distinction matters for your suspend functions. Use ensureActive() or check isActive to cooperate with cancellation:

suspend fun processLargeDataset(items: List<Item>) {
    for (item in items) {
        ensureActive()  // throws CancellationException if cancelled
        process(item)
    }
}

If you catch exceptions broadly inside a coroutine, be careful not to swallow CancellationException:

// WRONG — swallows cancellation
launch {
    try {
        riskyOperation()
    } catch (e: Exception) {
        log.error("Failed", e)  // This catches CancellationException too!
    }
}

// CORRECT — rethrow cancellation
launch {
    try {
        riskyOperation()
    } catch (e: CancellationException) {
        throw e  // Don't swallow cancellation
    } catch (e: Exception) {
        log.error("Failed", e)
    }
}

Launch vs Async Exception Behavior

launch and async handle exceptions differently, and confusing the two is a common source of bugs.

launch: exceptions propagate immediately to the parent.

scope.launch {
    throw RuntimeException("boom")
    // Exception propagates UP to scope immediately
    // scope's Job is cancelled
}

async: exceptions are stored in the Deferred and rethrown on .await().

scope.launch {
    val deferred = async {
        throw RuntimeException("boom")
        // Exception is NOT propagated yet
    }

    delay(1000)
    println("This line executes — async hasn't propagated yet")

    deferred.await()  // NOW the exception is thrown here
}

Wait — that description is partially wrong, and this is the subtlety that catches experienced developers. While the exception is stored in the Deferred, the child coroutine also notifies its parent when it fails. If async is a direct child of a launch or coroutineScope, the parent will be cancelled regardless of whether you call .await().

scope.launch {
    val deferred = async {
        delay(100)
        throw RuntimeException("boom")
    }

    delay(5000)
    // You might think you have 5 seconds before dealing with the error
    // But after 100ms, the async child fails and cancels this parent too
    println("This may not print")
}

The takeaway: async without await is almost always a bug unless you’re using a SupervisorJob.

CoroutineExceptionHandler

When an uncaught exception reaches the root coroutine (one with no parent, or whose parent is a SupervisorJob), the CoroutineExceptionHandler is invoked:

val handler = CoroutineExceptionHandler { _, exception ->
    log.error("Unhandled coroutine exception", exception)
    metrics.increment("coroutine.unhandled_exception")
}

val scope = CoroutineScope(Dispatchers.Default + handler)

scope.launch {
    launch {
        throw DatabaseException("connection reset")
        // Propagates to root → handler is called
    }
}

Critical constraints to understand:

  1. Only works on root coroutines. Installing a handler on a child coroutine has no effect — the exception propagates to the parent before the handler can intercept it.

  2. Does NOT prevent cancellation. The handler is a notification mechanism, not a recovery mechanism. By the time it’s called, the scope is already cancelled.

  3. async exceptions don’t trigger it. Since async stores exceptions in Deferred, the handler never sees them unless they propagate through a launch chain.

Compare with Java’s Thread.UncaughtExceptionHandler — similar role, similar limitation (notification only, no recovery).

SupervisorJob and supervisorScope

Default behavior (child failure cancels parent and siblings) is correct for most use cases. But sometimes you want isolation — one child’s failure shouldn’t affect others.

SupervisorJob: a Job where child failure does NOT propagate upward.

val scope = CoroutineScope(SupervisorJob() + Dispatchers.Default)

scope.launch {
    delay(100)
    throw RuntimeException("child 1 failed")
    // This does NOT cancel scope or sibling coroutines
}

scope.launch {
    delay(5000)
    println("Child 2 completes normally despite child 1's failure")
}

supervisorScope: like coroutineScope, but with supervisor behavior.

suspend fun loadDashboard(): Dashboard = supervisorScope {
    val pricing = async { fetchPricing() }       // can fail independently
    val analytics = async { fetchAnalytics() }   // can fail independently
    val alerts = async { fetchAlerts() }         // can fail independently

    Dashboard(
        pricing = runCatching { pricing.await() }.getOrNull(),
        analytics = runCatching { analytics.await() }.getOrNull(),
        alerts = runCatching { alerts.await() }.getOrNull()
    )
}

Here, if fetchPricing() fails, the other two continue running. The runCatching block handles the exception locally, and you return a Dashboard with partial data. This pattern is ideal for degradable UIs where some panels can show “unavailable” without killing the whole page.

The cancellation flow with a supervisor:

SupervisorJob (root)
  ├── Child A (fails with exception)
  │    → exception does NOT propagate to parent
  │    → sibling B is NOT cancelled
  └── Child B (continues running)

Contrast with a regular Job:

Regular Job (root)
  ├── Child A (fails with exception)
  │    → exception propagates to parent
  │    → parent cancels Child B
  └── Child B (cancelled via CancellationException)

Real-World Pattern: Parallel API Calls With Partial Failure Handling

Here’s a production-grade pattern for making parallel calls where some are required and others are optional:

suspend fun assembleUserProfile(userId: String): UserProfile = coroutineScope {
    // Required data — if these fail, the whole operation fails
    val user = async { userService.getUser(userId) }
    val permissions = async { authService.getPermissions(userId) }

    // Optional data — failures produce defaults
    val recommendations = async(SupervisorJob(coroutineContext[Job])) {
        try {
            recommendationService.getRecommendations(userId)
        } catch (e: Exception) {
            log.warn("Recommendations unavailable for $userId", e)
            emptyList()
        }
    }

    val recentActivity = async(SupervisorJob(coroutineContext[Job])) {
        try {
            activityService.getRecent(userId, limit = 10)
        } catch (e: Exception) {
            log.warn("Activity feed unavailable for $userId", e)
            emptyList()
        }
    }

    UserProfile(
        user = user.await(),                     // throws if user service fails
        permissions = permissions.await(),       // throws if auth service fails
        recommendations = recommendations.await(),
        recentActivity = recentActivity.await()
    )
}

The design: user and permissions are inside the regular coroutineScope, so their failure cancels everything (you can’t build a profile without them). recommendations and recentActivity use SupervisorJob as a parent override, so their failure is isolated and handled locally with fallback values.

Java’s ExecutorService.shutdownNow() vs Structured Cancellation

Java’s closest equivalent to “cancel everything in this scope”:

ExecutorService pool = Executors.newFixedThreadPool(4);

Future<?> f1 = pool.submit(() -> longRunningTask1());
Future<?> f2 = pool.submit(() -> longRunningTask2());

// Cancel everything
pool.shutdownNow();
// Returns List<Runnable> of tasks that never started
// Running tasks receive Thread.interrupt() — which they can ignore

// Wait for completion with timeout
if (!pool.awaitTermination(5, TimeUnit.SECONDS)) {
    // Tasks are STILL RUNNING. What now?
    log.warn("Tasks did not terminate in time");
}

The problems:

  • shutdownNow() sends interrupts, but tasks can catch InterruptedException and continue
  • There’s no hierarchy — you can’t cancel a subset of tasks while keeping others
  • There’s no automatic propagation — child tasks spawned inside running tasks are invisible to the pool
  • The pool itself is a mutable shared resource; shutdown affects all callers

Kotlin’s structured cancellation avoids every one of these issues:

val scope = CoroutineScope(Dispatchers.Default)

scope.launch {
    launch { longRunningTask1() }
    launch { longRunningTask2() }
}

// Cancel everything — including nested children at any depth
scope.cancel()
// All coroutines receive CancellationException at their next suspension point
// No "ignoring interrupts" — suspension points are cooperative checkpoints
// Nested children are automatically cancelled through the Job tree

Exception Handling Best Practices for Production

1. Install CoroutineExceptionHandler at the scope level, not on individual coroutines.

// Application-level scope with centralized error handling
val appScope = CoroutineScope(
    SupervisorJob() +
    Dispatchers.Default +
    CoroutineExceptionHandler { _, e ->
        logger.error("Unhandled coroutine failure", e)
        errorTracker.report(e)
    }
)

2. Use coroutineScope for groups of related work that should fail together.

suspend fun transferFunds(from: Account, to: Account, amount: BigDecimal) = coroutineScope {
    val debit = async { accountService.debit(from, amount) }
    val credit = async { accountService.credit(to, amount) }
    // If either fails, both are cancelled — no partial transfers
    debit.await()
    credit.await()
}

3. Use supervisorScope when partial failure is acceptable.

suspend fun sendNotifications(users: List<User>, message: String) = supervisorScope {
    users.map { user ->
        launch {
            try {
                notificationService.send(user, message)
            } catch (e: Exception) {
                log.warn("Failed to notify ${user.id}", e)
                // Don't propagate — other users should still get notified
            }
        }
    }.joinAll()
}

4. Never catch CancellationException without rethrowing.

// Use this pattern for cleanup-on-cancellation
launch {
    try {
        longRunningWork()
    } catch (e: CancellationException) {
        cleanup()   // release resources
        throw e     // always rethrow
    }
}

5. Avoid GlobalScope. It’s unstructured — coroutines launched in GlobalScope have no parent, no automatic cancellation, no lifecycle management.

// AVOID — this coroutine outlives everything, can leak
GlobalScope.launch { backgroundSync() }

// PREFER — scoped to the component's lifecycle
class UserRepository(private val scope: CoroutineScope) {
    fun startSync() = scope.launch { backgroundSync() }
}

Structured concurrency isn’t a feature you opt into — it’s the default behavior that you must deliberately opt out of (via SupervisorJob or GlobalScope). When you find yourself fighting against the parent-child cancellation model, that’s usually a signal that your concurrency structure doesn’t match your failure domain. Restructure the scope hierarchy to match how your system should actually respond to failure, and the exception handling code writes itself.