Supervision and Restart Strategies
This guide demonstrates how to use supervision and restart strategies in Spring Boot Starter Actor to build fault-tolerant, self-healing applications.
Overview
Supervision is a core concept in the actor model where parent actors monitor and respond to failures in their child actors. Spring Boot Starter Actor supports multiple supervision strategies that determine how the system handles failures.
The supervision example provides:
- Interactive visualization of actor hierarchies and supervision behavior
- Real-time failure tracking with visual indicators
- Hands-on testing of different supervision strategies
- Multi-level hierarchies demonstrating supervision at arbitrary depth
Demo
Source Code
Complete source code: https://github.com/seonWKim/spring-boot-starter-actor/tree/main/example/supervision
Supervision Strategies
Spring Boot Starter Actor provides four supervision strategies for handling child actor failures:
1. Restart (Unlimited)
Strategy:
Behavior:
- Child actor is stopped when it fails
- New instance is immediately created with fresh state
- No limit on number of restarts
- Failure continues to be tracked across restarts
When to use:
- Transient failures that are likely to succeed on retry
- Stateless actors where losing state is acceptable
- Services that connect to external resources (databases, APIs)
- Worker actors processing independent tasks
Example scenario: A worker actor fetching data from an external API fails due to network timeout. Restarting gives it another chance to succeed.
2. Restart (Limited)
Strategy:
Common configurations:
// Restart up to 3 times within 1 minute
SupervisorStrategy.restart().withLimit(3,Duration.ofMinutes(1))
// Restart up to 10 times within 1 hour
SupervisorStrategy.restart().withLimit(10,Duration.ofHours(1))
// Restart up to 5 times within 30 seconds
SupervisorStrategy.restart().withLimit(5,Duration.ofSeconds(30))
Behavior:
- Child actor restarts after failure, but only up to the specified limit
- If limit is exceeded within the time window, the child is stopped permanently
- Counter resets after the time window expires
- Prevents infinite restart loops
When to use:
- Preventing resource exhaustion from failing actors
- Distinguishing between transient and persistent failures
- Production systems where infinite retries could cause problems
- Critical actors where repeated failures indicate serious issues
Example scenario: A payment processing actor should retry a few times for network issues, but stop permanently if consistently failing (indicating a configuration or system problem).
3. Stop
Strategy:
Behavior:
- Child actor is stopped immediately on failure
- No restart attempted
- Parent is notified of termination
- Resources are cleaned up
When to use:
- Fatal errors that cannot be recovered by restart
- Validation failures indicating bad configuration
- Actors that should fail fast
- Development/testing to catch errors early
Example scenario: An actor receives invalid configuration at startup. There's no point restarting - the configuration must be fixed first.
4. Resume
Strategy:
Behavior:
- Exception is logged but ignored
- Actor continues processing next message
- Current state is preserved
- No restart occurs
When to use:
- Non-critical failures that shouldn't disrupt operation
- Actors processing independent messages where one failure shouldn't stop others
- Logging/monitoring actors
- Best-effort processing scenarios
Example scenario: A logging actor fails to write one log entry to a remote service. It should continue processing other log entries rather than stopping.
Choosing the Right Strategy
Use this decision tree to select an appropriate strategy:
Is the failure recoverable?
├─ NO → Stop
└─ YES
└─ Should we retry?
├─ NO → Resume (log and continue)
└─ YES
└─ Could infinite retries cause problems?
├─ YES → Restart (Limited)
└─ NO → Restart (Unlimited)
Strategy Comparison
| Strategy | State After Failure | Continues Processing | Automatic Retry | Use Case |
|---|---|---|---|---|
| Restart (Unlimited) | Reset | ✅ (new instance) | ✅ Infinite | Transient failures, stateless workers |
| Restart (Limited) | Reset | ✅ (until limit) | ✅ Limited | Production systems, prevent loops |
| Stop | Lost | ❌ | ❌ | Fatal errors, fail fast |
| Resume | Preserved | ✅ (same instance) | ❌ | Non-critical failures, best effort |
Interactive Demo
Starting the Application
Open your browser to: http://localhost:8080
Using the Web Interface
The interactive UI provides:
-
Visual Hierarchy Tree
- Green circles represent actors
- Darker green = supervisors (manage children only)
- Lighter green = workers (can process tasks and supervise)
-
Failure Count Badges
- Red badges show how many times each worker has failed
- Updated in real-time
- Persists across restarts
-
Interactive Actions
- Add Child: Create a child actor with chosen strategy
- Send Work: Test the actor processes messages correctly
- Trigger Failure: Deliberately fail the actor to test supervision
- Stop: Manually terminate an actor
-
Live Event Log
- Real-time stream of all actor events
- Shows spawns, failures, restarts, and stops
- Helps understand supervision behavior
State Management
Understanding how state behaves across supervision actions:
| Strategy | Actor Instance | Actor State | Message Mailbox |
|---|---|---|---|
| Restart | New instance created | Lost (reset to initial) | Preserved (old messages reprocessed) |
| Stop | Terminated | Lost | Lost |
| Resume | Same instance | Preserved | Preserved (continues from next message) |
Important: If you need to preserve state across restarts, consider:
- Persisting state to external storage (database, etc.)
- Using event sourcing patterns
- Implementing custom snapshot/recovery logic
Architecture Benefits
Supervision provides:
✅ Fault Isolation - Failures don't cascade to unrelated actors ✅ Self-Healing - Automatic recovery without manual intervention ✅ Graceful Degradation - System continues operating despite component failures ✅ Clear Error Boundaries - Supervision hierarchy defines failure domains ✅ Observability - Real-time visibility into system health
Best Practices
- Choose strategies per actor type, not globally
- Use limited restart in production to prevent resource exhaustion
- Monitor restart rates to detect systemic issues
- Design for restart - assume actors can restart anytime
- Keep critical state external to survive restarts
- Use hierarchy for isolation - group related actors under supervisors
- Test failure scenarios before production deployment
Common Pitfalls
❌ Using unlimited restart everywhere - Can hide persistent problems ❌ Using resume for critical failures - May continue with corrupted state ❌ Not monitoring failure rates - Miss signs of systemic issues ❌ Storing critical state in-memory only - Lost on restart ❌ Flat hierarchies - No fault isolation