Fault Tolerance and Error Handling¶
Overview¶
Xians leverages Temporal workflow capabilities to provide robust fault tolerance and error handling for AI agents. Temporal automatically manages workflow state, retries, and failure recovery, eliminating the need for custom implementation of complex error handling patterns.
Temporal-Based Fault Tolerance¶
Core Principle¶
To utilize fault tolerance features, critical business logic must run within workflows, not activities. Activities should only perform stateless operations and delegate complex error handling to workflows.
Workflow Configuration¶
Enable fault-tolerant processing by setting processInWorkflow = true
:
// Data processing in workflow
flow.SetDataProcessor<MyDataProcessor>(processInWorkflow: true);
// Schedule processing in workflow
flow.SetScheduleProcessor<MyScheduleProcessor>(processInWorkflow: true);
Starting Workflows from Activities¶
When in an activity, explicitly start a workflow for fault-tolerant execution:
Retry Policies¶
Custom Retry Configuration¶
Always configure custom retry policies. The default Temporal retry policy attempts infinite retries, which can cause issues:
var retryPolicy = new RetryPolicy
{
InitialInterval = TimeSpan.FromSeconds(1),
MaximumInterval = TimeSpan.FromMinutes(2),
BackoffCoefficient = 2.0,
MaximumAttempts = 5,
NonRetryableErrorTypes = new[] { "ValidationException", "AuthenticationException" }
};
// Apply to activity execution
await Workflow.ExecuteActivityAsync(
() => MyActivity.ProcessAsync(data),
new ActivityOptions
{
ScheduleToCloseTimeout = TimeSpan.FromMinutes(10),
RetryPolicy = retryPolicy
});
Timeout Configuration¶
Use ScheduleToCloseTimeout
to set maximum duration for activity completion:
var activityOptions = new ActivityOptions
{
ScheduleToCloseTimeout = TimeSpan.FromHours(1), // Maximum total time
StartToCloseTimeout = TimeSpan.FromMinutes(10), // Per-attempt timeout
RetryPolicy = retryPolicy
};
Exception Handling¶
Recoverable Failures¶
Activities should throw exceptions for recoverable failures to trigger Temporal's retry mechanisms:
[Activity]
public async Task<ProcessResult> ProcessDataActivity(ProcessingRequest request)
{
try
{
var result = await _externalService.ProcessAsync(request);
return result;
}
catch (HttpRequestException ex) when (ex.Message.Contains("timeout"))
{
// Let Temporal retry transient network issues
throw new ApplicationFailureException("Network timeout occurred", ex);
}
catch (SqlException ex) when (IsTransientError(ex))
{
// Let Temporal retry database connection issues
throw new ApplicationFailureException("Database connection failed", ex);
}
}
Non-Retryable Failures¶
For data validation and permanent failures, set NonRetryable = true
:
[Activity]
public async Task ValidateAndProcess(ProcessingRequest request)
{
if (string.IsNullOrEmpty(request.UserId))
{
// Don't retry validation failures
throw new ApplicationFailureException(
"User ID is required",
null,
nonRetryable: true);
}
if (!await _authService.IsAuthorized(request.UserId))
{
// Don't retry authorization failures
throw new ApplicationFailureException(
"User not authorized",
null,
nonRetryable: true);
}
// Process the request...
}
Workflow Implementation Examples¶
Data Processing Workflow¶
[Workflow]
public class DataProcessingWorkflow
{
private readonly RetryPolicy _defaultRetryPolicy = new()
{
InitialInterval = TimeSpan.FromSeconds(2),
MaximumInterval = TimeSpan.FromMinutes(5),
BackoffCoefficient = 2.0,
MaximumAttempts = 3,
NonRetryableErrorTypes = new[] { "ValidationException" }
};
[WorkflowRun]
public async Task<ProcessingResult> RunAsync(DataProcessingRequest request)
{
// Validate input (non-retryable)
await Workflow.ExecuteActivityAsync(
() => ValidationActivity.ValidateAsync(request),
new ActivityOptions
{
StartToCloseTimeout = TimeSpan.FromMinutes(1),
RetryPolicy = new RetryPolicy { MaximumAttempts = 1 }
});
// Process data (retryable)
var processResult = await Workflow.ExecuteActivityAsync(
() => ProcessingActivity.ProcessAsync(request),
new ActivityOptions
{
ScheduleToCloseTimeout = TimeSpan.FromHours(2),
StartToCloseTimeout = TimeSpan.FromMinutes(30),
RetryPolicy = _defaultRetryPolicy
});
// Store results (retryable)
await Workflow.ExecuteActivityAsync(
() => StorageActivity.StoreAsync(processResult),
new ActivityOptions
{
StartToCloseTimeout = TimeSpan.FromMinutes(5),
RetryPolicy = _defaultRetryPolicy
});
return processResult;
}
}
Error Categories and Handling¶
Transient Errors (Retryable)¶
- Network timeouts
- Database connection failures
- Temporary service unavailability
- Rate limiting (429 errors)
// Let Temporal handle retries automatically
throw new ApplicationFailureException("Service temporarily unavailable", ex);
Permanent Errors (Non-Retryable)¶
- Data validation failures
- Authentication/authorization errors
- Business rule violations
- Malformed requests
// Prevent unnecessary retries
throw new ApplicationFailureException("Invalid data format", ex, nonRetryable: true);
Timeout Types¶
- StartToCloseTimeout: Maximum time for single activity execution
- ScheduleToCloseTimeout: Maximum total time including all retries
- HeartbeatTimeout: For long-running activities requiring heartbeats
For more details, see Temporal Failure Handling.
Best Practices¶
- Run Critical Logic in Workflows: Place fault-tolerant business logic in workflows, not activities
- Configure Custom Retry Policies: Never rely on default infinite retry behavior
- Use Appropriate Timeouts: Set reasonable timeouts to prevent hanging processes
- Mark Non-Retryable Errors: Use
nonRetryable: true
for validation and permanent failures - Implement Compensation: Use workflow capabilities for rollback and cleanup operations
- Monitor Workflow Execution: Leverage Temporal's UI and metrics for observability
Temporal's workflow engine provides automatic state management, retry logic, and failure recovery, making fault tolerance straightforward and reliable without custom implementation complexity.