Error Handling

FlowState provides structured error handling through exit codes and the on-error mapping. This enables workflows to recover gracefully from failures rather than terminating immediately.

Exit Codes

Every tool execution produces an exit code:

Code	Meaning
`0`	Success - proceed to `next` state
`1`	General error
`2`	Misuse or invalid arguments
`124`	Timeout exceeded
`130`	User cancelled (Ctrl+C)
`-1`	Pause requested (internal)

Tools may define additional codes. Check individual tool documentation for specifics.

The on-error Map

The on-error field maps exit codes to error-handling states:

risky-operation:
  tool: bash
  arguments:
    command: ./might-fail.sh
  on-error:
    1: handle-general-error
    2: handle-invalid-args
    _: handle-unknown-error
  next: success-path

When the tool exits with a non-zero code:

FlowState checks on-error for that specific code
If found, transitions to the mapped state
If not found, checks for the _ wildcard
If no handler matches, the workflow halts with that exit code

Wildcard Matching

The _ key is a catch-all that matches any exit code not explicitly mapped:

fetch-data:
  tool: bash
  arguments:
    command: curl https://api.example.com/data
  on-error:
    _: handle-network-error
  next: process-data

Best practice: Always include a _ handler unless you want the workflow to halt on unexpected errors.

Error Recovery Patterns

Retry with Backoff

fetch-data:
  tool: bash
  arguments:
    command: curl https://api.example.com/data
  output: var(data)
  on-error:
    _: retry-fetch
  next: process-data

retry-fetch:
  tool: bash
  arguments:
    command: sleep 5
  next: fetch-data

For more sophisticated retry logic, track attempt count:

variables:
  max_retries: "3"
  attempt: "0"

fetch-data:
  tool: bash
  arguments:
    command: curl https://api.example.com/data
  output: var(data)
  on-error:
    _: check-retry
  next: process-data

check-retry:
  tool: bash
  arguments:
    command: |
      attempt=$(({{ attempt }} + 1))
      echo $attempt
  output: var(attempt)
  next: evaluate-retry

evaluate-retry:
  tool: switch
  arguments:
    value: "{{ attempt == '{{ max_retries }}' }}"
  goto:
    "true": give-up
    _: wait-and-retry

wait-and-retry:
  tool: bash
  arguments:
    command: sleep 5
  next: fetch-data

give-up:
  tool: bash
  arguments:
    command: 'echo "Failed after {{ max_retries }} attempts" >&2'

Fallback Values

get-config:
  tool: bash
  arguments:
    command: cat /etc/myapp/config.json
  output: var(config)
  on-error:
    _: use-default-config
  next: apply-config

use-default-config:
  tool: bash
  arguments:
    command: echo '{"mode": "default"}'
  output: var(config)
  next: apply-config

Cleanup on Error

create-temp-resources:
  tool: bash
  arguments:
    command: mkdir -p ./temp && touch ./temp/lock
  next: risky-operation

risky-operation:
  tool: bash
  arguments:
    command: ./might-fail.sh
  on-error:
    _: cleanup-and-fail
  next: cleanup-and-succeed

cleanup-and-succeed:
  tool: bash
  arguments:
    command: rm -rf ./temp
  next: done

cleanup-and-fail:
  tool: bash
  arguments:
    command: rm -rf ./temp
  next: report-failure

report-failure:
  tool: bash
  arguments:
    command: 'echo "Operation failed" >&2 && exit 1'

Notification on Failure

important-task:
  tool: bash
  arguments:
    command: ./critical-operation.sh
  on-error:
    _: notify-and-fail
  next: complete

notify-and-fail:
  tool: bash
  arguments:
    command: |
      curl -X POST https://hooks.slack.com/... \
        -d '{"text": "Workflow failed in important-task"}'
  # No next - workflow ends after notification

Timeout Handling

Exit code 124 indicates a timeout. Handle it explicitly:

long-running-task:
  tool: claude
  arguments:
    prompt: "Analyze this large codebase..."
    timeout: 10m
  on-error:
    124: handle-timeout
    _: handle-other-error
  next: use-result

handle-timeout:
  tool: bash
  arguments:
    command: echo "Task timed out - using cached result"
  next: use-cached-result

User Cancellation

Exit code 130 indicates the user pressed Ctrl+C during interactive prompts:

get-user-input:
  tool: ask-user
  arguments:
    question: "Enter the deployment target:"
  output: var(target)
  on-error:
    130: user-cancelled
    _: input-error
  next: deploy

user-cancelled:
  tool: bash
  arguments:
    command: echo "Deployment cancelled by user"

Error State Best Practices

Name clearly: Use prefixes like handle-, recover-, fallback-

Log context: Include enough information to diagnose issues

handle-error:
  tool: bash
  arguments:
    command: 'echo "Error in fetch-data: check network connectivity" >&2'

Preserve state: Avoid modifying variables that might be needed for debugging
Consider idempotency: Error handlers may run multiple times if retrying
Exit cleanly: Terminal error states should produce meaningful exit codes
```
fatal-error:
  tool: bash
  arguments:
    command: exit 1
```

Debugging Failed Workflows

When a workflow halts due to an unhandled error:

Check the instance’s context.json for the current state
Review the exit code from the failed tool
Examine any captured output for error messages
Resume with --resume after fixing the issue, or clean up and start fresh