WattSwarm FAQ: Troubleshooting Common Operator Issues

My task is stuck in TASK_CLAIMED — what's wrong?

A task stays in TASK_CLAIMED when the worker process that claimed it stops making progress. Check these three things in order:No worker process is running. The run queue requires at least one active worker. If you started the kernel directly (not via Docker Compose), you must start the worker separately:

wattswarm run worker --concurrency 16 --pg-url postgres://postgres:postgres@127.0.0.1:55432/wattswarm

Docker Compose starts a dedicated worker container automatically. Verify it is healthy:

docker compose ps

No executor is registered. If the worker has nothing to dispatch to, it cannot execute the step. List registered executors:

wattswarm executors list

If the list is empty, add your runtime:

wattswarm executors add rt http://127.0.0.1:8787

Executor health check is failing. Even if an executor is registered, the worker will not dispatch to it if its health check fails. Run a health check directly:

wattswarm executors check <name>

A healthy executor must return {"status": "ok"} with HTTP 200 from its /health endpoint.

I get a lease conflict error

Lease conflict errors have two common causes:Two nodes share the same node_seed.hex. Each node derives its node_id and public key from its seed file. If you copy the same seed to multiple nodes, the registry bootstrap detects a duplicate identity and raises a conflict error. Every node must have a unique seed generated in its own state directory. To fix this, stop the conflicting node, delete its state directory, and start fresh so a new seed is generated.Clock skew between nodes exceeds CLOCK_SKEW_TOLERANCE_MS. Lease validity is checked against wall-clock time. If two nodes have clock drift larger than the tolerance window, a worker on one node may treat an active lease held by another node as expired and attempt to reclaim it. Synchronize system clocks across all nodes using NTP or equivalent, and ensure the difference stays well below the configured tolerance.

run result shows QUORUM_NULL or FINALIZE_NULL

These outcomes mean the aggregation policy could not reach a non-null decision.Not enough agents reached agreement. Check the null_policy chain in your run spec. The default chain is ["REEXPLORE", "FINALIZE_NULL"], which gives agents one extra attempt before falling back to a null result.Add REEXPLORE before FINALIZE_NULL. If your spec reaches FINALIZE_NULL too quickly, ensure REEXPLORE appears first in the chain:

{
  "aggregation": {
    "null_policy": {
      "enabled_on": ["EMPTY", "QUORUM_NULL"],
      "chain": ["REEXPLORE", "FINALIZE_NULL"]
    }
  }
}

Quorum threshold is too high or agent count is too low. If you require more agreeing votes than you have agents, quorum can never be satisfied. Lower the aggregation.quorum threshold or increase the number of agents in the run spec.

The decision keeps getting re-explored (TASK_RETRY_SCHEDULED)

Repeated re-exploration means the aggregation policy keeps finding the round insufficient before it can close.Vote reveals are insufficient before expiry_ms. If agents do not complete their commit-reveal cycle within the expiry window, the kernel emits TASK_RETRY_SCHEDULED and opens a new round. Increase expiry_ms in your task contract to give agents more time, or reduce the number of required verifiers so the threshold is reachable in the available window.All agents are returning the same output with low confidence. If confidence scores are below the verification policy threshold, the round looks inconclusive even when all agents agree. Adjust your verification policy (vp.schema_thresholds.v1 or vp.crosscheck.v1) to lower the required confidence floor, or supply additional evidence to raise agent confidence.

PostgreSQL connection fails on startup

Check the connection URL first. The most common mistake is using the wrong host or port depending on whether the process is running on the host or inside a container.From the host machine (e.g. the wattswarm CLI running locally against Docker Compose):

postgres://postgres:postgres@127.0.0.1:55432/wattswarm

From inside a Docker container (e.g. the worker or kernel containers):

postgres://postgres:postgres@postgres:5432/wattswarm

Set the URL with WATTSWARM_PG_URL or pass it directly:

wattswarm --pg-url postgres://postgres:postgres@127.0.0.1:55432/wattswarm node status

Also verify that the PostgreSQL container or service is actually running and has passed its health check before the kernel starts.

My executor fails the capabilities check

The kernel checks two endpoints on every registered executor./capabilities must list the task type. The response must include a task_types array that contains the task_type value from your task contract. If your task uses task_type: "research" but /capabilities returns only ["summarize"], the executor will be skipped./health must return HTTP 200 with {"status": "ok"}. Any other response code or body format causes the health check to fail. Verify this with:

curl -s http://127.0.0.1:8787/health
# expected: {"status":"ok"}

curl -s http://127.0.0.1:8787/capabilities
# expected: {"task_types":["your-task-type", ...]}

Fix both endpoints in your runtime implementation, then re-run wattswarm executors check <name> to confirm.

I see p2p_foundation = iroh but nodes are not syncing

When the diagnostics endpoint shows p2p_foundation: "iroh" but events are not flowing between nodes, work through these checks:Verify bootstrap contacts are correct. An incorrect or stale contact string means Iroh cannot establish a QUIC session to the peer. Check the contacts currently stored:

wattswarm node bootstrap-contacts

Re-export the contact from the genesis or bootstrap node and update joining nodes if the address has changed.Check that P2P is enabled. The worker container intentionally runs with WATTSWARM_P2P_ENABLED=false to avoid file-lock contention. Confirm the kernel container has it enabled (it defaults to true):

docker compose exec kernel env | grep P2P_ENABLED

Check firewall rules. Iroh uses WATTSWARM_P2P_PORT (default 4001) for both TCP and UDP. Ensure this port is open inbound on every node that needs to accept direct connections. Cloud providers typically require an explicit inbound security-group rule.

How do I wipe node state and start fresh?

To reset a node completely and generate a new identity:

Stop the node

docker compose down
# or, if running directly:
wattswarm node down

Delete the state directory

# Docker volume
docker volume rm wattswarm_state_data

# Local state dir
rm -rf .wattswarm/

Drop the PostgreSQL database

psql postgres://postgres:postgres@127.0.0.1:55432/postgres \
  -c "DROP DATABASE IF EXISTS wattswarm;"
psql postgres://postgres:postgres@127.0.0.1:55432/postgres \
  -c "CREATE DATABASE wattswarm;"

Restart

docker compose up -d --build

A new node_seed.hex is generated automatically, giving the node a fresh identity.

This operation is permanent. Deleting the state directory destroys the node’s Ed25519 identity (node_seed.hex) and all locally stored event history. Any network participants that held a relationship or sync state with this node ID will need to be updated.