Troubleshooting

Startup failures

`production security validation failed`

The gate refuses to start outside dev mode when the configuration is not production-safe. The error message identifies the specific violation. Common causes:

Identity provider is none — set [identity] provider = "peercred" and configure UID-to-principal mappings. See Configuration.

Shared operator API key without DPoP — replace the single operator_api_key with named [operator_credentials.NAME] entries, each with a dpop_jkt. Run latchgate config add-operator --name <n> (generates keypair + updates TOML in one step).

unsafe_expose_http not set but TCP listener configured — either remove listen_http_addr (use UDS only) or explicitly set unsafe_expose_http = true.

Ephemeral signing keys — set receipt_signing_key_path, grant_signing_key_path, and receipt_keys_jwks_path to persistent file paths.

response_schema_enforcement = "warn" — set to "deny" for production (the warn mode allows schema violations through).

allow_unmapped = true — remove this from [identity.peercred] or set it to false. Unmapped UIDs must be denied in production.

Actions require egress proxy but egress_proxy_url not set — if any action uses egress.profile = "proxy_allowlist", configure egress_proxy_url and start Squid. See Egress Proxy.

Fix: run latchgate doctor before starting to catch these issues early.

`failed to initialise replay cache: Is Redis running?`

The gate cannot connect to Redis at startup. Redis is required for replay protection, budgets, approvals, and revocation epoch.

Check: redis-cli -u <redis_url> PING. Verify redis_url in latchgate.toml. If using Docker, ensure the Redis container is healthy before starting the gate (depends_on with condition: service_healthy).

`failed to load action manifests`

Manifest YAML files in manifests_dir are invalid or the directory does not exist.

Check the path in latchgate.toml. Verify manifest digests against compiled .wasm modules (run make providers to rebuild and rehash). Common causes: missing manifests_dir, YAML syntax error, or JSON Schema reference (../schemas/foo.json) pointing to a nonexistent file.

`WASM runtime initialisation failed`

The wasmtime engine could not start. This typically means the host does not support the required WASM features or wasm_providers_dir contains corrupt .wasm files.

Check: run make providers to rebuild all provider modules and rehash digests.

All requests return deny (503)

When a dependency is unreachable, LatchGate fails closed — every action is denied with a 503 status code. This is by design: there is no code path that produces ALLOW when a dependency is in an unknown state.

Redis is down

Symptoms: 503 replay_cache_unavailable or budget_store_unavailable on every execute() call. Lease issuance may still work (it does not require Redis), but execution will fail.

What Redis stores: DPoP jti replay cache (anti-replay), per-session budget counters, approval state, revocation epoch.

Fix: Restore Redis connectivity. Verify with redis-cli PING. If Redis was restarted without persistence, budget counters and the replay cache reset — this is safe (budgets start fresh, replay window is short-lived) but means previously-issued jti values could theoretically be replayed within the replay TTL window.

Redis persistence: For production, configure at minimum appendonly yes (AOF) on your Redis instance. The quickstart Docker stack uses --save "" --appendonly no (volatile) — this is for dev only.

OPA is down

Symptoms: 503 policy_engine_unavailable or policy_engine_timeout on every execute() call.

Fix: Restore OPA. Verify with curl http://<opa_url>/health. Check that the policy bundle is loaded: curl http://<opa_url>/v1/data/latchgate/decision. If OPA returns {} (empty), the policy is not loaded — check the volume mount and the OPA command arguments.

Timeout tuning: If you see policy_engine_timeout but OPA is reachable, investigate OPA bundle size or network latency. The timeout is a compile-time constant (1000 ms) — raising it is not the correct fix.

Egress proxy is down

Symptoms: Actions with egress.profile = "proxy_allowlist" return 502 action_execution_failed.

Fix: Check proxy reachability: curl -x <egress_proxy_url> https://api.github.com/. Restart Squid if needed. /readyz returns degraded (not not_ready) when the proxy is down — the gate keeps routing actions that don’t need the proxy.

See Egress Proxy for configuration and setup.

Authentication errors (401)

`lease_expired`

The lease JWT has expired (default TTL: 5 minutes). The SDKs auto-renew leases when fewer than 60 seconds remain, but long gaps between calls (or clock skew) can cause expiry.

Fix (SDK): The SDK automatically reconnects on the next execute() call when using lazy-connect (agent_id set). If using explicit connect(), call it again. Catch LatchGateAuthError and reconnect.

Fix (manual): Obtain a new lease from POST /v1/leases.

`invalid_dpop`

The DPoP proof failed verification. Common causes: the proof’s htu does not match the server’s expected URI, the proof was signed with the wrong key, or the iat timestamp is outside the allowed clock skew window.

Debug: Set RUST_LOG=debug and check the log for the sanitized DPoP rejection reason. Ensure public_base_url in latchgate.toml matches the URL that clients use to reach the gate.

`replay_detected`

The DPoP proof’s jti was already seen. The SDKs generate a fresh jti for every request, so this typically indicates a retry of the exact same HTTP request (e.g., a load balancer retrying on timeout).

Fix: Do not retry the exact same HTTP request with the same DPoP proof. If using an SDK and seeing this error, check for network-level retries (proxy, load balancer) that LatchGate cannot distinguish from an attack.

Policy denials (403)

`policy_denied: principal 'X' is not authorised for action 'Y'`

The OPA policy ACL does not grant this principal access to the requested action.

Fix: Grant the action to the principal via latchgate policy grant <principal> <action_id>, then reload OPA. Or edit policies/data.json directly if using a custom ACL structure.

Debug: Query OPA directly to see the full evaluation:

curl -X POST http://localhost:8181/v1/data/latchgate/decision \
  -H 'Content-Type: application/json' \
  -d '{
    "input": {
      "principal": "dev:anonymous",
      "action_id": "http_fetch",
      "action_trust_verdict": "digest_ok",
      "risk_level": "low",
      "scopes": ["tools:call"],
      "required_scopes": [],
      "requested_sinks": ["http_read"],
      "budgets_before": {"calls_remaining": 100}
    }
  }'

`budget_exhausted`

The session’s call budget is depleted. Budgets are set at lease issuance time and tracked atomically in Redis.

Fix: Obtain a new lease (the SDK’s connect() method does this). To increase the budget, pass a larger max_calls value at lease issuance.

`action_digest_mismatch`

The WASM provider module’s SHA-256 digest does not match the value declared in the action manifest. This means the .wasm file was recompiled but the manifest was not updated.

Fix: Run make providers to rebuild and update digests, then restart the gate.

Approval issues

Approval expired before the operator could review it

Pending approvals have a TTL (5 minutes, a compile-time constant). After expiry, the approval cannot be consumed and the original execute() call’s 202 response is final — the agent must re-submit the action.

Fix: Set up webhook notifications so operators are alerted immediately when approvals are pending.

`approval not found` when approving

The approval either expired, was already consumed (approved or denied), or the approval_id is wrong. Approval consumption is atomic and one-shot — concurrent approve requests on the same ID are serialized and only the first succeeds (enforced by Redis Lua script + durable SQLite outcome marker).

Debug: latchgate approvals list --all shows completed approvals with their final status.

Unresolved intents (evidence gaps)

What is an unresolved intent?

An ExecutionIntent is a pre-dispatch durable evidence marker. It is written to SQLite before the WASM provider dispatches. In normal operation, a matching receipt is written within milliseconds after dispatch completes.

An unresolved intent is an intent without a matching receipt. This indicates that the process either crashed between dispatch and receipt write, or that the receipt write itself failed (disk full, SQLite error, signal).

Unresolved intents are visible in:

Query the SQLite ledger directly for unresolved intents
/v1/admin/status — unresolved_intents field
/metrics — latchgate_unresolved_intents gauge

Investigating an unresolved intent

# Query the ledger SQLite database for execution intents without matching receipts

Output (one entry per unresolved intent):

{
  "intent_id": "intent_01J...",
  "trace_id": "trc_01J...",
  "action_id": "gmail_send",
  "principal": "agent-support",
  "dispatched_at": "2025-03-28T14:30:00Z",
  "age_seconds": 3600,
  "grant_id": "grant_01J...",
  "provider_module": "sha256:abc..."
}

For each intent:

Check whether the side effect actually occurred. Look at the target system (the email inbox, the database, the issue tracker) for evidence of the operation around dispatched_at. The trace_id and any unique payload fields from the action manifest help correlate.
Check server logs around dispatched_at. Look for panics, OOM kills, SIGKILL, or disk errors. RUST_LOG=debug logs include the full execution pipeline for that trace.
Decide whether to replay. If the side effect did NOT occur, the agent can safely re-submit the action (fresh request, fresh approval if needed). If the side effect DID occur, DO NOT replay — instead, manually record the outcome in your audit system.
Mark the intent as investigated. There is no automated “resolve” — intents remain in the ledger as permanent evidence of a gap. Track resolution in your incident system.

Preventing unresolved intents

Persistent SQLite storage. Use a local SSD or NVMe disk, not a network volume. SQLite on NFS/SMB is not durable.
Monitor latchgate_unresolved_intents. Alert when the gauge is non-zero for more than a few minutes.
Back up the ledger SQLite database on a schedule. Use SQLite’s online backup API for crash-consistent backups without stopping the gate.
Graceful shutdown. On SIGTERM, the gate drains in-flight executions before exiting. Orchestrators (systemd, Kubernetes) should allow at least 30 seconds for graceful shutdown.

`evidence_persistence_failed` in response

If a client receives 500 evidence_persistence_failed, the provider executed (side effect may have occurred) but the receipt + audit transaction did not commit. The corresponding ExecutionIntent remains unresolved until an operator investigates.

The budget is NOT refunded in this case — since the side effect may have happened, refunding would allow the client to re-trigger it for free. Manual budget adjustment via operator CLI is required if investigation confirms the side effect did not occur.

Webhook issues

Webhooks not firing

The startup banner shows Webhooks: active or Webhooks: disabled. If disabled, no [[webhooks]] sections are present in latchgate.toml.

Fix: Add at least one [[webhooks]] entry and restart the gate. See Webhooks for configuration. Run latchgate doctor to validate webhook configuration before starting.

Webhook delivery failures in logs

Look for webhook delivery failed (dead-letter) warnings. The log includes endpoint name, event type, event ID, HTTP status, and error message.

Common causes:

DNS resolution failed — the webhook URL hostname cannot be resolved
SSRF blocked — the URL resolved to a private/reserved IP address
HTTP 4xx (not retried) — the receiving endpoint rejected the payload
HTTP 5xx / timeout (retried) — the receiver is unhealthy; LatchGate retries with backoff (default: 1s, 5s, 30s). If all retries fail, the event is dead-lettered.

Signature verification fails on receiver

Check that: (1) the secret in latchgate.toml matches the secret your receiver uses, (2) verification uses the raw request body bytes — not a re-serialized or decoded version, (3) the timestamp tolerance is at least 300 seconds to account for clock skew.

See Webhooks — Signing and verification for code examples.

WASM provider errors (502)

`action_execution_failed`

The WASM provider returned an error or hit a resource limit.

Common causes:

Fuel exhaustion — the provider exceeded resource_limits.fuel. Increase the fuel limit or optimize the provider logic.
Memory limit — the provider allocated more linear memory than resource_limits.memory_mb allows.
Timeout — the provider (or a host I/O call it made) exceeded resource_limits.timeout_seconds. Enforced by epoch-based wall-clock deadline.
I/O call budget — the provider made more host I/O calls than resource_limits.max_io_calls allows.
Host I/O failure — the provider’s HTTP call failed (target unreachable, DNS failure, domain not in egress allowlist, or egress proxy rejected).

Debug: Set RUST_LOG=debug to see provider dispatch details, host I/O call traces, and resource consumption.

Secrets issues

`required secret 'X' not found in sops file`

The action manifest declares a secret with required: true but the key is missing from the SOPS-encrypted file.

Fix: Add the key via latchgate secrets set X <value>, or manually: SOPS_AGE_KEY_FILE=/path/to/key sops secrets.enc.yaml, add the key, save.

`sops binary not found`

sops_secrets_file is configured but the sops binary is not on $PATH.

Fix: Install SOPS so the sops binary is on $PATH. The binary name is a compile-time constant. Verify with latchgate doctor.

Secrets rotated but old value still used

By default, decrypted secrets are cached in memory for 30 seconds, keyed by file mtime + inode. SOPS updates the file’s mtime on re-encrypt, which invalidates the cache entry immediately. If you see stale values, verify that you edited the correct file and that SOPS re-encrypted it. The cache TTL is a compile-time constant (30 s).

SDK issues

`LatchGateNotConnected: call connect() before execute()`

execute() was called without a prior connect() and without agent_id in the constructor (which enables lazy-connect).

Fix: Either pass agent_id to the constructor for lazy-connect, or call connect(agent_id=...) explicitly before execute().

SSL/TLS errors connecting to LatchGate

LatchGate does not terminate TLS — it uses UDS (production) or plaintext TCP (dev). If you see TLS errors, ensure the SDK’s base_url uses http:// not https://, or that you are connecting via UDS socket.

Diagnostic commands

latchgate doctor              # pre-flight check: config, Redis, OPA, providers, SOPS, webhooks, egress proxy
latchgate doctor --json       # machine-readable output

latchgate status              # is the gate running? what is it serving?
latchgate status --json

latchgate actions             # list registered actions
latchgate actions http_fetch  # full manifest for a specific action

make providers               # rebuild providers and update manifest digests

latchgate audit --limit 5     # last 5 audit events
latchgate audit --decision deny --limit 10   # recent denials
latchgate audit --action http_fetch          # events for a specific action

# Ledger operations are performed directly on the SQLite database
# Use sqlite3 or the gate's built-in chain verification at startup

RUST_LOG=debug latchgate serve          # verbose logging
RUST_LOG=trace latchgate serve          # maximum verbosity (includes DPoP details)

For the full CLI reference, see CLI Reference. For the security model and fail-closed guarantees, see Security Model.