Skip to Content
Runbook

Runbook

This runbook covers the operational procedures for teams running txfence in production.


Pre-Deployment Checklist

Before deploying an agent to production, verify:

  • validateConfig(policy) returns { valid: true, errors: [], warnings: [] } — see Config Validation
  • All action kinds the agent will use (transfer, swap, contract_call) are covered by Policy.allowedContracts and maxSpendPerTx
  • humanApprovalThreshold is set and a working ApprovalProvider (memory for staging, createWebhookApprovalProvider for production) is wired in
  • humanApprovalTimeoutMs is at least a few minutes — too-short windows surface as approval_timeout rejections in normal operation
  • A CapLockProvider is configured — createMemoryCapLockProvider for single-process, @txfence/redis for multi-process
  • simulationStalenessMs is set (typically 30,000 ms) so stale simulations are rejected
  • temporalRules cover spend velocity, consecutive failures, and approval floods — see Temporal Rules
  • If using Intent execution, intentPolicy.maxTotalGrossSpend and maxDurationMs are set
  • A ReceiptStore is configured with a persistent backend (@txfence/storage-pg or @txfence/storage-sqlite)
  • An AuditLog is configured with createFileAuditLog(path) or @txfence/audit’s file backend; for cryptographic tamper evidence, also wire a ProvenanceChain
  • @txfence/monitor is running with agentAddresses, a CheckpointStore, and onUnrecordedTransaction set
  • The Signer is backed by an HSM/KMS, not a plaintext private key
  • A separate cold wallet holds the treasury; the agent wallet holds only the working balance
  • agent.dryRun(input) has been run against a representative action sample and wouldProceed === true for every legitimate action
  • Formal verification and stress testing (stressTest(policy)) have been run in CI and exit 0

Monitoring

Key Metrics

Wire a TelemetryProvider and a NotificationProvider. The pipeline emits these spans and events automatically; export them to your observability stack.

MetricSourceAlert ThresholdMeaning
Rate of policy_rejectedtxfence.pipeline.status attribute> 10/minAgent is repeatedly proposing blocked actions
Rate of simulation_failedsame> 5% of submissionsRPC instability or contract reverts
Rate of approval_timeoutsameanyHumans aren’t responding — webhook/escalation broken
cap_warning eventsNotificationProvideranyA cap is at the configured warning threshold
monitor_unrecorded (severity: critical)NotificationProvideranyTx on-chain from agent address that txfence did not record
monitor_reorgNotificationProvideranyA recorded receipt’s tx moved blocks
Circuit breaker closed → openCircuitBreaker.onStateChangeanyRPC failures clustered — agent paused
txfence.pipeline span duration (p99)TelemetryProvider> 500 msPolicy + simulation latency above target

Setting Up Alerts

import { createWebhookNotificationProvider, createConsoleNotificationProvider, createCompositeNotificationProvider, } from "@txfence/core"; const notificationProvider = createCompositeNotificationProvider( createConsoleNotificationProvider({ prefix: "[txfence]" }), createWebhookNotificationProvider("https://hooks.pagerduty.com/...", { secret: process.env.WEBHOOK_SECRET, }), ); // Pass as notificationProvider param to createAgent (slot 13)

Events fired: policy_rejected, execution_success, execution_failed, approval_requested, approval_decision, cap_warning, monitor_unrecorded, monitor_reorg.


Incident Response

Agent Is Repeatedly Rejected

Identify the trigger

Query the audit log for the recent rejections:

const rejected = await auditLog.query({ status: "policy_rejected", from: Date.now() - 3_600_000, // last hour }); for (const entry of rejected) { console.log(entry.outcome.reason, entry.action.kind, entry.action.chain); }

Determine if the rejection is correct

Each PolicyRejectionReason maps to a specific check. Ask:

  • chain_not_allowed — should we extend Policy.chains?
  • contract_not_allowed — should we add the contract to allowedContracts, or is the agent talking to the wrong protocol?
  • spend_exceeds_cap / cap_lock_unavailable — is the cap correct, or is the agent misjudging amount?
  • gas_buffer_insufficient — is the chain congested? Increase gasBufferMultiplier.
  • slippage_not_declared — agent submitted a swap with maxSlippage: 0; fix the agent.
  • bytecode_hash_mismatch / owner_address_mismatch — the on-chain contract changed; re-pin the metadata after review.
  • contract_entry_expired — the ContractEntry.expiresAt lapsed; re-attest the contract.
  • temporal_rule_triggered — sliding-window pattern detected; check temporalRules triggers.

Run replay before adjusting policy

# Test the proposed change against the historical audit log txfence replay \ --audit-log ./audit.jsonl \ --config ./txfence.config.proposed.ts \ --only-changed

The CLI exits 1 if any historical actions become newly-rejected. See Replay & Backtesting.


Suspicious Transactions Detected

Drain in-flight work

const result = await agent.shutdown(30_000); console.log(result); // { completed, abandoned, capLocksReleased }

agent.shutdown() refuses new submit() calls immediately and waits up to timeoutMs for in-flight pipelines to complete. See Agent Health & Shutdown.

Verify the audit log integrity

If you’re using @txfence/provenance:

const result = await chain.verify(); if (!result.valid) { for (const v of result.violations) { console.error(`${v.entryId}: ${v.violation} — ${v.details}`); } }

Or via CLI:

txfence provenance verify --chain ./provenance.jsonl

Exit 0 = chain intact, exit 1 = tampering detected.

Reconcile against the chain

Query the monitor for unrecorded transactions:

import { createMonitor } from "@txfence/monitor"; // onUnrecordedTransaction is required — keep its handler logging at minimum const monitor = createMonitor({ chains: ["ethereum"], agentAddresses: { ethereum: [agentAddress] }, rpcUrls, receiptStore, checkpointStore, onUnrecordedTransaction: (event) => { if (event.severity === "critical") { console.error("CRITICAL: unrecorded tx", event); } }, }); await monitor.start();

A critical event means a tx from the agent’s address landed on-chain without txfence recording it — investigate immediately. Either the signer was used out-of-band, or the agent is running in a configuration that bypasses runPipeline.

Rotate the signing key

# Generate new key via cast (or your KMS) cast wallet new # Update environment / KMS reference export AGENT_PRIVATE_KEY=0xNewKey... # Restart agent — pre-restart audit + provenance entries remain hash-verifiable

Policy Updates

Zero-Downtime Policy Change

The Policy is passed per submission: agent.submit({ action, policy }). You can roll a new policy by simply passing it to subsequent submits — no agent restart required, in-flight pipelines complete under the old policy.

For changes to the providers (cap lock, audit log, telemetry), do a blue-green agent swap:

  1. Start a new createAgent instance with the updated providers
  2. agent.shutdown(timeoutMs) on the old instance — drains in-flight work, refuses new submits
  3. Switch traffic to the new instance
  4. Verify health() on both before fully retiring the old

Testing a Policy Change

Always run both diff and replay before deploying:

# 1. Diff — synthetic actions, future-looking txfence diff \ --config-a ./current.config.ts \ --config-b ./proposed.config.ts \ --generate-actions # 2. Replay — historical audit log, retrospective txfence replay \ --audit-log ./audit.jsonl \ --config ./proposed.config.ts \ --only-changed # 3. Stress test — adversarial scenarios txfence stress-test --config ./proposed.config.ts --agents 10 --transactions 20 # 4. Verify properties hold txfence verify rolling-window --config ./proposed.config.ts txfence verify absolute-cap --config ./proposed.config.ts

All four exit non-zero on regressions — drop them into CI as required gates.

Pinning Known-Good Policy Versions

Use Policy Versioning to pin a SHA-256 fingerprint in CI:

EXPECTED="a3f2c1d8e9b7..." ACTUAL=$(txfence policy-snapshot --config ./txfence.config.ts --json | jq -r '.id') if [ "$ACTUAL" != "$EXPECTED" ]; then echo "Policy has changed — review and update the pinned hash" exit 1 fi

The audit log records policyVersionId on every entry, so compliance teams can reconstruct exactly which policy was active for any historical decision.


Audit Log Verification

Run the integrity check on a schedule (e.g., daily cron) plus on demand:

# Plain audit log (no cryptographic chain) — only checks parse + ordering txfence dry-run --config ./txfence.config.ts <action> # Provenance chain — full hash + Merkle verification txfence provenance verify --chain ./provenance.jsonl --json

Run this:

  • After any incident
  • Before using audit logs as evidence in a dispute
  • As a weekly scheduled job for compliance

Upgrading txfence

txfence follows semver. Minor and patch releases are backward compatible. Major releases may require config changes — check CHANGELOG.md and MIGRATION.md at the repo root.

# Check current versions pnpm list @txfence/core @txfence/evm @txfence/audit @txfence/monitor # Upgrade pnpm update @txfence/core @txfence/evm @txfence/audit @txfence/monitor # Verify nothing broke pnpm test txfence verify rolling-window --config ./txfence.config.ts txfence stress-test --config ./txfence.config.ts

After upgrading, re-run validateConfig, stress-test, and provenance verify before resuming production traffic.

Last updated on