Runbook

This runbook covers the operational procedures for teams running txfence in production.

Pre-Deployment Checklist

Before deploying an agent to production, verify:

Monitoring

Key Metrics

Wire a TelemetryProvider and a NotificationProvider. The pipeline emits these spans and events automatically; export them to your observability stack.

Metric	Source	Alert Threshold	Meaning
Rate of `policy_rejected`	`txfence.pipeline.status` attribute	> 10/min	Agent is repeatedly proposing blocked actions
Rate of `simulation_failed`	same	> 5% of submissions	RPC instability or contract reverts
Rate of `approval_timeout`	same	any	Humans aren’t responding — webhook/escalation broken
`cap_warning` events	NotificationProvider	any	A cap is at the configured warning threshold
`monitor_unrecorded` (severity: critical)	NotificationProvider	any	Tx on-chain from agent address that txfence did not record
`monitor_reorg`	NotificationProvider	any	A recorded receipt’s tx moved blocks
Circuit breaker `closed → open`	`CircuitBreaker.onStateChange`	any	RPC failures clustered — agent paused
`txfence.pipeline` span duration (p99)	TelemetryProvider	> 500 ms	Policy + simulation latency above target

Setting Up Alerts


import {
  createWebhookNotificationProvider,
  createConsoleNotificationProvider,
  createCompositeNotificationProvider,
} from "@txfence/core";
 
const notificationProvider = createCompositeNotificationProvider(
  createConsoleNotificationProvider({ prefix: "[txfence]" }),
  createWebhookNotificationProvider("https://hooks.pagerduty.com/...", {
    secret: process.env.WEBHOOK_SECRET,
  }),
);
 
// Pass as notificationProvider param to createAgent (slot 13)

Events fired: policy_rejected, execution_success, execution_failed, approval_requested, approval_decision, cap_warning, monitor_unrecorded, monitor_reorg.

Incident Response

Agent Is Repeatedly Rejected

Identify the trigger

Query the audit log for the recent rejections:


const rejected = await auditLog.query({
  status: "policy_rejected",
  from:   Date.now() - 3_600_000,   // last hour
});
 
for (const entry of rejected) {
  console.log(entry.outcome.reason, entry.action.kind, entry.action.chain);
}

Determine if the rejection is correct

Each PolicyRejectionReason maps to a specific check. Ask:

chain_not_allowed — should we extend Policy.chains?
contract_not_allowed — should we add the contract to allowedContracts, or is the agent talking to the wrong protocol?
spend_exceeds_cap / cap_lock_unavailable — is the cap correct, or is the agent misjudging amount?
gas_buffer_insufficient — is the chain congested? Increase gasBufferMultiplier.
slippage_not_declared — agent submitted a swap with maxSlippage: 0; fix the agent.
bytecode_hash_mismatch / owner_address_mismatch — the on-chain contract changed; re-pin the metadata after review.
contract_entry_expired — the ContractEntry.expiresAt lapsed; re-attest the contract.
temporal_rule_triggered — sliding-window pattern detected; check temporalRules triggers.

Run replay before adjusting policy


# Test the proposed change against the historical audit log
txfence replay \
  --audit-log ./audit.jsonl \
  --config ./txfence.config.proposed.ts \
  --only-changed

The CLI exits 1 if any historical actions become newly-rejected. See Replay & Backtesting.

Suspicious Transactions Detected

Drain in-flight work


const result = await agent.shutdown(30_000);
console.log(result);   // { completed, abandoned, capLocksReleased }

agent.shutdown() refuses new submit() calls immediately and waits up to timeoutMs for in-flight pipelines to complete. See Agent Health & Shutdown.

Verify the audit log integrity

If you’re using @txfence/provenance:


const result = await chain.verify();
if (!result.valid) {
  for (const v of result.violations) {
    console.error(`${v.entryId}: ${v.violation} — ${v.details}`);
  }
}

Or via CLI:


txfence provenance verify --chain ./provenance.jsonl

Exit 0 = chain intact, exit 1 = tampering detected.

Reconcile against the chain

Query the monitor for unrecorded transactions:


import { createMonitor } from "@txfence/monitor";
 
// onUnrecordedTransaction is required — keep its handler logging at minimum
const monitor = createMonitor({
  chains: ["ethereum"],
  agentAddresses: { ethereum: [agentAddress] },
  rpcUrls,
  receiptStore,
  checkpointStore,
  onUnrecordedTransaction: (event) => {
    if (event.severity === "critical") {
      console.error("CRITICAL: unrecorded tx", event);
    }
  },
});
await monitor.start();

A critical event means a tx from the agent’s address landed on-chain without txfence recording it — investigate immediately. Either the signer was used out-of-band, or the agent is running in a configuration that bypasses runPipeline.

Rotate the signing key


# Generate new key via cast (or your KMS)
cast wallet new
 
# Update environment / KMS reference
export AGENT_PRIVATE_KEY=0xNewKey...
 
# Restart agent — pre-restart audit + provenance entries remain hash-verifiable

Policy Updates

Zero-Downtime Policy Change

The Policy is passed per submission: agent.submit({ action, policy }). You can roll a new policy by simply passing it to subsequent submits — no agent restart required, in-flight pipelines complete under the old policy.

For changes to the providers (cap lock, audit log, telemetry), do a blue-green agent swap:

Start a new createAgent instance with the updated providers
agent.shutdown(timeoutMs) on the old instance — drains in-flight work, refuses new submits
Switch traffic to the new instance
Verify health() on both before fully retiring the old

Testing a Policy Change

Always run both diff and replay before deploying:


# 1. Diff — synthetic actions, future-looking
txfence diff \
  --config-a ./current.config.ts \
  --config-b ./proposed.config.ts \
  --generate-actions
 
# 2. Replay — historical audit log, retrospective
txfence replay \
  --audit-log ./audit.jsonl \
  --config ./proposed.config.ts \
  --only-changed
 
# 3. Stress test — adversarial scenarios
txfence stress-test --config ./proposed.config.ts --agents 10 --transactions 20
 
# 4. Verify properties hold
txfence verify rolling-window --config ./proposed.config.ts
txfence verify absolute-cap   --config ./proposed.config.ts

All four exit non-zero on regressions — drop them into CI as required gates.

Pinning Known-Good Policy Versions

Use Policy Versioning to pin a SHA-256 fingerprint in CI:


EXPECTED="a3f2c1d8e9b7..."
ACTUAL=$(txfence policy-snapshot --config ./txfence.config.ts --json | jq -r '.id')
if [ "$ACTUAL" != "$EXPECTED" ]; then
  echo "Policy has changed — review and update the pinned hash"
  exit 1
fi

The audit log records policyVersionId on every entry, so compliance teams can reconstruct exactly which policy was active for any historical decision.

Audit Log Verification

Run the integrity check on a schedule (e.g., daily cron) plus on demand:


# Plain audit log (no cryptographic chain) — only checks parse + ordering
txfence dry-run --config ./txfence.config.ts <action>
 
# Provenance chain — full hash + Merkle verification
txfence provenance verify --chain ./provenance.jsonl --json

Run this:

After any incident
Before using audit logs as evidence in a dispute
As a weekly scheduled job for compliance

Upgrading txfence

txfence follows semver. Minor and patch releases are backward compatible. Major releases may require config changes — check CHANGELOG.md and MIGRATION.md at the repo root.


# Check current versions
pnpm list @txfence/core @txfence/evm @txfence/audit @txfence/monitor
 
# Upgrade
pnpm update @txfence/core @txfence/evm @txfence/audit @txfence/monitor
 
# Verify nothing broke
pnpm test
txfence verify rolling-window --config ./txfence.config.ts
txfence stress-test --config ./txfence.config.ts

After upgrading, re-run validateConfig, stress-test, and provenance verify before resuming production traffic.