Runbook
This runbook covers the operational procedures for teams running txfence in production.
Pre-Deployment Checklist
Before deploying an agent to production, verify:
-
validateConfig(policy)returns{ valid: true, errors: [], warnings: [] }— see Config Validation - All action kinds the agent will use (
transfer,swap,contract_call) are covered byPolicy.allowedContractsandmaxSpendPerTx -
humanApprovalThresholdis set and a workingApprovalProvider(memory for staging,createWebhookApprovalProviderfor production) is wired in -
humanApprovalTimeoutMsis at least a few minutes — too-short windows surface asapproval_timeoutrejections in normal operation - A
CapLockProvideris configured —createMemoryCapLockProviderfor single-process,@txfence/redisfor multi-process -
simulationStalenessMsis set (typically 30,000 ms) so stale simulations are rejected -
temporalRulescover spend velocity, consecutive failures, and approval floods — see Temporal Rules - If using
Intentexecution,intentPolicy.maxTotalGrossSpendandmaxDurationMsare set - A
ReceiptStoreis configured with a persistent backend (@txfence/storage-pgor@txfence/storage-sqlite) - An
AuditLogis configured withcreateFileAuditLog(path)or@txfence/audit’s file backend; for cryptographic tamper evidence, also wire aProvenanceChain -
@txfence/monitoris running withagentAddresses, aCheckpointStore, andonUnrecordedTransactionset - The
Signeris backed by an HSM/KMS, not a plaintext private key - A separate cold wallet holds the treasury; the agent wallet holds only the working balance
-
agent.dryRun(input)has been run against a representative action sample andwouldProceed === truefor every legitimate action - Formal verification and stress testing (
stressTest(policy)) have been run in CI and exit 0
Monitoring
Key Metrics
Wire a TelemetryProvider and a NotificationProvider. The pipeline emits these spans and events automatically; export them to your observability stack.
| Metric | Source | Alert Threshold | Meaning |
|---|---|---|---|
Rate of policy_rejected | txfence.pipeline.status attribute | > 10/min | Agent is repeatedly proposing blocked actions |
Rate of simulation_failed | same | > 5% of submissions | RPC instability or contract reverts |
Rate of approval_timeout | same | any | Humans aren’t responding — webhook/escalation broken |
cap_warning events | NotificationProvider | any | A cap is at the configured warning threshold |
monitor_unrecorded (severity: critical) | NotificationProvider | any | Tx on-chain from agent address that txfence did not record |
monitor_reorg | NotificationProvider | any | A recorded receipt’s tx moved blocks |
Circuit breaker closed → open | CircuitBreaker.onStateChange | any | RPC failures clustered — agent paused |
txfence.pipeline span duration (p99) | TelemetryProvider | > 500 ms | Policy + simulation latency above target |
Setting Up Alerts
import {
createWebhookNotificationProvider,
createConsoleNotificationProvider,
createCompositeNotificationProvider,
} from "@txfence/core";
const notificationProvider = createCompositeNotificationProvider(
createConsoleNotificationProvider({ prefix: "[txfence]" }),
createWebhookNotificationProvider("https://hooks.pagerduty.com/...", {
secret: process.env.WEBHOOK_SECRET,
}),
);
// Pass as notificationProvider param to createAgent (slot 13)Events fired: policy_rejected, execution_success, execution_failed, approval_requested, approval_decision, cap_warning, monitor_unrecorded, monitor_reorg.
Incident Response
Agent Is Repeatedly Rejected
Identify the trigger
Query the audit log for the recent rejections:
const rejected = await auditLog.query({
status: "policy_rejected",
from: Date.now() - 3_600_000, // last hour
});
for (const entry of rejected) {
console.log(entry.outcome.reason, entry.action.kind, entry.action.chain);
}Determine if the rejection is correct
Each PolicyRejectionReason maps to a specific check. Ask:
chain_not_allowed— should we extendPolicy.chains?contract_not_allowed— should we add the contract toallowedContracts, or is the agent talking to the wrong protocol?spend_exceeds_cap/cap_lock_unavailable— is the cap correct, or is the agent misjudging amount?gas_buffer_insufficient— is the chain congested? IncreasegasBufferMultiplier.slippage_not_declared— agent submitted a swap withmaxSlippage: 0; fix the agent.bytecode_hash_mismatch/owner_address_mismatch— the on-chain contract changed; re-pin the metadata after review.contract_entry_expired— theContractEntry.expiresAtlapsed; re-attest the contract.temporal_rule_triggered— sliding-window pattern detected; checktemporalRulestriggers.
Run replay before adjusting policy
# Test the proposed change against the historical audit log
txfence replay \
--audit-log ./audit.jsonl \
--config ./txfence.config.proposed.ts \
--only-changedThe CLI exits 1 if any historical actions become newly-rejected. See Replay & Backtesting.
Suspicious Transactions Detected
Drain in-flight work
const result = await agent.shutdown(30_000);
console.log(result); // { completed, abandoned, capLocksReleased }agent.shutdown() refuses new submit() calls immediately and waits up to timeoutMs for in-flight pipelines to complete. See Agent Health & Shutdown.
Verify the audit log integrity
If you’re using @txfence/provenance:
const result = await chain.verify();
if (!result.valid) {
for (const v of result.violations) {
console.error(`${v.entryId}: ${v.violation} — ${v.details}`);
}
}Or via CLI:
txfence provenance verify --chain ./provenance.jsonlExit 0 = chain intact, exit 1 = tampering detected.
Reconcile against the chain
Query the monitor for unrecorded transactions:
import { createMonitor } from "@txfence/monitor";
// onUnrecordedTransaction is required — keep its handler logging at minimum
const monitor = createMonitor({
chains: ["ethereum"],
agentAddresses: { ethereum: [agentAddress] },
rpcUrls,
receiptStore,
checkpointStore,
onUnrecordedTransaction: (event) => {
if (event.severity === "critical") {
console.error("CRITICAL: unrecorded tx", event);
}
},
});
await monitor.start();A critical event means a tx from the agent’s address landed on-chain without txfence recording it — investigate immediately. Either the signer was used out-of-band, or the agent is running in a configuration that bypasses runPipeline.
Rotate the signing key
# Generate new key via cast (or your KMS)
cast wallet new
# Update environment / KMS reference
export AGENT_PRIVATE_KEY=0xNewKey...
# Restart agent — pre-restart audit + provenance entries remain hash-verifiablePolicy Updates
Zero-Downtime Policy Change
The Policy is passed per submission: agent.submit({ action, policy }). You can roll a new policy by simply passing it to subsequent submits — no agent restart required, in-flight pipelines complete under the old policy.
For changes to the providers (cap lock, audit log, telemetry), do a blue-green agent swap:
- Start a new
createAgentinstance with the updated providers agent.shutdown(timeoutMs)on the old instance — drains in-flight work, refuses new submits- Switch traffic to the new instance
- Verify
health()on both before fully retiring the old
Testing a Policy Change
Always run both diff and replay before deploying:
# 1. Diff — synthetic actions, future-looking
txfence diff \
--config-a ./current.config.ts \
--config-b ./proposed.config.ts \
--generate-actions
# 2. Replay — historical audit log, retrospective
txfence replay \
--audit-log ./audit.jsonl \
--config ./proposed.config.ts \
--only-changed
# 3. Stress test — adversarial scenarios
txfence stress-test --config ./proposed.config.ts --agents 10 --transactions 20
# 4. Verify properties hold
txfence verify rolling-window --config ./proposed.config.ts
txfence verify absolute-cap --config ./proposed.config.tsAll four exit non-zero on regressions — drop them into CI as required gates.
Pinning Known-Good Policy Versions
Use Policy Versioning to pin a SHA-256 fingerprint in CI:
EXPECTED="a3f2c1d8e9b7..."
ACTUAL=$(txfence policy-snapshot --config ./txfence.config.ts --json | jq -r '.id')
if [ "$ACTUAL" != "$EXPECTED" ]; then
echo "Policy has changed — review and update the pinned hash"
exit 1
fiThe audit log records policyVersionId on every entry, so compliance teams can reconstruct exactly which policy was active for any historical decision.
Audit Log Verification
Run the integrity check on a schedule (e.g., daily cron) plus on demand:
# Plain audit log (no cryptographic chain) — only checks parse + ordering
txfence dry-run --config ./txfence.config.ts <action>
# Provenance chain — full hash + Merkle verification
txfence provenance verify --chain ./provenance.jsonl --jsonRun this:
- After any incident
- Before using audit logs as evidence in a dispute
- As a weekly scheduled job for compliance
Upgrading txfence
txfence follows semver. Minor and patch releases are backward compatible. Major releases may require config changes — check CHANGELOG.md and MIGRATION.md at the repo root.
# Check current versions
pnpm list @txfence/core @txfence/evm @txfence/audit @txfence/monitor
# Upgrade
pnpm update @txfence/core @txfence/evm @txfence/audit @txfence/monitor
# Verify nothing broke
pnpm test
txfence verify rolling-window --config ./txfence.config.ts
txfence stress-test --config ./txfence.config.tsAfter upgrading, re-run validateConfig, stress-test, and provenance verify before resuming production traffic.