Kill Switches¶
The kill switch is the most important safety mechanism in agent monitoring. It provides instant circuit-breaking at three levels.
Why Kill Switches?¶
Standard monitoring alerts a human who then decides what to do. For AI agents, that's too slow. A runaway agent can:
- Spend $1,000 in API costs in minutes
- Send hundreds of unauthorized emails
- Exfiltrate sensitive data before anyone notices
Kill switches stop the agent immediately. No human in the loop for the emergency stop -- humans handle the investigation after the agent is safe.
Three Levels¶
Agent Kill¶
Block all events for a specific agent.
Session Kill¶
Block all events for a specific session.
Global Kill¶
Block all events for all agents. Emergency use only.
Manual Kill/Revive¶
Kill¶
# Kill a specific agent
monitor.kill_agent("sales-agent", reason="Investigation in progress")
# Kill a session
monitor.kill_session("sess-abc-123")
# Kill everything
monitor.kill_global(reason="Emergency shutdown")
Check Status¶
monitor.is_killed("sales-agent") # True/False
monitor.is_killed("sales-agent", session_id="sess-123") # Also checks session
Revive¶
# Revive a specific agent
monitor.revive(agent="sales-agent")
# Revive a specific session
monitor.revive(session_id="sess-abc-123")
# Deactivate global kill
monitor.revive_global()
CLI¶
# Kill an agent
agent-monitor -c monitor.yaml kill sales-agent --reason "Cost spike"
# Kill a session
agent-monitor -c monitor.yaml kill sess-abc-123 --session --reason "Suspicious"
# Global kill
agent-monitor -c monitor.yaml kill ALL --global-kill --reason "Emergency"
# Revive an agent
agent-monitor -c monitor.yaml revive sales-agent
# Revive a session
agent-monitor -c monitor.yaml revive sess-abc-123 --session
# Deactivate global kill
agent-monitor -c monitor.yaml revive ALL --global-revive
Auto-Kill Policies¶
Auto-kill policies trigger automatically when a metric exceeds a threshold. No human intervention needed.
kill_switch:
enabled: true
policies:
- name: auto-kill-on-high-cost
metric: cost_per_minute
operator: ">"
threshold: 5.0
action: kill_agent
severity: critical
- name: emergency-shutdown
metric: event_count
operator: ">"
threshold: 10000
action: kill_global
severity: critical
Policy Fields¶
| Field | Type | Description |
|---|---|---|
name |
string | Unique policy identifier |
metric |
string | Which metric to evaluate |
operator |
string | Comparison: >, <, >=, <=, == |
threshold |
float | Value that triggers the kill |
action |
string | kill_agent, kill_session, or kill_global |
severity |
string | Alert severity for the kill event |
message |
string | Optional custom message |
How Policies Are Evaluated¶
After every metric snapshot:
- For each policy, check if
metric <operator> thresholdis true - If yes, execute the action (kill agent/session/global)
- Dispatch a kill alert to all configured channels
Example: Cost Guard¶
kill_switch:
enabled: true
policies:
- name: cost-guard
metric: cost_per_minute
operator: ">"
threshold: 1.0
action: kill_agent
severity: critical
If any agent's cost exceeds $1.00/minute, that agent is automatically killed. Other agents continue operating normally.
Example: Global Emergency¶
kill_switch:
enabled: true
policies:
- name: flood-protection
metric: event_count
operator: ">"
threshold: 50000
action: kill_global
severity: critical
If the total event count across any single agent exceeds 50,000, all agents are killed. This prevents runaway loops.
Persistence¶
Kill state can survive restarts:
When configured, the kill switch saves its state to disk after every kill/revive operation (via save()). On startup, call load() to restore saved state.
This means:
- An auto-killed agent stays killed after a restart
- A manually killed agent stays killed after a restart
- Only an explicit
revive(agent=...)orrevive_global()restores the agent
How Events Are Rejected¶
When monitor.record() is called for a killed agent:
- The kill switch is checked first, before any other processing
- If the agent is killed, the event is not stored, metrics are not updated
record()returnsNone(it always returnsNone)
This is the fastest possible circuit breaker -- no metrics computation, no baseline updates, no anomaly detection. Just a set lookup (O(1)).
Kill State Inspection¶
state = monitor.kill_switch_engine.get_state()
print(state.killed_agents) # {"sales-agent"}
print(state.killed_sessions) # {"sess-abc-123"}
print(state.global_kill) # False
print(state.reasons) # {"agent:sales-agent": "Cost spike detected"}
Best Practices¶
-
Always set auto-kill on cost. LLM API costs can spiral. A $5/min threshold is a reasonable starting point.
-
Use agent-level kills, not global. A misbehaving agent should not take down the entire system.
-
Enable persistence. Without persistence, a restart clears all kill states -- the agent you just killed comes back alive.
-
Log kill reasons. The
reasonparameter is stored and exported in compliance reports. Future-you will thank past-you. -
Test revive regularly. Make sure your operators know how to revive agents. A killed agent that nobody knows how to revive is a production incident.