Configuration Reference
A monitor configuration is a single YAML file that defines agent tracking, metrics, baselines, anomaly rules, kill switch policies, alert channels, and storage settings.
This page is the complete reference.
File Structure
Every config file has these top-level sections:
version: "1.0" # Required. Always "1.0" for now.
metadata: # Optional. Name, description, author.
variables: # Optional. Shared key-value pairs.
agents: # Optional. Per-agent tracking config.
storage: # Optional. Event storage settings.
metrics: # Optional. Metrics engine settings.
baselines: # Optional. Baseline tracking settings.
anomaly_detection: # Optional. Anomaly detection rules.
kill_switch: # Optional. Kill switch policies.
alerts: # Optional. Alert channel configuration.
Only version is strictly required. All other sections use sensible defaults.
Version
Always "1.0" for the current release. The engine rejects unknown versions.
| Field |
Type |
Required |
Description |
version |
string |
Yes |
Config format version. Must be "1.0". |
metadata:
name: my-monitor
description: Production agent monitoring
author: ops-team
Optional metadata for documentation and identification.
| Field |
Type |
Default |
Description |
metadata.name |
string |
"" |
Monitor name. |
metadata.description |
string |
"" |
Human-readable description. |
metadata.author |
string |
"" |
Config author or team. |
Variables
variables:
alert_webhook: "https://hooks.slack.com/services/xxx"
cost_threshold: 5.0
Key-value pairs available for reference. Variables support ${ENV_VAR} interpolation.
Agents
agents:
sales-agent:
enabled: true
event_types:
- action
- denial
- error
tags:
- production
- sales
finance-agent:
enabled: true
Per-agent tracking configuration. If no agents are configured, all agents are tracked by default.
| Field |
Type |
Default |
Description |
agents.<name>.enabled |
bool |
true |
Whether to track this agent. |
agents.<name>.event_types |
list |
[] (all) |
Only track these event types. Empty = track all. |
agents.<name>.tags |
list |
[] |
Tags for this agent. |
Valid Event Types
| Value |
Description |
action |
An agent action (LLM call, tool call, etc.) |
guardrail_trigger |
A guardrail evaluation (non-denial) |
denial |
A guardrail denial |
approval_request |
An action requiring human approval |
approval_response |
A human approval response |
cost |
A cost record |
error |
An error or exception |
session_start |
Session start |
session_end |
Session end |
Storage
storage:
path: .agent_monitor/events.jsonl
retention_days: 90
| Field |
Type |
Default |
Description |
storage.path |
string |
.agent_monitor/events.jsonl |
Path to the JSONL event store. |
storage.retention_days |
int |
90 |
Days to retain events. Must be >= 1. |
Metrics
metrics:
default_window_seconds: 300
max_window_seconds: 3600
| Field |
Type |
Default |
Description |
metrics.default_window_seconds |
int |
300 |
Default rolling window size in seconds. |
metrics.max_window_seconds |
int |
3600 |
Maximum allowed window size. |
Valid Metrics
These are the metrics computed by the engine:
| Metric |
Description |
Source |
event_count |
Total events in window |
-- |
action_count |
Count of action events |
event_type |
denial_count |
Count of denial events |
event_type |
denial_rate |
Fraction of action+denial events that are denials |
event_type |
approval_count |
Count of approval events |
event_type |
approval_rate |
Fraction of events that are approvals |
event_type |
error_count |
Count of error events |
event_type |
cost_total |
Sum of cost_usd in window |
cost_usd |
cost_per_minute |
cost_total / (window_seconds / 60) |
cost_usd |
avg_latency_ms |
Mean latency for events with latency |
latency_ms |
Baselines
baselines:
enabled: true
min_samples: 30
metrics:
- denial_rate
- error_count
- cost_per_minute
- avg_latency_ms
storage_path: .agent_monitor/baselines.json
| Field |
Type |
Default |
Description |
baselines.enabled |
bool |
true |
Enable baseline tracking. |
baselines.min_samples |
int |
30 |
Minimum data points before z-scores are computed. |
baselines.metrics |
list |
["denial_rate", "error_count", "cost_per_minute", "avg_latency_ms"] |
Which metrics to track baselines for. |
baselines.storage_path |
string |
.agent_monitor/baselines.json |
Path to persist baseline state. |
Anomaly Detection
anomaly_detection:
enabled: true
rules:
- name: high-denial-rate
metric: denial_rate
z_threshold: 3.0
severity: high
cooldown_seconds: 300
Top-Level Fields
| Field |
Type |
Default |
Description |
anomaly_detection.enabled |
bool |
true |
Enable anomaly detection. |
anomaly_detection.rules |
list |
[] |
Anomaly detection rules. |
Rule Fields
| Field |
Type |
Required |
Default |
Description |
name |
string |
Yes |
-- |
Unique rule identifier. |
metric |
string |
Yes |
-- |
Which metric to monitor. |
z_threshold |
float |
No |
3.0 |
Z-score threshold for triggering. |
severity |
string |
No |
"high" |
Alert severity: critical, high, medium, low. |
cooldown_seconds |
int |
No |
300 |
Minimum seconds between repeated alerts. |
condition |
string |
No |
"" |
Optional condition expression. |
How Rules Are Evaluated
- For each agent's metric snapshot, iterate all anomaly rules
- Compute z-score for the rule's metric against the baseline
- If z-score exceeds
z_threshold and cooldown has elapsed, trigger an alert
- Record the alert time for cooldown tracking
Kill Switch
kill_switch:
enabled: true
state_path: .agent_monitor/kill_state.json
policies:
- name: auto-kill-on-high-cost
metric: cost_per_minute
operator: ">"
threshold: 5.0
action: kill_agent
severity: critical
Top-Level Fields
| Field |
Type |
Default |
Description |
kill_switch.enabled |
bool |
true |
Enable kill switch system. |
kill_switch.state_path |
string |
.agent_monitor/kill_state.json |
Path to persist kill state. |
kill_switch.policies |
list |
[] |
Auto-kill policies. |
Policy Fields
| Field |
Type |
Required |
Default |
Description |
name |
string |
Yes |
-- |
Unique policy identifier. |
metric |
string |
Yes |
-- |
Which metric to evaluate. |
operator |
string |
Yes |
-- |
Comparison: >, <, >=, <=, ==. |
threshold |
float |
Yes |
-- |
Metric value that triggers the kill. |
action |
string |
No |
"kill_agent" |
What to do: kill_agent, kill_session, kill_global. |
severity |
string |
No |
"critical" |
Alert severity for the kill event. |
message |
string |
No |
"" |
Custom message for the kill alert. |
Valid Kill Actions
| Value |
What it does |
kill_agent |
Kill the specific agent whose metric exceeded the threshold |
kill_session |
Kill the specific session (requires session_id on events) |
kill_global |
Kill all agents globally |
Alerts
alerts:
channels:
- type: console
- type: file
path: .agent_monitor/alerts.jsonl
min_severity: medium
- type: webhook
url: "${ALERT_WEBHOOK_URL}"
headers:
Authorization: "Bearer ${WEBHOOK_TOKEN}"
min_severity: high
Channel Types
| Type |
Required Fields |
Description |
console |
-- |
Print alerts to stderr |
file |
path |
Append JSONL to a file |
webhook |
url |
HTTP POST to a webhook endpoint |
Channel Fields
| Field |
Type |
Default |
Description |
type |
string |
(required) |
Channel type: console, file, webhook. |
enabled |
bool |
true |
Enable this channel. |
path |
string |
"" |
File path (for file channels). |
url |
string |
"" |
Webhook endpoint URL (for webhook channels). |
min_severity |
string |
"low" |
Minimum severity to dispatch. |
headers |
dict |
{} |
HTTP headers (for webhook channels). Supports ${ENV_VAR} interpolation. |
Environment Variable Interpolation
All string values in the config support ${ENV_VAR} and ${ENV_VAR:default} interpolation:
kill_switch:
state_path: "${MONITOR_DATA_DIR:.agent_monitor}/kill_state.json"
alerts:
channels:
- type: webhook
url: "${ALERT_WEBHOOK_URL}"
headers:
Authorization: "Bearer ${WEBHOOK_TOKEN}"
If the environment variable is not set and no default is provided, the placeholder is left as-is (no error, no expansion).
Complete Enterprise Example
version: "1.0"
metadata:
name: acme-monitor
description: Production monitoring for ACME AI agents
author: platform-team
agents:
sales-agent:
enabled: true
event_types: [action, denial, error, cost]
tags: [production, sales]
finance-agent:
enabled: true
tags: [production, finance]
storage:
path: /var/lib/agent-monitor/events.jsonl
retention_days: 365
metrics:
default_window_seconds: 300
max_window_seconds: 3600
baselines:
enabled: true
min_samples: 30
metrics:
- denial_rate
- error_count
- cost_per_minute
- avg_latency_ms
storage_path: /var/lib/agent-monitor/baselines.json
anomaly_detection:
enabled: true
rules:
- name: high-denial-rate
metric: denial_rate
z_threshold: 3.0
severity: high
cooldown_seconds: 300
- name: cost-spike
metric: cost_per_minute
z_threshold: 2.5
severity: critical
cooldown_seconds: 600
- name: latency-anomaly
metric: avg_latency_ms
z_threshold: 3.0
severity: medium
cooldown_seconds: 120
kill_switch:
enabled: true
state_path: /var/lib/agent-monitor/kill_state.json
policies:
- name: auto-kill-on-high-cost
metric: cost_per_minute
operator: ">"
threshold: 5.0
action: kill_agent
severity: critical
- name: emergency-shutdown
metric: event_count
operator: ">"
threshold: 10000
action: kill_global
severity: critical
alerts:
channels:
- type: console
- type: file
path: /var/log/agent-monitor/alerts.jsonl
min_severity: medium
- type: webhook
url: https://hooks.slack.com/services/xxx
headers:
Content-Type: "application/json"
min_severity: high
Validation
Always validate your config before deploying:
agent-monitor -c monitor.yaml validate
The validator checks:
- Version is supported (
"1.0")
- All event types in agent configs are valid
- Storage retention_days >= 1
- Metrics window sizes are valid
- Baselines min_samples >= 1
- Anomaly rule names are unique, reference valid severities
- Kill policy operators, actions, and severities are valid
- Alert channels have required fields and valid types/severities