Budget Management¶

The budget system controls how much context is assembled for each query. It estimates token counts, scores relevance, ranks chunks, and trims to fit within the configured limit. The goal: maximize the signal-to-noise ratio of the context your agent receives.

How It Works¶

After the engine fetches chunks from all matched and permitted sources, the budget pipeline processes them:

All fetched chunks
    |
    v
1. Token estimation (how big is each chunk?)
    |
    v
2. Relevance scoring (how relevant is each chunk to the query?)
    |
    v
3. Ranking (sort by relevance, recency, or insertion order)
    |
    v
4. Budget trimming (keep chunks until budget is full)
    |
    v
Final chunks (fits within max_tokens - reserve_tokens)

Config¶

budget:
  max_tokens: 8000
  ranking: relevance
  truncation: drop
  estimator: chars_div4
  reserve_tokens: 500

Field	Type	Default	Description
`max_tokens`	int	`8000`	Maximum total tokens for the assembled context. Must be >= 1.
`ranking`	string	`"relevance"`	Ranking strategy: `"relevance"`, `"recency"`, `"manual"`, or `"embedding"`.
`embedding`	object	`null`	Embedding config (required when ranking is `"embedding"`). See below.
`truncation`	string	`"drop"`	Truncation strategy: `"drop"`, `"truncate_end"`, or `"truncate_middle"`.
`estimator`	string	`"chars_div4"`	Token estimation method: `"chars_div4"`, `"words"`, or `"whitespace"`.
`reserve_tokens`	int	`0`	Tokens to reserve (subtracted from budget). Must be >= 0.

Token Estimation¶

The engine estimates token counts without calling a tokenizer model. This keeps it fast and dependency-free.

Estimator Methods¶

Method	Formula	Speed	Accuracy
`chars_div4`	`ceil(len(text) / 4)`	Fastest	Good for GPT-like tokenizers (~4 chars per token). Default.
`words`	`len(text.split())`	Fast	Rougher. Overestimates for short words, underestimates for long words.
`whitespace`	`len(text.split())`	Fast	Same as `words`. Whitespace-based splitting.

All methods return at least 1 for non-empty text and 0 for empty text.

Choosing an Estimator¶

chars_div4 is the best default. It approximates GPT-3/4 tokenization (which averages ~4 characters per token for English text) and is the fastest option.
words and whitespace are useful if your content has many short tokens (code, URLs) or very long tokens (technical terms), where word count is a better proxy than character count.

The estimator is used both for setting token_count on chunks and for budget trimming calculations. Consistency matters more than absolute accuracy -- the engine uses the same estimator everywhere.

Relevance Scoring¶

Each chunk is scored for relevance to the query using keyword overlap. This is a lightweight, zero-dependency method that works well for routing scenarios.

How Scoring Works¶

Tokenize both the query text and the chunk content (lowercase, strip punctuation)
Remove stopwords (common English words: "the", "is", "and", "for", etc.)
Include title keywords -- the chunk's title field contributes additional keywords
Calculate overlap -- the score is the fraction of query keywords found in the chunk

score = |query_keywords AND chunk_keywords| / |query_keywords|

The score ranges from 0.0 (no keyword overlap) to 1.0 (all query keywords found in the chunk).

Example¶

Query: "What is the remote work policy?"

After stopword removal, query keywords: {"remote", "work", "policy"}

Chunk	Keywords	Overlap	Score
"## Remote Work Policy\nEmployees may work..."	`{"remote", "work", "policy", "employees"}`	3/3	1.0
"## PTO Policy\nAll employees receive..."	`{"pto", "policy", "employees", "receive"}`	1/3	0.33
"## Office Hours\nThe office is open..."	`{"office", "hours", "open"}`	0/3	0.0

Stopwords¶

The following common English words are excluded from scoring (partial list): a, an, the, and, or, but, in, on, at, to, for, of, with, by, from, is, it, that, this, was, are, be, have, has, had, do, does, did, will, would, could, should, may, might, can, not, so, if, then, than, about, what, which, who, when, where, how, ...

Single-character tokens are also excluded.

Limitations¶

Keyword-based only -- no semantic understanding
English stopwords only -- may not work optimally for other languages
No term frequency weighting (TF-IDF) -- a keyword matching once scores the same as matching many times
Case-insensitive -- "API" and "api" are treated the same

For most context routing scenarios (documentation retrieval, knowledge base search), keyword overlap provides sufficient signal. For semantic search, use an http_api source backed by a vector search service.

Ranking Strategies¶

After scoring, chunks are sorted by the configured ranking strategy. The ranking order determines which chunks are kept when the budget is tight.

`relevance` (Default)¶

Sorts by relevance_score descending. The most relevant chunks appear first.

budget:
  ranking: relevance

Best for: search-like queries where the user wants the most relevant information.

`recency`¶

Sorts by metadata.mtime descending (most recently modified first). Uses the modification timestamp set by file-based sources (directory, git_repo).

budget:
  ranking: recency

Best for: "what changed recently" queries, changelog reviews, update notifications.

Chunks without metadata.mtime (e.g., inline or API sources) sort to the bottom (mtime defaults to 0).

`manual`¶

Preserves insertion order. Chunks appear in the order their sources were matched and fetched.

budget:
  ranking: manual

Best for: scenarios where source priority controls ordering, or when you want deterministic ordering based on route/source declaration order.

`embedding`¶

Uses OpenAI embeddings (or any compatible API) for semantic similarity scoring. Computes cosine similarity between the query embedding and each chunk embedding. Chunks closest in meaning to the query rank highest.

budget:
  ranking: embedding
  embedding:
    model: text-embedding-3-small
    api_key_env: OPENAI_API_KEY
    url: https://api.openai.com/v1/embeddings   # optional, default is OpenAI
    cache_dir: .context_router_embeddings        # optional, caches embeddings on disk

Install the optional dependency:

pip install theaios-context-router[embeddings]

Field	Type	Default	Description
`model`	string	`"text-embedding-3-small"`	Embedding model name
`api_key_env`	string	`"OPENAI_API_KEY"`	Environment variable holding the API key
`url`	string	`"https://api.openai.com/v1/embeddings"`	Embedding API endpoint
`cache_dir`	string	`".context_router_embeddings"`	Directory for caching document embeddings

How it works:

First query: embeds all fetched chunks and caches their embeddings on disk
Subsequent queries: only the query is embedded (1 API call), chunk embeddings are loaded from cache
Cosine similarity scores each chunk against the query
Chunks are sorted by similarity (highest first), then budget trimming applies normally

Performance characteristics:

	Keyword (`relevance`)	Embedding
Latency	~0.6ms/query	~200ms/query (API call)
Cost	$0	~$0.0002/query
P@1 (benchmark)	85%	95%
Determinism	100%	Near-deterministic (model is stable)
Setup	None	API key + optional dep

Best for: semantic queries where users don't use the exact terminology in the documents ("How do I take time off?" matching the PTO policy). The +10% P@1 improvement is most pronounced for ambiguous, natural-language queries.

Start with keyword, upgrade to embedding

Use ranking: relevance (the default) to get started. If you see queries that miss relevant documents because of vocabulary mismatch, switch to ranking: embedding. You can always switch back — the config change is one line.

Truncation Strategies¶

When the budget is being filled and a chunk exceeds the remaining space, the truncation strategy determines what happens.

`drop` (Default)¶

Skip the chunk entirely. Move on to the next one.

budget:
  truncation: drop

Pros: Safe. No partial content. Every chunk in the response is complete.

Cons: May waste remaining budget if the next chunk is also too large.

`truncate_end`¶

Cut the chunk from the end to fit the remaining budget. Appends [...] to indicate truncation.

budget:
  truncation: truncate_end

Pros: Fills the budget more completely. Beginning of documents (usually the most informative part) is preserved.

Cons: Chunks may end mid-sentence. The [...] marker takes a few tokens.

How it works:

For chars_div4: truncates to remaining_tokens * 4 characters
For words/whitespace: binary searches for the right number of words that fit

`truncate_middle`¶

Keep the beginning and end of the chunk, cut the middle. Inserts [...truncated...] in the middle.

budget:
  truncation: truncate_middle

Pros: Preserves both the introduction (context) and conclusion (summary/key points) of a document.

Cons: Middle section is lost, which may contain critical details.

How it works: Splits the available budget in half. The first half is taken from the beginning, the second half from the end.

Reserve Tokens¶

The reserve_tokens setting subtracts from the total budget before any chunks are added. This is useful when you inject additional content (like a system prompt) separately from the context router.

budget:
  max_tokens: 8000
  reserve_tokens: 1000
  # Effective budget for context: 7000 tokens

If reserve_tokens >= max_tokens, the effective budget is 0 and no chunks are returned.

Tuning Guide¶

How to Choose `max_tokens`¶

The right budget depends on your LLM's context window and how much you need for the actual conversation:

LLM Context Window	Suggested `max_tokens`	Reserve for conversation
4K (GPT-3.5)	2000-3000	1000-2000
8K	4000-6000	2000-4000
16K	8000-12000	4000-8000
128K+ (Claude, GPT-4 Turbo)	16000-64000	Plenty of room

Start conservative and increase. Too much context dilutes relevance; too little misses information.

When to Use Each Ranking¶

Scenario	Ranking	Why
General Q&A	`relevance`	Users want the most relevant answer
"What's new?"	`recency`	Most recent changes matter most
Structured prompts	`manual`	Control exact ordering via source priority
Debug/investigation	`relevance`	Find the most relevant error docs

When to Use Each Truncation¶

Scenario	Truncation	Why
Strict correctness	`drop`	Never show partial information
Maximum context	`truncate_end`	Fill the budget completely
Long documents	`truncate_middle`	Preserve intro and conclusion
Code snippets	`drop`	Partial code is worse than no code
Policy documents	`truncate_end`	The beginning usually has the key rules

Estimator Accuracy¶

The chars_div4 estimator is usually within 10-20% of actual GPT tokenization for English text. For precise token counting, you can:

Use a smaller max_tokens as a safety margin
Count tokens after assembly using tiktoken or your LLM's tokenizer
Re-trim if necessary

Optimizing Retrieval Quality¶

Use markdown H2 splitting. Structure your docs with clear H2 headings. The engine splits on these, giving the relevance scorer smaller, more focused chunks to rank.
Set source priorities. Use the priority field on sources to control which content appears first with manual ranking.
Use multiple routes. Instead of one route pointing to everything, create specific routes for specific query patterns. This naturally limits which sources are consulted.
Tune the budget. Start with max_tokens: 4000 and increase until response quality plateaus. More context is not always better -- it can dilute the signal.
Enable caching for API sources. If you use http_api sources, enable caching to avoid redundant API calls and reduce latency.

Inspecting Budget Behavior¶

CLI Query Output¶

context-router query --config context-router.yaml --text "remote work policy" --output json

The JSON output includes:

total_tokens -- how many tokens were used
was_truncated -- whether any chunks were dropped or truncated
Each chunk's token_count and relevance_score

Python API¶

response = router.query(Query(text="remote work policy"))

print(f"Total tokens: {response.total_tokens}")
print(f"Was truncated: {response.was_truncated}")
print(f"Chunks: {len(response.chunks)}")

for chunk in response.chunks:
    print(f"  [{chunk.source}] {chunk.title}: "
          f"score={chunk.relevance_score:.2f}, tokens={chunk.token_count}")