Source Types¶
Sources are where context lives. The context router fetches data from sources, scores it for relevance, and assembles it into a response. This page covers every built-in source type and how to write your own.
Overview¶
| Source | What it does | Key fields | Best for |
|---|---|---|---|
inline |
Returns static text from config | content |
System prompts, instructions, boilerplate |
directory |
Reads files from a local directory | path, patterns |
Documentation, knowledge bases |
git_repo |
Reads files from a git ref | path, ref, patterns |
Versioned docs, branch-specific context |
http_api |
Queries a REST API | url, method, body_template |
Search APIs, knowledge services |
All sources implement the Source base class and its async fetch() method. Sources are fetched in parallel -- if a query matches routes pointing to three sources, all three are fetched concurrently.
inline -- Static Content¶
The simplest source type. Returns the content field as a single context chunk. No I/O, no latency.
Config¶
sources:
system_prompt:
type: inline
content: |
You are an AI assistant for ACME Corp.
Be helpful, accurate, and concise.
Always cite the source of your information.
priority: 10
tags: [core]
Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
content |
string | Yes | "" |
The text content. Supports YAML multiline strings (| for literal, > for folded). |
Behavior¶
- Returns exactly one chunk containing the full
contentstring - The chunk's
sourceandtitleare set to the source name - Empty
contentreturns zero chunks - Token count is estimated at creation time
When to Use¶
- System prompts that every agent receives
- Instructions, guidelines, or boilerplate
- Static configuration context (company info, rules)
- Any content that doesn't change between queries
Example: Multiple Inline Sources with Priority¶
sources:
system_prompt:
type: inline
content: "You are a helpful assistant."
priority: 10
safety_notice:
type: inline
content: "Never share personal data. Always verify before acting."
priority: 9
response_format:
type: inline
content: "Format all responses in markdown with headers and bullet points."
priority: 8
With ranking: manual, these will appear in priority order (highest first).
directory -- Local Files¶
Reads files from a local directory with glob pattern matching, recursive traversal, and automatic markdown splitting.
Config¶
sources:
knowledge_base:
type: directory
path: "./data/docs"
patterns: ["**/*.md", "**/*.txt"]
exclude_patterns: ["**/drafts/**", "**/archive/**"]
recursive: true
encoding: utf-8
max_file_size: 1000000
tags: [documentation]
Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
path |
string | Yes | "" |
Directory path. Relative paths resolve from the working directory. |
patterns |
list | No | ["**/*"] |
Glob patterns for files to include. Matched against relative paths. |
exclude_patterns |
list | No | [] |
Glob patterns for files to exclude. |
recursive |
bool | No | true |
Traverse subdirectories. |
encoding |
string | No | "utf-8" |
File encoding. Undecodable files are skipped. |
max_file_size |
int | No | 1000000 |
Max file size in bytes. Larger files are skipped. |
Behavior¶
- The directory is scanned (recursively if enabled)
- Each file is matched against
patterns(include) thenexclude_patterns(exclude) - Files exceeding
max_file_sizeare skipped - Each file is read with the configured encoding
- Markdown files (
.md,.markdown) are split by H2 headings -- each## Sectionbecomes a separate chunk - Other files produce one chunk per file
Markdown H2 Splitting¶
Markdown files are split at every ## heading. Each section becomes its own chunk:
# My Document
Introduction text here.
## Installation
Install with pip install my-package.
## Configuration
Edit the config file at ~/.config/my-package.yaml.
This produces three chunks:
- Title: filename, content:
"# My Document\n\nIntroduction text here." - Title:
"Installation", content:"## Installation\n\nInstall with pip install my-package." - Title:
"Configuration", content:"## Configuration\n\nEdit the config file..."
This gives the relevance scorer fine-grained sections to rank, so the engine can return just the "Installation" section instead of the entire document.
If a markdown file has no H2 headings, it is returned as a single chunk (the filename becomes the title).
Chunk Metadata¶
Each chunk includes:
| Key | Type | Description |
|---|---|---|
mtime |
float | File modification timestamp (Unix epoch). Enables recency ranking. |
The chunk's path field is set to the relative path within the directory (e.g., "guides/setup.md"), which enables deny_paths filtering.
Security¶
- Path traversal protection: every file path is resolved via
Path.resolve()and verified to be inside the base directory. Symlink escapes are blocked. - File size limit:
max_file_size(default 1MB) prevents reading very large files. - Note: the
pathconfig field can point to any directory the process has read access to. Restrict process permissions accordingly.
Example: Multiple Directories with Different Patterns¶
sources:
markdown_docs:
type: directory
path: "./docs"
patterns: ["**/*.md"]
tags: [docs, markdown]
config_examples:
type: directory
path: "./examples"
patterns: ["**/*.yaml", "**/*.json"]
max_file_size: 50000
tags: [examples]
code_snippets:
type: directory
path: "./src"
patterns: ["**/*.py"]
exclude_patterns: ["**/tests/**", "**/__pycache__/**"]
tags: [code]
git_repo -- Git Repository¶
Reads files from a git repository at a specific ref (branch, tag, or commit SHA). Uses git ls-tree and git show subprocess calls -- no working tree modifications, safe for production.
Config¶
sources:
versioned_docs:
type: git_repo
path: "/repos/documentation"
ref: "v2.1.0"
patterns: ["docs/**/*.md", "README.md"]
exclude_patterns: ["docs/internal/**"]
tags: [documentation, versioned]
Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
path |
string | Yes | "" |
Path to the git repository root. |
ref |
string | No | "HEAD" |
Git ref: branch name, tag, or commit SHA. |
patterns |
list | No | ["**/*"] |
Glob patterns for files to include. |
exclude_patterns |
list | No | [] |
Glob patterns for files to exclude. |
max_file_size |
int | No | 1000000 |
Max file size in bytes. |
Behavior¶
git ls-tree -r --name-only <ref>lists all files at the specified ref- Files are filtered by
patternsandexclude_patterns git show <ref>:<file>reads each matching file's content- Markdown files are split by H2 headings (same as the directory source)
- Files exceeding
max_file_sizeare skipped
Chunk Metadata¶
| Key | Type | Description |
|---|---|---|
ref |
string | The git ref used to read the file. |
Security¶
- Ref validation: git refs are validated against
^[a-zA-Z0-9._/-]+$before any subprocess call. Refs containing;,$,`, or other shell metacharacters are rejected. - Path validation: file paths from
git ls-treeare validated against a safe character set. Files with special characters in their names are skipped. - No shell=True: all subprocess calls use list syntax.
- Timeouts:
git ls-treehas a 30-second timeout,git showhas a 10-second timeout.
Requirements¶
gitmust be available on the system PATH- The path must be a valid git repository
- The ref must exist
Use Cases¶
- Versioned documentation: Pin to a release tag (
ref: "v2.0.0") so context stays consistent with the deployed version - Branch-specific context: Provide different context per branch (
ref: "feature/new-api") - Historical context: Read from any commit to understand past state
Example: Multiple Refs¶
sources:
stable_docs:
type: git_repo
path: "/repos/docs"
ref: "v2.0.0"
patterns: ["**/*.md"]
tags: [stable]
latest_docs:
type: git_repo
path: "/repos/docs"
ref: "main"
patterns: ["**/*.md"]
tags: [latest]
http_api -- REST API¶
Queries a REST API endpoint. Supports GET and POST, template substitution, custom headers, and JSON response parsing with dot-notation path navigation.
Config¶
sources:
search_api:
type: http_api
url: "https://api.example.com/search?q={{query}}&limit=5"
method: GET
headers:
Authorization: "Bearer ${API_TOKEN}"
Accept: "application/json"
response_path: "data.results"
result_text_field: "content"
result_title_field: "name"
tags: [search, api]
Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | Yes | "" |
API endpoint. {{query}} is replaced with the query text. |
method |
string | No | "GET" |
HTTP method: GET or POST. |
headers |
dict | No | {} |
HTTP headers. Values support ${ENV_VAR} interpolation. |
body_template |
string | No | "" |
Request body for POST. {{query}} is replaced with the query text. |
response_path |
string | No | "" |
Dot-notation path to the results array in the JSON response. |
result_text_field |
string | No | "text" |
Field name for chunk text in each result object. |
result_title_field |
string | No | "title" |
Field name for chunk title in each result object. |
Template Substitution¶
The {{query}} placeholder is replaced with the raw query text in both url and body_template:
# GET with query in URL
url: "https://api.example.com/search?q={{query}}"
# POST with query in body
url: "https://api.example.com/search"
method: POST
body_template: '{"query": "{{query}}", "top_k": 10, "filters": {"active": true}}'
Response Parsing¶
The engine parses JSON responses in three steps:
Step 1: Navigate to results. Use response_path to reach the array of results. For example, given this API response:
{
"status": "ok",
"data": {
"results": [
{"content": "First result", "title": "Doc A"},
{"content": "Second result", "title": "Doc B"}
]
}
}
Set response_path: "data.results" to navigate to the results array.
Step 2: Normalize. If the navigated value is a dict, it is wrapped in a list. If it is a scalar, it becomes a single chunk. If it is already a list, it is used as-is.
Step 3: Extract fields. For each item in the list, result_text_field and result_title_field specify which keys to extract. Items that are plain strings (not dicts) are used directly as chunk content.
Non-JSON Responses¶
If the response is not valid JSON, the entire response body is returned as a single chunk.
Chunk Metadata¶
| Key | Type | Description |
|---|---|---|
url |
string | The configured source URL (before template substitution). |
Security¶
- SSRF protection: URLs are validated before every request. Private IPs (
127.0.0.1,10.x.x.x,172.16-31.x.x,192.168.x.x), loopback, and link-local addresses are blocked. Onlyhttp://andhttps://schemes are allowed. - Validation after template substitution: the URL is validated after
{{query}}replacement, preventing query-based SSRF bypasses. - Timeout: all requests have a 30-second timeout.
Error Handling¶
- HTTP errors (4xx, 5xx) result in zero chunks (no exception propagated)
- Invalid URLs result in zero chunks
- Timeouts (30 seconds) result in zero chunks
- SSRF-blocked URLs result in zero chunks
- For POST requests,
Content-Type: application/jsonis set automatically if not provided inheaders
Example: Vector Search API¶
sources:
vector_search:
type: http_api
url: "https://embeddings.internal.acme.com/search"
method: POST
headers:
Authorization: "Bearer ${EMBEDDINGS_API_KEY}"
body_template: |
{
"query": "{{query}}",
"top_k": 10,
"threshold": 0.7,
"namespace": "production"
}
response_path: "matches"
result_text_field: "text"
result_title_field: "metadata.title"
tags: [search, vectors]
Writing a Custom Source¶
You can add custom source types by subclassing Source and registering with the @register_source decorator.
Step 1: Implement the Source¶
from theaios.context_router.sources import Source, register_source
from theaios.context_router.types import ContextChunk, Query, SourceConfig
from theaios.context_router.budget import estimate_tokens
@register_source("redis")
class RedisSource(Source):
"""Source that reads context from a Redis hash."""
async def fetch(self, query: Query, config: SourceConfig) -> list[ContextChunk]:
import redis.asyncio as redis
# Use config.url for the Redis connection string
client = redis.from_url(config.url)
try:
# Use config.path as the hash key
data = await client.hgetall(config.path)
finally:
await client.aclose()
chunks = []
for key, value in data.items():
text = value.decode("utf-8") if isinstance(value, bytes) else str(value)
if not text.strip():
continue
chunks.append(ContextChunk(
content=text,
source=config.name,
title=key.decode("utf-8") if isinstance(key, bytes) else str(key),
token_count=estimate_tokens(text),
))
return chunks
Step 2: Register the Import¶
Make sure your module is imported before the Router is created. You can do this in your application's startup code:
# Import to trigger @register_source decorator
import my_package.sources.redis_source # noqa: F401
from theaios.context_router import Router, load_config
config = load_config("context-router.yaml")
router = Router(config) # Now "redis" is available as a source type
Step 3: Use in Config¶
sources:
session_context:
type: redis
url: "redis://localhost:6379/0"
path: "agent:context:session-123"
tags: [session, dynamic]
Source Base Class¶
class Source(ABC):
@abstractmethod
async def fetch(self, query: Query, config: SourceConfig) -> list[ContextChunk]:
"""Fetch context chunks from this source."""
def fetch_sync(self, query: Query, config: SourceConfig) -> list[ContextChunk]:
"""Synchronous wrapper (uses asyncio.run internally)."""
Your fetch() method receives:
query-- the incomingQuerywithtext,agent,tags, andmetadataconfig-- theSourceConfigparsed from YAML, with all common fields (name,type,path,url,content,headers,patterns, etc.)
Return a list of ContextChunk objects. Each chunk should have:
content-- the text content (required for relevance scoring)source-- set toconfig.nametitle-- a human-readable titletoken_count-- useestimate_tokens(text)from the budget modulepath-- set if applicable (enablesdeny_pathsfiltering)metadata-- any source-specific metadata
Guidelines¶
- Handle errors gracefully. If your source is unavailable, return an empty list rather than raising an exception. The engine silently skips failed sources.
- Use
asyncio.to_thread()for blocking I/O. If your source uses a synchronous library, wrap the call inasyncio.to_thread()to avoid blocking the event loop. - Set
token_counton chunks. Useestimate_tokens()from the budget module so the budget trimmer doesn't need to re-estimate. - Set
pathfor file-based sources. This enables thedeny_pathspermission feature.
Registry Functions¶
from theaios.context_router.sources import register_source, get_source, list_sources
# Register a source class
@register_source("my_source")
class MySource(Source): ...
# Get an instance by name
source = get_source("my_source")
# List all registered source types
names = list_sources() # ["directory", "git_repo", "http_api", "inline", "my_source"]