API Reference¶

Complete Python API reference for TrustGate. All public symbols listed here are importable directly from the trustgate package unless otherwise noted.

Core Pipeline Functions¶

`trustgate.certify`¶

trustgate.certify(
    config: TrustGateConfig | None = None,
    config_path: str = "trustgate.yaml",
    questions: list[Question] | None = None,
    labels: dict[str, str] | None = None,
    ground_truth_file: str | None = None,
) -> CertificationResult

Run the full certification pipeline synchronously. This is the main entry point for most users.

Steps performed:

Load and validate configuration
Load evaluation questions
Initialize the response cache
Sample K responses per question (with optional sequential stopping)
Canonicalize all responses
Compute self-consistency profiles
Split data into calibration and test sets
Load or derive ground-truth labels
Run conformal calibration
Enrich the result with cost estimates and metadata

Parameters:

Parameter	Description
`config`	A pre-built `TrustGateConfig`. If `None`, loaded from `config_path`.
`config_path`	Path to a YAML config file. Defaults to `"trustgate.yaml"`. Ignored when `config` is provided.
`questions`	List of `Question` objects. If `None`, loaded from the config's `questions` section.
`labels`	Dict mapping question ID to the correct answer string. If `None`, derived from `ground_truth_file` or from each question's `acceptable_answers`.
`ground_truth_file`	Path to a CSV or JSON file containing ground-truth labels.

Returns: CertificationResult

Raises:

ConfigError -- if config validation fails.
LabelsRequired -- if no labels can be found from any source.
FileNotFoundError -- if config_path or ground_truth_file does not exist.

Example:

from theaios import trustgate

result = trustgate.certify(config_path="trustgate.yaml")
print(f"Reliability: {result.reliability_level:.0%}")

`trustgate.certify_async`¶

async trustgate.certify_async(
    config: TrustGateConfig | None = None,
    config_path: str = "trustgate.yaml",
    questions: list[Question] | None = None,
    labels: dict[str, str] | None = None,
    ground_truth_file: str | None = None,
) -> CertificationResult

Async version of certify(). Identical parameters, return type, and behavior. Use this when you are already inside an async context (e.g., an asyncio event loop or a framework like FastAPI).

Example:

import asyncio
from theaios import trustgate

result = asyncio.run(trustgate.certify_async(config_path="trustgate.yaml"))

`trustgate.calibrate`¶

trustgate.calibrate(
    profiles: dict[str, list[tuple[str, float]]],
    labels: dict[str, str],
    cal_ids: list[str],
    test_ids: list[str],
    alpha_values: list[float],
) -> CertificationResult

Low-level conformal calibration. Most users should call certify() instead; this function is useful when you have already computed self-consistency profiles and want direct control over the calibration/test split.

Steps performed:

Compute nonconformity scores on cal_ids.
For each alpha, find M* (the conformal quantile threshold).
Evaluate empirical coverage on test_ids.
Determine the reliability level (largest 1-alpha where coverage holds).
Compute conditional coverage and capability gap on the test set.

Parameters:

Parameter	Description
`profiles`	Dict mapping question ID to a self-consistency profile (list of `(answer, frequency)` tuples sorted by frequency descending).
`labels`	Dict mapping question ID to the correct canonical answer.
`cal_ids`	List of question IDs to use for calibration.
`test_ids`	List of question IDs to use for testing.
`alpha_values`	List of significance levels to evaluate (e.g., `[0.01, 0.05, 0.10]`).

Returns: CertificationResult (with k_used and api_cost_estimate set to 0; the caller should fill these in if needed).

`trustgate.compute_profile`¶

trustgate.compute_profile(
    canonical_answers: list[str],
) -> list[tuple[str, float]]

Compute the self-consistency profile from K canonicalized responses to a single question.

Parameters:

Parameter	Description
`canonical_answers`	List of K canonical answer strings for a single question. Must not be empty.

Returns: A list of (answer, frequency) tuples sorted by frequency descending, then alphabetically for tie-breaking. Frequencies sum to 1.0.

Raises: ValueError if the input list is empty.

Example:

from theaios import trustgate

profile = trustgate.compute_profile(["42", "42", "42", "43", "42"])
# [("42", 0.8), ("43", 0.2)]

`trustgate.sample`¶

trustgate.sample(
    config: TrustGateConfig,
    questions: list[Question],
    cache: DiskCache | None = None,
) -> dict[str, list[SampleResponse]]

Low-level synchronous sampling. Queries the configured AI endpoint K times per question, returning raw responses. Uses caching, concurrency limiting, and retries with exponential backoff internally.

Parameters:

Parameter	Description
`config`	A `TrustGateConfig` with endpoint and sampling parameters.
`questions`	List of `Question` objects to sample.
`cache`	Optional `DiskCache` instance. A default disk cache is used if not provided.

Returns: Dict mapping question ID to a list of SampleResponse objects.

`trustgate.load_config`¶

trustgate.load_config(
    path: str = "trustgate.yaml",
    overrides: dict[str, object] | None = None,
) -> TrustGateConfig

Load and validate a YAML configuration file.

Parameters:

Parameter	Description
`path`	Path to the YAML config file.
`overrides`	Optional dict of dot-notation overrides (e.g., `{"sampling.k_fixed": 5}`).

Returns: A validated TrustGateConfig instance.

Raises:

FileNotFoundError -- if the config file does not exist.
ConfigError -- if validation fails.

Example:

from theaios import trustgate

config = trustgate.load_config("trustgate.yaml")
print(config.endpoint.model)

Configuration Classes¶

`trustgate.TrustGateConfig`¶

@dataclass
class TrustGateConfig:
    endpoint: EndpointConfig
    sampling: SamplingConfig          # default: SamplingConfig()
    canonicalization: CanonConfig      # default: CanonConfig()
    calibration: CalibrationConfig    # default: CalibrationConfig()
    questions: QuestionsConfig        # default: QuestionsConfig()
    thresholds: ThresholdsConfig      # default: ThresholdsConfig()

Top-level configuration dataclass. Maps 1:1 to the sections in trustgate.yaml.

`trustgate.EndpointConfig`¶

@dataclass
class EndpointConfig:
    url: str
    model: str = ""
    temperature: float | None = 0.7
    api_key_env: str = ""
    max_tokens: int = 4096
    provider: str = ""  # "openai", "anthropic", "together", "generic_http", "generic" (auto-detected if empty)
    headers: dict[str, str] = {}
    request_template: dict | None = None
    response_path: str = ""
    cost_per_request: float | None = None

Configuration for an AI API endpoint. The provider field is auto-detected from the URL if left empty. Supported providers:

openai -- OpenAI Chat Completions API
anthropic -- Anthropic Messages API
together -- Together AI (OpenAI-compatible)
generic -- Any OpenAI-compatible endpoint (vLLM, Ollama, LiteLLM, etc.)

The api_key_env field specifies the name of the environment variable holding the API key (e.g., "LLM_API_KEY").

`trustgate.SamplingConfig`¶

@dataclass
class SamplingConfig:
    k_max: int = 20
    k_fixed: int | None = 10
    sequential_stopping: bool = True
    delta: float = 0.05
    max_concurrent: int = 10
    timeout: float = 120.0
    retries: int = 10

Field	Description
`k_max`	Maximum number of samples per question.
`k_fixed`	Fixed K to use. If set, overrides `k_max`. Set to `None` to use adaptive stopping up to `k_max`.
`sequential_stopping`	Enable early stopping when self-consistency stabilizes.
`delta`	Stability threshold for sequential stopping.
`max_concurrent`	Maximum concurrent API requests.
`timeout`	Per-request timeout in seconds.
`retries`	Number of retries on transient errors (429, 5xx, timeouts).

`trustgate.CanonConfig`¶

@dataclass
class CanonConfig:
    type: str = "mcq"
    judge_endpoint: EndpointConfig | None = None
    custom_class: str | None = None

Field	Description
`type`	Canonicalization strategy. One of: `"numeric"`, `"mcq"`, `"llm_judge"`, `"llm"`, `"embedding"`, `"custom"`.
`judge_endpoint`	Required when `type` is `"llm_judge"` or `"llm"`. Configures the LLM used as the judge.
`custom_class`	Required when `type` is `"custom"`. Dot-path to a `Canonicalizer` subclass (e.g., `"mypackage.MyCanonicalizer"`).

`trustgate.CalibrationConfig`¶

@dataclass
class CalibrationConfig:
    alpha_values: list[float] = [0.01, 0.05, 0.10, 0.15, 0.20]
    n_cal: int = 500
    n_test: int = 500
    bootstrap_splits: int = 100

Field	Description
`alpha_values`	Significance levels to evaluate. Smaller alpha = stricter reliability requirement.
`n_cal`	Number of questions for the calibration set.
`n_test`	Number of questions for the test set.
`bootstrap_splits`	Number of bootstrap re-splits for stability analysis.

Data Classes¶

`trustgate.Question`¶

@dataclass
class Question:
    id: str
    text: str
    acceptable_answers: list[str] | None = None
    metadata: dict[str, str] = {}

A single evaluation question.

Field	Description
`id`	Unique identifier for the question (e.g., `"gsm8k_42"`).
`text`	The question text sent to the model.
`acceptable_answers`	Optional list of correct answers. Used to derive labels when no external ground truth is provided.
`metadata`	Arbitrary key-value metadata (e.g., `{"source": "gsm8k", "subject": "algebra"}`).

`trustgate.CertificationResult`¶

@dataclass
class CertificationResult:
    reliability_level: float
    m_star: int
    coverage: float
    conditional_coverage: float
    capability_gap: float
    target_alpha: float = 0.05
    n_cal: int
    n_test: int
    k_used: int
    api_cost_estimate: float
    alpha_coverage: dict[float, float] = {}
    per_item: list[dict[str, object]] = []

Output of the certification pipeline.

Field	Description
`reliability_level`	The largest `1 - alpha` for which conformal coverage holds on the test set. Higher is better. For example, `0.90` means 90% reliability.
`m_star`	The size of the conformal prediction set. `m_star = 1` means the model's top answer is reliably correct.
`coverage`	Empirical coverage on the test set: fraction of questions where the top-M* answers contain the correct one.
`conditional_coverage`	Coverage on solvable items only (where the correct answer appears somewhere in the K samples).
`capability_gap`	Fraction of questions where the correct answer never appeared in any of the K samples. Indicates inherent model limitations.
`n_cal`	Number of calibration items used.
`n_test`	Number of test items used.
`k_used`	Average number of samples per question.
`api_cost_estimate`	Estimated API cost in USD (based on known model pricing).
`alpha_coverage`	Per-alpha coverage breakdown: `{alpha: empirical_coverage}`.
`per_item`	Optional per-question diagnostics.

Canonicalization¶

`trustgate.Canonicalizer`¶

class Canonicalizer(ABC):
    def canonicalize(self, question: str, answer: str) -> str: ...
    def validate(self, canonical: str) -> bool: ...
    def preprocess(self, answer: str) -> str: ...

Abstract base class for all canonicalizers. Subclass this to create custom canonicalization logic.

Method	Description
`canonicalize(question, answer)`	Abstract. Map a raw LLM answer to a canonical form. Two answers that mean the same thing should produce the same string.
`validate(canonical)`	Optional. Check that the canonical form is well-formed. Returns `True` by default.
`preprocess(answer)`	Shared preprocessing applied before canonicalization. Strips whitespace, removes markdown code fences, removes common LLM preambles, and applies Unicode NFC normalization.

Built-in canonicalizers:

Name	Use case
`"numeric"`	Math problems -- extracts and normalizes numeric answers.
`"mcq"`	Multiple-choice -- extracts letter answers (A, B, C, D).
`"llm"`	LLM Semantic -- extracts core factual answers via LLM prompt. Source: `theaios.trustgate.canonicalize.llm_semantic`.
`"llm_judge"`	Open-ended -- uses another LLM to judge equivalence.
`"embedding"`	Semantic similarity -- clusters answers by embedding distance.

`trustgate.register_canonicalizer`¶

trustgate.register_canonicalizer(name: str) -> Callable

Class decorator that registers a Canonicalizer subclass under the given name.

Example:

from theaios.trustgate import Canonicalizer, register_canonicalizer

@register_canonicalizer("my_custom")
class MyCanonicalizer(Canonicalizer):
    def canonicalize(self, question: str, answer: str) -> str:
        text = self.preprocess(answer)
        return text.lower().strip()

`trustgate.get_canonicalizer`¶

trustgate.get_canonicalizer(name: str, **kwargs) -> Canonicalizer

Instantiate a registered canonicalizer by name.

Parameters:

Parameter	Description
`name`	The registered name (e.g., `"numeric"`, `"mcq"`, `"my_custom"`).
`**kwargs`	Passed to the canonicalizer constructor. For `"llm_judge"`, pass `judge_config=EndpointConfig(...)`.

Raises: KeyError if the name is not registered.

Example:

from theaios import trustgate

canon = trustgate.get_canonicalizer("numeric")
result = canon.canonicalize("What is 2+2?", "The answer is 4.")
# "4"

Model Comparison¶

`trustgate.comparison.compare`¶

trustgate.comparison.compare(
    models: list[str],
    config: TrustGateConfig,
    questions: list[Question],
    labels: dict[str, str],
) -> list[tuple[str, CertificationResult]]

Compare multiple models on the same question set. Runs each model through the full certification pipeline and returns results sorted by reliability_level descending.

Models are evaluated sequentially (to respect per-model rate limits), but sampling within each model run is parallelized.

Parameters:

Parameter	Description
`models`	List of model name strings (e.g., `["gpt-4.1-mini", "gpt-4.1"]`). Each name is set as `config.endpoint.model` for that run.
`config`	Base configuration. The `endpoint.model` field is overridden for each model.
`questions`	Shared question set for all models.
`labels`	Shared ground-truth labels for all models.

Returns: List of (model_name, CertificationResult) tuples, sorted by reliability level descending (best model first).

There is also an async variant: trustgate.comparison.compare_async(...) with the same signature.

Example:

from theaios.trustgate.comparison import compare

results = compare(
    models=["gpt-4.1-mini", "gpt-4.1"],
    config=config,
    questions=questions,
    labels=labels,
)

for name, result in results:
    print(f"{name}: {result.reliability_level:.0%}")

Dataset Loaders¶

Built-in loaders for popular evaluation benchmarks. Import from trustgate.datasets.

`trustgate.datasets.load_gsm8k`¶

trustgate.datasets.load_gsm8k(
    n: int | None = None,
    seed: int = 42,
) -> list[Question]

Load GSM8K grade-school math problems. Downloads from GitHub on first use and caches locally in ~/.trustgate/datasets/.

Each returned Question has acceptable_answers set to the numeric answer extracted from the solution. Recommended canonicalization: "numeric".

Parameters:

Parameter	Description
`n`	Subsample to `n` questions. `None` returns all (~1319 questions).
`seed`	Random seed for reproducible subsampling.

`trustgate.datasets.load_mmlu`¶

trustgate.datasets.load_mmlu(
    subjects: list[str] | None = None,
    n: int | None = None,
    seed: int = 42,
    data: list[dict[str, object]] | None = None,
) -> list[Question]

Load MMLU multiple-choice questions. Uses the HuggingFace datasets library if installed; falls back to an empty list otherwise.

Each question is formatted with lettered options (A, B, C, D) and acceptable_answers set to the correct letter. Recommended canonicalization: "mcq".

Parameters:

Parameter	Description
`subjects`	Filter to specific MMLU subjects. `None` returns all subjects.
`n`	Subsample to `n` questions. `None` returns all.
`seed`	Random seed for reproducible subsampling.
`data`	Pre-loaded data rows (list of dicts). Useful for testing without downloading.

`trustgate.datasets.load_truthfulqa`¶

trustgate.datasets.load_truthfulqa(
    n: int | None = None,
    seed: int = 42,
    data: list[dict[str, object]] | None = None,
) -> list[Question]

Load TruthfulQA questions about common misconceptions. Uses the HuggingFace datasets library if installed; falls back to an empty list otherwise.

Open-ended questions. Recommended canonicalization: "llm_judge" or "embedding".

Parameters:

Parameter	Description
`n`	Subsample to `n` questions. `None` returns all.
`seed`	Random seed for reproducible subsampling.
`data`	Pre-loaded data rows (list of dicts). Useful for testing without downloading.

Reporting¶

Functions for outputting certification results. Import from trustgate.reporting.

`trustgate.reporting.console.print_certification_result`¶

trustgate.reporting.console.print_certification_result(
    result: CertificationResult,
    verbose: bool = False,
    console: Console | None = None,
) -> None

Print a formatted certification result table to the terminal using Rich. Shows reliability level, M*, coverage metrics, capability gap, and a PASS/UNCERTAIN status indicator.

When verbose=True, also prints a per-alpha coverage breakdown table.

`trustgate.reporting.console.print_comparison_result`¶

trustgate.reporting.console.print_comparison_result(
    results: list[tuple[str, CertificationResult]],
    console: Console | None = None,
) -> None

Print a side-by-side comparison table for multiple models. Shows reliability, M*, coverage, conditional coverage, capability gap, and K used for each model.

`trustgate.reporting.json_export.export_json`¶

trustgate.reporting.json_export.export_json(
    result: CertificationResult,
    path: str | None = None,
) -> str

Export a certification result as a JSON string. If path is provided, also writes the JSON to that file. Always returns the JSON string.

The JSON includes a trustgate_version field and a UTC timestamp.

`trustgate.reporting.csv_export.export_csv`¶

trustgate.reporting.csv_export.export_csv(
    result: CertificationResult,
    path: str | None = None,
) -> str

Export a certification result as a CSV string. If path is provided, also writes the CSV to that file. Always returns the CSV string.

The first row contains summary metrics. If per_item data is present, additional rows with per-question diagnostics are appended after a blank separator row.

API Reference¶

Core Pipeline Functions¶

trustgate.certify¶

trustgate.certify_async¶

trustgate.calibrate¶

trustgate.compute_profile¶

trustgate.sample¶

trustgate.load_config¶

Configuration Classes¶

trustgate.TrustGateConfig¶

trustgate.EndpointConfig¶

trustgate.SamplingConfig¶

trustgate.CanonConfig¶

trustgate.CalibrationConfig¶

Data Classes¶

trustgate.Question¶

trustgate.CertificationResult¶

Canonicalization¶

trustgate.Canonicalizer¶

trustgate.register_canonicalizer¶

trustgate.get_canonicalizer¶

Model Comparison¶

trustgate.comparison.compare¶

Dataset Loaders¶

trustgate.datasets.load_gsm8k¶

trustgate.datasets.load_mmlu¶

trustgate.datasets.load_truthfulqa¶

Reporting¶

trustgate.reporting.console.print_certification_result¶

trustgate.reporting.console.print_comparison_result¶

trustgate.reporting.json_export.export_json¶

trustgate.reporting.csv_export.export_csv¶

`trustgate.certify`¶

`trustgate.certify_async`¶

`trustgate.calibrate`¶

`trustgate.compute_profile`¶

`trustgate.sample`¶

`trustgate.load_config`¶

`trustgate.TrustGateConfig`¶

`trustgate.EndpointConfig`¶

`trustgate.SamplingConfig`¶

`trustgate.CanonConfig`¶

`trustgate.CalibrationConfig`¶

`trustgate.Question`¶

`trustgate.CertificationResult`¶

`trustgate.Canonicalizer`¶

`trustgate.register_canonicalizer`¶

`trustgate.get_canonicalizer`¶

`trustgate.comparison.compare`¶

`trustgate.datasets.load_gsm8k`¶

`trustgate.datasets.load_mmlu`¶

`trustgate.datasets.load_truthfulqa`¶

`trustgate.reporting.console.print_certification_result`¶

`trustgate.reporting.console.print_comparison_result`¶

`trustgate.reporting.json_export.export_json`¶

`trustgate.reporting.csv_export.export_csv`¶