Token Budget Enforcement

Compute token usage across the full request context and reject oversized requests before they reach the LLM.

Overview

Token budget enforcement prevents runaway context growth and protects against requests that exceed model or application limits. The guardrail calculates token usage from the complete context—including system prompts, conversation history, and the current user message—and compares it against configurable limits before processing begins.

Use this guardrail to control cost, stay within model context windows, and ensure predictable resource usage in production applications.

How It Works

The guardrail receives the full request context (system messages, history, and user input).
Token usage is computed for the entire context, not just the latest message.
The total is compared against max_request_tokens and max_run_tokens.
Output capacity is reserved via reserved_output_tokens before evaluating the request budget.
Requests that exceed configured limits are rejected before LLM invocation.

Configuration

Enable Token Budget Enforcement

yaml

guardrails:
  token_budget:
    enabled: true
    input_checks: true
    output_checks: true
    max_request_tokens: 50
    max_run_tokens: 80
    reserved_output_tokens: 10

Parameters

Option	Type	Description
`enabled`	bool	Enable token budget enforcement
`input_checks`	bool	Enforce limits on incoming requests
`output_checks`	bool	Enforce limits on model output
`max_request_tokens`	int	Maximum tokens allowed for a single request context
`max_run_tokens`	int	Maximum total tokens allowed for an entire run or session
`reserved_output_tokens`	int	Tokens reserved for the model response when evaluating input budget

Complete Example

yaml

guardrails:
  input_checks: true
  output_checks: true
  check_toxicity: true
  check_sensitive_data: true
  check_semantic: true

  token_budget:
    enabled: true
    input_checks: true
    output_checks: true
    max_request_tokens: 50
    max_run_tokens: 80
    reserved_output_tokens: 10

Full Context Calculation

Token counts include the entire context passed to the guardrail:

System prompts and instructions
Full conversation history
Current user message
Any additional context injected by the application

This ensures that accumulated history cannot silently exceed your budget. A request with a short latest message but a long conversation history will still be evaluated against the full token count.

Rejecting Oversized Requests

When a request exceeds max_request_tokens (after accounting for reserved_output_tokens) or max_run_tokens, the guardrail rejects it before the LLM is called. The caller receives a clear failure indicating the request exceeded the configured budget.

Typical rejection scenarios:

A single request context exceeds max_request_tokens
Cumulative usage across a run exceeds max_run_tokens
Insufficient budget remains after reserving tokens for the expected response

Input vs Output Checks

Control where enforcement applies:

yaml

guardrails:
  token_budget:
    enabled: true
    input_checks: true   # Reject oversized input contexts
    output_checks: true  # Validate output token usage

For most applications, enabling input checks alone is sufficient to prevent oversized requests from reaching the model.

Usage

Token budget enforcement supports two integration approaches:

LLMRails — Automatic enforcement when token_budget.enabled: true is set in your YAML config. No separate import required beyond LLMRails and RailsConfig.
TokenBudgetEnforcer — Standalone class for custom LLM pipelines with pre-flight and post-flight token checks.

With LLMRails (Recommended)

When token budget is enabled in your configuration, LLMRails applies enforcement automatically during generate() and generate_async(). Input checks run before the LLM is called; output checks run after the response is received.

python

from elsai_guardrails.guardrails import LLMRails, RailsConfig

config = RailsConfig.from_content(config_path="config.yml")
rails = LLMRails(config=config)

messages = [{"role": "user", "content": "Hello"}]
result = rails.generate(messages, return_details=True)

if result["blocked"]:
    print(result["block_reason"])   # token_budget_input | token_budget_output
    print(result["token_budget_input_check"])
    print(result["token_budget_output_check"])
else:
    print(result["final_response"])

Ensure your config.yml includes:

yaml

guardrails:
  token_budget:
    enabled: true
    input_checks: true
    output_checks: true
    max_request_tokens: 50
    max_run_tokens: 80
    reserved_output_tokens: 10

Block reasons when token budget is exceeded:

`block_reason`	Description
`token_budget_input`	Request context exceeded the input token budget
`token_budget_output`	Response exceeded the output or run token budget

Additional result fields when return_details=True:

Field	Description
`token_budget_input_check`	Input token budget check result (present when input checks are enabled)
`token_budget_output_check`	Output token budget check result (present when output checks are enabled)

See LLMRails for full result structure details.

With TokenBudgetEnforcer (Standalone)

Use TokenBudgetEnforcer when you manage LLM invocation yourself and need explicit pre-flight and post-flight token validation.

python

from elsai_guardrails.guardrails import TokenBudgetEnforcer, TokenBudgetExceededError

# From YAML
enforcer = TokenBudgetEnforcer.from_config("config.yml")

# Or programmatically
enforcer = TokenBudgetEnforcer(
    max_total_tokens=4096,
    reserved_output_tokens=512,
    max_run_total_tokens=8192,
)

messages = [{"role": "user", "content": "Hello"}]
user_input = "Hello"

try:
    # Pre-flight: validate input context before calling the LLM
    breakdown = enforcer.process_request(
        user_input,
        messages=messages,
        system_prompt="You are helpful.",
        retrieved_context="",  # optional RAG context
    )
    print("Within budget:", breakdown.within_budget)

    # Call your LLM
    raw_response = my_llm.invoke(messages)

    # Post-flight: validate response token usage
    usage, breakdown, run_usage = enforcer.process_response(
        raw_response,
        user_input,
        messages=messages,
    )
    print(usage.to_dict())

except TokenBudgetExceededError as exc:
    print(exc)
    print(exc.breakdown.to_dict())

When to use each approach:

Approach	Best For
`LLMRails`	Standard guardrailed LLM calls with YAML-driven configuration
`TokenBudgetEnforcer`	Custom LLM integrations, RAG pipelines, or manual pre/post-flight control

Programmatic vs YAML Parameter Mapping

When using TokenBudgetEnforcer programmatically, parameter names differ slightly from YAML:

YAML (`config.yml`)	`TokenBudgetEnforcer` constructor
`max_request_tokens`	`max_total_tokens`
`max_run_tokens`	`max_run_total_tokens`
`reserved_output_tokens`	`reserved_output_tokens`

Use Cases

Cost Control

Set conservative per-request limits for high-volume applications:

yaml

guardrails:
  token_budget:
    enabled: true
    max_request_tokens: 4096
    max_run_tokens: 16384
    reserved_output_tokens: 512

Short-Context Applications

Enforce tight limits for lightweight assistants or classification tasks:

yaml

guardrails:
  token_budget:
    enabled: true
    max_request_tokens: 50
    max_run_tokens: 80
    reserved_output_tokens: 10

Session-Level Limits

Use max_run_tokens to cap total usage across a multi-turn conversation:

yaml

guardrails:
  token_budget:
    enabled: true
    max_request_tokens: 2048
    max_run_tokens: 8192
    reserved_output_tokens: 256

Best Practices

Account for output tokens — Set reserved_output_tokens to match your expected response length so input budget calculations leave room for the model reply.
Size limits to your model — Align max_request_tokens with your LLM's context window minus headroom for the response.
Monitor rejections — Track how often requests are rejected to tune limits without blocking legitimate use.
Combine with other guardrails — Token budget enforcement works alongside PII/PHI Detection and content safety checks for comprehensive request validation.

Next Steps

PII/PHI Detection and Data Masking — Detect and mask sensitive information
Input Rails — How input validation works
LLMRails — Integrated LLM and guardrail usage
Guardrails Configuration — Full configuration reference
YAML Configuration — Complete configuration examples

Token Budget Enforcement ​

Overview ​

How It Works ​

Configuration ​

Enable Token Budget Enforcement ​

Parameters ​

Complete Example ​

Full Context Calculation ​

Rejecting Oversized Requests ​

Input vs Output Checks ​

Usage ​

With LLMRails (Recommended) ​

With TokenBudgetEnforcer (Standalone) ​

Programmatic vs YAML Parameter Mapping ​

Use Cases ​

Cost Control ​

Short-Context Applications ​

Session-Level Limits ​

Best Practices ​

Next Steps ​

Token Budget Enforcement

Overview

How It Works

Configuration

Enable Token Budget Enforcement

Parameters

Complete Example

Full Context Calculation

Rejecting Oversized Requests

Input vs Output Checks

Usage

With LLMRails (Recommended)

With TokenBudgetEnforcer (Standalone)

Programmatic vs YAML Parameter Mapping

Use Cases

Cost Control

Short-Context Applications

Session-Level Limits

Best Practices

Next Steps