Skip to content

Token Budget Enforcement

Compute token usage across the full request context and reject oversized requests before they reach the LLM.

Overview

Token budget enforcement prevents runaway context growth and protects against requests that exceed model or application limits. The guardrail calculates token usage from the complete context—including system prompts, conversation history, and the current user message—and compares it against configurable limits before processing begins.

Use this guardrail to control cost, stay within model context windows, and ensure predictable resource usage in production applications.

How It Works

  1. The guardrail receives the full request context (system messages, history, and user input).
  2. Token usage is computed for the entire context, not just the latest message.
  3. The total is compared against max_request_tokens and max_run_tokens.
  4. Output capacity is reserved via reserved_output_tokens before evaluating the request budget.
  5. Requests that exceed configured limits are rejected before LLM invocation.

Configuration

Enable Token Budget Enforcement

yaml
guardrails:
  token_budget:
    enabled: true
    input_checks: true
    output_checks: true
    max_request_tokens: 50
    max_run_tokens: 80
    reserved_output_tokens: 10

Parameters

OptionTypeDescription
enabledboolEnable token budget enforcement
input_checksboolEnforce limits on incoming requests
output_checksboolEnforce limits on model output
max_request_tokensintMaximum tokens allowed for a single request context
max_run_tokensintMaximum total tokens allowed for an entire run or session
reserved_output_tokensintTokens reserved for the model response when evaluating input budget

Complete Example

yaml
guardrails:
  input_checks: true
  output_checks: true
  check_toxicity: true
  check_sensitive_data: true
  check_semantic: true

  token_budget:
    enabled: true
    input_checks: true
    output_checks: true
    max_request_tokens: 50
    max_run_tokens: 80
    reserved_output_tokens: 10

Full Context Calculation

Token counts include the entire context passed to the guardrail:

  • System prompts and instructions
  • Full conversation history
  • Current user message
  • Any additional context injected by the application

This ensures that accumulated history cannot silently exceed your budget. A request with a short latest message but a long conversation history will still be evaluated against the full token count.

Rejecting Oversized Requests

When a request exceeds max_request_tokens (after accounting for reserved_output_tokens) or max_run_tokens, the guardrail rejects it before the LLM is called. The caller receives a clear failure indicating the request exceeded the configured budget.

Typical rejection scenarios:

  • A single request context exceeds max_request_tokens
  • Cumulative usage across a run exceeds max_run_tokens
  • Insufficient budget remains after reserving tokens for the expected response

Input vs Output Checks

Control where enforcement applies:

yaml
guardrails:
  token_budget:
    enabled: true
    input_checks: true   # Reject oversized input contexts
    output_checks: true  # Validate output token usage

For most applications, enabling input checks alone is sufficient to prevent oversized requests from reaching the model.

Usage

Token budget enforcement supports two integration approaches:

  1. LLMRails — Automatic enforcement when token_budget.enabled: true is set in your YAML config. No separate import required beyond LLMRails and RailsConfig.
  2. TokenBudgetEnforcer — Standalone class for custom LLM pipelines with pre-flight and post-flight token checks.

When token budget is enabled in your configuration, LLMRails applies enforcement automatically during generate() and generate_async(). Input checks run before the LLM is called; output checks run after the response is received.

python
from elsai_guardrails.guardrails import LLMRails, RailsConfig

config = RailsConfig.from_content(config_path="config.yml")
rails = LLMRails(config=config)

messages = [{"role": "user", "content": "Hello"}]
result = rails.generate(messages, return_details=True)

if result["blocked"]:
    print(result["block_reason"])   # token_budget_input | token_budget_output
    print(result["token_budget_input_check"])
    print(result["token_budget_output_check"])
else:
    print(result["final_response"])

Ensure your config.yml includes:

yaml
guardrails:
  token_budget:
    enabled: true
    input_checks: true
    output_checks: true
    max_request_tokens: 50
    max_run_tokens: 80
    reserved_output_tokens: 10

Block reasons when token budget is exceeded:

block_reasonDescription
token_budget_inputRequest context exceeded the input token budget
token_budget_outputResponse exceeded the output or run token budget

Additional result fields when return_details=True:

FieldDescription
token_budget_input_checkInput token budget check result (present when input checks are enabled)
token_budget_output_checkOutput token budget check result (present when output checks are enabled)

See LLMRails for full result structure details.

With TokenBudgetEnforcer (Standalone)

Use TokenBudgetEnforcer when you manage LLM invocation yourself and need explicit pre-flight and post-flight token validation.

python
from elsai_guardrails.guardrails import TokenBudgetEnforcer, TokenBudgetExceededError

# From YAML
enforcer = TokenBudgetEnforcer.from_config("config.yml")

# Or programmatically
enforcer = TokenBudgetEnforcer(
    max_total_tokens=4096,
    reserved_output_tokens=512,
    max_run_total_tokens=8192,
)

messages = [{"role": "user", "content": "Hello"}]
user_input = "Hello"

try:
    # Pre-flight: validate input context before calling the LLM
    breakdown = enforcer.process_request(
        user_input,
        messages=messages,
        system_prompt="You are helpful.",
        retrieved_context="",  # optional RAG context
    )
    print("Within budget:", breakdown.within_budget)

    # Call your LLM
    raw_response = my_llm.invoke(messages)

    # Post-flight: validate response token usage
    usage, breakdown, run_usage = enforcer.process_response(
        raw_response,
        user_input,
        messages=messages,
    )
    print(usage.to_dict())

except TokenBudgetExceededError as exc:
    print(exc)
    print(exc.breakdown.to_dict())

When to use each approach:

ApproachBest For
LLMRailsStandard guardrailed LLM calls with YAML-driven configuration
TokenBudgetEnforcerCustom LLM integrations, RAG pipelines, or manual pre/post-flight control

Programmatic vs YAML Parameter Mapping

When using TokenBudgetEnforcer programmatically, parameter names differ slightly from YAML:

YAML (config.yml)TokenBudgetEnforcer constructor
max_request_tokensmax_total_tokens
max_run_tokensmax_run_total_tokens
reserved_output_tokensreserved_output_tokens

Use Cases

Cost Control

Set conservative per-request limits for high-volume applications:

yaml
guardrails:
  token_budget:
    enabled: true
    max_request_tokens: 4096
    max_run_tokens: 16384
    reserved_output_tokens: 512

Short-Context Applications

Enforce tight limits for lightweight assistants or classification tasks:

yaml
guardrails:
  token_budget:
    enabled: true
    max_request_tokens: 50
    max_run_tokens: 80
    reserved_output_tokens: 10

Session-Level Limits

Use max_run_tokens to cap total usage across a multi-turn conversation:

yaml
guardrails:
  token_budget:
    enabled: true
    max_request_tokens: 2048
    max_run_tokens: 8192
    reserved_output_tokens: 256

Best Practices

  1. Account for output tokens — Set reserved_output_tokens to match your expected response length so input budget calculations leave room for the model reply.
  2. Size limits to your model — Align max_request_tokens with your LLM's context window minus headroom for the response.
  3. Monitor rejections — Track how often requests are rejected to tune limits without blocking legitimate use.
  4. Combine with other guardrails — Token budget enforcement works alongside PII/PHI Detection and content safety checks for comprehensive request validation.

Next Steps

Released under the MIT License.