Appearance
Token Budget Enforcement
Compute token usage across the full request context and reject oversized requests before they reach the LLM.
Overview
Token budget enforcement prevents runaway context growth and protects against requests that exceed model or application limits. The guardrail calculates token usage from the complete context—including system prompts, conversation history, and the current user message—and compares it against configurable limits before processing begins.
Use this guardrail to control cost, stay within model context windows, and ensure predictable resource usage in production applications.
How It Works
- The guardrail receives the full request context (system messages, history, and user input).
- Token usage is computed for the entire context, not just the latest message.
- The total is compared against
max_request_tokensandmax_run_tokens. - Output capacity is reserved via
reserved_output_tokensbefore evaluating the request budget. - Requests that exceed configured limits are rejected before LLM invocation.
Configuration
Enable Token Budget Enforcement
yaml
guardrails:
token_budget:
enabled: true
input_checks: true
output_checks: true
max_request_tokens: 50
max_run_tokens: 80
reserved_output_tokens: 10Parameters
| Option | Type | Description |
|---|---|---|
enabled | bool | Enable token budget enforcement |
input_checks | bool | Enforce limits on incoming requests |
output_checks | bool | Enforce limits on model output |
max_request_tokens | int | Maximum tokens allowed for a single request context |
max_run_tokens | int | Maximum total tokens allowed for an entire run or session |
reserved_output_tokens | int | Tokens reserved for the model response when evaluating input budget |
Complete Example
yaml
guardrails:
input_checks: true
output_checks: true
check_toxicity: true
check_sensitive_data: true
check_semantic: true
token_budget:
enabled: true
input_checks: true
output_checks: true
max_request_tokens: 50
max_run_tokens: 80
reserved_output_tokens: 10Full Context Calculation
Token counts include the entire context passed to the guardrail:
- System prompts and instructions
- Full conversation history
- Current user message
- Any additional context injected by the application
This ensures that accumulated history cannot silently exceed your budget. A request with a short latest message but a long conversation history will still be evaluated against the full token count.
Rejecting Oversized Requests
When a request exceeds max_request_tokens (after accounting for reserved_output_tokens) or max_run_tokens, the guardrail rejects it before the LLM is called. The caller receives a clear failure indicating the request exceeded the configured budget.
Typical rejection scenarios:
- A single request context exceeds
max_request_tokens - Cumulative usage across a run exceeds
max_run_tokens - Insufficient budget remains after reserving tokens for the expected response
Input vs Output Checks
Control where enforcement applies:
yaml
guardrails:
token_budget:
enabled: true
input_checks: true # Reject oversized input contexts
output_checks: true # Validate output token usageFor most applications, enabling input checks alone is sufficient to prevent oversized requests from reaching the model.
Usage
Token budget enforcement supports two integration approaches:
- LLMRails — Automatic enforcement when
token_budget.enabled: trueis set in your YAML config. No separate import required beyondLLMRailsandRailsConfig. - TokenBudgetEnforcer — Standalone class for custom LLM pipelines with pre-flight and post-flight token checks.
With LLMRails (Recommended)
When token budget is enabled in your configuration, LLMRails applies enforcement automatically during generate() and generate_async(). Input checks run before the LLM is called; output checks run after the response is received.
python
from elsai_guardrails.guardrails import LLMRails, RailsConfig
config = RailsConfig.from_content(config_path="config.yml")
rails = LLMRails(config=config)
messages = [{"role": "user", "content": "Hello"}]
result = rails.generate(messages, return_details=True)
if result["blocked"]:
print(result["block_reason"]) # token_budget_input | token_budget_output
print(result["token_budget_input_check"])
print(result["token_budget_output_check"])
else:
print(result["final_response"])Ensure your config.yml includes:
yaml
guardrails:
token_budget:
enabled: true
input_checks: true
output_checks: true
max_request_tokens: 50
max_run_tokens: 80
reserved_output_tokens: 10Block reasons when token budget is exceeded:
block_reason | Description |
|---|---|
token_budget_input | Request context exceeded the input token budget |
token_budget_output | Response exceeded the output or run token budget |
Additional result fields when return_details=True:
| Field | Description |
|---|---|
token_budget_input_check | Input token budget check result (present when input checks are enabled) |
token_budget_output_check | Output token budget check result (present when output checks are enabled) |
See LLMRails for full result structure details.
With TokenBudgetEnforcer (Standalone)
Use TokenBudgetEnforcer when you manage LLM invocation yourself and need explicit pre-flight and post-flight token validation.
python
from elsai_guardrails.guardrails import TokenBudgetEnforcer, TokenBudgetExceededError
# From YAML
enforcer = TokenBudgetEnforcer.from_config("config.yml")
# Or programmatically
enforcer = TokenBudgetEnforcer(
max_total_tokens=4096,
reserved_output_tokens=512,
max_run_total_tokens=8192,
)
messages = [{"role": "user", "content": "Hello"}]
user_input = "Hello"
try:
# Pre-flight: validate input context before calling the LLM
breakdown = enforcer.process_request(
user_input,
messages=messages,
system_prompt="You are helpful.",
retrieved_context="", # optional RAG context
)
print("Within budget:", breakdown.within_budget)
# Call your LLM
raw_response = my_llm.invoke(messages)
# Post-flight: validate response token usage
usage, breakdown, run_usage = enforcer.process_response(
raw_response,
user_input,
messages=messages,
)
print(usage.to_dict())
except TokenBudgetExceededError as exc:
print(exc)
print(exc.breakdown.to_dict())When to use each approach:
| Approach | Best For |
|---|---|
LLMRails | Standard guardrailed LLM calls with YAML-driven configuration |
TokenBudgetEnforcer | Custom LLM integrations, RAG pipelines, or manual pre/post-flight control |
Programmatic vs YAML Parameter Mapping
When using TokenBudgetEnforcer programmatically, parameter names differ slightly from YAML:
YAML (config.yml) | TokenBudgetEnforcer constructor |
|---|---|
max_request_tokens | max_total_tokens |
max_run_tokens | max_run_total_tokens |
reserved_output_tokens | reserved_output_tokens |
Use Cases
Cost Control
Set conservative per-request limits for high-volume applications:
yaml
guardrails:
token_budget:
enabled: true
max_request_tokens: 4096
max_run_tokens: 16384
reserved_output_tokens: 512Short-Context Applications
Enforce tight limits for lightweight assistants or classification tasks:
yaml
guardrails:
token_budget:
enabled: true
max_request_tokens: 50
max_run_tokens: 80
reserved_output_tokens: 10Session-Level Limits
Use max_run_tokens to cap total usage across a multi-turn conversation:
yaml
guardrails:
token_budget:
enabled: true
max_request_tokens: 2048
max_run_tokens: 8192
reserved_output_tokens: 256Best Practices
- Account for output tokens — Set
reserved_output_tokensto match your expected response length so input budget calculations leave room for the model reply. - Size limits to your model — Align
max_request_tokenswith your LLM's context window minus headroom for the response. - Monitor rejections — Track how often requests are rejected to tune limits without blocking legitimate use.
- Combine with other guardrails — Token budget enforcement works alongside PII/PHI Detection and content safety checks for comprehensive request validation.
Next Steps
- PII/PHI Detection and Data Masking — Detect and mask sensitive information
- Input Rails — How input validation works
- LLMRails — Integrated LLM and guardrail usage
- Guardrails Configuration — Full configuration reference
- YAML Configuration — Complete configuration examples
