Toxicity Detection

Toxicity detection identifies toxic, offensive, or harmful content in text.

Overview

The toxicity detection system classifies content into categories:

Toxic: Clearly toxic or harmful content
Offensive: Offensive but less severe content
Non-toxic: Safe content

Configuration

Enable toxicity detection:

yaml

guardrails:
  check_toxicity: true
  toxicity_threshold: 0.7
  block_toxic: true

Parameters

check_toxicity: Enable/disable toxicity detection
toxicity_threshold: Confidence threshold for blocking (0.0-1.0)
block_toxic: Whether to block toxic content

Usage

Automatic Detection

Toxicity detection is automatic when enabled:

python

from elsai_guardrails.guardrails import LLMRails, RailsConfig

yaml_content = """
llm:
  engine: "openai"
  model: "gpt-4o-mini"
  api_key: "sk-..."

guardrails:
  check_toxicity: true
  toxicity_threshold: 0.7
  block_toxic: true
"""

config = RailsConfig.from_content(yaml_content=yaml_content)
rails = LLMRails(config=config)

result = rails.generate(
    messages=[{"role": "user", "content": "test"}],
    return_details=True
)

if result.get('input_check'):
    toxicity = result['input_check'].toxicity
    print(f"Label: {toxicity.get('label')}")
    print(f"Confidence: {toxicity.get('confidence')}")

Standalone Check

python

from elsai_guardrails.guardrails import GuardrailSystem, GuardrailConfig

config = GuardrailConfig(
    check_toxicity=True,
    toxicity_threshold=0.7,
    block_toxic=True
)
guardrail = GuardrailSystem(config=config)

result = guardrail.check_text("test message")
print(f"Toxicity detected: {result.passed}")

Result Format

Toxicity results are returned in the toxicity field:

python

{
    'label': 'toxic',  # or 'offensive', 'non-toxic'
    'confidence': 0.85  # Confidence score (0.0-1.0)
}

Threshold Tuning

Adjust the threshold based on your needs:

Lower threshold (0.5): More strict, blocks more content
Higher threshold (0.9): More permissive, blocks less content
Default (0.7): Balanced approach

yaml

guardrails:
  toxicity_threshold: 0.7  # Adjust as needed

Examples

Non-toxic Content

python

result = guardrail.check_text("Hello, how are you?")
print(result.passed)  # True

Toxic Content

python

result = guardrail.check_text("offensive content here")
print(result.passed)  # False
print(result.message)  # "Toxic content detected."

Best Practices

Start with default threshold (0.7) and adjust based on results
Monitor false positives and adjust threshold accordingly
Consider your use case - some applications may need stricter thresholds
Test with your data to find the optimal threshold

Next Steps

Sensitive Data Detection - Detect sensitive information
Content Classification - Detect content issues
Configuration Guide - Configure guardrails

Toxicity Detection ​

Overview ​

Configuration ​

Parameters ​

Usage ​

Automatic Detection ​

Standalone Check ​

Result Format ​

Threshold Tuning ​

Examples ​

Non-toxic Content ​

Toxic Content ​

Best Practices ​

Next Steps ​

Toxicity Detection

Overview

Configuration

Parameters

Usage

Automatic Detection

Standalone Check

Result Format

Threshold Tuning

Examples

Non-toxic Content

Toxic Content

Best Practices

Next Steps