v4.2.0 • OpenEnv Compatible • Production Ready

Train AI to Stop
Hallucinating

The production-grade RL environment for training and evaluating LLMs on hallucination avoidance. Built on 1M+ real-world examples across 38 benchmark datasets.

⚡ Try Interactive Demo 📖 Full API Reference

Examples

Datasets

Reward Components

Task Levels

Why HallucinationGuard?

Research-grade evaluation for grounded AI systems

🎯

Factual Grounding

Rewards answers derived strictly from provided context

🔬

9-Component Reward

Factual correctness, grounding, calibration, NLI, BERTScore...

📊

Real-World Datasets

SQuAD, HotpotQA, HaluEval, TruthfulQA, FEVER, and 33 more

⚡

Fast API

RESTful endpoints with OpenEnv compliance

🧠

NLI-Powered

Detects entailment and contradiction semantically

🏆

Leaderboard

Compare model performance across tasks

Three Difficulty Levels

Progressive curriculum from basic to adversarial

🟢

Task 1: Factual Grounding

Answer straightforward factual questions from a short context passage. Single-hop retrieval with unambiguous ground truth. Perfect for initial training.

Beginner

Datasets: SQuAD, BoolQ, ARC, OpenBookQA

🟡

Task 2: Multi-Hop Synthesis

Synthesize evidence from multiple sentences. Connect disparate facts without fabricating bridging information. Requires reasoning chains.

Intermediate

Datasets: HotpotQA, CoQA, NQ-Open, MS-MARCO

🔴

Task 3: Adversarial Resistance

Resist adversarial prompts designed to elicit hallucinations. Many questions are unanswerable — confident refusals are rewarded.

Advanced

Datasets: HaluEval, TruthfulQA, FEVER, AdversarialQA

Interactive Playground

Test the API directly in your browser

🔄 Reset Episode

📝 Submit Answer

📦 Batch Evaluate

🤖 Run Baseline

REQUEST BODY

RESPONSE

// Response will appear here... // // Click "Send Request" to test the API

All Endpoints

Complete API reference at a glance

POST /reset Start a new episode with optional difficulty and seed

POST /step Submit an answer with confidence and source citation

GET /state Get current episode state, accuracy, and skill rating

GET /tasks List all 3 tasks with complete action schema

POST /grader Score a completed episode (returns 0.0–1.0)

POST /baseline Run built-in heuristic baseline agent

POST /batch/evaluate Evaluate multiple Q&A pairs in one request

GET /leaderboard View ranked model performance

GET /health Service health check

GET /datasets Dataset statistics and distribution

Train AI to StopHallucinating