v4.2.0 • OpenEnv Compatible • Production Ready

Train AI to Stop
Hallucinating

The production-grade RL environment for training and evaluating LLMs on hallucination avoidance. Built on 1M+ real-world examples across 38 benchmark datasets.

0
Examples
0
Datasets
0
Reward Components
0
Task Levels

Why HallucinationGuard?

Research-grade evaluation for grounded AI systems

🎯

Factual Grounding

Rewards answers derived strictly from provided context

🔬

9-Component Reward

Factual correctness, grounding, calibration, NLI, BERTScore...

📊

Real-World Datasets

SQuAD, HotpotQA, HaluEval, TruthfulQA, FEVER, and 33 more

Fast API

RESTful endpoints with OpenEnv compliance

🧠

NLI-Powered

Detects entailment and contradiction semantically

🏆

Leaderboard

Compare model performance across tasks

Three Difficulty Levels

Progressive curriculum from basic to adversarial

🟢

Task 1: Factual Grounding

Answer straightforward factual questions from a short context passage. Single-hop retrieval with unambiguous ground truth. Perfect for initial training.

Beginner
Datasets: SQuAD, BoolQ, ARC, OpenBookQA
🟡

Task 2: Multi-Hop Synthesis

Synthesize evidence from multiple sentences. Connect disparate facts without fabricating bridging information. Requires reasoning chains.

Intermediate
Datasets: HotpotQA, CoQA, NQ-Open, MS-MARCO
🔴

Task 3: Adversarial Resistance

Resist adversarial prompts designed to elicit hallucinations. Many questions are unanswerable — confident refusals are rewarded.

Advanced
Datasets: HaluEval, TruthfulQA, FEVER, AdversarialQA

Interactive Playground

Test the API directly in your browser

🔄 Reset Episode
📝 Submit Answer
📦 Batch Evaluate
🤖 Run Baseline
REQUEST BODY
RESPONSE
// Response will appear here... // // Click "Send Request" to test the API

All Endpoints

Complete API reference at a glance

POST /reset Start a new episode with optional difficulty and seed
POST /step Submit an answer with confidence and source citation
GET /state Get current episode state, accuracy, and skill rating
GET /tasks List all 3 tasks with complete action schema
POST /grader Score a completed episode (returns 0.0–1.0)
POST /baseline Run built-in heuristic baseline agent
POST /batch/evaluate Evaluate multiple Q&A pairs in one request
GET /leaderboard View ranked model performance
GET /health Service health check
GET /datasets Dataset statistics and distribution