Datasets & Benchmarks
32 datasets and benchmarks for LLM evaluation. Plus 15 AI content detectors.
Showing 32 results
| Name | Description | Category | Link |
|---|---|---|---|
| Chatbot Arena / LM Arena | 6M+ user votes for Elo-rated pairwise LLM comparisons. De facto standard for human preference. | Major Benchmarks (2024–2026) | View |
| MMLU-Pro | 12,000+ graduate-level questions across 14 domains. NeurIPS 2024 Spotlight. | Major Benchmarks (2024–2026) | View |
| GPQA | 448 "Google-proof" STEM questions; non-expert validators achieve only 34%. | Major Benchmarks (2024–2026) | View |
| SWE-bench Verified | Human-validated 500-task subset for real-world GitHub issue resolution. | Major Benchmarks (2024–2026) | View |
| SWE-bench Pro | 1,865 tasks across 41 professional repos; best models score only ~23%. | Major Benchmarks (2024–2026) | View |
| Humanity's Last Exam (HLE) | 2,500 expert-vetted questions; top AI scores only ~10–30%. | Major Benchmarks (2024–2026) | View |
| BigCodeBench | 1,140 coding tasks across 7 domains; AI achieves ~35.5% vs. 97% human success. | Major Benchmarks (2024–2026) | View |
| LiveBench | Contamination-resistant with frequently updated questions. | Major Benchmarks (2024–2026) | View |
| FrontierMath | Research-level math; AI solves only ~2% of problems. | Major Benchmarks (2024–2026) | |
| ARC-AGI v2 | Abstract reasoning measuring fluid intelligence. | Major Benchmarks (2024–2026) | |
| IFEval | Instruction-following evaluation with formatting/content constraints. | Major Benchmarks (2024–2026) | View |
| MLE-bench | OpenAI's ML engineering evaluation via Kaggle-style tasks. | Major Benchmarks (2024–2026) | View |
| PaperBench | Evaluates AI's ability to replicate 20 ICML 2024 papers from scratch. | Major Benchmarks (2024–2026) | View |
| Hugging Face Open LLM Leaderboard v2 | Evaluates open models on MMLU-Pro, GPQA, IFEval, MATH. | Leaderboards and Meta-Benchmarks | View |
| Artificial Analysis Intelligence Index v3 | Aggregates 10 evaluations. | Leaderboards and Meta-Benchmarks | View |
| SEAL by Scale AI | Hosts SWE-bench Pro and agentic evaluations. | Leaderboards and Meta-Benchmarks | View |
| P3 (Public Pool of Prompts) | Prompt templates for 270+ NLP tasks used to train T0 and similar models. | Prompt and Instruction Datasets | View |
| System Prompts Dataset | 944 system prompt templates for agent workflows (by Daniel Rosehill, Aug 2025). | Prompt and Instruction Datasets | View |
| OpenAssistant Conversations (OASST) | 161,443 messages in 35 languages with 461,292 quality ratings. | Prompt and Instruction Datasets | View |
| UltraChat / UltraFeedback | Large-scale synthetic instruction and preference datasets for alignment training. | Prompt and Instruction Datasets | |
| SoftAge Prompt Engineering Dataset | 1,000 diverse prompts across 10 categories for benchmarking prompt performance. | Prompt and Instruction Datasets | |
| Text Transformation Prompt Library | Comprehensive collection of text transformation prompts (May 2025). | Prompt and Instruction Datasets | |
| Writing Prompts | ~300K human-written stories paired with prompts from r/WritingPrompts. | Prompt and Instruction Datasets | View |
| Midjourney Prompts | Text prompts and image URLs scraped from MidJourney's public Discord. | Prompt and Instruction Datasets | View |
| CodeAlpaca-20k | 20,000 programming instruction-output pairs. | Prompt and Instruction Datasets | View |
| ProPEX-RAG | Dataset for prompt optimization in RAG workflows. | Prompt and Instruction Datasets | |
| NanoBanana Trending Prompts | 1,000+ curated AI image prompts from X/Twitter, ranked by engagement. | Prompt and Instruction Datasets | View |
| HarmBench | 510 harmful behaviors across standard, contextual, copyright, and multimodal categories. | Red Teaming and Adversarial Datasets | View |
| JailbreakBench | Open robustness benchmark for jailbreaking with 100 prompts. | Red Teaming and Adversarial Datasets | |
| AgentHarm | 110 malicious agent tasks across 11 harm categories. | Red Teaming and Adversarial Datasets | View |
| DecodingTrust | 243,877 prompts evaluating trustworthiness across 8 perspectives. | Red Teaming and Adversarial Datasets | |
| SafetyPrompts.com | Aggregator tracking 50+ safety/red-teaming datasets. | Red Teaming and Adversarial Datasets | View |