Datasets & Benchmarks

32 datasets and benchmarks for LLM evaluation. Plus 15 AI content detectors.

Showing 32 results

Name	Description	Category	Link
Chatbot Arena / LM Arena	6M+ user votes for Elo-rated pairwise LLM comparisons. De facto standard for human preference.	Major Benchmarks (2024–2026)	View
MMLU-Pro	12,000+ graduate-level questions across 14 domains. NeurIPS 2024 Spotlight.	Major Benchmarks (2024–2026)	View
GPQA	448 "Google-proof" STEM questions; non-expert validators achieve only 34%.	Major Benchmarks (2024–2026)	View
SWE-bench Verified	Human-validated 500-task subset for real-world GitHub issue resolution.	Major Benchmarks (2024–2026)	View
SWE-bench Pro	1,865 tasks across 41 professional repos; best models score only ~23%.	Major Benchmarks (2024–2026)	View
Humanity's Last Exam (HLE)	2,500 expert-vetted questions; top AI scores only ~10–30%.	Major Benchmarks (2024–2026)	View
BigCodeBench	1,140 coding tasks across 7 domains; AI achieves ~35.5% vs. 97% human success.	Major Benchmarks (2024–2026)	View
LiveBench	Contamination-resistant with frequently updated questions.	Major Benchmarks (2024–2026)	View
FrontierMath	Research-level math; AI solves only ~2% of problems.	Major Benchmarks (2024–2026)
ARC-AGI v2	Abstract reasoning measuring fluid intelligence.	Major Benchmarks (2024–2026)
IFEval	Instruction-following evaluation with formatting/content constraints.	Major Benchmarks (2024–2026)	View
MLE-bench	OpenAI's ML engineering evaluation via Kaggle-style tasks.	Major Benchmarks (2024–2026)	View
PaperBench	Evaluates AI's ability to replicate 20 ICML 2024 papers from scratch.	Major Benchmarks (2024–2026)	View
Hugging Face Open LLM Leaderboard v2	Evaluates open models on MMLU-Pro, GPQA, IFEval, MATH.	Leaderboards and Meta-Benchmarks	View
Artificial Analysis Intelligence Index v3	Aggregates 10 evaluations.	Leaderboards and Meta-Benchmarks	View
SEAL by Scale AI	Hosts SWE-bench Pro and agentic evaluations.	Leaderboards and Meta-Benchmarks	View
P3 (Public Pool of Prompts)	Prompt templates for 270+ NLP tasks used to train T0 and similar models.	Prompt and Instruction Datasets	View
System Prompts Dataset	944 system prompt templates for agent workflows (by Daniel Rosehill, Aug 2025).	Prompt and Instruction Datasets	View
OpenAssistant Conversations (OASST)	161,443 messages in 35 languages with 461,292 quality ratings.	Prompt and Instruction Datasets	View
UltraChat / UltraFeedback	Large-scale synthetic instruction and preference datasets for alignment training.	Prompt and Instruction Datasets
SoftAge Prompt Engineering Dataset	1,000 diverse prompts across 10 categories for benchmarking prompt performance.	Prompt and Instruction Datasets
Text Transformation Prompt Library	Comprehensive collection of text transformation prompts (May 2025).	Prompt and Instruction Datasets
Writing Prompts	~300K human-written stories paired with prompts from r/WritingPrompts.	Prompt and Instruction Datasets	View
Midjourney Prompts	Text prompts and image URLs scraped from MidJourney's public Discord.	Prompt and Instruction Datasets	View
CodeAlpaca-20k	20,000 programming instruction-output pairs.	Prompt and Instruction Datasets	View
ProPEX-RAG	Dataset for prompt optimization in RAG workflows.	Prompt and Instruction Datasets
NanoBanana Trending Prompts	1,000+ curated AI image prompts from X/Twitter, ranked by engagement.	Prompt and Instruction Datasets	View
HarmBench	510 harmful behaviors across standard, contextual, copyright, and multimodal categories.	Red Teaming and Adversarial Datasets	View
JailbreakBench	Open robustness benchmark for jailbreaking with 100 prompts.	Red Teaming and Adversarial Datasets
AgentHarm	110 malicious agent tasks across 11 harm categories.	Red Teaming and Adversarial Datasets	View
DecodingTrust	243,877 prompts evaluating trustworthiness across 8 perspectives.	Red Teaming and Adversarial Datasets
SafetyPrompts.com	Aggregator tracking 50+ safety/red-teaming datasets.	Red Teaming and Adversarial Datasets	View

Prompt Engineering Course

Datasets & Benchmarks