PromptsLabPromptsLab
Coming Soon

Prompt Engineering Course

Get Early AccessLearn More

Datasets & Benchmarks

32 datasets and benchmarks for LLM evaluation. Plus 15 AI content detectors.

Showing 32 results

NameDescriptionCategoryLink
Chatbot Arena / LM Arena6M+ user votes for Elo-rated pairwise LLM comparisons. De facto standard for human preference.Major Benchmarks (2024–2026)View
MMLU-Pro12,000+ graduate-level questions across 14 domains. NeurIPS 2024 Spotlight.Major Benchmarks (2024–2026)View
GPQA448 "Google-proof" STEM questions; non-expert validators achieve only 34%.Major Benchmarks (2024–2026)View
SWE-bench VerifiedHuman-validated 500-task subset for real-world GitHub issue resolution.Major Benchmarks (2024–2026)View
SWE-bench Pro1,865 tasks across 41 professional repos; best models score only ~23%.Major Benchmarks (2024–2026)View
Humanity's Last Exam (HLE)2,500 expert-vetted questions; top AI scores only ~10–30%.Major Benchmarks (2024–2026)View
BigCodeBench1,140 coding tasks across 7 domains; AI achieves ~35.5% vs. 97% human success.Major Benchmarks (2024–2026)View
LiveBenchContamination-resistant with frequently updated questions.Major Benchmarks (2024–2026)View
FrontierMathResearch-level math; AI solves only ~2% of problems.Major Benchmarks (2024–2026)
ARC-AGI v2Abstract reasoning measuring fluid intelligence.Major Benchmarks (2024–2026)
IFEvalInstruction-following evaluation with formatting/content constraints.Major Benchmarks (2024–2026)View
MLE-benchOpenAI's ML engineering evaluation via Kaggle-style tasks.Major Benchmarks (2024–2026)View
PaperBenchEvaluates AI's ability to replicate 20 ICML 2024 papers from scratch.Major Benchmarks (2024–2026)View
Hugging Face Open LLM Leaderboard v2Evaluates open models on MMLU-Pro, GPQA, IFEval, MATH.Leaderboards and Meta-BenchmarksView
Artificial Analysis Intelligence Index v3Aggregates 10 evaluations.Leaderboards and Meta-BenchmarksView
SEAL by Scale AIHosts SWE-bench Pro and agentic evaluations.Leaderboards and Meta-BenchmarksView
P3 (Public Pool of Prompts)Prompt templates for 270+ NLP tasks used to train T0 and similar models.Prompt and Instruction DatasetsView
System Prompts Dataset944 system prompt templates for agent workflows (by Daniel Rosehill, Aug 2025).Prompt and Instruction DatasetsView
OpenAssistant Conversations (OASST)161,443 messages in 35 languages with 461,292 quality ratings.Prompt and Instruction DatasetsView
UltraChat / UltraFeedbackLarge-scale synthetic instruction and preference datasets for alignment training.Prompt and Instruction Datasets
SoftAge Prompt Engineering Dataset1,000 diverse prompts across 10 categories for benchmarking prompt performance.Prompt and Instruction Datasets
Text Transformation Prompt LibraryComprehensive collection of text transformation prompts (May 2025).Prompt and Instruction Datasets
Writing Prompts~300K human-written stories paired with prompts from r/WritingPrompts.Prompt and Instruction DatasetsView
Midjourney PromptsText prompts and image URLs scraped from MidJourney's public Discord.Prompt and Instruction DatasetsView
CodeAlpaca-20k20,000 programming instruction-output pairs.Prompt and Instruction DatasetsView
ProPEX-RAGDataset for prompt optimization in RAG workflows.Prompt and Instruction Datasets
NanoBanana Trending Prompts1,000+ curated AI image prompts from X/Twitter, ranked by engagement.Prompt and Instruction DatasetsView
HarmBench510 harmful behaviors across standard, contextual, copyright, and multimodal categories.Red Teaming and Adversarial DatasetsView
JailbreakBenchOpen robustness benchmark for jailbreaking with 100 prompts.Red Teaming and Adversarial Datasets
AgentHarm110 malicious agent tasks across 11 harm categories.Red Teaming and Adversarial DatasetsView
DecodingTrust243,877 prompts evaluating trustworthiness across 8 perspectives.Red Teaming and Adversarial Datasets
SafetyPrompts.comAggregator tracking 50+ safety/red-teaming datasets.Red Teaming and Adversarial DatasetsView