Papers
100 research papers on prompt engineering techniques.
Showing 100 of 100 papers
The Prompt Report: A Systematic Survey of Prompting Techniques
Most comprehensive survey: taxonomy of 58 text and 40 multimodal prompting techniques from 1,500+ papers. Co-authored with OpenAI, Microsoft, Google, Stanford.
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
44 techniques across application areas with per-task performance summaries.
A Survey of Prompt Engineering Methods in LLMs for Different NLP Tasks
39 prompting methods across 29 NLP tasks.
A Survey of Automatic Prompt Engineering: An Optimization Perspective
Formalizes auto-PE methods as discrete/continuous/hybrid optimization problems.
Efficient Prompting Methods for Large Language Models: A Survey
Survey of efficiency-oriented prompting (compression, optimization, APE) for reducing compute and latency.
Navigate through Enigmatic Labyrinth: A Survey of Chain of Thought Reasoning
Systematic CoT survey.
Demystifying Chains, Trees, and Graphs of Thoughts
Unified framework for multi-prompt reasoning topologies.
Towards Goal-oriented Prompt Engineering for Large Language Models: A Survey
oriented Prompt Engineering for Large Language Models: A Survey](https://arxiv.org/abs/2401.14043) [2024] — Focuses on prompts designed around explicit task goals.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning LLMs
of-Thought for Reasoning LLMs](https://arxiv.org/abs/2503.09567) [2025] — Distinguishes Long CoT from Short CoT in o1/R1-era models.
OPRO: Large Language Models as Optimizers
Uses LLMs as optimizers via meta-prompts; optimized prompts outperform human-designed ones by up to 50% on BBH.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Improving Pipelines](https://arxiv.org/abs/2310.03714) [2023, ICLR 2024] — Framework for programming (not prompting) LLMs with automatic prompt optimization.
MIPRO: Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
Stage Language Model Programs](https://arxiv.org/abs/2406.11695) [2024, EMNLP 2024] — Bayesian optimization for multi-stage LM programs; up to 13% accuracy gains.
TextGrad: Automatic "Differentiation" via Text
Treats compound AI systems as computation graphs with textual feedback as gradients. Published in Nature.
EvoPrompt
Evolutionary algorithm approach for automatically optimizing discrete prompts.
Meta Prompting for AI Systems
Example-agnostic structural templates formalized using category theory.
Prompt Engineering a Prompt Engineer (PE²)
Uses LLMs to meta-prompt themselves, refining prompts with step-by-step templates to significantly improve reasoning.
Large Language Models Are Human-Level Prompt Engineers
Level Prompt Engineers](https://arxiv.org/abs/2211.01910) [2022] — Automatic prompt generation via APE.
Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning
Based Discrete Optimization for Prompt Tuning](https://arxiv.org/abs/2302.03668) [2023]
SPO: Self-Supervised Prompt Optimization
Supervised Prompt Optimization](https://arxiv.org/abs/2502.06855) [2025] — Competitive performance at 1–6% of the cost of prior methods.
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression](https://arxiv.org/abs/2403.12968) [2024, ACL 2024] — 3x–6x faster than LLMLingua with GPT-4 data distillation.
LongLLMLingua
Question-aware compression for long contexts; 21.4% performance boost with 4x fewer tokens.
Prompt Compression for Large Language Models: A Survey
Comprehensive survey of hard and soft prompt compression methods.
Scaling LLM Test-Time Compute Optimally
Time Compute Optimally](https://arxiv.org/abs/2408.03314) [2024] — Shows optimal test-time compute allocation can outperform 14x larger models.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948) [2025] — Pure RL-trained reasoning model matching o1; open-source with distilled variants.
s1: Simple Test-Time Scaling
Time Scaling](https://arxiv.org/abs/2501.19393) [2025] — SFT on just 1,000 examples creates competitive reasoning model via "budget forcing."
Reasoning Language Models: A Blueprint
Systematic framework organizing reasoning LM approaches.
Demystifying Long Chain-of-Thought Reasoning in LLMs
of-Thought Reasoning in LLMs](https://arxiv.org/abs/2502.03373) [2025] — Analyzes long CoT behavior in modern reasoning models.
Graph of Thoughts: Solving Elaborate Problems with LLMs
Models thoughts as arbitrary graphs; 62% quality improvement over ToT on sorting.
Tree of Thoughts: Deliberate Problem Solving with LLMs
Tree search over reasoning paths.
Everything of Thoughts
Integrates CoT, ToT, and external solvers via MCTS.
Skeleton-of-Thought
of-Thought](https://arxiv.org/abs/2307.15337) [2023] — Parallel decoding via answer skeleton generation for up to 2.69x speedup.
Chain of Thought Prompting Elicits Reasoning in Large Language Models
The foundational CoT paper.
Self-Consistency Improves Chain of Thought Reasoning
Consistency Improves Chain of Thought Reasoning](https://arxiv.org/abs/2203.11171) [2022] — Aggregating multiple CoT outputs for reliability.
Large Language Models are Zero-Shot Reasoners
Shot Reasoners](https://arxiv.org/abs/2205.11916) [2022] — "Let's think step by step" as a zero-shot reasoning trigger.
ReAct: Synergizing Reasoning and Acting in Language Models
Interleaving reasoning and tool use.
Many-Shot In-Context Learning
Shot In-Context Learning](https://arxiv.org/abs/2404.11018) [2024, NeurIPS 2024 Spotlight] — Significant gains scaling ICL to hundreds/thousands of examples; introduces Reinforced and Unsupervised ICL.
Many-Shot In-Context Learning in Multimodal Foundation Models
Shot In-Context Learning in Multimodal Foundation Models](https://arxiv.org/abs/2405.09798) [2024] — Scales multimodal ICL to ~2,000 examples across 14 datasets.
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Context Learning Work?](https://arxiv.org/abs/2202.12837) [2022]
Fantastically Ordered Prompts and Where to Find Them
Overcoming few-shot prompt order sensitivity.
Calibrate Before Use: Improving Few-Shot Performance of Language Models
Shot Performance of Language Models](https://arxiv.org/abs/2102.09690) [2021]
Agentic Large Language Models: A Survey
Comprehensive survey organizing agentic LLMs by reasoning, acting, and interacting capabilities.
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
Agents: A Survey of Progress and Challenges](https://arxiv.org/abs/2402.01680) [2024] — Covers profiling, communication, and growth mechanisms.
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
Agent Collaboration Mechanisms: A Survey of LLMs](https://arxiv.org/abs/2501.06322) [2025] — Reviews debate and cooperation strategies in LLM-based multi-agent systems.
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Gen LLM Applications via Multi-Agent Conversation](https://arxiv.org/abs/2308.08155) [2023] — Microsoft's foundational multi-agent framework paper.
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs
World APIs](https://arxiv.org/abs/2307.16789) [2023, ICLR 2024] — Trains LLMs to use massive real-world API collections.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
bench: Can Language Models Resolve Real-World GitHub Issues?](https://arxiv.org/abs/2310.06770) [2023, ICLR 2024] — The benchmark driving agentic coding progress.
AgentBench: Evaluating LLMs as Agents
Benchmark across 8 environments.
PAL: Program-aided Language Models
aided Language Models](https://arxiv.org/abs/2211.10435) [2023] — Offloading computation to code interpreters.
Visual Prompting in Multimodal Large Language Models: A Survey
First comprehensive survey on visual prompting methods in MLLMs.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V](https://arxiv.org/abs/2310.11441) [2023] — Visual markers dramatically improve visual grounding.
A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks
Language Tasks](https://arxiv.org/abs/2411.06284) [2024] — Covers text, image, video, audio MLLMs.
Multimodal Chain-of-Thought Reasoning in Language Models
of-Thought Reasoning in Language Models](https://arxiv.org/abs/2302.00923) [2023]
From Prompt Engineering to Prompt Craft
Design-research view of prompt "craft" for diffusion models.
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of LLMs
Examines how constraining outputs to structured formats impacts reasoning performance.
Batch Prompting: Efficient Inference with LLM APIs
Structured Prompting: Scaling In-Context Learning to 1,000 Examples
Context Learning to 1,000 Examples](https://arxiv.org/abs/2212.06713) [2022]
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Formal framework with systematic evaluation of 5 attacks and 10 defenses across 10 LLMs.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
OpenAI's priority-level training for injection defense.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses
Realistic agent scenario benchmark.
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents
Integrated LLM Agents](https://arxiv.org/abs/2403.02691) [2024]
SecAlign: Defending Against Prompt Injection with Preference Optimization
DPO-based defense.
WASP: Benchmarking Web Agent Security Against Prompt Injection
Security benchmark for web/computer-use agents.
Many-Shot Jailbreaking
Shot Jailbreaking](https://www.anthropic.com/research/many-shot-jailbreaking) [2024] — Scaling harmful examples in long-context windows enables jailbreaking (Anthropic Technical Report).
Constitutional AI: Harmlessness from AI Feedback
Ignore Previous Prompt: Attack Techniques For Language Models
Artificial Intelligence and Cybersecurity: Documented Risks, Enterprise Guardrails, and Emerging Threats in 2024–2025
2025](https://www.ijfmr.com/research-paper.php?id=62200) [2025] — Survey of real prompt-injection incidents with practical governance prompt patterns.
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
Legal Prompt Engineering for Multilingual Legal Judgement Prediction
Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems
Commonsense-Aware Prompting for Controllable Empathetic Dialogue Generation
Aware Prompting for Controllable Empathetic Dialogue Generation](https://arxiv.org/abs/2302.01441) [2023]
PLACES: Prompting Language Models for Social Conversation Synthesis
Medical Image Segmentation Using Transformer Encoders and Prompt-Based Learning: A Systematic Review
Based Learning: A Systematic Review](https://ieeexplore.ieee.org/document/11313186/) [2025]
TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning
SQL-based interface preserving tabular structure for multi-hop queries.
A Taxonomy of Prompt Modifiers for Text-To-Image Generation
To-Image Generation](https://arxiv.org/abs/2204.13988) [2022]
Design Guidelines for Prompt Engineering Text-to-Image Generative Models
to-Image Generative Models](https://arxiv.org/abs/2109.06977) [2021]
High-Resolution Image Synthesis with Latent Diffusion Models
Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) [2021]
DALL·E: Creating Images from Text
Investigating Prompt Engineering in Diffusion Models
MusicLM: Generating Music From Text
ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models
Music: Text-to-Waveform Music Generation with Diffusion Models](https://arxiv.org/pdf/2302.04456) [2023]
AudioLM: A Language Modeling Approach to Audio Generation
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models](https://arxiv.org/pdf/2301.12661.pdf) [2023]
Language Models are Few-Shot Learners (GPT-3)
Shot Learners (GPT-3)](https://arxiv.org/abs/2005.14165) [2020] — Demonstrated few-shot prompting at scale.
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/abs/2101.00190) [2021]
The Power of Scale for Parameter-Efficient Prompt Tuning
Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691) [2021]
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
Shot Paradigm](https://arxiv.org/abs/2102.07350) [2021]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Generated Knowledge Prompting for Commonsense Reasoning
Making Pre-trained Language Models Better Few-shot Learners
trained Language Models Better Few-shot Learners](https://aclanthology.org/2021.acl-long.295) [2021]
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
How Can We Know What Language Models Know?
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
Synthetic Prompting: Generating Chain-of-Thought Demonstrations for LLMs
of-Thought Demonstrations for LLMs](https://arxiv.org/abs/2302.00618) [2023]
Progressive Prompts: Continual Learning for Language Models
Successive Prompting for Decompleting Complex Questions
Decomposed Prompting: A Modular Approach for Solving Complex Tasks
PromptChainer: Chaining Large Language Model Prompts through Visual Programming
Ask Me Anything: A Simple Strategy for Prompting Language Models
me-anything-a-simple-strategy-for) [2022]
Prompting GPT-3 To Be Reliable
3 To Be Reliable](https://arxiv.org/abs/2210.09150) [2022]
On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning
Shot Reasoning](https://arxiv.org/abs/2212.08061) [2022]