Adversarial Prompting for Amazon: Benchmarking Safety in Large Language Models
Benchmarking vulnerabilities in LLMs that power Alexa and AWS
As Amazon integrates LLMs into Alexa, AWS Bedrock, and enterprise customer workflows, the stakes for safety are high. Adversarial prompting has emerged as a critical risk, exposing how even advanced models can be manipulated into harmful, biased, or restricted outputs — eroding customer trust and increasing compliance risk.
This original research from Appen introduces a novel evaluation dataset and benchmarks leading models across multiple harm categories. Results show how attackers exploit weaknesses with techniques like virtualisation, sidestepping, and prompt injection, revealing safety gaps that directly affect Amazon’s global-scale deployments.
What is adversarial prompting?
Adversarial prompting is the practice of crafting inputs that bypass LLM safety mechanisms, triggering unsafe or policy-violating outputs. These inputs often rely on linguistic subtlety rather than overt rule-breaking, making them difficult to detect with standard moderation tools.
Key techniques include:
- Virtualisation – Framing harmful content within fictional or hypothetical scenarios
- Sidestepping – Using vague or indirect language to circumvent keyword-based filters
- Prompt Injection – Overriding model instructions with embedded commands
- Persuasion and Persistence – Leveraging roleplay, appeals to logic or authority, and repeated rewording to wear down refusal behaviour
Understanding these techniques is critical for assessing model robustness and developing safe, trustworthy AI systems.
Why does this research matter?
This study offers a comprehensive benchmark of LLM safety performance under adversarial pressure, exposing meaningful differences between models. The findings show that:
- Safety outcomes varies significantly across models, even under identical testing conditions
- Prompting techniques and identity-related content can dramatically influence model outputs
- Deployment-time factors—like system prompts and moderation layers—play a crucial role in safety
Download the research paper
As LLMs are increasingly deployed in high-stakes environments, understanding their vulnerabilities is critical to responsible AI development. This paper delivers actionable insights into the effectiveness of current safety interventions and proposes strategies to mitigate emerging threats.
In this paper, you’ll learn:
- How adversarial prompts reveal vulnerabilities in LLMs
- What techniques (e.g., virtualization, sidestepping) were most effective at eliciting harm
- How identity-related prompts impacted safety outcomes
- Why safety-aligned LLM training data is essential to building robust LLMs
- What organisations can do to improve LLM safety in practice
Optimise for resilience, not just intelligence
Appen’s human-in-the-loop LLM red teaming approach helps leading AI developers stress-test their models against sophisticated attack strategies. By integrating ethical evaluation, adversarial testing, and real-time human judgment, we support clients in developing AI systems that are not only powerful—but also aligned, resilient, and ready for real-world deployment.
