Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

 

Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

Multimodal AI safety isn’t as stable as you think

New research from Appen reveals how model updates can introduce unexpected safety shifts across modalities, even in leading LLMs.

Multimodal LLMs are rapidly advancing, but their safety behavior isn’t static.

In this study, Appen evaluated multiple generations of leading multimodal models using adversarial prompts and human review. What we found challenges a common assumption:

Newer models don’t always behave more safely.

Some improved.
Some regressed.
Most changed in ways that aren’t obvious without structured evaluation.

Alignment can drift across model updates
Model upgrades introduced measurable shifts in how systems respond to adversarial inputs — not always in the direction you’d expect.

Multimodal doesn’t mean more robust
In some cases, text-only prompts were more effective than multimodal ones. In others, the gap disappeared entirely depending on the model generation.

The pattern isn’t consistent and that’s the risk.

“Safer” outputs can be misleading
Some models appear safer because they refuse more often, not because they generate better responses.

Understanding the difference is critical for real-world deployment.

If you’re building or deploying AI, this impacts you

  • Safety behavior can change between model versions
  • Vulnerabilities can shift across modalities
  • Improvements in one area may introduce risks in another

Without consistent evaluation, these changes are easy to miss.

Download the full paper to explore:

  • A two-phase evaluation across multiple model generations
  • Adversarial testing across text and multimodal inputs
  • Human-rated harmfulness at scale
  • Model-by-model comparison of safety behavior
  • What alignment drift means for enterprise AI teams

Measure what actually changes
As AI systems evolve, static benchmarks aren’t enough.

Appen helps teams evaluate real-world model behavior through:

  • human-led red teaming
  • multimodal evaluation
  • longitudinal testing frameworks

See how safety actually changes across models

Download the full research report

White Paper from  appen_logo

    Complete the form below to download the content:


    You have been directed to this site by Global IT Research. For more details on our information practices, please see our Privacy Policy, and by accessing this content you agree to our Terms of Use. You can unsubscribe at any time.

    If your Download does not start Automatically, Click Download Whitepaper

    Show More