Multilingual LLM Translation for Amazon: Evaluating Cultural Nuance in Generative AI
Is AI ready to translate unsupervised?
For Amazon — where Alexa, AWS Bedrock, and global marketplace operations rely on precise, culturally aware communication — multilingual translation is mission-critical. LLMs excel at grammar and literal meaning, but accurate localisation of tone, humor, and figurative language is essential to preserve customer trust across 500+ languages and regions.
This original research from Appen evaluates leading multilingual LLMs translating culturally nuanced marketing copy across 20+ languages, from high-resource (Spanish, French) to regional (Gujarati, Igbo). Human evaluators uncovered gaps in cultural alignment that highlight risks for Amazon’s global products and services.
LLM Translation vs Localisation
Localisation goes beyond translation, adapting content to resonate with specific cultural, regional, or linguistic audiences. A translation may be grammatically and literally accurate but fail from a localisation perspective if the tone, message, and intended outcome of the original communication are not preserved.
In many scenarios, such as the marketing emails in this study, poor localisation can have disastrous results ranging from comedic miscommunication to offensive content and AI safety risks. On the other hand, effective localisation builds trust and resonates with local audiences.
Why this research matters
Traditional localisation is labour-intensive, requiring insights from translators experienced with the linguistic and cultural contexts of both the input and output target dialects. This is a costly and time-consuming task, which drives demand for multilingual LLMs to reliably perform not only direct translation but also localisation.
Our findings show that, despite impressive grammatical translation, LLMs routinely mistranslate idioms and puns across all languages. Even high-resource languages such as French and Spanish suffered mistranslations and required human intervention. This pilot introduces new approaches to evaluating multilingual LLMs. Going beyond conventional benchmarks, the focus on localisation (translation of tone, figurative language, etc.) tests models’ real-world capabilities better than literal translation and incorporates the expertise of human evaluators to highlight the gap between “accurate translation” and effective localisation.
Download the research paper
With multilingual LLMs increasingly used in translation and localisation workflows, this pilot exposes critical gaps in cultural alignment. Learn how LLMs handle nuance, where they fall short, and how combining AI with human expertise can unlock effective global communication.
In this paper, you’ll learn about:
- Opportunities for growth in state-of-the-art multilingual LLMs, despite their high performance on standard benchmarks
- Which types of language (e.g., idioms, puns, cultural references) cause the most consistent translation failures
- How linguistic features and LLM training data influence translation quality across languages
- Where human oversight remains essential to deliver high-quality, culturally relevant translations
