research

Why language-specific models matter for customer feedback

Generic multilingual models miss nuance. Here's why we built dedicated German and English pipelines.

By Howzer Team, Research

When we started building Howzer's sentiment analysis, the obvious path was to use a large multilingual model and fine-tune it. It would handle German and English out of the box, and every other language too. We tried it. The results were acceptable on benchmarks and poor on real customer messages.

A multilingual model that scores 92% on academic benchmarks can still miss the difference between a frustrated customer and an angry one in German.

Where multilingual models fail

Customer feedback is not academic text. It is informal, often ungrammatical, full of domain-specific terms, and emotionally charged. The linguistic patterns that signal frustration, urgency, or risk differ between languages in ways that a shared embedding space cannot capture reliably.

  • German compound words: "Vertragskündigungsbestätigung" carries semantic weight that tokenizers split incorrectly.
  • Indirect complaints: German speakers often phrase dissatisfaction indirectly ("Ich hätte erwartet, dass...") in ways that score as neutral in multilingual models.
  • Formal register: the distinction between "Sie" and "du" signals relationship context that matters for tone analysis.
  • English idioms: expressions like "the last straw" or "couldn't care less" need idiomatic understanding, not literal interpretation.

Our approach

Instead of one model that handles everything, we run dedicated models per language. Each is trained on customer feedback data (not Wikipedia, not news articles) and calibrated for the scoring thresholds relevant to that language's communication patterns.

Multilingual (generic)
  • Single model for all languages
  • Trained on general-purpose corpora
  • Same thresholds across languages
  • Tokenization compromises
Language-specific (Howzer)
  • Dedicated model per language
  • Trained on customer feedback data
  • Calibrated thresholds per language
  • Native tokenization

What this means in practice

The German pipeline (v2.3-DE) uses a native German language model for sentiment analysis. It handles compound words correctly, understands indirect complaint patterns, and distinguishes between formal and informal register. The English pipeline (v2.2) uses a separately calibrated model optimized for English idioms and tone patterns.

Both pipelines produce the same output structure (sentiment scores, emotion labels, risk levels) so downstream components (root cause analysis, response generation, routing) work identically regardless of language. The difference is in precision: each model is better at its own language than any single model would be at both.

We plan to apply the same language-specific approach when adding future languages. Each will get its own sentiment model trained on customer feedback data for that language.