Language Aggregation in OpenSearch: Selecting One Document Per Group by Language Preference

Multilingual content is common in documentation systems, product catalogs, and knowledge bases. When the same item exists in several languages, search results often become cluttered with multiple versions of the same document.

A typical requirement is…


This content originally appeared on DEV Community and was authored by Alexey Vidanov

Multilingual content is common in documentation systems, product catalogs, and knowledge bases. When the same item exists in several languages, search results often become cluttered with multiple versions of the same document.

A typical requirement is to return one document per content group, chosen using a language preference order such as de > en > fr.

This blog post presents a practical pattern for handling language aggregation. The approach is part of the open-source OpenSearch project and is fully supported in Amazon OpenSearch Service, making it suitable for both self-managed clusters and AWS-managed environments.

The Problem

If an article exists in German, English, and French, a standard search will return all three. You want:

  • One hit per crossLanguageGroup
  • The language with the highest user preference
  • Deterministic, predictable selection

Simple deduplication does not work because you must apply a ranking rule across the group.

Solution Overview

The solution relies on three capabilities:

  1. Field Collapse Groups all translations of the same document.
  2. Scripted Sort Applies an explicit language ranking.
  3. Keyword Fields Enable efficient sorting and scripting on language arrays.

Workflow

Workflow diagram showing a 6-step multilingual document search process using collapse

Figure: Multi-language document search workflow using collapse functionality. The process reduces 6 duplicate documents across German, English, and French to 3 results by grouping cross-language versions and applying language preference ranking.

Index Setup

PUT tmp_multi_lang
{
  "mappings": {
    "properties": {
      "crossLanguageGroup": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" }}
      },
      "languages": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" }}
      },
      "title": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" }}
      },
      "content": { "type": "text" }
    }
  }
}

Notes:

  • crossLanguageGroup stores the logical ID shared by all translations of the same item. Every language variant uses the same value, so collapse can group them reliably.
  • languages is an array of ISO language codes (e.g., ["de", "en"]). Using an array lets the script evaluate multiple languages if needed.
  • .keyword fields are essential because text fields are analyzed. The analyzer splits or lowercases values, which breaks exact matching and makes sorting impossible.
  • The .keyword subfield stores the raw, untouched value, enabling:
    • deterministic sorting
    • exact matches (e.g., term queries)
    • using values inside Painless scripts

Without .keyword, collapse and scripted sorting would not work correctly.

Sample Data

Load with POST tmp_multi_lang/_bulk using NDJSON:

{ "index": {} }
{ "crossLanguageGroup": "abc123", "languages": ["de"], "title": "Willkommen", "content": "Dies ist eine Einführung. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "abc123", "languages": ["en"], "title": "Welcome", "content": "This is an introduction. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "abc123", "languages": ["fr"], "title": "Bienvenue", "content": "Ceci est une introduction. Source: Reply" }

{ "index": {} }
{ "crossLanguageGroup": "xyz789", "languages": ["en"], "title": "Search", "content": "How to search in OpenSearch. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "xyz789", "languages": ["fr"], "title": "Recherche", "content": "Comment chercher dans OpenSearch. Source: Reply" }

{ "index": {} }
{ "crossLanguageGroup": "def456", "languages": ["de"], "title": "Produktübersicht", "content": "Dies ist eine Produktbeschreibung." }
{ "index": {} }
{ "crossLanguageGroup": "def456", "languages": ["en"], "title": "Product Overview", "content": "This is a product description." }

The Core Query

POST tmp_multi_lang/_search
{
  "query": {
    "match": { "content": "Reply" }
  },
  "collapse": {
    "field": "crossLanguageGroup.keyword"
  },
  "sort": [
    {
      "_script": {
        "type": "number",
        "order": "asc",
        "script": {
          "lang": "painless",
          "params": {
            "lang_order": { "de": 0, "en": 1, "fr": 2 }
          },
          "source": """
            int best = 100;
            def order = params.lang_order;
            if (doc.containsKey('languages.keyword')) {
              for (def l : doc['languages.keyword']) {
                if (order.containsKey(l)) {
                  int ord = (int) order.get(l);
                  if (ord < best) { best = ord; }
                }
              }
            }
            return best;
          """
        }
      }
    },
    { "_score": "desc" }
  ]
}

How It Works

Query Stage

Matches all documents containing "Reply".

Collapse Stage

Groups documents by crossLanguageGroup.keyword.

Script Sort Stage

  • Iterates the languages array
  • Checks each language in the priority map
  • Selects the lowest value (best match)
  • Uses 100 as fallback

Tie Breaking

If two documents share the same priority, _score decides.

Example Result

abc123/de, abc123/en, abc123/fr, xyz789/en, xyz789/fr, klm654/de

After collapse with de > en > fr:

abc123/de, xyz789/en, klm654/de

Why This Works

  • Keyword fields expose doc values, making sorting fast and predictable.
  • Scripted sort applies a strict language hierarchy, not a soft boost.
  • Collapse guarantees exactly one document per crossLanguageGroup.
  • Painless correctly handles multi-value arrays like languages.

Everything works together to deliver deterministic, language-aware selection.

Common Errors and Fixes

1. "Text fields are not optimised for operations"

Use .keyword:

doc['languages.keyword']

2. "unknown field [lang]"

lang must be inside the script object.

3. Casting errors

Use explicit casting:

int ord = (int) order.get(l);

4. "Illegal list shortcut value [values]"

Iterate normally:

for (def l : doc['languages.keyword']) { ... }

Alternative Approach: Score-Based Selection

This method uses query-time boosts to encourage certain languages rather than enforcing a strict order. Each language gets a different weight: German +3, English +2, French +1.

When OpenSearch calculates the score, documents that match higher-boosted languages naturally rise to the top.

After scoring, collapse picks the highest-scoring document per crossLanguageGroup.

What this means in practice

  • If a group has de, en, and fr, the German version usually wins because it has the highest boost.
  • But if the English document has stronger text relevance, its score may exceed the German one.
  • The boosts add to the full-text score, so the effect is soft preference, not a strict ranking.

Good fit: simple setups where speed matters and minor inconsistencies are acceptable.

Not ideal: cases requiring deterministic de > en > fr without exceptions.

POST tmp_multi_lang/_search
{
  "query": {
    "bool": {
      "must": { "match": { "content": "Reply" } },
      "should": [
        { "term": { "languages.keyword": { "value": "de", "boost": 3 } } },
        { "term": { "languages.keyword": { "value": "en", "boost": 2 } } },
        { "term": { "languages.keyword": { "value": "fr", "boost": 1 } } }
      ]
    }
  },
  "collapse": { "field": "crossLanguageGroup.keyword" },
  "sort": [ { "_score": "desc" } ]
}

Pros: fast and simple

Cons: boosts are additive, not strict priority

Conclusion

Language-aware document aggregation in OpenSearch is solved cleanly by combining collapse, a Painless script-based sort, and keyword-backed language fields. Script sorting provides reliable, deterministic selection, while score-based boosting offers a faster but less strict alternative.

This pattern is useful anywhere multilingual data creates noise in search results. By grouping documents, applying a clear language hierarchy, and keeping sorting deterministic, teams can deliver cleaner UX across documentation portals, product catalogs, and knowledge bases. It also works well with personalization and dynamic language preferences.

If you want to build cleaner multilingual search or explore how to apply language-aware ranking in Amazon OpenSearch Service, feel free to reach out.

At Reply, we help teams design scalable, predictable, and user-centred search workflows — from proof of concept to fully aligned production deployments.


This content originally appeared on DEV Community and was authored by Alexey Vidanov


Print Share Comment Cite Upload Translate Updates
APA

Alexey Vidanov | Sciencx (2025-11-26T11:04:54+00:00) Language Aggregation in OpenSearch: Selecting One Document Per Group by Language Preference. Retrieved from https://www.scien.cx/2025/11/26/language-aggregation-in-opensearch-selecting-one-document-per-group-by-language-preference/

MLA
" » Language Aggregation in OpenSearch: Selecting One Document Per Group by Language Preference." Alexey Vidanov | Sciencx - Wednesday November 26, 2025, https://www.scien.cx/2025/11/26/language-aggregation-in-opensearch-selecting-one-document-per-group-by-language-preference/
HARVARD
Alexey Vidanov | Sciencx Wednesday November 26, 2025 » Language Aggregation in OpenSearch: Selecting One Document Per Group by Language Preference., viewed ,<https://www.scien.cx/2025/11/26/language-aggregation-in-opensearch-selecting-one-document-per-group-by-language-preference/>
VANCOUVER
Alexey Vidanov | Sciencx - » Language Aggregation in OpenSearch: Selecting One Document Per Group by Language Preference. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/11/26/language-aggregation-in-opensearch-selecting-one-document-per-group-by-language-preference/
CHICAGO
" » Language Aggregation in OpenSearch: Selecting One Document Per Group by Language Preference." Alexey Vidanov | Sciencx - Accessed . https://www.scien.cx/2025/11/26/language-aggregation-in-opensearch-selecting-one-document-per-group-by-language-preference/
IEEE
" » Language Aggregation in OpenSearch: Selecting One Document Per Group by Language Preference." Alexey Vidanov | Sciencx [Online]. Available: https://www.scien.cx/2025/11/26/language-aggregation-in-opensearch-selecting-one-document-per-group-by-language-preference/. [Accessed: ]
rf:citation
» Language Aggregation in OpenSearch: Selecting One Document Per Group by Language Preference | Alexey Vidanov | Sciencx | https://www.scien.cx/2025/11/26/language-aggregation-in-opensearch-selecting-one-document-per-group-by-language-preference/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.