Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL

Introduction

Building OCR models for right-to-left (RTL) languages like Arabic, Urdu, Persian, or Hebrew often suffers from a lack of annotated training data. SynthDoG-RTL is a synthetic document generator adapted from Donut’s SynthDoG, exte…


This content originally appeared on DEV Community and was authored by AI Viewz

Introduction

Building OCR models for right-to-left (RTL) languages like Arabic, Urdu, Persian, or Hebrew often suffers from a lack of annotated training data. SynthDoG-RTL is a synthetic document generator adapted from Donut’s SynthDoG, extended to handle RTL text rendering correctly. In this post, we’ll walk through how advanced developers can generate large-scale synthetic datasets compatible with Donut.

What is SynthDoG-RTL?

SynthDoG (Synthetic Document Generator) was introduced with Donut to create training data on the fly for document understanding. SynthDoG-RTL extends it by:

  • Supporting RTL text direction and contextual script shaping.
  • Including sample corpora, fonts, and templates for Arabic, Urdu, Persian, Hebrew, and others.
  • Allowing custom YAML configuration for layouts, distortions, and effects.

Installation and Setup

Clone the repository and install dependencies:

git clone https://github.com/aiviewz/Synthdog-RTL.git
cd Synthdog-RTL

conda create -n synthdog python=3.8 -y
conda activate synthdog
pip install synthtiger

Make sure to install libraqm for proper Arabic/RTL shaping:

sudo apt-get install libfreetype6-dev libharfbuzz-dev

On macOS, set:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Preparing Resources

Each language needs:

  • Corpus: UTF-8 text file under resources/corpus/ (e.g., urdu.txt, arabic.txt).
  • Fonts: Place .ttf/.otf fonts in resources/font/<lang_code>/.
  • Backgrounds: Optional textures under resources/backgrounds/.

Example structure:

resources/
 ├─ corpus/
 │   ├─ urdu.txt
 │   └─ arabic.txt
 └─ font/
     ├─ ur/
     │   └─ NotoNastaliq.ttf
     └─ ar/
         └─ NotoNaskh.ttf

Configuring Generation

YAML config files (e.g., config_ur.yaml) define page size, font range, distortions, and paths.

Example Urdu config:

corpus_path: "resources/corpus/urdu.txt"
font_dir: "resources/font/ur/"
page_width: 1240
page_height: 1754
min_font_size: 20
max_font_size: 40
rotate_angle: [-2, 2]
background_dir: "resources/backgrounds/paper/"

Generating Synthetic Data

Run the CLI:

synthtiger -o ./outputs/synthdog_ur -c 1000 -w 8 -v template.py SynthDoG config_ur.yaml

This generates 1000 samples with 8 workers, outputting images and text into ./outputs/synthdog_ur/.

Repeat with config_ar.yaml, config_fa.yaml, etc. for multiple languages.

Formatting for Donut

Donut expects an image + JSON pair. Structure your dataset like:

my_dataset/
 ├─ train/
 │   ├─ metadata.jsonl
 │   ├─ 00000001.png
 │   └─ ...
 ├─ validation/
 │   └─ ...
 └─ test/
     └─ ...

Each line in metadata.jsonl:

{"file_name": "00000001.png", "ground_truth": "{\"gt_parse\":{\"text_sequence\":\"یہ اردو کا متن ہے\"}}"}

Donut will tokenize this internally. Ensure that file_name matches your image and text_sequence contains the RTL ground truth text.

Advanced Tips

  • Layouts: Customize template.py for multi-column, headers, or tables.
  • Effects: Add noise, blur, or perspective distortion in YAML for realism.
  • Fonts: Use multiple fonts per language to avoid overfitting.
  • Mixed Scripts: Include English corpora to simulate bilingual documents.
  • Scaling: Generate 10k–100k samples to pre-train Donut effectively.

Conclusion

With SynthDog-RTL you can rapidly bootstrap synthetic OCR datasets for all major RTL languages. The generated data integrates seamlessly with Donut, enabling you to train or fine-tune robust document understanding models even in low-resource settings.

References:


This content originally appeared on DEV Community and was authored by AI Viewz


Print Share Comment Cite Upload Translate Updates
APA

AI Viewz | Sciencx (2025-09-23T20:11:56+00:00) Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL. Retrieved from https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/

MLA
" » Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL." AI Viewz | Sciencx - Tuesday September 23, 2025, https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/
HARVARD
AI Viewz | Sciencx Tuesday September 23, 2025 » Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL., viewed ,<https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/>
VANCOUVER
AI Viewz | Sciencx - » Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/
CHICAGO
" » Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL." AI Viewz | Sciencx - Accessed . https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/
IEEE
" » Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL." AI Viewz | Sciencx [Online]. Available: https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/. [Accessed: ]
rf:citation
» Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL | AI Viewz | Sciencx | https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.