Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL

This content originally appeared on DEV Community and was authored by AI Viewz

Introduction

Building OCR models for right-to-left (RTL) languages like Arabic, Urdu, Persian, or Hebrew often suffers from a lack of annotated training data. SynthDoG-RTL is a synthetic document generator adapted from Donut’s SynthDoG, extended to handle RTL text rendering correctly. In this post, we’ll walk through how advanced developers can generate large-scale synthetic datasets compatible with Donut.

What is SynthDoG-RTL?

SynthDoG (Synthetic Document Generator) was introduced with Donut to create training data on the fly for document understanding. SynthDoG-RTL extends it by:

Supporting RTL text direction and contextual script shaping.
Including sample corpora, fonts, and templates for Arabic, Urdu, Persian, Hebrew, and others.
Allowing custom YAML configuration for layouts, distortions, and effects.

Installation and Setup

Clone the repository and install dependencies:

git clone https://github.com/aiviewz/Synthdog-RTL.git
cd Synthdog-RTL

conda create -n synthdog python=3.8 -y
conda activate synthdog
pip install synthtiger

Make sure to install libraqm for proper Arabic/RTL shaping:

sudo apt-get install libfreetype6-dev libharfbuzz-dev

On macOS, set:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Preparing Resources

Each language needs:

Corpus: UTF-8 text file under resources/corpus/ (e.g., urdu.txt, arabic.txt).
Fonts: Place .ttf/.otf fonts in resources/font/<lang_code>/.
Backgrounds: Optional textures under resources/backgrounds/.

Example structure:

resources/
 ├─ corpus/
 │   ├─ urdu.txt
 │   └─ arabic.txt
 └─ font/
     ├─ ur/
     │   └─ NotoNastaliq.ttf
     └─ ar/
         └─ NotoNaskh.ttf

Configuring Generation

YAML config files (e.g., config_ur.yaml) define page size, font range, distortions, and paths.

Example Urdu config:

corpus_path: "resources/corpus/urdu.txt"
font_dir: "resources/font/ur/"
page_width: 1240
page_height: 1754
min_font_size: 20
max_font_size: 40
rotate_angle: [-2, 2]
background_dir: "resources/backgrounds/paper/"

Generating Synthetic Data

Run the CLI:

synthtiger -o ./outputs/synthdog_ur -c 1000 -w 8 -v template.py SynthDoG config_ur.yaml

This generates 1000 samples with 8 workers, outputting images and text into ./outputs/synthdog_ur/.

Repeat with config_ar.yaml, config_fa.yaml, etc. for multiple languages.

Formatting for Donut

Donut expects an image + JSON pair. Structure your dataset like:

my_dataset/
 ├─ train/
 │   ├─ metadata.jsonl
 │   ├─ 00000001.png
 │   └─ ...
 ├─ validation/
 │   └─ ...
 └─ test/
     └─ ...

Each line in metadata.jsonl:

{"file_name": "00000001.png", "ground_truth": "{\"gt_parse\":{\"text_sequence\":\"یہ اردو کا متن ہے\"}}"}

Donut will tokenize this internally. Ensure that file_name matches your image and text_sequence contains the RTL ground truth text.

Advanced Tips

Layouts: Customize template.py for multi-column, headers, or tables.
Effects: Add noise, blur, or perspective distortion in YAML for realism.
Fonts: Use multiple fonts per language to avoid overfitting.
Mixed Scripts: Include English corpora to simulate bilingual documents.
Scaling: Generate 10k–100k samples to pre-train Donut effectively.

Conclusion

With SynthDog-RTL you can rapidly bootstrap synthetic OCR datasets for all major RTL languages. The generated data integrates seamlessly with Donut, enabling you to train or fine-tune robust document understanding models even in low-resource settings.

References:

This content originally appeared on DEV Community and was authored by AI Viewz

Print Share Comment Cite Upload Translate Updates

APA

AI Viewz | Sciencx (2025-09-23T20:11:56+00:00) Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL. Retrieved from https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/

MLA

" » Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL." AI Viewz | Sciencx - Tuesday September 23, 2025, https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/

HARVARD

AI Viewz | Sciencx Tuesday September 23, 2025 » Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL., viewed ,<https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/>

VANCOUVER

AI Viewz | Sciencx - » Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/

CHICAGO

" » Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL." AI Viewz | Sciencx - Accessed . https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/

IEEE

" » Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL." AI Viewz | Sciencx [Online]. Available: https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/. [Accessed: ]

rf:citation

» Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL | AI Viewz | Sciencx | https://www.scien.cx/2025/09/23/generating-synthetic-rtl-ocr-data-for-donut-with-synthdog-rtl-4/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.