The Digital Scribe: My GSoC Journey Teaching AI to Read the Past

This content originally appeared on Level Up Coding - Medium and was authored by Saarthak Gupta

Introduction

The main challenge for building an OCR engine for historical books or manuscripts is the limitation of data. This project propose the use of generative models like GANs to generate synthetic renaissance-era text and use the augmented data to improve the performance of OCR models.

Overview

Here are the specific goals I aimed to achieve by the end of the project:

Build a pipeline for dataset generation from the available manuscripts. This pipeline will be capable of applying various transformation, detecting text using pretrained models, and mapping word images to their corresponding label in transcripts.
Experiment with different generative models like GANs to develop a generative models capable of producing Renaissance-style printed text
Use the developed generative model for synthetic dataset generation and using it to improve the performance of OCR models.
Integrate the above steps to build an easy to use pipeline.

The challenge of generating synthetic text samples varies significantly between printed and handwritten sources, primarily due to differences in consistency and data availability.

Generating synthetic samples for printed text is a more manageable task due to the inherent uniformity of the source material.

In essence, while synthesizing printed text involves modeling a consistent structure with predictable variations, synthesizing handwritten text requires modeling a vast and unstructured spectrum of individualistic styles, a challenge made more difficult by data scarcity.

PHASE I: Data preparation for training Generative Models

An overview of data generation pipeline

Image Preprocessing

When working with historical documents, especially for tasks like Optical Character Recognition (OCR) and image generation, preprocessing isn’t just helpful — it’s crucial. The quality of preprocessing directly influences the accuracy and performance of deep learning models downstream.

In our project, we were provided with digitized versions of historical books in the form of PDFs, along with their corresponding transcripts. The first step involved splitting these multi-page PDFs into individual page images and aligning each one with its transcript. This ensured a reliable ground truth for training and evaluation.

To prepare the images for effective text detection, I applied several enhancement techniques:

Skew Correction & Normalization: Many scanned pages were slightly tilted or poorly aligned. Correcting this skew helped models better identify line and word boundaries.
DPI Adjustment: We ensured each image had a resolution of at least 300 pixels per inch (PPI). Interestingly, increasing the PPI had a noticeable positive impact on text detection accuracy.
Ink Bleed Removal & Denoising: Historical pages often suffer from ink bleeding through thin paper or noise due to aging. We carefully removed these artifacts without compromising character clarity.
Sharpening & Contrast Enhancement: These techniques helped bring out faint text and faded ink, making the characters stand out more clearly.
Binarization: By converting grayscale images to black-and-white using adaptive thresholds, we simplified the input for OCR models while retaining key features.
Morphological Operations: Erosion, dilation, opening, and closing were selectively applied based on the document type to clean up background clutter and reinforce character structure.

Each book had its own unique challenges — so I had to experiment with different combinations of preprocessing techniques to achieve the best results.

Various Preprocessing steps to improve image for text detection

Text Detection

For the crucial task of text detection, I used the CRAFT (Character Region Awareness for Text detection) model. CRAFT is a state-of-the-art deep learning approach known for its ability to accurately detect individual word or character-level bounding boxes — even in complex, irregular layouts, which are common in historical documents.

During experimentation, I tested several other models, including PSENet, PaddleOCR, and even the more traditional PyTesseract. CRAFT consistently outperformed the others in terms of precision and robustness, especially on noisy, skewed, or degraded manuscript scans.

An image of page with text detected using CRAFT model

Aligning Detection with Ground Truth

Once the text regions were accurately detected, the next big challenge was mapping each bounding box to its correct label from the transcript. This turned out to be one of the most complex and time-consuming parts of the pipeline.

To tackle this, I used a PyTesseract OCR model fine-tuned for Spanish to extract raw text from each detected box. Then, I compared the extracted word with the corresponding transcript using text similarity matching. To maintain high data integrity, I set a strict similarity threshold of 0.8 — only matches above this threshold were accepted.

While this meant that only about 50% of the detected boxes could be confidently matched to a label, it significantly reduced the risk of introducing incorrect annotations. Despite the tradeoff in quantity, this method allowed me to build a clean, high-quality dataset of approximately 4,800 word-label pairs, laying a solid foundation for model training and evaluation.

Preparing Data for Image-to-Image Translation (GAN)

To train an image-to-image GAN for handwriting generation, creating a clean and consistent input format was essential. I began by designing synthetic templates for each word using the RomanAntique font, which closely resembles the elegant typefaces of 17th-century Renaissance manuscripts. This not only provided visual consistency but also helped maintain the historical aesthetic of the generated outputs.

Each word was rendered as a 64×128 grayscale image. To improve GPU efficiency and enable a deeper and more expressive generator architecture, I stacked 8 such word images in a 4×2 grid, creating a single 256×256 composite image as input to the GAN.

However, GANs are highly sensitive to noisy or inconsistent training data. To address this, I manually reviewed and cleaned the dataset — removing incorrectly mapped samples, duplicates, and any artifacts that could degrade model performance. This was a tedious but critical step to ensure stable training and high-quality results.

This structured, high-integrity layout allowed the model to capture spatial dependencies across multiple words, which ultimately led to richer, more authentic handwriting generation.

Phase II: Building Generative Models for synthetic Image Generation

To generate realistic synthetic pages, I employed two distinct and complementary methodologies: a procedural approach focused on algorithmic degradation and a deep learning approach focused on data-driven style translation.

Approach 1: Deep Learning-Based Style Translation with a GAN

This method uses a data-driven approach to learn and replicate the subtle stylistic nuances of the historical text at the word level.

Architecture Overview

To achieve this, I used an image-to-image Generative Adversarial Network (GAN) inspired by the Pix2Pix architecture. This model was particularly suited for our task because it’s designed to learn a direct mapping from a source domain (clean, synthetic text) to a target domain (real, historical handwriting).

What Are GANs?

Generative Adversarial Networks (GANs) are a type of AI that can learn to create new images by studying examples. You can think of them as a game between two players: a generator, which tries to create fake images that look real, and a discriminator, which tries to tell whether an image is real or fake. As they compete, both get better — until the generator becomes so good that its images are almost indistinguishable from real ones. This technique is especially powerful for generating things like realistic handwriting, art, or even human faces.

Pix-2-Pix model architecture

The architecture consists of two key components:

Generator (U-Net):
The generator follows a U-Net architecture, which can be thought of as an autoencoder with skip connections between encoding and decoding layers. These skip connections allow low-level spatial features (like edges and contours) to bypass the bottleneck and flow directly into the decoding layers. This results in sharper and more realistic outputs, especially important for preserving fine text details in handwriting generation.
Discriminator (PatchGAN):
Instead of evaluating the entire image at once, the PatchGAN discriminator breaks it down into smaller patches (e.g., 70x70) and classifies each patch as real or fake. This approach forces the generator to produce locally coherent textures and style details, encouraging realism at a finer level. It’s particularly effective for tasks like ours where local texture (e.g., ink strokes) matters more than global structure.

This adversarial setup allows the model to learn style transfer implicitly, without requiring explicit rules or handcrafted features.

Training the GAN: Loss Functions & Strategy

Training a GAN requires careful coordination between two competing networks: the generator, which creates images, and the discriminator, which evaluates them. I trained the model for 100 epochs using the Adam optimizer with a batch size of 32 and a learning rate of 2e-4.

Generator Loss

The generator is trained using a hybrid loss function that combines Binary Cross-Entropy (BCE) loss and L1 loss. The BCE component, also known as the GAN loss, measures how effectively the generator fools the discriminator — if the discriminator classifies the generated image as real, the generator is rewarded. The L1 loss ensures pixel-wise similarity between the generated and ground truth handwritten images. To emphasize structural accuracy, the L1 loss is weighted by a factor of 100. This combination encourages the generator to produce outputs that are both realistic and closely aligned with the actual target images.

Total Generator Loss = GAN Loss + 100 × L1 Loss

Discriminator Loss

The discriminator is trained to distinguish between real (input paired with ground truth) and fake (input paired with generated output) image pairs using Binary Cross-Entropy loss. It assigns a label of 1 to real pairs and 0 to fake pairs, computing the loss for both cases. This adversarial setup sharpens the discriminator’s ability to detect fake images while simultaneously pushing the generator to produce more convincing and realistic outputs.

Total Discriminator Loss = 0.5 × (Loss_real + Loss_fake)

Final Results

The model successfully generated high-quality synthetic handwritten images, making it suitable for training OCR systems. Given the limited size of the training dataset, I intentionally allowed slight overfitting to ensure strong performance on Spanish words, which were the primary focus of the task.

Approach 2: Procedural Generation with Algorithmic Degradation

This method involves building a synthetic page from the ground up by layering various computer-generated effects. The goal is to simulate the physical artifacts of historical printing through a controlled, step-by-step process.

Start with Clean Digital Text:
Begin with a plain text file containing the desired content, typically historical or domain-specific text.
Custom Font Rendering:
Render the text using a carefully selected font that closely mimics Renaissance-era handwriting or typefaces to match the historical style.
Aged Background Simulation:
Add a paper-like background that replicates the look of aged, yellowed, or stained parchment typically seen in historical documents.
Ink Degradation & Bleed-through:
Apply effects that simulate faded ink, irregular strokes, and subtle ink bleeding to imitate wear from age and printing limitations.
Noise & Printing Artifacts:
Introduce random visual noise, ink patches, blotches, or faded sections to mimic imperfections found in old manuscripts and printed pages.
Layout Control:
Adjust line spacing, text alignment, and margin irregularities to recreate the uneven, hand-aligned layout of historical pages.
Outcome:
This technique provides a high degree of control and consistency in the generation process, resulting in visually realistic and stylistically accurate synthetic pages that closely resemble historical documents. Additionally, since the text is rendered programmatically, it allows for the automatic extraction of precise word-level and line-level bounding boxes — making it especially well-suited for training OCR models.

A Synthetic generated page using approach 2

Conclusion

By combining both deep learning and procedural techniques, I was able to generate high-quality synthetic data that mimics the unique visual characteristics of Renaissance-era texts which can be used for training the OCR model. The dual approach not only provided stylistic realism through GAN-based handwriting generation, but also offered control and annotation precision via algorithmic degradation.

You can also find more details about this project on the GSoC platform: GSoC 2025 Project.

The Digital Scribe: My GSoC Journey Teaching AI to Read the Past was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding - Medium and was authored by Saarthak Gupta

Print Share Comment Cite Upload Translate Updates

APA

Saarthak Gupta | Sciencx (2025-07-25T14:18:49+00:00) The Digital Scribe: My GSoC Journey Teaching AI to Read the Past. Retrieved from https://www.scien.cx/2025/07/25/the-digital-scribe-my-gsoc-journey-teaching-ai-to-read-the-past/

MLA

" » The Digital Scribe: My GSoC Journey Teaching AI to Read the Past." Saarthak Gupta | Sciencx - Friday July 25, 2025, https://www.scien.cx/2025/07/25/the-digital-scribe-my-gsoc-journey-teaching-ai-to-read-the-past/

HARVARD

Saarthak Gupta | Sciencx Friday July 25, 2025 » The Digital Scribe: My GSoC Journey Teaching AI to Read the Past., viewed ,<https://www.scien.cx/2025/07/25/the-digital-scribe-my-gsoc-journey-teaching-ai-to-read-the-past/>

VANCOUVER

Saarthak Gupta | Sciencx - » The Digital Scribe: My GSoC Journey Teaching AI to Read the Past. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/07/25/the-digital-scribe-my-gsoc-journey-teaching-ai-to-read-the-past/

CHICAGO

" » The Digital Scribe: My GSoC Journey Teaching AI to Read the Past." Saarthak Gupta | Sciencx - Accessed . https://www.scien.cx/2025/07/25/the-digital-scribe-my-gsoc-journey-teaching-ai-to-read-the-past/

IEEE

" » The Digital Scribe: My GSoC Journey Teaching AI to Read the Past." Saarthak Gupta | Sciencx [Online]. Available: https://www.scien.cx/2025/07/25/the-digital-scribe-my-gsoc-journey-teaching-ai-to-read-the-past/. [Accessed: ]

rf:citation

» The Digital Scribe: My GSoC Journey Teaching AI to Read the Past | Saarthak Gupta | Sciencx | https://www.scien.cx/2025/07/25/the-digital-scribe-my-gsoc-journey-teaching-ai-to-read-the-past/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.