Vision Transformer (ViT) from Scratch in PyTorch

For years, Convolutional Neural Networks (CNNs) ruled computer vision. But since the paper “An Image is Worth 16×16 Words”, the Vision Transformer (ViT) has challenged CNNs by treating an image as a sequence of patches—similar to how words form a sente…


This content originally appeared on DEV Community and was authored by anesmeftah

For years, Convolutional Neural Networks (CNNs) ruled computer vision. But since the paper “An Image is Worth 16x16 Words”, the Vision Transformer (ViT) has challenged CNNs by treating an image as a sequence of patches—similar to how words form a sentence.

In this post, we’ll walk through a PyTorch implementation of ViT, trained on a small food classification dataset (pizza, steak, sushi).

Core Idea

Architecture of The ViT

  • Split an image into fixed-size patches (e.g., 16×16).
  • Flatten patches into vectors → feed them as tokens.
  • Add:

    • [CLS] Token → represents the entire image for classification.
    • Positional Embeddings → retain spatial info.
  • Process the sequence with a Transformer Encoder.

ViT-Base Config

  • Image size: 224×224
  • Patch size: 16×16 → 196 tokens
  • Embedding dim: 768
  • Layers: 12
  • Attention heads: 12
  • Params: ~85.8M

Dataset

We used a 3-class dataset:

  • 🍕 Pizza
  • 🥩 Steak
  • 🍣 Sushi

All images resized to 224×224.

Training Setup

Parameter Value
Optimizer Adam
Loss CrossEntropyLoss
LR 0.001
Batch Size 32
Epochs 10
Device GPU (CUDA)

Results

plots of the training and testing loss and accuracy

  • Training Loss → decreases fast (ViT is very powerful).
  • Validation Loss → may plateau or rise (overfitting risk).
  • Accuracy → Training near 100%, validation reflects true performance.

ViTs are large models. On small datasets, they overfit quickly. For real use, try pretrained ViTs + fine-tuning.

Takeaways

  • ViT proves attention works for vision, not just text.
  • Even a scratch implementation highlights the shift from pixels → patches → tokens.
  • Next steps:

    • Try on larger datasets (CIFAR-100, ImageNet subset).
    • Use pretrained weights (HuggingFace, timm).
    • Experiment with augmentations (Mixup, CutMix).


This content originally appeared on DEV Community and was authored by anesmeftah


Print Share Comment Cite Upload Translate Updates
APA

anesmeftah | Sciencx (2025-10-02T11:32:56+00:00) Vision Transformer (ViT) from Scratch in PyTorch. Retrieved from https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/

MLA
" » Vision Transformer (ViT) from Scratch in PyTorch." anesmeftah | Sciencx - Thursday October 2, 2025, https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/
HARVARD
anesmeftah | Sciencx Thursday October 2, 2025 » Vision Transformer (ViT) from Scratch in PyTorch., viewed ,<https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/>
VANCOUVER
anesmeftah | Sciencx - » Vision Transformer (ViT) from Scratch in PyTorch. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/
CHICAGO
" » Vision Transformer (ViT) from Scratch in PyTorch." anesmeftah | Sciencx - Accessed . https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/
IEEE
" » Vision Transformer (ViT) from Scratch in PyTorch." anesmeftah | Sciencx [Online]. Available: https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/. [Accessed: ]
rf:citation
» Vision Transformer (ViT) from Scratch in PyTorch | anesmeftah | Sciencx | https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.