Vision Transformer (ViT) from Scratch in PyTorch

This content originally appeared on DEV Community and was authored by anesmeftah

For years, Convolutional Neural Networks (CNNs) ruled computer vision. But since the paper “An Image is Worth 16x16 Words”, the Vision Transformer (ViT) has challenged CNNs by treating an image as a sequence of patches—similar to how words form a sentence.

In this post, we’ll walk through a PyTorch implementation of ViT, trained on a small food classification dataset (pizza, steak, sushi).

Core Idea

Split an image into fixed-size patches (e.g., 16×16).
Flatten patches into vectors → feed them as tokens.
Add:
- [CLS] Token → represents the entire image for classification.
- Positional Embeddings → retain spatial info.
Process the sequence with a Transformer Encoder.

ViT-Base Config

Image size: 224×224
Patch size: 16×16 → 196 tokens
Embedding dim: 768
Layers: 12
Attention heads: 12
Params: ~85.8M

Dataset

We used a 3-class dataset:

🍕 Pizza
🥩 Steak
🍣 Sushi

All images resized to 224×224.

Training Setup

Parameter	Value
Optimizer	Adam
Loss	CrossEntropyLoss
LR	0.001
Batch Size	32
Epochs	10
Device	GPU (CUDA)

Results

Training Loss → decreases fast (ViT is very powerful).
Validation Loss → may plateau or rise (overfitting risk).
Accuracy → Training near 100%, validation reflects true performance.

ViTs are large models. On small datasets, they overfit quickly. For real use, try pretrained ViTs + fine-tuning.

Takeaways

ViT proves attention works for vision, not just text.
Even a scratch implementation highlights the shift from pixels → patches → tokens.
Next steps:
- Try on larger datasets (CIFAR-100, ImageNet subset).
- Use pretrained weights (HuggingFace, timm).
- Experiment with augmentations (Mixup, CutMix).

This content originally appeared on DEV Community and was authored by anesmeftah

Print Share Comment Cite Upload Translate Updates

APA

anesmeftah | Sciencx (2025-10-02T11:32:56+00:00) Vision Transformer (ViT) from Scratch in PyTorch. Retrieved from https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/

MLA

" » Vision Transformer (ViT) from Scratch in PyTorch." anesmeftah | Sciencx - Thursday October 2, 2025, https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/

HARVARD

anesmeftah | Sciencx Thursday October 2, 2025 » Vision Transformer (ViT) from Scratch in PyTorch., viewed ,<https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/>

VANCOUVER

anesmeftah | Sciencx - » Vision Transformer (ViT) from Scratch in PyTorch. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/

CHICAGO

" » Vision Transformer (ViT) from Scratch in PyTorch." anesmeftah | Sciencx - Accessed . https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/

IEEE

" » Vision Transformer (ViT) from Scratch in PyTorch." anesmeftah | Sciencx [Online]. Available: https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/. [Accessed: ]

rf:citation

» Vision Transformer (ViT) from Scratch in PyTorch | anesmeftah | Sciencx | https://www.scien.cx/2025/10/02/vision-transformer-vit-from-scratch-in-pytorch/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.