Unveiling phi-3-vision: Architecture, Pre-training, and Post-training for Visual AI

Explore the technical specifications of phi-3-vision, detailing its CLIP + phi-3-mini-128K architecture, diverse multimodal pre-training dataset, and dual-stage post-training for strong image-text reasoning.


This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models

Abstract and 1 Introduction

2 Technical Specifications

3 Academic benchmarks

4 Safety

5 Weakness

6 Phi-3-Vision

6.1 Technical Specifications

6.2 Academic benchmarks

6.3 Safety

6.4 Weakness

References

A Example prompt for benchmarks

B Authors (alphabetical)

C Acknowledgements

6.1 Technical Specifications

Architecture The Phi-3-Vision (4.2B parameters) is a multimodal model designed to process an image and a textual prompt as inputs, and subsequently generate textual outputs. This model is composed of two primary components: an image encoder, i.e., CLIP ViT-L/14 [RKH+ 21] and a transformer decoder, i.e., phi-3-mini-128K-instruct. The visual tokens, once extracted by the image encoder, are then combined with text tokens in an interleaved way (no particular order for image and text tokens). To accommodate high-resolution images and various aspect ratios, a dynamic cropping strategy [DZZ+ 24b] is utilized to split the input image into a 2d array of blocks, where the tokens of the blocks are concatenated to represent the whole image.

\ Pre-training The Phi-3-Vision model undergoes a pre-training phase using a diverse dataset, which consists of a combination of interleaved image-text documents (e.g., [LST+ 24]), image-text pairs from FLD-5B [XWX+ 24], synthetic data derived from Optical Character Recognition (OCR) of PDF files, datasets for chart/table comprehension, and text-only data. The objective of predicting the next token is employed specifically on text tokens, while any loss associated with image tokens is disregarded during this phase. The pre-training process involves a total of 0.5T tokens that encompass both visual and text elements. During the pre-training phase, the maximum image resolution is capped at 1344 ×1344 as the majority of the training images are smaller than this resolution.

\ Post-training. The Phi-3-Vision model contains two post-training stages: supervised finetuning (SFT) and direct preference optimization (DPO). For SFT, we leveraged text SFT dataset, public multimodal instruct tuning datasets along with large-scale multimodal instruct tuning datasets that we built ourselves, covering diverse domains and tasks such as general natural image understanding, chart/table/- diagram understanding/reasoning, PowerPoint understanding, and model safety. The multimodal SFT data has about a total of 15B tokens. For DPO we mainly use a text DPO dataset and a relatively smaller-scale multimodal DPO dataset. For these two stages, we jointly train multimodal tasks and textonly tasks so that the model can achieve multi-modal reasoning while maintaining language capabilities as much as possible.

\

:::info Authors:

(1) Marah Abdin;

(2) Sam Ade Jacobs;

(3) Ammar Ahmad Awan;

(4) Jyoti Aneja;

(5) Ahmed Awadallah;

(6) Hany Awadalla;

(7) Nguyen Bach;

(8) Amit Bahree;

(9) Arash Bakhtiari;

(10) Jianmin Bao;

(11) Harkirat Behl;

(12) Alon Benhaim;

(13) Misha Bilenko;

(14) Johan Bjorck;

(15) Sébastien Bubeck;

(16) Qin Cai;

(17) Martin Cai;

(18) Caio César Teodoro Mendes;

(19) Weizhu Chen;

(20) Vishrav Chaudhary;

(21) Dong Chen;

(22) Dongdong Chen;

(23) Yen-Chun Chen;

(24) Yi-Ling Chen;

(25) Parul Chopra;

(26) Xiyang Dai;

(27) Allie Del Giorno;

(28) Gustavo de Rosa;

(29) Matthew Dixon;

(30) Ronen Eldan;

(31) Victor Fragoso;

(32) Dan Iter;

(33) Mei Gao; 

(34) Min Gao;

(35) Jianfeng Gao;

(36) Amit Garg;

(37) Abhishek Goswami;

(38) Suriya Gunasekar;

(39) Emman Haider;

(40) Junheng Hao;

(41) Russell J. Hewett;

(42) Jamie Huynh;

(43) Mojan Javaheripi;

(44) Xin Jin;

(45) Piero Kauffmann;

(46) Nikos Karampatziakis;

(47) Dongwoo Kim;

(48) Mahoud Khademi;

(49) Lev Kurilenko; 

(50) James R. Lee;

(51) Yin Tat Lee;

(52) Yuanzhi Li;

(53) Yunsheng Li;

(54) Chen Liang;

(55) Lars Liden;

(56) Ce Liu;

(57) Mengchen Liu;

(58) Weishung Liu;

(59) Eric Lin;

(60) Zeqi Lin;

(61) Chong Luo;

(62) Piyush Madan;

(63) Matt Mazzola;

(64) Arindam Mitra;

(65) Hardik Modi;

(66) Anh Nguyen;

(67) Brandon Norick;

(68) Barun Patra;

(69) Daniel Perez-Becker;

(70) Thomas Portet; 

(71) Reid Pryzant;

(72) Heyang Qin;

(73) Marko Radmilac;

(74) Corby Rosset;

(75) Sambudha Roy; 

(76) Olatunji Ruwase;

(77) Olli Saarikivi;

(78) Amin Saied;

(79) Adil Salim;

(80) Michael Santacroce;

(81) Shital Shah;

(82) Ning Shang;

(83) Hiteshi Sharma;

(84) Swadheen Shukla;

(85) Xia Song;

(86) Masahiro Tanaka;

(87) Andrea Tupini;

(88) Xin Wang;

(89) Lijuan Wang; 

(90) Chunyu Wang;

(91) Yu Wang;

(92) Rachel Ward;

(93) Guanhua Wang;

(94) Philipp Witte; 

(95) Haiping Wu; 

(96) Michael Wyatt; 

(97) Bin Xiao;

(98) Can Xu; 

(99) Jiahang Xu; 

(100) Weijian Xu; 

(101) Sonali Yadav; 

(102) Fan Yang; 

(103) Jianwei Yang;

(104) Ziyi Yang;

(105) Yifan Yang; 

(106) Donghan Yu;

(107) Lu Yuan;

(108) Chengruidong Zhang; 

(109) Cyril Zhang; 

(110) Jianwen Zhang;

(111) Li Lyna Zhang;

(112) Yi Zhang;

(113) Yue Zhang;

(114) Yunan Zhang;

(115) Xiren Zhou.

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\


This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models


Print Share Comment Cite Upload Translate Updates
APA

Writings, Papers and Blogs on Text Models | Sciencx (2025-07-08T02:30:02+00:00) Unveiling phi-3-vision: Architecture, Pre-training, and Post-training for Visual AI. Retrieved from https://www.scien.cx/2025/07/08/unveiling-phi-3-vision-architecture-pre-training-and-post-training-for-visual-ai/

MLA
" » Unveiling phi-3-vision: Architecture, Pre-training, and Post-training for Visual AI." Writings, Papers and Blogs on Text Models | Sciencx - Tuesday July 8, 2025, https://www.scien.cx/2025/07/08/unveiling-phi-3-vision-architecture-pre-training-and-post-training-for-visual-ai/
HARVARD
Writings, Papers and Blogs on Text Models | Sciencx Tuesday July 8, 2025 » Unveiling phi-3-vision: Architecture, Pre-training, and Post-training for Visual AI., viewed ,<https://www.scien.cx/2025/07/08/unveiling-phi-3-vision-architecture-pre-training-and-post-training-for-visual-ai/>
VANCOUVER
Writings, Papers and Blogs on Text Models | Sciencx - » Unveiling phi-3-vision: Architecture, Pre-training, and Post-training for Visual AI. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/07/08/unveiling-phi-3-vision-architecture-pre-training-and-post-training-for-visual-ai/
CHICAGO
" » Unveiling phi-3-vision: Architecture, Pre-training, and Post-training for Visual AI." Writings, Papers and Blogs on Text Models | Sciencx - Accessed . https://www.scien.cx/2025/07/08/unveiling-phi-3-vision-architecture-pre-training-and-post-training-for-visual-ai/
IEEE
" » Unveiling phi-3-vision: Architecture, Pre-training, and Post-training for Visual AI." Writings, Papers and Blogs on Text Models | Sciencx [Online]. Available: https://www.scien.cx/2025/07/08/unveiling-phi-3-vision-architecture-pre-training-and-post-training-for-visual-ai/. [Accessed: ]
rf:citation
» Unveiling phi-3-vision: Architecture, Pre-training, and Post-training for Visual AI | Writings, Papers and Blogs on Text Models | Sciencx | https://www.scien.cx/2025/07/08/unveiling-phi-3-vision-architecture-pre-training-and-post-training-for-visual-ai/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.