MindEye2 unCLIP vs. Versatile Diffusion: Evaluating Image Generation from CLIP Latents

To compare the image generation capabilities of our unCLIP model with Versatile Diffusion, we computed Fréchet inception distance (FID)


This content originally appeared on HackerNoon and was authored by Image Recognition

Abstract and 1 Introduction

2 MindEye2 and 2.1 Shared-Subject Functional Alignment

2.2 Backbone, Diffusion Prior, & Submodules

2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP

2.5 Model Inference

3 Results and 3.1 fMRI-to-Image Reconstruction

3.2 Image Captioning

3.3 Image/Brain Retrieval and 3.4 Brain Correlation

3.5 Ablations

4 Related Work

5 Conclusion

6 Acknowledgements and References

\ A Appendix

A.1 Author Contributions

A.2 Additional Dataset Information

A.3 MindEye2 (not pretrained) vs. MindEye1

A.4 Reconstruction Evaluations Across Varying Amounts of Training Data

A.5 Single-Subject Evaluations

A.6 UnCLIP Evaluation

A.7 OpenCLIP BigG to CLIP L Conversion

A.8 COCO Retrieval

A.9 Reconstruction Evaluations: Additional Information

A.10 Pretraining with Less Subjects

A.11 UMAP Dimensionality Reduction

A.12 ROI-Optimized Stimuli

A.13 Human Preference Experiments

A.6 UnCLIP Evaluation

Previous fMRI-to-image papers (Scotti et al., 2023; Ozcelik and VanRullen, 2023; Mai and Zhang, 2023) opted for Versatile Diffusion because it was state-of-the-art in reconstructing images from CLIP image latents with little variation. To compare the image generation capabilities of our unCLIP model with Versatile Diffusion, we computed Fréchet inception distance (FID) (Heusel et al., 2018) scores across 30,000 randomly sampled images from the COCO 2017 validation set. The images were center-cropped and scaled to 480 × 480 resolution. For Versatile Diffusion, we used Huggingface’s VersatileDiffusionDualGuidedPipeline with texttoimage set to 0 to not take any input from text.

\ Our unCLIP model fine-tuned from Stable Diffusion XL outperforms Versatile Diffusion in terms of returning the original image from CLIP latents (see Appendix 9). This difference is visually obvious as shown in Figure 9. Note that while we observed distortions in our unrefined fMRI-toimage reconstructions using our unCLIP model fine-tuned from SDXL, such distortions were rare when using the ground truth CLIP embeddings.

\ The ability for this unCLIP model to nearly perfectly return the original image also indicates that OpenCLIP ViT-bigG image embeddings effectively preserve the majority of the information inherent in the original pixel image, retaining both low-level structure and high-level semantic details.

\

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

:::info Authors:

(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);

(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;

(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;

(4) Reese Kneeland, University of Minnesota and a Core contribution;

(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);

(6) Ashutosh Narang, Medical AI Research Center (MedARC);

(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);

(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);

(9) Thomas Naselaris, University of Minnesota;

(10) Kenneth A. Norman, Princeton Neuroscience Institute;

(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).

:::

\


This content originally appeared on HackerNoon and was authored by Image Recognition


Print Share Comment Cite Upload Translate Updates
APA

Image Recognition | Sciencx (2025-04-16T01:04:04+00:00) MindEye2 unCLIP vs. Versatile Diffusion: Evaluating Image Generation from CLIP Latents. Retrieved from https://www.scien.cx/2025/04/16/mindeye2-unclip-vs-versatile-diffusion-evaluating-image-generation-from-clip-latents/

MLA
" » MindEye2 unCLIP vs. Versatile Diffusion: Evaluating Image Generation from CLIP Latents." Image Recognition | Sciencx - Wednesday April 16, 2025, https://www.scien.cx/2025/04/16/mindeye2-unclip-vs-versatile-diffusion-evaluating-image-generation-from-clip-latents/
HARVARD
Image Recognition | Sciencx Wednesday April 16, 2025 » MindEye2 unCLIP vs. Versatile Diffusion: Evaluating Image Generation from CLIP Latents., viewed ,<https://www.scien.cx/2025/04/16/mindeye2-unclip-vs-versatile-diffusion-evaluating-image-generation-from-clip-latents/>
VANCOUVER
Image Recognition | Sciencx - » MindEye2 unCLIP vs. Versatile Diffusion: Evaluating Image Generation from CLIP Latents. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/04/16/mindeye2-unclip-vs-versatile-diffusion-evaluating-image-generation-from-clip-latents/
CHICAGO
" » MindEye2 unCLIP vs. Versatile Diffusion: Evaluating Image Generation from CLIP Latents." Image Recognition | Sciencx - Accessed . https://www.scien.cx/2025/04/16/mindeye2-unclip-vs-versatile-diffusion-evaluating-image-generation-from-clip-latents/
IEEE
" » MindEye2 unCLIP vs. Versatile Diffusion: Evaluating Image Generation from CLIP Latents." Image Recognition | Sciencx [Online]. Available: https://www.scien.cx/2025/04/16/mindeye2-unclip-vs-versatile-diffusion-evaluating-image-generation-from-clip-latents/. [Accessed: ]
rf:citation
» MindEye2 unCLIP vs. Versatile Diffusion: Evaluating Image Generation from CLIP Latents | Image Recognition | Sciencx | https://www.scien.cx/2025/04/16/mindeye2-unclip-vs-versatile-diffusion-evaluating-image-generation-from-clip-latents/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.