High-resolution image reconstruction with latent diffusion models from human brain activity

🔥INFO

Blog: 2025/07/21 by IgniSavium

Title: High-resolution image reconstruction with latent diffusion models from human brain activity
Authors: Yu Takagi, Shinji Nishimoto (Osaka University)
Published: March 2023
Comment: CVPR
URL: https://www.biorxiv.org/content/10.1101/2022.11.18.517004v3

🥜TLDR: Image reconstruction via training-free latent diffusion models ( Stable Diffusion) with fMRI inputs.

Motivation

This research aims to improve the reconstruction of high-resolution, semantically accurate images from human brain activity (fMRI) using training-free latent diffusion models (LDMs), addressing the limitations of previous deep generative models that required complex training and fine-tuning, by offering a more efficient and interpretable method with minimal computational cost.

Model

Architecture

Overall a Latent Diffusion Model framework (actually Stable Diffusion ): $ \epsilon $ denotes an image encoder, $ D $ is a image decoder (they are a pair of autoencoder).

only train 2 (small linear) parts:

(1) fMRI -> original image latent $z$ (a simple coarse network approximation)

(2) fMRI -> conditional "text" (i.e. semantic) $c$

Evaluation

No comparable previous work exists, so this paper almost only provides ablation and visualization analysis.

Ablation

Perceptual Similarity Metrics (PSMs):

$z$ represents the low-level semantics, and $c$ captures the high-level abstraction semantics.

Reverse Encoding Visualization

Reversely map the features or components in the LDM BACK INTO the fMRI inputs (to find semantic relationships).

$z,c,z_c$ feature

Different latent representations (z, c, and $z_c$) show varying prediction performance across visual cortex regions, with z performing well in early visual cortex, c excelling in higher visual cortex, and $z_c$ closely resembling z in its performance, despite representing visually different images

$z,c$ semantic balancing with varying noise level

As noise levels increased, the latent representation with added noise ($z_c$) predicted voxel activity in higher visual cortex better (🤔thanks to high-level condition $c$) than the original representation ($z$)

$z, c$ semantic balancing at different denoising timesteps

During the denoising process, early stages were dominated by the original image representation (z), while mid-stages saw the noise-added representation ($z_c$) ( 🤔thanks to high-level condition $c$) better predict activity in higher visual cortex.

denoise network per-layer semantic interpretation

The U-Net bottleneck layer initially captures the most information across the cortex, but as denoising progresses, early U-Net layers become more predictive of early visual cortex activity while the bottleneck layer shifts to representing higher-level semantic information in higher visual areas.