MindDiffuser-Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion
🔥INFO
Blog: 2025/07/21 by IgniSavium
- Title: MindDiffuser-Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion
- Authors: Yizhuo Lu, Huiguang He et al. (CAS)
- Published: March 2023
- Comment: ACM Multimedia
- URL: https://arxiv.org/abs/2303.14139
🥜TLDR: Preserve structural information in fMRI-to-image reconstruction by iteratively backpropagating Stable Diffusion inputs guided by CLIP low-layer feature supervision.
Motivation
The research aims to address the challenge of reconstructing visual stimuli from fMRI data by developing a two-stage model (based on Stable Diffusion) that simultaneously ensures semantic richness and structural alignment, overcoming limitations in previous methods that either lacked clear semantic details or controlled structural features like position and orientation.
Model
Architecture
Stage-1: linear transform fMRI into original visual latent state \(z\) and text condition \(c\) in Stable Diffusion img2img process.
Stage-2: iteratively backpropagate \(z,c\) to align the lower CLIP layer features (which contains structural information such as position and orientation) between generated images and original images (extracted via trained linear transformation from fMRI)
Evaluation
Performance
- CLIP last layer cosine similarity for semantic comparison
SSIM (Structural Similarity Index Measure)
- From -1 to 1, but typically between 0 and 1;
- 1 indicates that the two images are structurally identical.
Where:
- \(\mu_x, \mu_y\): The mean values (brightness) of images \(x\) and \(y\);
- \(\sigma_x^2, \sigma_y^2\): Variance (contrast);
- \(\sigma_{xy}\): Covariance (structural similarity);
- \(C_1, C_2\): Stabilizing constants to prevent division by zero.
Per-pixel correlation coefficient (Pixel-wise Correlation Coefficient)
- From -1 to 1;
- 1 indicates perfect positive correlation (every pixel matches);
- 0 indicates no correlation;
- -1 indicates perfect negative correlation.
Assuming the two images are flattened into two vectors \(x\) and \(y\), the correlation coefficient is: $$ r= \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} $$ Where:
- \(\bar{x}, \bar{y}\) are the mean pixel values of the two images;
- This is essentially the Pearson correlation coefficient from statistics, applied to the pixel arrays of the images.