BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction
🔥INFO
Blog: 2025/07/23 by IgniSavium
- Title: BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction
- Authors: Honghao Fu, Hao Wang et.al (HKUST-Guangzhou)
- Published: December 2023
- Comment: BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction
- URL: https://arxiv.org/abs/2312.14871
🥜TLDR: This paper introduces BrainVis, which reconstructs semantically accurate images from EEG by enhancing representations with (large scale) self-supervised learning and CLIP-based alignment.
Motivation
This paper is motivated by the challenge of reconstructing semantically accurate images from noisy EEG signals—addressing limitations of previous methods (i.e. Dream Diffusion) such as weak EEG feature embedding, reliance on large (self-supervision) datasets, and inability to capture fine-grained semantics—by proposing BrainVis, which enhances EEG representation through self-supervised learning (latent masked modeling) and frequency features, and improves cross-modal alignment with CLIP-based semantic interpolation, achieving superior results with significantly less training data.
Model
Architecture
The main target is to map EEG to conditional text input embeddings in Stable Diffusion.
Pretraining
Time Branch
The Latent Masked Modeling (LMM) pre-training method enhances EEG time-domain feature learning by dividing the signal \(x \in \mathbb{R}^{c \times l}\) into n units, projecting each into d-dimensional embeddings \(z \in \mathbb{R}^{n \times d}\), and applying random masking (ratio \(r_m\)) for self-supervised learning; the model optimizes two objectives: (1) regression loss \(L_{\text{reg}} = \frac{1}{d} \| f_m - f_{mp} \|^2_2\) to reconstruct masked embeddings using transformer-based predictions, and (2) classification loss \(L_{\text{cls}} = -\mathbf{l}_m \cdot \log(\mathbf{p}_m)\) via codebook tokenization of masked units, with the total loss defined as \(L_{\text{lmm}} = L_{\text{reg}} + L_{\text{cls}}\).
(channel = 128, time_step = 440, n = 110, d = 1024, \(r_M\) = 0.75 and \(n_{\text{code}}\) = 660)
✨related work: Momentum Encoder ; Vector Quantized VAE
Frequency Branch
- Frequency Transformation: EEG signals are converted to the frequency domain using Fast Fourier Transform (FFT).
- Feature Extraction: An LSTM model is used to extract frequency features, avoiding overfitting risks of complex networks.
- Supervised Training: The LSTM is trained with visual classification labels using cross-entropy (CE) loss.
Unified Classify: The time and frequency branches are fine-tuned together using CE loss to form a unified time-frequency embedding.
CLIP alignment finetuing
Find a balance (simple sum of loss) between label-induced coarse text feature and description-induced fine text feature. (🧐Possibly tuned together with Unified Classify)
Refined SD Generation
Img2Img Refinement using EEG classification labels (inferred from Unified Classify) (using only ''label word" as refinement conditional text) to enhance image quality.
Evaluation
Performance
Ablation
🧐Time Branch is the main information source vs. Frequency Branch.
🧐Seems that refinement (single-label guided image2image SD refinement) dominates the output semantics, thus original image structure (size, position, orientation, action etc.) will be largely ignored.
🧐Reflections
-
There's no experiment to show the effectiveness of MASKED LATENT MODELING classification objective (codebook design).
-
Seems that even simple object class recognition is still a little bit hard for EEG-signal analysis(acc. ~45%) compared with fMRI modeling(acc. ~75%) .