BCI part 1: from EEG to CSP features

A first checkpoint on the BCI side project. I'm decoding motor imagery from EEG, classical pipeline first, deep models later. This post goes from raw recordings to CSP features — the point at which the data is ready for a linear classifier.

The dataset

BCI Competition IV Dataset 2a, loaded through MOABB as BNCI2014_001. Nine subjects, 22 EEG channels at 250 Hz, four motor-imagery classes:

left hand
right hand
feet
tongue

Each subject has one training session and one test session. The cue lands at $t = 0$ ; the imagery period runs from roughly $0.5$ to $4\,\text{s}$ after.

Preprocessing

EEG channels only, bandpass 8–30 Hz, epoch from $-0.5$ to $4.0\,\text{s}$ around each cue with a $(-0.5, 0)$ baseline.

eeg = run_raw.copy().pick("eeg").filter(l_freq=8.0, h_freq=30.0)
events, ev_id = mne.events_from_annotations(eeg)
epochs = mne.Epochs(eeg, events, event_id=class_ids,
                    tmin=-0.5, tmax=4.0, baseline=(-0.5, 0),
                    preload=True)

The 8–30 Hz band isn't a tunable hyperparameter — it's the mu (8–13 Hz) and beta (13–30 Hz) ranges where motor imagery actually lives. Below 8 Hz is mostly slow drift and ocular artifacts; above 30 Hz is mostly muscle noise.

One session out: a tensor of shape (n_trials, 22, 1126), where $1126 = 250 \times 4.5 + 1$ .

Sanity check: event-related desynchronization

Before any modeling, I want to see that the data actually contains motor imagery. The classic check is ERD: when a subject imagines a hand movement, mu/beta band power drops in the contralateral sensorimotor cortex.

Per trial and per channel, take the mean squared signal over a baseline window ( $-0.5 \to 0\,\text{s}$ ) and an imagery window ( $0.5 \to 2.5\,\text{s}$ ):

\mathrm{ERD}_c = \frac{P_c^{\text{imagery}} - P_c^{\text{baseline}}}{P_c^{\text{baseline}}}

Plot that as a topomap per class, blue for negative. For "left hand" you'd expect blue over right-hemisphere motor cortex, and the mirror for "right hand". If those topomaps line up, the rest of the pipeline is worth building.

CSP, with the actual math

CSP (Common Spatial Patterns) finds spatial filters — linear combinations of the 22 channels — that maximize the variance of one class while minimizing the variance of the other. Since the 8–30 Hz bandpass is already applied, post-filter variance is band power, which is exactly the signal ERD lives in.

For two classes with covariance matrices $\Sigma_1$ and $\Sigma_2$ , CSP solves the generalized eigenvalue problem:

\Sigma_1 \mathbf{w} = \lambda \, (\Sigma_1 + \Sigma_2) \, \mathbf{w}

The eigenvalue $\lambda \in [0, 1]$ is the share of total variance the filter assigns to class 1. The largest eigenvectors are the strongest "class 1 vs rest" directions; the smallest are the strongest "class 2 vs rest" directions. I take the top $k$ and bottom $k$ to get $2k$ spatial filters per binary problem.

T = X.shape[2]
cov1 = (X[y == 0] @ X[y == 0].transpose(0, 2, 1)).mean(0) / (T - 1)
cov2 = (X[y == 1] @ X[y == 1].transpose(0, 2, 1)).mean(0) / (T - 1)
_, V = eigh(cov1, cov1 + cov2)
filters = np.concatenate([V[:, :k], V[:, -k:]], axis=1).T  # (2k, C)

Features: log of normalized variance

Filtering a trial gives $Z = W X$ of shape (2k, T). Each filter collapses to one number via variance over time, then I take the log of the normalized variance:

f_i = \log \frac{\mathrm{Var}(z_i)}{\sum_j \mathrm{Var}(z_j)}

Normalization removes per-trial scale; the log pushes the distribution toward Gaussian, which is exactly what an LDA wants downstream.

Z = filters @ X                # (N, 2k, T)
v = (Z ** 2).mean(axis=2)      # (N, 2k)
features = np.log(v / v.sum(axis=1, keepdims=True))

Multi-class via one-vs-rest

CSP is binary by definition. For the four-class 2a problem I fit one CSP per class against the rest and concatenate the feature blocks:

features = np.concatenate(
    [csp_c.transform(X) for csp_c in csps_per_class],
    axis=1
)  # shape (N, num_classes * 2k)

Joint-diagonalization variants exist and probably do better on 2a, but one-vs-rest is the right baseline.

Where I am

End to end on a single subject: load → bandpass → epoch → CSP → log-variance features. The features are shaped to feed an LDA — that's next, then a per-subject accuracy table, then trying to actually beat the published baseline.