Common Spatial Patterns (CSP)

Math is hard, especially when it works in dimensions that the human brain has never been subject to. I'm going to try my best to explain CSP in an understandable way.

Your skull is a terrible conductor. Great for playing football, not so great for allowing the brain's information to escape. That being said, we need that information if we want to, say, give Stephen Hawking the ability to speak again. Woah — wouldn't that be something.

Listening to a single thought from the brain is a bit like trying to hear your friend whisper behind a concrete wall with a party on the other side…or something. The brain is full of these things called neurons. Billions of them. Your neurons "talk" to each other, moving charged ions around, resulting in small electrical fluctuations. When a neuron sends a signal, we say it's firing. Luckily, it's not completely random (as you can see in my blog on Visual Cortical Mapping). Neuronal firing is often synchronized with nearby neurons in the same brain region. The result of this? Detectable voltage fluctuations. We place electrodes around the brain to measure these fluctuations.

There are a few methods by which we actually do this. In this article, I'm referring to non-invasive EEG (Electroencephalogram). This means electrodes are placed in a pattern around the scalp — here's an image of a 22 electrode pattern from above a patient's head.

The difficulty is, when a cluster of neurons in your motor cortex fires in sync, the electrical signal smears as it travels through tissue and bone. Every electrode placed on the scalp is essentially a microphone hanging from the ceiling of a crowded party. Every microphone in the room picks up everyone's voices at once, but the microphone directly above you will capture your conversation with Sally better than it will hear the conversation I'm having with your mom in the opposite corner.

Okay, onto Common Spatial Patterns — CSP. We've established that the conversations at the party are happening, they are just individually buried under a mountain of background noise. Mathematically, we can unmix this party using something called a spatial filter.

Now imagine a mixing board where you can adjust the volume of every single microphone in the room. The goal is to find the perfect combination of volume sliders to tune out the whole party except for a person whispering about moving their left hand. The question is: How do we mathematically find those slider positions? We'll get to some of the math in a second.

It's important to know that the way we differentiate left hand and right hand signals is by measuring the variance between electrodes — a measure of how much electric power fluctuates within a frequency band.

A few variables:

X: raw data from our 22 electrodes over a single trial. A grid of numbers representing the voltage at every electrode over time. Like a 3 second recording from all the microphones at the party.
$\Sigma$ : A covariance matrix. A grid of numbers that provides information on how each electrode relates to every other electrode. A relatively high number at row 2 column 4 indicates that electrode 2 and electrode 4 are highly correlated -> aka, their variance is synchronized. The covariance matrix gives us an idea of the "shape" of the noise. If $\Sigma0$ is the spatial pattern when the patient imagines moving their left hand, $\Sigma1$ might be the same for their right hand.
λ: A number that tells us how much of a signal's variance belongs to the left hand vs the right hand. More on this later.
w: Our weight vector — this is what CSP solves for. This is the arrangement of our volume sliders that we talked about earlier. Each weight vector w has a corresponding λ to show how well that weight vector isolates the variance of one class. If w is highly effective at only listening to left hand variance, λ may be 0.95 — 95% of the variance is from imagining the left hand. We call these weight vectors "filters" because they filter how we see the data.

So we've established that our objective is to find the perfect slider position (w) so that the resulting audio track has a massive amount of variance when the user thinks "left" (Class 0) and almost zero variance when they think "right" (Class 1). Let's touch on the math.

$\max \frac{w^T \Sigma_0 w}{w^T \Sigma_1 w}$

Look at the numerator and the denominator of this fraction we are trying to maximize. We have a covariance matrix by our vector w and its transpose. In linear algebra, we call this quadratic form. It allows us to compress matrix $\Sigma$ into a single number. When we multiply our matrix by our filter vector w, we change the perspective by which we are looking at our data.

Imagine 20 apples on a flat table forming a plus sign: 10 red, 10 blue. The red apples form a long horizontal line running left to right, the blue apples form a long vertical spanning up and down the table. They cross in the middle.

Now you shine a strong light from the left side of the table, so the apples cast shadows on the right wall. The red apples, which extend along the direction the light is traveling, all cast shadows in roughly the same spot on the wall; they pile up. The blue apples, which extend perpendicular to the light, throw shadows across the full width of the wall. Reds collapse onto 1 shadow, blues spread and each get their own. The shadow looks like this → **********

Move the light to the front of the table and shine it backward. Now it's reversed: red apples cast shadows scattered along the back wall, blues pile up.

Same apples, different light direction. The light direction is the filter w. The position of each shadow is the projection $w^Tx$ . The spread of shadows is $w^T\Sigma w$ . CSP finds the direction of the light that maximally spreads one class while collapsing the other.

The equation above is called the Rayleigh quotient, and it is just the mathematical form of what I just described. In calculus, in order to find the maximum value of a function, we take the derivative of it, and set it equal to 0. I'm not going to explain the derivations in this article, if you're curious, learn calculus and linear algebra — or just paste it into Claude. To find the maximum, we constrain $w^T \Sigma_1 w = 1$ (fix the projected variance of class 1) and use Lagrange multipliers. Setting the gradient of $w^T \Sigma_0 w - \lambda(w^T \Sigma_1 w - 1)$ to zero gives:

$\Sigma_0 w = \lambda \Sigma_1 w$

To make the math cleaner, instead of comparing the variance of Class 0 directly to Class 1, we instead compare Class 0 to the total covariance, a combination of Class 0 and Class 1. Because we compare our desired class to a set that includes itself, our λ values (representing our variance ratios) are limited between 0 and 1.

$\Sigma_0 w = \lambda' (\Sigma_0 + \Sigma_1) w$

If you've taken linear algebra, you recognize this as a classic generalized eigenvalue problem. Math aside, this allows us to find the shared axes or "perspectives" between our two matrices. When we solve this, we are given a set of possible weight vectors w (our eigenvectors), corresponding to a set of scalars λ (our eigenvalues).

What this means practically: If we find a $w$ where $\lambda'$ is 0.99, that filter is incredibly pure for the left hand. If $\lambda'$ is 0.01, the filter is pure for the right hand. If $\lambda'$ is 0.50, the filter is picking up equal amounts of left and right-hand noise and is entirely useless.

What does it actually mean to solve this? This solving of this eigenvalue decomposition is the engine of CSP that allows us to determine the optimal filters for identifying "leftness" from our mountain of seemingly random data. Imagine the left-hand noise ( $\Sigma_0$ ) as a red cloud of smoke in the room, and the right-hand noise ( $\Sigma_1$ ) as a blue cloud of smoke. Because of the smearing effect of the skull, these clouds are heavily overlapping. If you look at them from the front, they just look like purple fog. Solving this eigenvalue problem is the mathematical equivalent of walking around the room until you find the exact, highly specific angle where the red cloud is completely separated from the blue cloud. You are finding a new line of sight where the overlap disappears.

Finding the filters is the meat of CSP, but afterwards, we manipulate the filters to make the math easier down the line.

Because the math solves for the whole room, it gives you multiple lines of sight (eigenvectors). We only want the extremes: the top few that perfectly isolate the left hand, and the bottom few for the right. The middle ones? Throw them out. They're just noise.

We apply this perfectly tuned lens to our raw data, but a machine learning model needs clean numbers. So, we extract the variance of the filtered signal, normalize it, and take the logarithm.

Why the log? Variance is strictly positive and right-skewed. The log stretches that clumped data into a smooth bell curve, making the distribution approximately Gaussian (necessary for our classifier down the line.) Why normalize? If a user sweats, the electrode connection improves and the signal gets louder. Did they think about their left hand harder? Nope. Normalizing across trials accounts for this.

If you're curious, here's a basic coded implementation in Python.

    def fit(self, X: np.ndarray, y: np.ndarray) -> "CSP":
        T = X.shape[2] # number of time points per trial
 
       # (72, 22, 1126) @ (72, 1126, 22) -> (72, 22, 22) -> mean -> (22, 22) -> for each class
        def class_cov(trials): # per-trial covariance
            covs = trials @ trials.transpose(0, 2, 1) / (T - 1) # (n, C, C)
            traces = np.trace(covs, axis1=1, axis2=2) # (n,)
            covs /= traces[:, None, None]
            return covs.mean(axis=0)
 
        cov_class0 = class_cov(X[y == 0])
        cov_class1 = class_cov(X[y == 1])
 
        _, eigenvectors = eigh(cov_class0, cov_class0 + cov_class1)
        bottom_n = (eigenvectors[:, :self.n_components]) # smallest eigenvalues
        top_n = (eigenvectors[:, -self.n_components:]) # largest eigenvalues
        self.filters_ = np.concatenate((bottom_n, top_n), axis=1).T
        return self
 
    def transform(self, X:np.ndarray) -> np.ndarray:
        # apply spatial filters and return log variance features
        Z = self.filters_ @ X # (2k, C) @ (N, C, T) -> (N, 2k, T)
        variances = np.var(Z, axis=2)
        return np.log(variances / variances.sum(axis=1, keepdims=True)) # normalize and log transform

We've covered the math for separating 2 classes, but what about more? Let's say we have four classes, (Left, Right, Feet, Tongue), just use a One-vs-Rest strategy. Run this math four separate times (Left vs Rest, etc.)

As proof this math works, here's a plot of the filters that shows where they are "looking"

CSP spatial patterns showing filter weight distribution over the scalp

As seen above, the left-hand filter generates a hotspot perfectly over the right motor cortex. The algorithm mathematically rediscovers human neuroanatomy entirely on its own. Cool.

A few notes about CSP: First, the bandpass preprocessing that runs on the data before CSP assumes all useful info lives exactly between 8-30Hz. Second, it assumes your brain's noise is stationary. It isn't; filters calibrated on Tuesday perform worse on Wednesday just because you slept differently. Third, it's strictly linear.

CSP is not new. Modern deep learning models often outperform it on motor imagery benchmarks. Even many of the strongest recent architectures use CSP-style spatial filtering as a layer inside their network. Despite that, CSP remains the absolute cleanest example of how we can take physical truths about the brain, like how imagery lives in variance, and algorithmically decode to extract useful information. The brain isn't a black box after all.