Locality, Learning, and the FFT: Why CNNs Avoid the Fourier Domain

Locality, Learning, and the FFT: Why CNNs Avoid the Fourier Domain

Introduction

Convolution sits at the heart of modern machine learning—especially convolutional neural networks (CNNs)—yet the underlying mathematics is often hidden behind highly optimised implementations in PyTorch, TensorFlow, and other frameworks. As a result, many of the properties that make convolution such a powerful building block for deep learning become obscured, particularly when we try to reason about model behaviour or debug a failing architecture.

If you know the convolution theorem, a natural question arises:

Why don’t CNNs simply compute a Fourier transform of the input and kernel, multiply them in the frequency domain, and invert the result? Wouldn’t that be simpler and faster?

This blog post addresses exactly that question. We will see that:

  1. FFT-based convolution is not local.
    In the Fourier domain every coefficient depends on every input pixel. This destroys the locality structure that CNNs rely on to learn hierarchical, spatially meaningful features. As a result, it breaks the very inductive bias that makes CNNs effective.

  2. FFT-based convolution is not computationally cheaper in neural networks.
    Although FFTs are asymptotically efficient, they must be recomputed on every forward and backward pass—and the cost of repeatedly transforming inputs, kernels, and gradients outweighs any benefit from spectral multiplication.

By the end of this post, we’ll have a clear, explicit comparison—both in matrix form and via backpropagation—showing why CNNs deliberately perform convolution in the spatial domain. Any practioner of signal processing should also be interested in knowing when the “locality” property is useful and when it is not!

1-D Convolution

Let us start with the most basic form of convolution, the 1D convolution. In this case you have a filter (which is nothing but a sequence of numbers) that you want to multiply with your signal in order to produce another signal which is hopefully more interesting to you. For example, in your headphones, you want to multiply a set of numbers with the music signal such that the resulting signal is more music than the wailing baby 1 row behind you.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np

def conv1d_direct(x, h):
nx, nh = len(x), len(h)
y = np.zeros(nx+nh-1)
for n in range(len(y)):
for m in range(nx):
k = n - m
if 0 <= k < nh:
y[n] += x[m] * h[k]
return y

x = np.array([1.,2.,0.,-1.]) # this is the signal of music + baby wailing
h = np.array([0.5,1.,0.5]) # this is a filter that when multiplied with x makes it more music
conv1d_direct(x,h)

Convolution Theorem

This brings us to the convolution theorem wherein we can prove that the process of convolution i.e. multiplying window-wise h and x is mathematically equivalent to a simple multiplication between the fft of h and the fft of x.

1
2
3
4
5
6
7
8
9
def conv_via_fft(x,h):
N = len(x)+len(h)-1
X = np.fft.rfft(x,n=N)
H = np.fft.rfft(h,n=N)
return np.fft.irfft(X*H,n=N)

np.max(np.abs(conv1d_direct(x,h) - conv_via_fft(x,h)))
print(conv1d_direct(x,h))
print(conv_via_fft(x,h))

2-D Convolution

Just like before before we will convolve a 2D filter with a 2D signal in the spatial domain. We will then, try to do it using the FFT. We will verify that the convolution theorem does indeed work in the 2D space as well.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def conv2d_direct(img, ker):
ih, iw = img.shape
kh, kw = ker.shape
out = np.zeros((ih+kh-1, iw+kw-1))
for i in range(out.shape[0]):
for j in range(out.shape[1]):
for m in range(ih):
for n in range(iw):
km, kn = i-m, j-n
if 0 <= km < kh and 0 <= kn < kw:
out[i,j] += img[m,n] * ker[km,kn]
return out

img = np.array([[0,0,0,0],[0,1,2,0],[0,3,4,0],[0,0,0,0]])
ker = np.array([[1,2,1],[2,4,2],[1,2,1]])/16
conv2d_direct(img,ker)

Convolution Theorem 2D

In a similar way to the 1D case instead of windowing and multiplying, we can take the fft of the signal and the kernel and simply multiply.

1
2
3
4
5
6
7
8
9
10
11
def conv2d_fft(img,ker):
H,W = img.shape
Kh,Kw = ker.shape
OH,OW = H+Kh-1, W+Kw-1
IMG = np.fft.rfft2(img, s=(OH,OW))
KER = np.fft.rfft2(ker, s=(OH,OW))
return np.fft.irfft2(IMG*KER, s=(OH,OW))

out_d = conv2d_direct(img,ker)
out_f = conv2d_fft(img,ker)
np.max(np.abs(out_d - out_f))

So why do NNs not use the FFT?

In a neural network, convolution is used to generate feature maps that feed into the next layer. At first glance, the convolution theorem suggests a tempting shortcut: instead of sliding a kernel spatially, we could transform both the image and kernel into the frequency domain, multiply them element-wise, and transform the result back. The output would be mathematically equivalent—so why not do this inside CNNs?

It turns out there are two fundamental reasons:

  1. Neural networks care about more than just the output—they care about how the output is produced.
    During backpropagation, each filter weight is updated using gradients derived from local spatial features. This locality enables CNNs to learn hierarchies of edges, textures, shapes, and patterns.
    In the Fourier domain, however, gradients flow through global Fourier coefficients. Every frequency component depends on every pixel, so the update for a single weight depends on the entire image. This destroys the spatial locality that CNNs rely on and eliminates the inductive bias that makes them effective.

  2. The FFT is not “simpler” computationally for neural networks.
    While FFTs are efficient in isolation, a CNN would need to repeatedly compute forward FFTs, spectral multiplications, and inverse FFTs—not just for the forward pass, but also for backpropagation.
    When you count actual multiplications and transforms, the FFT approach is often more expensive, especially for small kernels (e.g., 3×3, 5×5), which dominate modern architectures.

In short: CNNs avoid the Fourier domain because it removes locality and adds computational overhead—both of which undermine the very reasons convolution works so well in deep learning.

2D Spatial Convolution as a Matrix Multiply

For our next trick we will show the exact way in which your hardware actually computes convolutions. Spoiler: it will be some kind of matrix multiplication. This is quite different from the way convolution is taught in the classroom where you usually convolve with a patch of pixels in the spatial domain and roll the kernel onto the next patch nearby. In reality, this whole process is just represented as one huge matrix multiply. It is very important to think about convolution in this way, as it makes approaching complex questions easier. Since looping over pixels is not a coherent mathematical approach whose complexity is easy to compute. Once it is expressed as a matrix multiply between to matrices we can directly use a formula to compute complexity. More importantly, GPUs work fast precisely because they can parallelize this matrix multiply (as opposed to parallizing various kinds of for-loop structures).

In this section, denotes the input image. It’s worth noting that most deep-learning libraries treat the 2D and 1D cases in essentially the same way: the very first step is to reshape the image into a long vector, commonly written as . This operation—often implemented as im2col in the source code—unrolls local patches of the image so that convolution can be expressed as a matrix–vector multiplication.

Let the kernel we are interested in convolving be:

The valid convolution output (size ) is (again im2col outputs a long vector that can be then transformed to an image on the other end):

We can express the convolution as a matrix multiply:

where is the Block-Toeplitz with Toeplitz Blocks (BTTB) matrix.

Expanded, the output entries are:

Loss Backpropagation in Convolution

1D Convolution Example

Let the 1D convolution be:

where:

  • () is the input
  • () is the kernel
  • () is the output (valid convolution)

Assume a scalar loss ().

Step 1: Gradient w.r.t Output

Step 2: Gradient w.r.t Kernel

Construct the input Toeplitz matrix:

Then the gradient w.r.t the kernel is:

Observation: Each kernel weight sees only the local patches of the input it touches, preserving locality.

Step 3: Gradient w.r.t Input

Again, each input element only receives gradient from the outputs it contributed to.

2D Convolution Example

Only for completeness, it should be clear that 1D and 2D is handled the same way using im2col

For 2D BTTB convolution:

with scalar loss ():

  • Gradient w.r.t kernel:

  • Gradient w.r.t input:

Observation

  • Each kernel weight is influenced only by the input pixels in the patch it was applied to
  • Each input pixel receives gradients only from outputs it contributed to
  • This is why CNNs learn localized features efficiently.

2D Fourier Transform Convolution as Matrix Multiplies

Similar to the spatial convolution case we will represent the Fourier transform as a sequence of matrix multiplies. The recipe is as follows,

  1. Fourier Transform of Kernel
  2. Fourier Transform of 2D Image
  3. Elementwise Multiply in the Frequency Domain
  4. Inverse Fourier Transform

These matrices can get quite huge, but I thought we need to see them explicitly to make understanding them a bit easier.

We assume:

Flatten row-major:

The 2D DFT matrix for a 4×4 image (flattened row-major) is:

Where

1. Fourier Transform of the Kernel**

where is the 3×3 kernel zero-padded to 4×4. Explicitly:

Then:

Take the first row,

2. Fourier Transform of the Image

Take the first row,

3. Multiply (Elementwise) in Frequency Space

Define the frequency-domain product:

Written explicitly:

4. Inverse Fourier Transform

To return to spatial domain:

Explicitly:

Thus the first row of the output looks like (the subscript is 11 because it will eventually be recast to an image),

We will try to focus on that first term on the RHS, , ,

Compare this to from the spatial case, notice that the term is missing in the below expression,

Eventually these two values will be numerically the same! We know this from the convolution theorem. In the next section we will see that the contributing values matter to the gradient back propagation and that is where the two approaches will differ.

Gradient Comparison

FFT Gradient

Notice that every input pixel contributes to the gradient of .

Similarly for other weights, EVERY pixel contributes to the gradient.

Gradient in the Spatial Convolution Case

Notice that each update depends only on the pixel patch that it touches!

Gradient update for scalar loss L

Computational Comparison

Spatial Convolution

Suppose:

  • Input image: of size
  • Kernel: of size
  • Output: (N-K+1) \times (N-K+1)$

Number of multiplications

Each output pixel requires multiplications:

  • Linear in number of pixels and kernel size.
  • Memory access is local, cache-friendly.

FFT-based Convolution

Forward pass:

  1. Zero-pad kernel to size
  2. Compute 2D FFT of input and kernel: each
  3. Elementwise multiplication in Fourier domain:
  4. Inverse FFT:

Total computational cost

  • For small kernels , so:

  • Spatial convolution is cheaper for small kernels, which is why CNNs prefer it.
  • FFT becomes advantageous only for very large kernels or very large images.

TL;DR

  1. Spatial convolution is efficient for small kernels and preserves locality which is crucial for CNNs to learn hierarchies.
  2. FFT convolution has global interactions, destroys the local inductive bias, and is only computationally advantageous for very large kernels.

Conclusion

We have seen that spatial convolution is not only computationally more efficient but also better suited to capturing the hierarchical structure inherent in most images. For instance, a face detection algorithm may rely on local patterns such as the triangle formed by the eyes and the nose. A kernel that focuses specifically on this local arrangement is highly effective because it preserves locality.

Conversely, in domains like recommendation systems, where data may be represented as a sparse matrix of product–user interactions, capturing global patterns can be more important. Here, the “local” interactions often correspond to users with strong connections, whereas broader, global patterns reveal trends across the entire system. In such contexts, FFT-based approaches—or methods that leverage global connectivity, like graph convolutional networks—can be more appropriate.

This contrast explains why spatial CNNs excel in image-based tasks, while GCNs or FFT-based methods are more suitable for graphs representing global interactions, such as those between users and products.

References & Further Reading

Visualization & Signal Processing

Implementing FFT-based Convolution

Graph Neural Networks & Spectral Methods

Practical Engineering Notes

Comments