Mixing Tokens with Fourier Transforms
This is a breakdown of the paper titled "FNet: Mixing Tokens with Fourier Transforms". They take a Transformer architecture, and swap in a Fourier transform for the Attention layer.
So is it really just all about sparseness not attention? Or is it really all about 'mixing' as Yannic says?
We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear transformations, along with simple nonlinearities in feed-forward layers, are sufficient to model semantic relationships in several text classification tasks. Perhaps most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92% of the accuracy of BERT on the GLUE benchmark, but pre-trains and runs up to seven times faster on GPUs and twice as fast on TPUs.
You can check out the paper here.