Attention is Not All You Need - the transformer backlash begins

 With the end of my 45 day straight 16 hour a day campout in the debugger doing stack traces for a major software release, i finally have some time to get back to this blog.  And in the fast paced world of deep learning, there are a ton of new things to cover.

A paper was just release called 'Attention is not all you need: pure attention looses rank doubly with depth'.  It tries to dive into answering the question 'why do transformers work'.  And it turns out our old friend the skip connection plays the key role (surprise).

Here's the abstract.

Attention-based architectures have become ubiquitous in machine learning. Yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards “token uniformity”. Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

Another fascinating quote.

Skip connections were first introduced in ResNets, ever since, it has been used to facilitate optimization in deep networks . In particular, skip connections tackle the vanishing gradient problem, by allowing the gradient to flow bypass the skipped layers during backpropagation. The original motivation of using skip connections in transformers follow the same reasoning on facilitating optimization . With the paths decomposition for transformers, we discover an additional surprising importance of skip connections: they prevent the transformer output from degenerating to rank one exponentially quickly with respect to network depth.


Popular posts from this blog

Simulating the Universe with Machine Learning

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Pix2Pix: a GAN architecture for image to image transformation