### Attention is Not All You Need - the transformer backlash begins

With the end of my 45 day straight 16 hour a day campout in the debugger doing stack traces for a major software release, i finally have some time to get back to this blog. And in the fast paced world of deep learning, there are a ton of new things to cover.

A paper was just release called 'Attention is not all you need: pure attention looses rank doubly with depth'. It tries to dive into answering the question 'why do transformers work'. And it turns out our old friend the skip connection plays the key role (surprise).

Here's the abstract.

*Attention-based architectures have become ubiquitous in machine learning. Yet our understanding
of the reasons for their effectiveness remains limited. This work proposes a new way to understand
self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each
involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove
that self-attention possesses a strong inductive bias towards “token uniformity”. Specifically, without
skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a
rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our
experiments verify the identified convergence phenomena on different variants of standard transformer
architectures.*

Another fascinating quote.

*Skip connections were first introduced in ResNets, ever since, it has been used to facilitate
optimization in deep networks . In particular, skip connections tackle the
vanishing gradient problem, by allowing the gradient to flow bypass the skipped layers during backpropagation.
The original motivation of using skip connections in transformers follow the same reasoning on
facilitating optimization . With the paths decomposition for transformers, we discover an additional
surprising importance of skip connections: they prevent the transformer output from degenerating to rank
one exponentially quickly with respect to network depth.*

## Comments

## Post a Comment