Attention is all you need: Discovering the Transformer paper

10 Mar.,2023

 

The company has a group of cooperation teams engaged in the High frequency transformer, Low frequency transformer industry for many years, with dedication, innovation spirit and service awareness, and has established a sound quality control and management system to ensure product quality.

Attention is all you need: Discovering the Transformer paper

Detailed implementation of a Transformer model in Tensorflow

Picture by Vinson Tan from Pixabay

In this post we will describe and demystify the relevant artifacts in the paper “Attention is all you need” (Vaswani, Ashish & Shazeer, Noam & Parmar, Niki & Uszkoreit, Jakob & Jones, Llion & Gomez, Aidan & Kaiser, Lukasz & Polosukhin, Illia. (2017))[1]. This paper was a great advance in the use of the attention mechanism, being the main improvement for a model called Transformer. The most famous current models that are emerging in NLP tasks consist of dozens of transformers or some of their variants, for example, GPT-2 or BERT.

We will describe the components of this model, analyze their operation and build a simple model that we will apply to a small-scale NMT problem (Neural Machine Translation). To read more about the problem that we will address and to know how the basic attention mechanism works, I recommend you to read my previous post “A Guide on the Encoder-Decoder Model and the Attention Mechanism”.

Why we need Transformer

In sequence-to-sequence problems such as the neural machine translation, the initial proposals were based on the use of RNNs in an encoder-decoder architecture. These architectures have a great limitation when working with long sequences, their ability to retain information from the first elements was lost when new elements were incorporated into the sequence. In the encoder, the hidden state in every step is associated with a certain word in the input sentence, usually one of the most recent. Therefore, if the decoder only accesses the last hidden state of the decoder, it will lose relevant information about the first elements of the sequence. Then to deal with this limitation, a new concept were introduced the attention mechanism.

Instead of paying attention to the last state of the encoder as is usually done with RNNs, in each step of the decoder we look at all the states of the encoder, being able to access information about all the elements of the input sequence. This is what attention does, it extracts information from the whole sequence, a weighted sum of all the past encoder states. This allows the decoder to assign greater weight or importance to a certain element of the input for each element of the output. Learning in every step to focus in the right element of the input to predict the next output element.

But this approach continues to have an important limitation, each sequence must be treated one element at a time. Both the encoder and the decoder have to wait till the completion of t-1 steps to process thet-th step. So when dealing with huge corpus it is very time consuming and computationally inefficient.

What is the Transformer?

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization … the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

“Attention is all you need” paper [1]

The Transformer model extract features for each word using a self-attention mechanism to figure out how important all the other words in the sentence are w.r.t. to the aforementioned word. And no recurrent units are used to obtain this features, they are just weighted sums and activations, so they can be very parallelizable and efficient.

But we will dive deeper into its architecture (next figure) to understand what all this pieces do [1].

From “Attention is all you need” paper by Vaswani, et al., 2017 [1]

We can observe there is an encoder model on the left side and the decoder on the right one. Both contains a core block of “an attention and a feed-forward network” repeated N times. But first we need to explore a core concept in depth: the self-attention mechanism.

Self-Attention: the fundamental operation

Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out. Let’s call the input vectors x1, x2,…, xt and the corresponding output vectors y1, y2,…, yt. The vectors all have dimension k. To produce output vector yi, the self attention operation simply takes a weighted average over all the input vectors, the simplest option is the dot product.

Transformers from scratch by Peter Bloem [2]

In the self-attention mechanism of our model we need to introduce three elements: Queries, Values and Keys

The Query, The Value and The Key

Every input vector is used in three different ways in the self-attention mechanism: the Query, the Key and the Value. In every role, it is compared to the other vectors to get its own output yi(Query), to get the j-th output yj(Key) and to compute each output vector once the weights have been established (Value).

To obtain this roles, we need three weight matrices of dimensions k x k and compute three linear transformation for each xi:

“Transformers from scratch” by Peter Bloem [2]

These three matrices are usually known as K, Q and V, three learnable weight layers that are applied to the same encoded input. Consequently, as each of these three matrices come from the same input, we can apply the attention mechanism of the input vector with itself, a “self-attention”.

The Scaled Dot-Product Attention

The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot product of the query with all keys, divide each by the square root of dk, and apply a softmax function to obtain the weights on the values.

“Attention is all you need” paper [1]

Then we use the Q, K and V matrices to calculate the attention scores. The scores measure how much focus to place on other places or words of the input sequence w.r.t a word at a certain position. That is, the dot product of the query vector with the key vector of the respective word we’re scoring. So, for position 1 we calculate the dot product (.) of q1and k1, then q1. k2, q1. k3 and so on,…

Next we apply the “scaled” factor to have more stable gradients. The softmax function can not work properly with large values, resulting in vanishing the gradients and slowing down the learning, [2]. After “softmaxing” we multiply by the Value matrix to keep the values of the words we want to focus on and minimizing or removing the values for the irrelevant words (its value in V matrix should be very small).

The formula for these operations is:

From “Attention is all you need” paper by Vaswani, et al., 2017 [1]. Scaled dot-product attention formula.

Multi-head Attention

In the previous description the attention scores are focused on the whole sentence at a time, this would produce the same results even if two sentences contain the same words in a different order. Instead, we would like to attend to different segments of the words. “We can give the self attention greater power of discrimination, by combining several self attention heads, dividing the words vectors into a fixed number (h, number of heads) of chunks, and then self-attention is applied on the corresponding chunks, using Q, K and V sub-matrices.”, [2] Peter Bloem, “Transformers from scratch”. This produce h different output matrices of scores.

From “Attention is all you need” paper by Vaswani, et al., 2017 [1]

But the next layer (the Feed-Forward layer) is expecting just one matrix, a vector for each word, so “after calculating the dot product of every head, we concatenate the output matrices and multiply them by an additional weights matrix Wo,”[3]. This final matrix captures information from all the attention heads.

Want more information on High frequency transformer, Low frequency transformer? Click the link below to contact us.

Guest Posts
*
*
* CAPTCHA
Submit