In the past, we’ve had different architectures for different modalities of data like CNNs for Images, RNNs for Text and GNNs for Graphs. Recently we have seen the adoption of Transformers for processing all types of data modalities. Transformers can be thought of like general purpose trainable architecture. Since the adoption of transformers has been growing so rapidly, It might be a good idea to revisit the original paper.

text

Introduction

In sequence modelling and transduction problems, recurrent neural networks, and in particular extended short-term memory and gated recurrent neural networks, have become standard. Computational recurrence is often decomposed in recurrent models according to the symbol positions in the input and output sequences. Due to its intrinsic sequential nature, training examples cannot be parallelized. Recent work has improved computational efficiency significantly by employing factorization methods. However, the underlying barrier of sequential processing persists.

text

Attention mechanisms have evolved to the point where they are now an indispensable component of compelling sequence modelling and transduction models in a variety of tasks. This has made it possible to model dependencies without taking into account the order in which they appear in the input or output sequences. However, in virtually every instance, such attention mechanisms are used in conjunction with a recurrent neural network

In order to create global relationships between input and output, the paper’s authors propose the Transformer, a model architecture that does not use recursion but rather relies solely on an attention mechanism. The Transformer can achieve a new state-of-the-art in translation quality and allows for substantially higher parallelization.

Background

Reducing Sequential Computation in Deep Neural Networks

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S. All of these use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions. This makes it more difficult to learn dependencies between distant positions.

Self-Attention: An Intra-Attention Approach

Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.

Has self-attention been used successfully in a variety of tasks?

A recurrent attention mechanism, as opposed to a sequence-aligned recurrence mechanism, is used as the foundation for end-to-end memory networks. It has been demonstrated that they are successful at answering questions in straightforward English and in language modelling activities.

Transformer is the first transduction model that rely solely on self-attention to compute representations of its input and output.

Model Architecture

Auto-Regressive Neural Sequence Transduction

Most competitive neural sequence transduction models have an encoder-decoder structure. Here, the encoder maps a sequence of symbol representations (x1,…,xn) to a sequence of continuous representations (z = z1,…,zn). When given z, the decoder then creates a sequence of symbols (y1,…, ym) one at a time. At each step, the model is auto-regressive, which means that it uses the symbols it has already made as input for the next step.

For both the encoder and the decoder, the Transformer has stacked self-attention and point-by-point layers that are all fully connected.

text

class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many
    other models.
    """

    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
class Generator(nn.Module):
    "Define standard linear + softmax generation step."

    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

Encoder and Decoder Stacks

Encoder

There are N = 6 layers in the encoder, each of which is identical in every way possible.

The layers are divided into two sublayers each.

  1. Multi-head self-attention mechanism,
  2. Simple, position-wise fully connected feed-forward network.

We first normalise the layer and then use a residual connection around each of the two sublayers. LayerNorm(x + Sublayer(x)) is the output of a sub-layer, where Sublayer(x) is the function implemented by that sub-layer. All model sublayers and embedding layers generate 512-dimensional outputs to facilitate these residual connections.

def clones(module, N):
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
class Encoder(nn.Module):
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)
class LayerNorm(nn.Module):
    """
    Layer Norm module

    LayerNorm is a type of normalization that is applied to the output of each sub-layer in the encoder. 
    This normalization helps to improve the stability of the AI model and makes it easier for the model to learn. 
    Layer normalisation tries to diminish the impact of covariant shift. 
    In other words, it prevents the mean and standard deviation of embedding vector elements from shifting, 
    which renders training unstable and sluggish.
    """

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
class EncoderLayer(nn.Module):
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

text

Decoder

In the same way that the encoder is made up of N = 6 levels, the decoder is also built up of identical layers. In order to execute multi-head attention on the encoder stack’s output, the decoder adds a third sub-layer on top of the original two in each encoder layer. We use residual connections around each of the sub-layers, much like the encoder does, and then we normalise the layers. Moreover, we tweak the decoder stack’s self-attention sub-layer to make positions uninterested in focusing on those that come after them. This masking, in conjunction with the output embeddings’ positional offset of one, ensures that the predictions for position i can only depend on the known outputs at positions less than i.

When we talk about masked multi-head attention, this indicates that the multi-head attention receives inputs that are disguised such that the attention mechanism does not use any of the information from the positions that are hidden. The researchers applied the mask within the attention computation, according to the study, and they mention that they did so by setting attention scores to a value greater than negative infinity (or a very large negative number). Masked places are given an effective probability of zero thanks to the softmax function found within the attention systems.

(1, 0, 0, 0, 0, …, 0) => (<SOS>)
(1, 1, 0, 0, 0, …, 0) => (<SOS>, ‘Friday’)
(1, 1, 1, 0, 0, …, 0) => (<SOS>, ‘Friday’, ‘hai’)
(1, 1, 1, 1, 0, …, 0) => (<SOS>, ‘Friday’, ‘hai’, 'pencho')
(1, 1, 1, 1, 1, …, 0) => (<SOS>, 'Friday', 'hai', 'pencho', '!')

Another type of multi-head attention, the source-target attention, determines the attention values between the features (embeddings) of the input sentence and the features of the output (still partial) sentence. It does this by calculating the distance between the two sets of features.

text

class Decoder(nn.Module):
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)
class DecoderLayer(nn.Module):
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

text

Position-wise Feed-Forward Networks

Encoder and decoder layers both have attention sub-layers and a fully connected feed-forward network, which are applied independently and uniformly to each position. This is made up of two linear transformations separated by a ReLU activation.

text

$$FFN(x) = max(0, xW_1+b_1)W_2 + b_2$$

Despite the fact that the linear transformations are identical across all places, their parameters vary from layer to layer. One other approach to explain this would be to say that it consists of two convolutions with a kernel size of 1. The dimensionality of input and output is

$$d_{model} = 512$$

and the inner-layer has dimensionality

$$d_{ff} = 2048$$

class PositionwiseFeedForward(nn.Module):
    """
    Multiplying x by W1 doubles its size to 2048, then dividing it by W2 brings it back down to 512. 
    In FFN, the weights for all positions inside the same layer are the same.
    """
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

Embeddings and Softmax

In a manner analogous to that of previous sequence transduction models, authors make use of learnt embeddings in order to transform input tokens and output tokens into vectors with the dimension.

$$d_{model}$$

They convert the decoder output into projected next-token probabilities by employing the standard learnt linear transformation in conjunction with the softmax function. Within the model, they make use of the same weight matrix for both of the embedding layers as well as the pre-softmax linear transformation. Within the embedding layers, they multiply those weights by the square root of the model’s dimension.

$$\sqrt{d_{model}}$$ ​

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

Positional Encoding

I’ve saved my favorite for last. When and why do we require positional encoding?

A non-recurrent architecture of multi-head attention, Transformer employs positional encoding to provide the order context. When recurrent neural networks are given sequence inputs, the input itself defines the sequential order (ordering of time-steps). But the Transformer’s Multi-Head Attention layer is a feed-forward layer, which takes in the entire sequence at once instead of sequentially step by step.. Attention is sequence-independent since it is computed on each datapoint (time-step) independently, hence it does not take into account the ordering of datapoints.

text

The principle of positional encoding is employed to solve this issue. Simply put, this entails adding a tensor to the input sequence  that has the desired characteristics. 

To do this, “positional encodings” are added to the input embeddings at the base of the encoder and decoder stacks. Positional encodings and embeddings are of the same dimension, allowing for a simple addition of the two.

They employ sine and cosine functions with varying frequencies:

$$ PE_{(pos, 2i)} = sin(pos/1000^{2i/d_{model}}) $$ $$ PE_{(pos, 2{i+1})} = cos(pos/1000^{2i/d_{model}}) $$

where pos is the location and i represents the size. That is, a sinusoid represents one dimension of the positional encoding. The wavelengths increase geometrically from 2 to 100002. Authors opted for this function on the assumption that it would facilitate the model’s ability to pick up on relative positional cues, since for every fixed offset k,

$$ PE_{pos+k} = \textrm{linear \ function \ of } PE_{pos+k} $$

The total embedding and positional encoding sums in the encoder and decoder stacks are also subjected to dropout. The starting point for the base model is a rate of

$$P_{drop} = 0.1$$ ​

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

Thats all for now 🤭

In the next part, We will try to understand the attention mechanism in transformers! I felt like it would be better to have a separate post instead of writing about it here.

References

  1. Attention Is All You Need https://arxiv.org/abs/1706.03762 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

  2. The Annotated Transformer http://nlp.seas.harvard.edu/2018/04/03/attention.html Harvard NLP

  3. The Illustrated Transformer http://jalammar.github.io/illustrated-transformer/ Jay Alammar

  4. Transformer break-down: Positional Encoding https://medium.datadriveninvestor.com/transformer-break-down-positional-encoding-c8d1bbbf79a8