Self-Attention

In the past, we’ve had different architectures for different modalities of data like CNNs for Images, RNNs for Text and GNNs for Graphs. Recently we have seen the adoption of Transformers for processing all types of data modalities. Transformers can be thought of like general purpose trainable architecture. Since the adoption of transformers has been growing so rapidly, It might be a good idea to revisit the original paper. Introduction In sequence modelling and transduction problems, recurrent neural networks, and in particular extended short-term memory and gated recurrent neural networks, have become standard....