Transformer Architecture

Uses a Decoder - Encoder seq2seq Architecture.

Transclude of Transformer-Model-Architecture.canvas

Pros compared to other architectures

Self-Attention layers were found to be faster than RNN layers for shorter sequence lengths and can be restricted to consider only a neighborhood in the input sequence for very long sequence lengths.
The number of sequential operations required by a RNN layer is based on the sequence length, whereas this number remains constant for a self-attention layer.
In CNNs, the kernel width directly affects the long-term dependencies that can be established between pairs of input and output positions. Tracking long-term dependencies would require using large kernels or stacks of convolutional layers that could increase the computational cost.
more parallelizable
requiring significantly less time to train