The Scaled Dot-Product Attention

A form of Attention that uses the scaled dot product as Similarity function. The inputs to the attention function are queries , keys and values . Each query, key and value is just a vector of numbers (e.g. the Word Embedding coming from the input). The query and key have the same dimensions and

Intuition We compare all keys with the current query via some kind of Similarity function (in this case the scaled dot product). This will produce a mask that we can apply to all values.

In practice, one puts all queries, key and values in matrices and to use efficient matrix multiplication. The calculated mask will be normalized with to make sure that not too much information is lost with the Softmax function. In the end we use the mask to attend to values in .

Transclude of Scaled-Dot-Product-Attention.canvas