The Scaled Dot-Product Attention

A form of Attention that uses the scaled dot product as Similarity function. $Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$ The inputs to the attention function are queries $q$ , keys $k$ and values $v$ . Each query, key and value is just a vector of numbers (e.g. the Word Embedding coming from the input). The query and key have the same dimensions $q, k \in R^{d_{k}}$ and $v \in R^{d_{v}} .$

Intuition We compare all keys with the current query via some kind of Similarity function (in this case the scaled dot product). This will produce a mask that we can apply to all values.

In practice, one puts all queries, key and values in matrices $Q, K$ and $V$ to use efficient matrix multiplication. The calculated mask will be normalized with $d_{k}$ to make sure that not too much information is lost with the Softmax function. In the end we use the mask to attend to values in $V$ .

Transclude of Scaled-Dot-Product-Attention.canvas

Marcs Notes

Explorer

The Scaled Dot-Product Attention

The Scaled Dot-Product Attention

Graphansicht

Backlinks