Input (X)
Wq
Wk
Wv
Query (Q = X·Wq)
Key (K = X·Wk)
Value (V = X·Wv)
Scores (Q·K^T / √d)
→
Attention (softmax)
Output (Attention · V)