Query, Key, and Value in Self-Attention

Transformers in Deep Learning - Part 4
Sibling notes: Self Attention in Transformers - The Deep Dive · Linear Transformation in Self-Attention - The Deep Dive

Self-attention uses three names that sound unusual at first:

  • Query
  • Key
  • Value

These are not random labels. They come from a familiar idea in computer science: looking something up by a key and getting back a value.

The names describe the three roles a token plays when a Transformer decides how words in a sentence should influence each other.


1. The Dictionary Analogy

A simple Python dictionary maps keys to values:

animal_sounds = {
    "dog": "bark",
    "cat": "meow",
    "cow": "moo",
}

Here:

Part Meaning
"dog" key
"bark" value
animal_sounds["dog"] lookup/query

The lookup asks:

Given this key, what value should be returned?

So if the key is "dog", the value is "bark". If the key is "cat", the value is "meow".

flowchart LR
    Q["lookup / query"] --> K["key: dog"]
    K --> V["value: bark"]

This gives the basic intuition:

A query searches.
A key is matched against the search.
A value is the useful information returned after the match.


2. How This Becomes Self-Attention

In self-attention, every word creates three learned vectors:

qi=xiWQq_i = x_i W_Q ki=xiWKk_i = x_i W_K vi=xiWVv_i = x_i W_V

where:

  • xix_i is the original embedding of token ii
  • WQW_Q creates the query vector
  • WKW_K creates the key vector
  • WVW_V creates the value vector

The same word therefore has three different roles:

Vector Role Intuition
Query qq asks what this token needs "What am I looking for?"
Key kk offers itself for matching "Do I match that need?"
Value vv carries useful content forward "If selected, what information do I contribute?"

The important point:

Queries and keys decide how much attention should be paid.
Values decide what information gets passed along.


3. Example: "love Apple phones"

Suppose the sentence is:

love Apple phones

Now imagine the model is building a new contextual representation for Apple.

The query of Apple is compared with the keys of every token:

qApplekloveq_{\text{Apple}} \cdot k_{\text{love}} qApplekAppleq_{\text{Apple}} \cdot k_{\text{Apple}} qApplekphonesq_{\text{Apple}} \cdot k_{\text{phones}}

Each dot product gives a similarity score.

flowchart TD
    QA["query of Apple"] --> KL["key of love"]
    QA --> KA["key of Apple"]
    QA --> KP["key of phones"]

    KL --> SL["score: Apple-love"]
    KA --> SA["score: Apple-Apple"]
    KP --> SP["score: Apple-phones"]

This is why the vector is called a query:

The Apple query is asking every key, "How relevant are you to me?"

The keys do not return the final information yet. They only help produce scores.


4. From Scores to Attention Weights

The raw scores are passed through softmax:

αi=softmax(qAppleki)\alpha_i = \text{softmax}(q_{\text{Apple}} \cdot k_i)

Softmax converts the scores into attention weights:

Token Meaning of the weight
love how much love should influence Apple
Apple how much Apple should keep itself
phones how much phones should influence Apple

The weights behave like percentages and sum to 1.

For example:

[αlove,αApple,αphones]=[0.1,0.5,0.4][\alpha_{\text{love}}, \alpha_{\text{Apple}}, \alpha_{\text{phones}}] = [0.1, 0.5, 0.4]

This means:

Apple=0.1vlove+0.5vApple+0.4vphones\text{Apple}' = 0.1v_{\text{love}} + 0.5v_{\text{Apple}} + 0.4v_{\text{phones}}

So the new representation of Apple is built from a weighted mixture of value vectors.

flowchart LR
    W["attention weights"] --> M["weighted sum"]
    VL["value of love"] --> M
    VA["value of Apple"] --> M
    VP["value of phones"] --> M
    M --> OUT["new Apple representation"]

5. Why the Third Vector Is Called Value

The value vector contains the information that actually gets mixed into the output.

Keys help decide relevance, but values carry content.

For Apple in:

love Apple phones

the model may learn that phones is highly relevant. But it does not directly mix in the phone key. It mixes in the phone value:

αphonesvphones\alpha_{\text{phones}} v_{\text{phones}}

That is why this vector is called the value:

The value is the information returned after attention has decided what matters.

Different percentages of value vectors form the new contextual word representation.


6. The Full Attention Formula

For all tokens at once, self-attention is written as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

This formula has two stages:

Stage Formula part Meaning
Matching QKTQK^T compare queries with keys
Normalizing softmax()\text{softmax}(\cdot) turn scores into attention weights
Content mixing weightsV\text{weights} \cdot V combine value vectors

So the formula says:

Compare what each token is looking for with what every token offers for matching. Convert those comparisons into weights. Use those weights to mix the actual information.


7. A Better Mental Model

A useful analogy is a search system:

Search system Self-attention
Search phrase query
Searchable labels/indexes keys
Documents/content returned values

The search phrase is not the document. The index is not the document. The index only helps decide which documents are relevant.

Self-attention works similarly:

  • The query asks for relevant context.
  • The keys help compute relevance.
  • The values provide the context that gets mixed into the output.

8. Common Confusions

Note

Not in Transformer attention. The key is used for matching. The value is mixed later, after softmax creates attention weights.

Note

No. Every token produces all three vectors: its own query, key, and value.

Note

No. It asks mathematically through a dot product. The question is closer to: "How compatible are these two learned vectors?"

Note

Not exactly. The value is usually a learned transformation of the embedding: v=xWVv = xW_V.


9. One-Line Interview Answer

In self-attention, a query represents what a token is looking for, a key represents what each token offers for matching, and a value represents the content that gets passed forward. Queries and keys produce attention scores; softmax turns those scores into weights; the weights are used to mix the value vectors into a new contextual representation.


10. Final Intuition

The names query, key, and value come from lookup systems, but Transformers generalize the idea.

A dictionary lookup finds one exact key and returns one value.

Self-attention performs a soft lookup:

Each query compares itself with every key, receives a percentage of attention for each match, and blends the corresponding values.

That soft blend is what lets a word become context-aware.