Query, Key, and Value in Self-Attention

Transformers in Deep Learning - Part 4
Sibling notes: Self Attention in Transformers - The Deep Dive · Linear Transformation in Self-Attention - The Deep Dive

Self-attention uses three names that sound unusual at first:

Query
Key
Value

These are not random labels. They come from a familiar idea in computer science: looking something up by a key and getting back a value.

The names describe the three roles a token plays when a Transformer decides how words in a sentence should influence each other.

1. The Dictionary Analogy

A simple Python dictionary maps keys to values:

animal_sounds = {
    "dog": "bark",
    "cat": "meow",
    "cow": "moo",
}

Here:

Part	Meaning
`"dog"`	key
`"bark"`	value
`animal_sounds["dog"]`	lookup/query

The lookup asks:

Given this key, what value should be returned?

So if the key is "dog", the value is "bark". If the key is "cat", the value is "meow".

flowchart LR
    Q["lookup / query"] --> K["key: dog"]
    K --> V["value: bark"]

This gives the basic intuition:

A query searches.
A key is matched against the search.
A value is the useful information returned after the match.

2. How This Becomes Self-Attention

In self-attention, every word creates three learned vectors:

q_i = x_i W_Q

k_i = x_i W_K

v_i = x_i W_V

where:

$x_i$ is the original embedding of token $i$
$W_Q$ creates the query vector
$W_K$ creates the key vector
$W_V$ creates the value vector

The same word therefore has three different roles:

Vector	Role	Intuition
Query $q$	asks what this token needs	"What am I looking for?"
Key $k$	offers itself for matching	"Do I match that need?"
Value $v$	carries useful content forward	"If selected, what information do I contribute?"

The important point:

Queries and keys decide how much attention should be paid.
Values decide what information gets passed along.

3. Example: "love Apple phones"

Suppose the sentence is:

love Apple phones

Now imagine the model is building a new contextual representation for Apple.

The query of Apple is compared with the keys of every token:

q_{\text{Apple}} \cdot k_{\text{love}}

q_{\text{Apple}} \cdot k_{\text{Apple}}

q_{\text{Apple}} \cdot k_{\text{phones}}

Each dot product gives a similarity score.

flowchart TD
    QA["query of Apple"] --> KL["key of love"]
    QA --> KA["key of Apple"]
    QA --> KP["key of phones"]

    KL --> SL["score: Apple-love"]
    KA --> SA["score: Apple-Apple"]
    KP --> SP["score: Apple-phones"]

This is why the vector is called a query:

The Apple query is asking every key, "How relevant are you to me?"

The keys do not return the final information yet. They only help produce scores.

4. From Scores to Attention Weights

The raw scores are passed through softmax:

\alpha_i = \text{softmax}(q_{\text{Apple}} \cdot k_i)

Softmax converts the scores into attention weights:

Token	Meaning of the weight
love	how much love should influence Apple
Apple	how much Apple should keep itself
phones	how much phones should influence Apple

The weights behave like percentages and sum to 1.

For example:

[\alpha_{\text{love}}, \alpha_{\text{Apple}}, \alpha_{\text{phones}}] = [0.1, 0.5, 0.4]

This means:

\text{Apple}' = 0.1v_{\text{love}} + 0.5v_{\text{Apple}} + 0.4v_{\text{phones}}

So the new representation of Apple is built from a weighted mixture of value vectors.

flowchart LR
    W["attention weights"] --> M["weighted sum"]
    VL["value of love"] --> M
    VA["value of Apple"] --> M
    VP["value of phones"] --> M
    M --> OUT["new Apple representation"]

5. Why the Third Vector Is Called Value

The value vector contains the information that actually gets mixed into the output.

Keys help decide relevance, but values carry content.

For Apple in:

love Apple phones

the model may learn that phones is highly relevant. But it does not directly mix in the phone key. It mixes in the phone value:

\alpha_{\text{phones}} v_{\text{phones}}

That is why this vector is called the value:

The value is the information returned after attention has decided what matters.

Different percentages of value vectors form the new contextual word representation.

6. The Full Attention Formula

For all tokens at once, self-attention is written as:

\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

This formula has two stages:

Stage	Formula part	Meaning
Matching	$QK^T$	compare queries with keys
Normalizing	$\text{softmax}(\cdot)$	turn scores into attention weights
Content mixing	$\text{weights} \cdot V$	combine value vectors

So the formula says:

Compare what each token is looking for with what every token offers for matching. Convert those comparisons into weights. Use those weights to mix the actual information.

7. A Better Mental Model

A useful analogy is a search system:

Search system	Self-attention
Search phrase	query
Searchable labels/indexes	keys
Documents/content returned	values

The search phrase is not the document. The index is not the document. The index only helps decide which documents are relevant.

Self-attention works similarly:

The query asks for relevant context.
The keys help compute relevance.
The values provide the context that gets mixed into the output.

8. Common Confusions

Note

Not in Transformer attention. The key is used for matching. The value is mixed later, after softmax creates attention weights.

Note

No. Every token produces all three vectors: its own query, key, and value.

Note

No. It asks mathematically through a dot product. The question is closer to: "How compatible are these two learned vectors?"

Note

Not exactly. The value is usually a learned transformation of the embedding: $v = xW_V$ .

9. One-Line Interview Answer

In self-attention, a query represents what a token is looking for, a key represents what each token offers for matching, and a value represents the content that gets passed forward. Queries and keys produce attention scores; softmax turns those scores into weights; the weights are used to mix the value vectors into a new contextual representation.

10. Final Intuition

The names query, key, and value come from lookup systems, but Transformers generalize the idea.

A dictionary lookup finds one exact key and returns one value.

Self-attention performs a soft lookup:

Each query compares itself with every key, receives a percentage of attention for each match, and blends the corresponding values.

That soft blend is what lets a word become context-aware.