Query, Key, and Value in Self-Attention
Transformers in Deep Learning - Part 4
Sibling notes: Self Attention in Transformers - The Deep Dive · Linear Transformation in Self-Attention - The Deep Dive
Self-attention uses three names that sound unusual at first:
- Query
- Key
- Value
These are not random labels. They come from a familiar idea in computer science: looking something up by a key and getting back a value.
The names describe the three roles a token plays when a Transformer decides how words in a sentence should influence each other.
1. The Dictionary Analogy
A simple Python dictionary maps keys to values:
animal_sounds = {
"dog": "bark",
"cat": "meow",
"cow": "moo",
}
Here:
| Part | Meaning |
|---|---|
"dog" |
key |
"bark" |
value |
animal_sounds["dog"] |
lookup/query |
The lookup asks:
Given this key, what value should be returned?
So if the key is "dog", the value is "bark". If the key is "cat", the value is "meow".
flowchart LR
Q["lookup / query"] --> K["key: dog"]
K --> V["value: bark"]
This gives the basic intuition:
A query searches.
A key is matched against the search.
A value is the useful information returned after the match.
2. How This Becomes Self-Attention
In self-attention, every word creates three learned vectors:
where:
- is the original embedding of token
- creates the query vector
- creates the key vector
- creates the value vector
The same word therefore has three different roles:
| Vector | Role | Intuition |
|---|---|---|
| Query | asks what this token needs | "What am I looking for?" |
| Key | offers itself for matching | "Do I match that need?" |
| Value | carries useful content forward | "If selected, what information do I contribute?" |
The important point:
Queries and keys decide how much attention should be paid.
Values decide what information gets passed along.
3. Example: "love Apple phones"
Suppose the sentence is:
love Apple phones
Now imagine the model is building a new contextual representation for Apple.
The query of Apple is compared with the keys of every token:
Each dot product gives a similarity score.
flowchart TD
QA["query of Apple"] --> KL["key of love"]
QA --> KA["key of Apple"]
QA --> KP["key of phones"]
KL --> SL["score: Apple-love"]
KA --> SA["score: Apple-Apple"]
KP --> SP["score: Apple-phones"]
This is why the vector is called a query:
The Apple query is asking every key, "How relevant are you to me?"
The keys do not return the final information yet. They only help produce scores.
4. From Scores to Attention Weights
The raw scores are passed through softmax:
Softmax converts the scores into attention weights:
| Token | Meaning of the weight |
|---|---|
| love | how much love should influence Apple |
| Apple | how much Apple should keep itself |
| phones | how much phones should influence Apple |
The weights behave like percentages and sum to 1.
For example:
This means:
So the new representation of Apple is built from a weighted mixture of value vectors.
flowchart LR
W["attention weights"] --> M["weighted sum"]
VL["value of love"] --> M
VA["value of Apple"] --> M
VP["value of phones"] --> M
M --> OUT["new Apple representation"]
5. Why the Third Vector Is Called Value
The value vector contains the information that actually gets mixed into the output.
Keys help decide relevance, but values carry content.
For Apple in:
love Apple phones
the model may learn that phones is highly relevant. But it does not directly mix in the phone key. It mixes in the phone value:
That is why this vector is called the value:
The value is the information returned after attention has decided what matters.
Different percentages of value vectors form the new contextual word representation.
6. The Full Attention Formula
For all tokens at once, self-attention is written as:
This formula has two stages:
| Stage | Formula part | Meaning |
|---|---|---|
| Matching | compare queries with keys | |
| Normalizing | turn scores into attention weights | |
| Content mixing | combine value vectors |
So the formula says:
Compare what each token is looking for with what every token offers for matching. Convert those comparisons into weights. Use those weights to mix the actual information.
7. A Better Mental Model
A useful analogy is a search system:
| Search system | Self-attention |
|---|---|
| Search phrase | query |
| Searchable labels/indexes | keys |
| Documents/content returned | values |
The search phrase is not the document. The index is not the document. The index only helps decide which documents are relevant.
Self-attention works similarly:
- The query asks for relevant context.
- The keys help compute relevance.
- The values provide the context that gets mixed into the output.
8. Common Confusions
Not in Transformer attention. The key is used for matching. The value is mixed later, after softmax creates attention weights.
No. Every token produces all three vectors: its own query, key, and value.
No. It asks mathematically through a dot product. The question is closer to: "How compatible are these two learned vectors?"
Not exactly. The value is usually a learned transformation of the embedding: .
9. One-Line Interview Answer
In self-attention, a query represents what a token is looking for, a key represents what each token offers for matching, and a value represents the content that gets passed forward. Queries and keys produce attention scores; softmax turns those scores into weights; the weights are used to mix the value vectors into a new contextual representation.
10. Final Intuition
The names query, key, and value come from lookup systems, but Transformers generalize the idea.
A dictionary lookup finds one exact key and returns one value.
Self-attention performs a soft lookup:
Each query compares itself with every key, receives a percentage of attention for each match, and blends the corresponding values.
That soft blend is what lets a word become context-aware.