Matrix Multiplication - The Deep Dive
Q10: Why Does "Row Times Column" Work? - Multiplication as Composition
Manish: We know matrices can represent all kinds of things - a vector, a bundle of features or properties for the same object, captured in parallel. I can hold that intuition. But here's what I never questioned and want to stop just believing: when we multiply matrix by matrix , the rule book says "take the first row, multiply it against the first column, that's your first entry; first row against second column, that's the next," and so on. Where does that come from? I don't want to memorize it. I want to feel why matrix multiplication has to work this way.
Response: Good. The goal is not to remember the row-column rule. The goal is to see why it is forced.
The key shift is this:
A matrix is not just a grid of numbers. A matrix is a machine that transforms a vector.
So multiplying two matrices means: chain two machines into one machine.
Once we ask for the product matrix to mean "do first, then do ", the row-times-column formula falls out naturally.
The word matrix comes from Latin matrix, connected to mater, meaning "mother." Older meanings point toward a womb, source, or place where something is formed.
In general English, a matrix can mean an environment, substance, or system within which something else originates, develops, or takes shape.
That fits the mathematical feeling: a matrix is a structured environment of numbers in which a new vector takes form from an old one.
1. A Matrix Is a Machine
A vector is a list of values:
You can think of those values as coordinates, features, amounts, signals, or measurements.
A matrix takes that vector and produces a new vector:
So the matrix is not mainly a static object. It is an instruction for how to remake the input vector.
flowchart LR
x["input vector<br/>old features"] -->|"matrix A"| y["output vector<br/>new features"]
In the functions chapter, a function was an input-output machine. A matrix is that same idea, but specialized for vectors.
A matrix is a feature mixer. It takes old values and blends them into new values.
Before we multiply two matrices, we need the smaller idea: why does one row meeting one vector become a dot product?
2. One Row Is One Recipe
Take one row:
and one vector:
The row says:
- take 2 times the first input
- take 5 times the second input
- add the contributions
So:
That is a dot product.
A dot product answers: "How much total output do I get if each input contributes with its own weight?"
Example:
amounts: [5 USD, 3 EUR, 10 GBP]
rates: [83, 90, 105] rupees per unit
The total value in rupees is:
You multiply each amount by its matching rate, then add the results.
That is all a dot product is: multiply matching pieces, then add the contributions into one total.
A row is a recipe for building one output number from the input vector.
3. Matrix Times Vector = Many Recipes at Once
If one row builds one output number, then several rows build several output numbers.
Each row asks a different question about the same input vector:
- Row 1 builds output 1.
- Row 2 builds output 2.
- Row 3 builds output 3.
So matrix-vector multiplication is simple:
A matrix applies many row-recipes to the same input vector. Each row produces one coordinate of the output.
In feature language:
old features -> matrix -> new features
Each new feature is a weighted blend of the old features.
4. Two Matrices = Two Machines in a Row
Now we can ask the real question.
Suppose transforms raw inputs into middle values .
Then transforms those middle values into final outputs .
flowchart LR
x["x<br/>raw input"] -->|"B"| y["y<br/>middle values"]
y -->|"A"| z["z<br/>final output"]
x -. "combined machine = A B" .-> z
In symbols:
The product means:
is the single matrix that does what does first, then what does second.
So if:
and:
then:
The product is the shortcut machine:
So matrix multiplication is not an arbitrary grid rule. It is the arithmetic needed to build the shortcut machine.
5. Deriving Row Times Column
Use small matrices so we can see every part.
Let:
And let the input vector be:
First acts on and produces the middle vector :
So:
Then acts on and produces the final vector :
So:
Now substitute the first machine into the second:
Look at the coefficient of in :
That is:
which is:
That gives the entry of .
The entry tells us: how much input contributes to output after both machines run.
To compute that, we must collect every route from to through the middle layer. That collection becomes row 1 of dotted with column 1 of .
The same logic works for every entry:
So the row-column rule is not the starting point. It is the result of asking:
What single matrix does the same job as these two matrices chained together?
6. The Path Picture
Here is the same idea as paths through a two-layer machine.
flowchart LR
x1(["x1"]) -->|"b11"| y1(["y1"])
x1 -->|"b21"| y2(["y2"])
x2(["x2"]) -->|"b12"| y1
x2 -->|"b22"| y2
y1 -->|"a11"| z1(["z1"])
y2 -->|"a12"| z1
y1 -->|"a21"| z2(["z2"])
y2 -->|"a22"| z2
To find how much affects , trace all paths from to :
path 1: x1 -> y1 -> z1 strength = b11 * a11
path 2: x1 -> y2 -> z1 strength = b21 * a12
total strength = a11*b11 + a12*b21
That is the entry of .
Multiply along a path. Add across different paths.
This is why there is a sum of products.
- The multiplication measures the strength of one complete route.
- The addition combines all routes that connect the same input to the same output.
So entry of means:
the total influence of input on output after the two machines are chained.
7. A Concrete Example
Let:
The entry of is:
Path reading:
x1 -> y1 -> z1: 3 * 1 = 3
x1 -> y2 -> z1: 2 * 2 = 4
total: 3 + 4 = 7
Same answer.
The row-column calculation is just a compact way to do the path calculation.
8. Why the Inner Dimensions Must Match
For to exist:
the inside dimensions must match.
Why?
Because produces the middle values, and consumes those middle values.
x --B--> y --A--> z
^
this middle size must agree
If produces 3 middle values but expects 4, the machines cannot connect.
So:
The outer dimensions remain:
- input directions from
- output directions from
The shared middle dimension gets summed over:
The middle dimension is the number of paths through the middle layer. That is why it must match, and why it disappears into the sum.
9. Why Order Matters
means:
do B first, then A
means:
do A first, then B
Those are usually different machines.
For example:
- rotate, then stretch
- stretch, then rotate
These do not usually land in the same place.
So matrix multiplication is usually not commutative:
Matrix multiplication is composition. Composition cares about order.
10. Why Not Multiply Matching Cells?
It is natural to ask:
Why not just multiply matching entries?
Like this:
That operation exists. It is called the Hadamard product.
But it answers a different question.
Element-wise multiplication says:
combine matching cells independently
Matrix multiplication says:
chain two transformations into one transformation
To chain transformations, every output must collect influence from the whole middle layer. That is why we need row times column.
11. A Second Lens: Columns Show Where Basis Vectors Land
There is another useful way to see the same rule.
The columns of a matrix tell you where the basis vectors go.
In 2D:
If:
then:
- column 1 of is where lands after
- column 2 of is where lands after
Now compose .
Column of tells us where input direction lands after the first machine.
Then acts on that landing vector.
So:
And applying to that column means:
So again:
Columns of show where the first machine sends the basis directions. Rows of show how the second machine builds each final output. Their meeting point is row times column.
12. Why This Matters in AI
This is the exact machinery behind neural networks.
An embedding is a vector of features.
A weight matrix remixes those features:
embedding -> weight matrix -> new feature vector
In self-attention, the , , and matrices transform token embeddings into queries, keys, and values.
flowchart LR
E["token embedding<br/>features"] -->|"matrix Q"| q["query"]
E -->|"matrix K"| k["key"]
q --> S["score = q dot k"]
k --> S
When attention computes , it is doing many dot products:
How strongly does this query match that key?
That is still the same idea:
- rows as recipes
- columns as inputs/destinations
- dot products as weighted totals
- matrix multiplication as many such totals arranged together
Deep networks are many transformations chained together. Matrix multiplication is the arithmetic that lets those chains be represented as layers.
13. What Matrix Multiplication Really Is
Here is the same truth at several depths.
School level
Each entry of the product is:
Intuitive level
Each entry asks:
How much does this input direction influence this output direction after both machines run?
Path level
For each input-output pair:
Multiply along every path through the middle. Add the paths together.
Transformation level
is the single transformation that does:
So:
First-principle level
Matrix multiplication is defined the way it is so that the product matrix represents the composition of two linear transformations.
14. The One Line to Keep
Row times column is not a trick. It is the bookkeeping of influence through a middle layer.
Multiply along each path. Add across paths.
That is matrix multiplication.
We came in unwilling to memorize a rule. We leave able to derive it, because we now know the question the rule answers:
What single machine does the same work as these two machines chained together?
15. The Bridge Forward
We have now seen how transformations combine. The next natural questions are:
- Is there a matrix that undoes another matrix? That is the idea of an inverse.
- Is there a matrix that changes nothing? That is the identity.
- When a matrix acts, what directions stay aligned, stretch, shrink, or collapse? That leads to eigenvectors and eigenvalues.
Geometry gives relationships shape. Coordinates give shape an address. Vectors turn position and meaning into something we can move. Linear algebra studies the machines that move them, and matrix multiplication is how those machines combine. Calculus then studies how all of it changes.
Back to Index | Related: Functions (transformation) | Related: Linear Transformation in Self-Attention