Matrix Multiplication - The Deep Dive

Q10: Why Does "Row Times Column" Work? - Multiplication as Composition

We know matrices can represent all kinds of things - a vector, a bundle of features or properties for the same object, captured in parallel. I can hold that intuition. But here's what I never questioned and want to stop just believing: when we multiply matrix $A$ by matrix $B$ , the rule book says "take the first row, multiply it against the first column, that's your first entry; first row against second column, that's the next," and so on. Where does that come from? I don't want to memorize it. I want to feel why matrix multiplication has to work this way.

Good. The goal is not to remember the row-column rule. The goal is to see why it is forced.

The key shift is this:

Note

A matrix is not just a grid of numbers. A matrix is a machine that transforms a vector.

So multiplying two matrices means: chain two machines into one machine.

Once we ask for the product matrix to mean "do $B$ first, then do $A$ ", the row-times-column formula falls out naturally.

Note

The word matrix comes from Latin matrix, connected to mater, meaning "mother." Older meanings point toward a womb, source, or place where something is formed.

In general English, a matrix can mean an environment, substance, or system within which something else originates, develops, or takes shape.

That fits the mathematical feeling: a matrix is a structured environment of numbers in which a new vector takes form from an old one.

1. A Matrix Is a Machine

A vector is a list of values:

x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}

You can think of those values as coordinates, features, amounts, signals, or measurements.

A matrix takes that vector and produces a new vector:

A x = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} \text{new first value} \\ \text{new second value} \end{bmatrix}

So the matrix is not mainly a static object. It is an instruction for how to remake the input vector.

flowchart LR
    x["input vector<br/>old features"] -->|"matrix A"| y["output vector<br/>new features"]

In the functions chapter, a function was an input-output machine. A matrix is that same idea, but specialized for vectors.

Note

A matrix is a feature mixer. It takes old values and blends them into new values.

Before we multiply two matrices, we need the smaller idea: why does one row meeting one vector become a dot product?

2. One Row Is One Recipe

Take one row:

\begin{bmatrix} 2 & 5 \end{bmatrix}

and one vector:

\begin{bmatrix} 3 \\ 4 \end{bmatrix}

The row says:

take 2 times the first input
take 5 times the second input
add the contributions

So:

\begin{bmatrix} 2 & 5 \end{bmatrix} \cdot \begin{bmatrix} 3 \\ 4 \end{bmatrix} = 2(3) + 5(4) = 26

That is a dot product.

Note

A dot product answers: "How much total output do I get if each input contributes with its own weight?"

Example:

amounts: [5 USD, 3 EUR, 10 GBP]
rates:   [83,    90,    105]   rupees per unit

The total value in rupees is:

5(83) + 3(90) + 10(105)

You multiply each amount by its matching rate, then add the results.

That is all a dot product is: multiply matching pieces, then add the contributions into one total.

Note

A row is a recipe for building one output number from the input vector.

3. Matrix Times Vector = Many Recipes at Once

If one row builds one output number, then several rows build several output numbers.

\begin{bmatrix} 2 & 5 \\ 1 & 0 \\ -3 & 4 \end{bmatrix} \begin{bmatrix} 3 \\ 4 \end{bmatrix} = \begin{bmatrix} 2(3) + 5(4) \\ 1(3) + 0(4) \\ -3(3) + 4(4) \end{bmatrix} = \begin{bmatrix} 26 \\ 3 \\ 7 \end{bmatrix}

Each row asks a different question about the same input vector:

Row 1 builds output 1.
Row 2 builds output 2.
Row 3 builds output 3.

So matrix-vector multiplication is simple:

Note

A matrix applies many row-recipes to the same input vector. Each row produces one coordinate of the output.

In feature language:

old features  ->  matrix  ->  new features

Each new feature is a weighted blend of the old features.

4. Two Matrices = Two Machines in a Row

Now we can ask the real question.

Suppose $B$ transforms raw inputs $x$ into middle values $y$ .

Then $A$ transforms those middle values $y$ into final outputs $z$ .

flowchart LR
    x["x<br/>raw input"] -->|"B"| y["y<br/>middle values"]
    y -->|"A"| z["z<br/>final output"]
    x -. "combined machine = A B" .-> z

In symbols:

x \xrightarrow{B} y \xrightarrow{A} z

The product $AB$ means:

Note

$AB$ is the single matrix that does what $B$ does first, then what $A$ does second.

So if:

y = Bx

and:

z = Ay

then:

z = A(Bx)

The product $AB$ is the shortcut machine:

z = (AB)x

So matrix multiplication is not an arbitrary grid rule. It is the arithmetic needed to build the shortcut machine.

5. Deriving Row Times Column

Use small $2 \times 2$ matrices so we can see every part.

Let:

B = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} ,\qquad A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}

And let the input vector be:

x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}

First $B$ acts on $x$ and produces the middle vector $y$ :

y = Bx = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}

So:

\begin{aligned} y_1 &= b_{11}x_1 + b_{12}x_2 \\ y_2 &= b_{21}x_1 + b_{22}x_2 \end{aligned}

Then $A$ acts on $y$ and produces the final vector $z$ :

z = Ay = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} z_1 \\ z_2 \end{bmatrix}

So:

\begin{aligned} z_1 &= a_{11}y_1 + a_{12}y_2 \\ z_2 &= a_{21}y_1 + a_{22}y_2 \end{aligned}

Now substitute the first machine into the second:

\begin{aligned} z_1 &= a_{11}y_1 + a_{12}y_2 \\ &= a_{11}(b_{11}x_1 + b_{12}x_2) + a_{12}(b_{21}x_1 + b_{22}x_2) \\ &= (a_{11}b_{11} + a_{12}b_{21})x_1 + (a_{11}b_{12} + a_{12}b_{22})x_2 \end{aligned}

Look at the coefficient of $x_1$ in $z_1$ :

a_{11}b_{11} + a_{12}b_{21}

That is:

\begin{bmatrix} a_{11} & a_{12} \end{bmatrix} \cdot \begin{bmatrix} b_{11} \\ b_{21} \end{bmatrix}

which is:

\text{row 1 of } A \cdot \text{column 1 of } B

That gives the $(1,1)$ entry of $AB$ .

Note

The $(1,1)$ entry tells us: how much input $x_1$ contributes to output $z_1$ after both machines run.

To compute that, we must collect every route from $x_1$ to $z_1$ through the middle layer. That collection becomes row 1 of $A$ dotted with column 1 of $B$ .

The same logic works for every entry:

(AB)_{ij} = \sum_k a_{ik}b_{kj} = \text{row}_i(A)\cdot\text{column}_j(B)

So the row-column rule is not the starting point. It is the result of asking:

What single matrix does the same job as these two matrices chained together?

6. The Path Picture

Here is the same idea as paths through a two-layer machine.

flowchart LR
    x1(["x1"]) -->|"b11"| y1(["y1"])
    x1 -->|"b21"| y2(["y2"])
    x2(["x2"]) -->|"b12"| y1
    x2 -->|"b22"| y2
    y1 -->|"a11"| z1(["z1"])
    y2 -->|"a12"| z1
    y1 -->|"a21"| z2(["z2"])
    y2 -->|"a22"| z2

To find how much $x_1$ affects $z_1$ , trace all paths from $x_1$ to $z_1$ :

path 1: x1 -> y1 -> z1    strength = b11 * a11
path 2: x1 -> y2 -> z1    strength = b21 * a12

total strength = a11*b11 + a12*b21

That is the $(1,1)$ entry of $AB$ .

Note

Multiply along a path. Add across different paths.

This is why there is a sum of products.

The multiplication measures the strength of one complete route.
The addition combines all routes that connect the same input to the same output.

So entry $(i,j)$ of $AB$ means:

the total influence of input $j$ on output $i$ after the two machines are chained.

7. A Concrete Example

Let:

A = \begin{bmatrix} 1 & 2 \\ 0 & 1 \end{bmatrix} ,\qquad B = \begin{bmatrix} 3 & 1 \\ 2 & 4 \end{bmatrix}

The $(1,1)$ entry of $AB$ is:

\begin{bmatrix} 1 & 2 \end{bmatrix} \cdot \begin{bmatrix} 3 \\ 2 \end{bmatrix} = 1(3) + 2(2) = 7

Path reading:

x1 -> y1 -> z1: 3 * 1 = 3
x1 -> y2 -> z1: 2 * 2 = 4

total: 3 + 4 = 7

Same answer.

The row-column calculation is just a compact way to do the path calculation.

8. Why the Inner Dimensions Must Match

For $AB$ to exist:

A_{m \times n}B_{n \times p}

the inside dimensions must match.

Why?

Because $B$ produces the middle values, and $A$ consumes those middle values.

x  --B-->  y  --A-->  z
           ^
           this middle size must agree

If $B$ produces 3 middle values but $A$ expects 4, the machines cannot connect.

So:

A_{m \times n}B_{n \times p} = (AB)_{m \times p}

The outer dimensions remain:

$p$ input directions from $B$
$m$ output directions from $A$

The shared middle dimension $n$ gets summed over:

(AB)_{ij} = \sum_{k=1}^{n} a_{ik}b_{kj}

Note

The middle dimension is the number of paths through the middle layer. That is why it must match, and why it disappears into the sum.

9. Why Order Matters

$AB$ means:

do B first, then A

$BA$ means:

do A first, then B

Those are usually different machines.

For example:

rotate, then stretch
stretch, then rotate

These do not usually land in the same place.

So matrix multiplication is usually not commutative:

AB \neq BA

Note

Matrix multiplication is composition. Composition cares about order.

10. Why Not Multiply Matching Cells?

It is natural to ask:

Why not just multiply matching entries?

Like this:

\begin{bmatrix} a & b \\ c & d \end{bmatrix} \odot \begin{bmatrix} e & f \\ g & h \end{bmatrix} = \begin{bmatrix} ae & bf \\ cg & dh \end{bmatrix}

That operation exists. It is called the Hadamard product.

But it answers a different question.

Element-wise multiplication says:

combine matching cells independently

Matrix multiplication says:

chain two transformations into one transformation

To chain transformations, every output must collect influence from the whole middle layer. That is why we need row times column.

11. A Second Lens: Columns Show Where Basis Vectors Land

There is another useful way to see the same rule.

The columns of a matrix tell you where the basis vectors go.

In 2D:

\hat{e}_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix} ,\qquad \hat{e}_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}

If:

B = \begin{bmatrix} 3 & 1 \\ 2 & 4 \end{bmatrix}

then:

column 1 of $B$ is where $\hat{e}_1$ lands after $B$
column 2 of $B$ is where $\hat{e}_2$ lands after $B$

Now compose $AB$ .

Column $j$ of $B$ tells us where input direction $j$ lands after the first machine.

Then $A$ acts on that landing vector.

So:

\text{column }j\text{ of }AB = A(\text{column }j\text{ of }B)

And applying $A$ to that column means:

\text{each row of }A \cdot \text{that column of }B

So again:

(AB)_{ij} = \text{row}_i(A)\cdot\text{column}_j(B)

Note

Columns of $B$ show where the first machine sends the basis directions. Rows of $A$ show how the second machine builds each final output. Their meeting point is row times column.

12. Why This Matters in AI

This is the exact machinery behind neural networks.

An embedding is a vector of features.

A weight matrix remixes those features:

embedding -> weight matrix -> new feature vector

In self-attention, the $Q$ , $K$ , and $V$ matrices transform token embeddings into queries, keys, and values.

flowchart LR
    E["token embedding<br/>features"] -->|"matrix Q"| q["query"]
    E -->|"matrix K"| k["key"]
    q --> S["score = q dot k"]
    k --> S

When attention computes $QK^\top$ , it is doing many dot products:

How strongly does this query match that key?

That is still the same idea:

rows as recipes
columns as inputs/destinations
dot products as weighted totals
matrix multiplication as many such totals arranged together

Deep networks are many transformations chained together. Matrix multiplication is the arithmetic that lets those chains be represented as layers.

13. What Matrix Multiplication Really Is

Here is the same truth at several depths.

School level

Each entry of the product is:

\text{row of the first matrix} \cdot \text{column of the second matrix}

Intuitive level

Each entry asks:

How much does this input direction influence this output direction after both machines run?

Path level

For each input-output pair:

Multiply along every path through the middle. Add the paths together.

Transformation level

$AB$ is the single transformation that does:

x \xrightarrow{B} Bx \xrightarrow{A} A(Bx)

So:

(AB)x = A(Bx)

First-principle level

Matrix multiplication is defined the way it is so that the product matrix represents the composition of two linear transformations.

14. The One Line to Keep

Note

Row times column is not a trick. It is the bookkeeping of influence through a middle layer.

Multiply along each path. Add across paths.

That is matrix multiplication.

We came in unwilling to memorize a rule. We leave able to derive it, because we now know the question the rule answers:

What single machine does the same work as these two machines chained together?

15. The Bridge Forward

We have now seen how transformations combine. The next natural questions are:

Is there a matrix that undoes another matrix? That is the idea of an inverse.
Is there a matrix that changes nothing? That is the identity.
When a matrix acts, what directions stay aligned, stretch, shrink, or collapse? That leads to eigenvectors and eigenvalues.

Geometry gives relationships shape. Coordinates give shape an address. Vectors turn position and meaning into something we can move. Linear algebra studies the machines that move them, and matrix multiplication is how those machines combine. Calculus then studies how all of it changes.

Back to Index | Related: Functions (transformation) | Related: Linear Transformation in Self-Attention