Matrix Multiplication - The Deep Dive

Q10: Why Does "Row Times Column" Work? - Multiplication as Composition

Manish: We know matrices can represent all kinds of things - a vector, a bundle of features or properties for the same object, captured in parallel. I can hold that intuition. But here's what I never questioned and want to stop just believing: when we multiply matrix AA by matrix BB, the rule book says "take the first row, multiply it against the first column, that's your first entry; first row against second column, that's the next," and so on. Where does that come from? I don't want to memorize it. I want to feel why matrix multiplication has to work this way.

Response: Good. The goal is not to remember the row-column rule. The goal is to see why it is forced.

The key shift is this:

Note

A matrix is not just a grid of numbers. A matrix is a machine that transforms a vector.

So multiplying two matrices means: chain two machines into one machine.

Once we ask for the product matrix to mean "do BB first, then do AA", the row-times-column formula falls out naturally.

Note

The word matrix comes from Latin matrix, connected to mater, meaning "mother." Older meanings point toward a womb, source, or place where something is formed.

In general English, a matrix can mean an environment, substance, or system within which something else originates, develops, or takes shape.

That fits the mathematical feeling: a matrix is a structured environment of numbers in which a new vector takes form from an old one.


1. A Matrix Is a Machine

A vector is a list of values:

x=[x1x2]x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}

You can think of those values as coordinates, features, amounts, signals, or measurements.

A matrix takes that vector and produces a new vector:

Ax=[a11a12a21a22][x1x2]=[new first valuenew second value]A x = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} \text{new first value} \\ \text{new second value} \end{bmatrix}

So the matrix is not mainly a static object. It is an instruction for how to remake the input vector.

flowchart LR
    x["input vector<br/>old features"] -->|"matrix A"| y["output vector<br/>new features"]

In the functions chapter, a function was an input-output machine. A matrix is that same idea, but specialized for vectors.

Note

A matrix is a feature mixer. It takes old values and blends them into new values.

Before we multiply two matrices, we need the smaller idea: why does one row meeting one vector become a dot product?


2. One Row Is One Recipe

Take one row:

[25]\begin{bmatrix} 2 & 5 \end{bmatrix}

and one vector:

[34]\begin{bmatrix} 3 \\ 4 \end{bmatrix}

The row says:

  • take 2 times the first input
  • take 5 times the second input
  • add the contributions

So:

[25][34]=2(3)+5(4)=26\begin{bmatrix} 2 & 5 \end{bmatrix} \cdot \begin{bmatrix} 3 \\ 4 \end{bmatrix} = 2(3) + 5(4) = 26

That is a dot product.

Note

A dot product answers: "How much total output do I get if each input contributes with its own weight?"

Example:

amounts: [5 USD, 3 EUR, 10 GBP]
rates:   [83,    90,    105]   rupees per unit

The total value in rupees is:

5(83)+3(90)+10(105)5(83) + 3(90) + 10(105)

You multiply each amount by its matching rate, then add the results.

That is all a dot product is: multiply matching pieces, then add the contributions into one total.

Note

A row is a recipe for building one output number from the input vector.


3. Matrix Times Vector = Many Recipes at Once

If one row builds one output number, then several rows build several output numbers.

[251034][34]=[2(3)+5(4)1(3)+0(4)3(3)+4(4)]=[2637]\begin{bmatrix} 2 & 5 \\ 1 & 0 \\ -3 & 4 \end{bmatrix} \begin{bmatrix} 3 \\ 4 \end{bmatrix} = \begin{bmatrix} 2(3) + 5(4) \\ 1(3) + 0(4) \\ -3(3) + 4(4) \end{bmatrix} = \begin{bmatrix} 26 \\ 3 \\ 7 \end{bmatrix}

Each row asks a different question about the same input vector:

  • Row 1 builds output 1.
  • Row 2 builds output 2.
  • Row 3 builds output 3.

So matrix-vector multiplication is simple:

Note

A matrix applies many row-recipes to the same input vector. Each row produces one coordinate of the output.

In feature language:

old features  ->  matrix  ->  new features

Each new feature is a weighted blend of the old features.


4. Two Matrices = Two Machines in a Row

Now we can ask the real question.

Suppose BB transforms raw inputs xx into middle values yy.

Then AA transforms those middle values yy into final outputs zz.

flowchart LR
    x["x<br/>raw input"] -->|"B"| y["y<br/>middle values"]
    y -->|"A"| z["z<br/>final output"]
    x -. "combined machine = A B" .-> z

In symbols:

xByAzx \xrightarrow{B} y \xrightarrow{A} z

The product ABAB means:

Note

ABAB is the single matrix that does what BB does first, then what AA does second.

So if:

y=Bxy = Bx

and:

z=Ayz = Ay

then:

z=A(Bx)z = A(Bx)

The product ABAB is the shortcut machine:

z=(AB)xz = (AB)x

So matrix multiplication is not an arbitrary grid rule. It is the arithmetic needed to build the shortcut machine.


5. Deriving Row Times Column

Use small 2×22 \times 2 matrices so we can see every part.

Let:

B=[b11b12b21b22],A=[a11a12a21a22]B = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} ,\qquad A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}

And let the input vector be:

x=[x1x2]x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}

First BB acts on xx and produces the middle vector yy:

y=Bx=[b11b12b21b22][x1x2]=[y1y2]y = Bx = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}

So:

y1=b11x1+b12x2y2=b21x1+b22x2\begin{aligned} y_1 &= b_{11}x_1 + b_{12}x_2 \\ y_2 &= b_{21}x_1 + b_{22}x_2 \end{aligned}

Then AA acts on yy and produces the final vector zz:

z=Ay=[a11a12a21a22][y1y2]=[z1z2]z = Ay = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} z_1 \\ z_2 \end{bmatrix}

So:

z1=a11y1+a12y2z2=a21y1+a22y2\begin{aligned} z_1 &= a_{11}y_1 + a_{12}y_2 \\ z_2 &= a_{21}y_1 + a_{22}y_2 \end{aligned}

Now substitute the first machine into the second:

z1=a11y1+a12y2=a11(b11x1+b12x2)+a12(b21x1+b22x2)=(a11b11+a12b21)x1+(a11b12+a12b22)x2\begin{aligned} z_1 &= a_{11}y_1 + a_{12}y_2 \\ &= a_{11}(b_{11}x_1 + b_{12}x_2) + a_{12}(b_{21}x_1 + b_{22}x_2) \\ &= (a_{11}b_{11} + a_{12}b_{21})x_1 + (a_{11}b_{12} + a_{12}b_{22})x_2 \end{aligned}

Look at the coefficient of x1x_1 in z1z_1:

a11b11+a12b21a_{11}b_{11} + a_{12}b_{21}

That is:

[a11a12][b11b21]\begin{bmatrix} a_{11} & a_{12} \end{bmatrix} \cdot \begin{bmatrix} b_{11} \\ b_{21} \end{bmatrix}

which is:

row 1 of Acolumn 1 of B\text{row 1 of } A \cdot \text{column 1 of } B

That gives the (1,1)(1,1) entry of ABAB.

Note

The (1,1)(1,1) entry tells us: how much input x1x_1 contributes to output z1z_1 after both machines run.

To compute that, we must collect every route from x1x_1 to z1z_1 through the middle layer. That collection becomes row 1 of AA dotted with column 1 of BB.

The same logic works for every entry:

(AB)ij=kaikbkj=rowi(A)columnj(B)(AB)_{ij} = \sum_k a_{ik}b_{kj} = \text{row}_i(A)\cdot\text{column}_j(B)

So the row-column rule is not the starting point. It is the result of asking:

What single matrix does the same job as these two matrices chained together?


6. The Path Picture

Here is the same idea as paths through a two-layer machine.

flowchart LR
    x1(["x1"]) -->|"b11"| y1(["y1"])
    x1 -->|"b21"| y2(["y2"])
    x2(["x2"]) -->|"b12"| y1
    x2 -->|"b22"| y2
    y1 -->|"a11"| z1(["z1"])
    y2 -->|"a12"| z1
    y1 -->|"a21"| z2(["z2"])
    y2 -->|"a22"| z2

To find how much x1x_1 affects z1z_1, trace all paths from x1x_1 to z1z_1:

path 1: x1 -> y1 -> z1    strength = b11 * a11
path 2: x1 -> y2 -> z1    strength = b21 * a12

total strength = a11*b11 + a12*b21

That is the (1,1)(1,1) entry of ABAB.

Note

Multiply along a path. Add across different paths.

This is why there is a sum of products.

  • The multiplication measures the strength of one complete route.
  • The addition combines all routes that connect the same input to the same output.

So entry (i,j)(i,j) of ABAB means:

the total influence of input jj on output ii after the two machines are chained.


7. A Concrete Example

Let:

A=[1201],B=[3124]A = \begin{bmatrix} 1 & 2 \\ 0 & 1 \end{bmatrix} ,\qquad B = \begin{bmatrix} 3 & 1 \\ 2 & 4 \end{bmatrix}

The (1,1)(1,1) entry of ABAB is:

[12][32]=1(3)+2(2)=7\begin{bmatrix} 1 & 2 \end{bmatrix} \cdot \begin{bmatrix} 3 \\ 2 \end{bmatrix} = 1(3) + 2(2) = 7

Path reading:

x1 -> y1 -> z1: 3 * 1 = 3
x1 -> y2 -> z1: 2 * 2 = 4

total: 3 + 4 = 7

Same answer.

The row-column calculation is just a compact way to do the path calculation.


8. Why the Inner Dimensions Must Match

For ABAB to exist:

Am×nBn×pA_{m \times n}B_{n \times p}

the inside dimensions must match.

Why?

Because BB produces the middle values, and AA consumes those middle values.

x  --B-->  y  --A-->  z
           ^
           this middle size must agree

If BB produces 3 middle values but AA expects 4, the machines cannot connect.

So:

Am×nBn×p=(AB)m×pA_{m \times n}B_{n \times p} = (AB)_{m \times p}

The outer dimensions remain:

  • pp input directions from BB
  • mm output directions from AA

The shared middle dimension nn gets summed over:

(AB)ij=k=1naikbkj(AB)_{ij} = \sum_{k=1}^{n} a_{ik}b_{kj}
Note

The middle dimension is the number of paths through the middle layer. That is why it must match, and why it disappears into the sum.


9. Why Order Matters

ABAB means:

do B first, then A

BABA means:

do A first, then B

Those are usually different machines.

For example:

  • rotate, then stretch
  • stretch, then rotate

These do not usually land in the same place.

So matrix multiplication is usually not commutative:

ABBAAB \neq BA
Note

Matrix multiplication is composition. Composition cares about order.


10. Why Not Multiply Matching Cells?

It is natural to ask:

Why not just multiply matching entries?

Like this:

[abcd][efgh]=[aebfcgdh]\begin{bmatrix} a & b \\ c & d \end{bmatrix} \odot \begin{bmatrix} e & f \\ g & h \end{bmatrix} = \begin{bmatrix} ae & bf \\ cg & dh \end{bmatrix}

That operation exists. It is called the Hadamard product.

But it answers a different question.

Element-wise multiplication says:

combine matching cells independently

Matrix multiplication says:

chain two transformations into one transformation

To chain transformations, every output must collect influence from the whole middle layer. That is why we need row times column.


11. A Second Lens: Columns Show Where Basis Vectors Land

There is another useful way to see the same rule.

The columns of a matrix tell you where the basis vectors go.

In 2D:

e^1=[10],e^2=[01]\hat{e}_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix} ,\qquad \hat{e}_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}

If:

B=[3124]B = \begin{bmatrix} 3 & 1 \\ 2 & 4 \end{bmatrix}

then:

  • column 1 of BB is where e^1\hat{e}_1 lands after BB
  • column 2 of BB is where e^2\hat{e}_2 lands after BB

Now compose ABAB.

Column jj of BB tells us where input direction jj lands after the first machine.

Then AA acts on that landing vector.

So:

column j of AB=A(column j of B)\text{column }j\text{ of }AB = A(\text{column }j\text{ of }B)

And applying AA to that column means:

each row of Athat column of B\text{each row of }A \cdot \text{that column of }B

So again:

(AB)ij=rowi(A)columnj(B)(AB)_{ij} = \text{row}_i(A)\cdot\text{column}_j(B)
Note

Columns of BB show where the first machine sends the basis directions. Rows of AA show how the second machine builds each final output. Their meeting point is row times column.


12. Why This Matters in AI

This is the exact machinery behind neural networks.

An embedding is a vector of features.

A weight matrix remixes those features:

embedding -> weight matrix -> new feature vector

In self-attention, the QQ, KK, and VV matrices transform token embeddings into queries, keys, and values.

flowchart LR
    E["token embedding<br/>features"] -->|"matrix Q"| q["query"]
    E -->|"matrix K"| k["key"]
    q --> S["score = q dot k"]
    k --> S

When attention computes QKQK^\top, it is doing many dot products:

How strongly does this query match that key?

That is still the same idea:

  • rows as recipes
  • columns as inputs/destinations
  • dot products as weighted totals
  • matrix multiplication as many such totals arranged together

Deep networks are many transformations chained together. Matrix multiplication is the arithmetic that lets those chains be represented as layers.


13. What Matrix Multiplication Really Is

Here is the same truth at several depths.

School level

Each entry of the product is:

row of the first matrixcolumn of the second matrix\text{row of the first matrix} \cdot \text{column of the second matrix}

Intuitive level

Each entry asks:

How much does this input direction influence this output direction after both machines run?

Path level

For each input-output pair:

Multiply along every path through the middle. Add the paths together.

Transformation level

ABAB is the single transformation that does:

xBBxAA(Bx)x \xrightarrow{B} Bx \xrightarrow{A} A(Bx)

So:

(AB)x=A(Bx)(AB)x = A(Bx)

First-principle level

Matrix multiplication is defined the way it is so that the product matrix represents the composition of two linear transformations.


14. The One Line to Keep

Note

Row times column is not a trick. It is the bookkeeping of influence through a middle layer.

Multiply along each path. Add across paths.

That is matrix multiplication.

We came in unwilling to memorize a rule. We leave able to derive it, because we now know the question the rule answers:

What single machine does the same work as these two machines chained together?


15. The Bridge Forward

We have now seen how transformations combine. The next natural questions are:

  • Is there a matrix that undoes another matrix? That is the idea of an inverse.
  • Is there a matrix that changes nothing? That is the identity.
  • When a matrix acts, what directions stay aligned, stretch, shrink, or collapse? That leads to eigenvectors and eigenvalues.

Geometry gives relationships shape. Coordinates give shape an address. Vectors turn position and meaning into something we can move. Linear algebra studies the machines that move them, and matrix multiplication is how those machines combine. Calculus then studies how all of it changes.


Back to Index | Related: Functions (transformation) | Related: Linear Transformation in Self-Attention