Chapter 2: We Are All Just Numbers Here...

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture!

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Okay, let's unpack this.

Picture this.

Dublin, 1843.

A crisp autumn day.

Sir William Rowan Hamilton.

Brilliant guy.

He's walking along the Royal Canal with his wife.

He's been stuck on this math problem, right?

Extending numbers to higher dimensions.

Then crossing Brougham Bridge.

Bam!

It hits him.

The aha moment.

Totally.

And he's so fired up, he apparently whips out a knight and scratches the formula right onto the Bridgestone.

A lift clay in the Mayonages, 1801.

Wild.

Yeah, that's the famous Quaternian story.

A whole new number system.

But, you know, while working all that out, Hamilton formalized something absolutely crucial for us today.

The ideas of scalars and vectors.

And that's our focus for this deep dive.

We're cracking open a chapter from why machines learn to really understand these, well, these building blocks, scalar vectors, how they connect to this thing called a perceptron, and why it all matters for, you know, machines learning about the world.

We're using the source material you shared to guide us.

And it's a fascinating journey.

We'll touch on some history.

See how lists of numbers can capture ideas like force.

See the perceptron in action with this map.

But it may be even why AI research kind of stalled.

Exactly.

The so -called AI winter.

It's all connected.

Okay, let's start super simple.

Stalars.

What are we talking about?

Right.

A scalar fundamentally is just one number.

A single quantity.

Think magnitude.

Like temperature, speed, distance.

If I say a man walks five miles, that five miles, that's the scalar.

Got it.

Just the how much.

No direction involved.

So how's a vector different?

Well, the vector brings in direction.

So instead of just five miles, it's five miles northeasterly.

Now you've got both magnitude, the five miles, and direction northeasterly.

Oh, okay.

Magnitude and direction.

Both pieces.

And we can draw this, right?

Like an arrow.

Absolutely.

That's the classic way.

Imagine our walker starts at the origin.

Zero, zero on a graph.

It goes four miles east, three miles north.

The vector is an arrow to the point four, three.

So the end point four, three gives the components four east, three north.

Precisely.

And the arrow's length.

That's the magnitude.

Which is where Pythagoras comes in.

A plus B plus BD or ESAR.

You got it.

Square root of four squared plus three squared.

That's a square down of 16 plus nine.

Square down of an F5, which is five.

There's your five mile magnitude.

And the arrow points northeast -ish, giving the direction.

Okay.

Connects the numbers four, three to the idea.

Now was Hamilton like the very first person thinking about direction with quantities?

Not really, no.

The concept was around much earlier.

People worked with it without calling it a vector.

Think Isaac Newton, late 1600s.

Physics, right.

Forces, velocity.

Exactly.

All things with magnitude and direction.

In his Principia, he talks about how combining forces makes something move along the diagonal of a parallelogram.

Which is basically vector addition just drawn out geometrically.

Pretty much.

He understood the principle.

And even before Newton, you had Leibniz in 1679 hinting at representing shapes and movements with numbers.

Kind of out of his time.

Doss, too.

Mapping numbers geometrically.

Hamilton really formalized it.

Gave us the system.

Okay, so deep roots, but Hamilton gave it the structure.

We've got scalars, just numbers,

and vectors numbers with direction.

Like four, three, or that arrow.

Right.

And for machine learning, the key thing is representing these vectors numerically as lists or arrays of numbers.

That's what computers work with.

Like features of something you want to classify.

Maybe house, feature, size, bedrooms, location, score.

That's an input vector.

Makes sense.

Computers need the numbers.

So how do we actually, you know, do things with these number lists?

Add them.

Change them.

Yeah.

Vector operations.

Addition is the easiest.

You just add the matching parts, the corresponding components.

Vector A1, A2 plus vector B1.

B2 is just A1 plus B1, A2 plus B2.

So our walker, he walks four, three, then maybe walks another two, six.

His final spot is four plus two, three plus six, which is six, nine.

Simple.

Exactly.

That vector six, nine shows his final position relative to where he started.

His net displacement.

But here's an important distinction.

The total distance he walked is different.

The first walk had magnitude five.

The second, two, six, has magnitude thrall -gapsity, two plus six errors, which is about six point three two.

So total distance walked is five plus six point three two, roughly eleven point three two miles.

But his final distance from the start, the magnitude of six, nine is scored six.

Six error plus nine, which is?

That's quarter T one seventeen, about ten point eight two miles, less than the total distance.

Because the vectors account for direction changes, not just the raw length covered.

You can still visualize it like drawing arrows head to tail.

Exactly.

Put the start of the second arrow at the end of the first.

The sum is the arrow from the very start to the very end.

It matches Newton's force parallelogram thing, too.

And a key point, you can slide a vector anywhere in space.

As long as length and direction stay the same, it's the same vector.

OK.

Addition is component by component.

Subtraction, too.

Yep.

Same idea.

Component wise, AB is A1, B1, A2, B2.

If we take four, three and subtract two, six, we get four, two, three, six, which is two minus three.

And a negative component like minus three, what's that mean?

It means it's acting in the opposite direction on that axis.

If plus three north was our rule, then minus three means three south.

So two minus three is like two steps east, three steps south.

Got it.

Addition, subtraction.

Can you multiply vectors?

Well, you can multiply a vector by a scalar.

Scalar multiplication.

Take our vector four, three and multiply by say five.

You just hit each component.

Five, four, three equals fifty, fifty three equals twenty, fifteen.

Geometrically, what's that doing?

Just stretching it.

Exactly.

Stretching it by a factor of five in the same direction.

The original four, three had magnitude of five.

The new twenty, fifteen.

Its magnitude is squirt.

Twenty to two dots equals fifteen den of vector.

It's squirt.

Four hundred plus two to twenty five, six to twenty five, which is twenty five.

Five times bigger.

Makes sense.

And if the scalar was negative?

It flips the direction, stretches and points the opposite way.

OK.

And you mentioned unit vectors before.

Right.

Unit vector.

Just a vector whose magnitude, its length is exactly one.

They're super useful for representing pure direction.

Like signposts.

Sort of, yeah.

The standard ones are i for the x -axis, one, zero and j for the y -axis, zero, one.

You can write any 2D vector using them.

Like four, three is the same as four i plus three j.

Four steps in the i direction, three in the j.

But they don't have to be on the axis.

Nope.

Any vector with length one is a unit vector.

It could point in any direction.

OK.

So we have scalars, vectors, adding, subtracting, scaling.

What's next?

You mentioned something critical for ML.

The dot product.

This was huge.

The dot product of two vectors A and B written AB.

It gives you back just a single scalar number.

OK.

Combines two vectors into one number.

How?

Geometrically?

Geometrically, one way to think of it is magnitude times projection.

Take the magnitude of vector A, multiply it by the length of the shadow that vector B casts onto A.

Projection.

Like a shadow.

Yeah.

The formula is A B equals A B cos theta.

Magnitudes multiplied times the cosine of the angle between them.

OK.

Magnitude times magnitude times cosine theta.

Let's try that example.

Vector A is four, zero, just on the x -axis.

Vector B is five, five.

Angle between them is 45 degrees.

Right.

Magnitude of A is four.

Magnitude B square t, five y's plus five minute two, which is five times square two.

Which is five times square two.

Cosine of 45 degrees is one over square two two.

So multiply it all.

Four, five square two two.

One square two two two.

The square two's cancel out.

Leaving four, five equals 20.

The dot product is 20.

OK, 20.

How does that connect back to the shadow idea, the projection?

This is where the unit vector idea clicks again.

If one of the vectors, say A, is a unit vector, its magnitude A is one.

Right.

So the formula becomes A B equals one B cos theta.

Just B cos theta.

Yeah.

And that quantity, B cos theta, is exactly the mathematical definition of the length of the projection of B onto the direction of A.

Whoa, hang on.

If A is a unit vector, the dot product A B is the length of B's shadow on A directly.

Yep.

Pretty neat, huh?

Like if A is the x -axis unit vector, one zero B is three three, the shadow of three three on the x -axis is just three units long.

It's x component.

And the dot product, one of zero three three, is 13 plus three plus layer three.

Wow.

It works.

Even if the unit vector isn't on an axis.

Take a unit vector pointing at 45 degrees.

That's first order t two from first square t two.

Dot that with, say, one three.

The result tells you how much of one three points along that 45 degree line.

Okay.

Projection is a really core meaning here, especially with unit vectors.

What else is important about the dot product?

Orthogonality?

Super critical.

If the dot product of two non -zero vectors is zero, A B equals zero, it means they're orthogonal.

Yeah.

Perpendicular.

At right angles.

Because of the formula.

A B cos theta is zero.

If the magnitudes aren't zero, the only way to get zero is if cos theta is zero.

And cos theta is zero when the theta is 90 degrees.

Ah, right.

So zero dot product means perpendicular vectors.

That sounds useful for boundaries, definitions.

Exactly.

It's fundamental for defining things like separating planes and machine learning.

But how do we actually calculate it without knowing the angle?

You mentioned doing it with the components before.

Yes.

And that's the computational workhorse.

If A one A two by D N and B one B two by B N, then A B is just A one B one plus A two B two plus one plus N A B N.

Multiply corresponding components, then sum them all up.

Okay, let's check.

Four zero five or five minus a forty five plus O five equals twenty.

Matches.

Yeah.

One zero three minus thirteen plus O three equals three.

Matches.

First square root of T two.

One over three.

First square root of T two plus three square root of T two plus three square root of T two plus three four square root of T two.

Matches two.

See, this numerical way is equivalent.

Much easier for computers.

And it's basically the calculation inside models like the perceptron.

Okay, now we're connecting it.

Why machines learn, how do vectors and this dot product relate to the perceptron model?

They are the perceptron's language, basically.

Remember, it's core calculation.

Inputs times weights, sum them up, add a bias, check if positive or negative.

Yeah, W one by one plus W two by two plus plus W N X N plus B.

The weighted sum.

Right.

Now think of the inputs by one by two, cabio S is the input vector X and the weights W one W two W N as the weight vector W A.

That weighted sum part.

W one by one plus seduce S plus new and X N.

That's exactly the dot product W X.

Wait, really?

The main number crunching in the perceptron is just a dot product W X.

Fundamentally, yeah, the decision rule is essentially W X plus B zero for one class and equals zero for the other, often in W T X plus B in matrix notation, assuming column vectors.

Okay.

And the geometry, the projections, the perpendicular thing, how does that fit?

That's the really cool insight.

The weight vector W isn't just arbitrary numbers.

Geometrically, it defines the orientation or tilt of the decision bound or the line or plane that separates the classes.

Exactly.

And here's the key.

The weight vector W is orthogonal perpendicular to that separating hyperplane.

Orthogonal again.

So W points straight out from the separating plane.

If you change W, you change the tilt of the plane.

Precisely.

The whole goal of training the perceptron is to find a weight vector W whose perpendicular hyperplane correctly splits the data points.

And the dot product W X that tells you which side of the plane a data point X is on.

Yes.

The value of W X is related to the distance of the point X from the plane defined by W.

More importantly, it's sine plus or like tells you which side it's on.

Positive means one side.

Negative means the other.

Zero means it's right on the plane and the perceptron uses that sign to make its prediction.

Wow.

Okay.

What about the B, the bias term that just shifts the plane, right?

Doesn't tilt it.

Correct.

The bias shifts the plane parallel to itself away from the origin.

And there's a handy math trick to handle it.

You add an extra input component X zero.

It's always one.

A constant input of one.

Yeah.

And you add a corresponding weight W zero, which is just equal to the bias B.

So your input vector becomes like one by one by two and your weight vector becomes B W one W two.

Augmented vectors.

Got it.

Why do that?

Because now the original calculation W one by one plus W annex N plus B becomes just the dot product of these new augmented vectors B one plus W one by one plus W two by two plus W O.

So the whole condition W one by one plus W one by Z augmented zero.

That is elegant.

Wraps the bias right into the main dot product calculation.

So the perceptron learns this vector W, including the bias now, that defines the separating hyperplane.

How does it actually learn it?

Through the perceptron learning algorithm.

Let's use that pandemic patient example from the source.

We have patient data, age, BMI, fever, et cetera.

Maybe six features.

That's our 6D input vector X.

We want to classify them.

Need ventilation.

Why one or not?

The goal is to find a 6D weight vector W.

Well, 7D with the bias trick that defines a 5D hyperplane in the 6D space to separate the Y1 patients from the Y1 patients.

A 5D plane in 6D space.

Hard to picture, but OK.

What's the algorithm step by step?

It's surprisingly simple.

One, initialize your weight vector W to all zeros or maybe small random numbers.

Two, then you loop through your training data, taking one patient input X, correct label Y at a time.

Three, for that patient, you check if your current perceptron, using its current W, gets the classification right.

The perceptron predicts one if WTX zero, using augmented districts now, and magus one otherwise.

A quick check for a mistake is, is Y WTX zero because of Y, if Y WTX zero is zero because of Y WTX positive, so one WTX zero.

If Y knows one WTX zero, it would make one WTX zero means WTX zero.

If the product Y WTX is zero or negative, it's wrong.

OK, check if Y WTX equals zero.

If it is, it's a mistake.

What then?

If it is a mistake, you update the weights.

The rule is, the rule is WOLD plus YX.

You add the input vector X scaled by the correct label Y to the old weights.

The new WOLD plus YX.

Hmm.

You do this check and update for every single patient in your training set.

That's one pass or epoch.

If you go through the entire data set and make no updates, congratulations, your W works for all points.

You stop.

But if you did make updates?

You just start over from the beginning of the data set and do another pass.

Keep repeating passes until you have one full pass with zero mistakes.

OK, it cycles through nudging the weights whenever it messes up.

Why does that specific update, adding YX, actually help?

Let's think about it.

Suppose you misclassified point X.

The condition was Y WTX equals zero.

You want to make WTX have the same sign as Y.

Right.

Now look at the dot product with the new weight vector.

No, not old plus YX plus X.

Using dot product rules, that expands to WOLD plus YXX, which is WOLD dot X plus Y.

XX.

That's the magnitude of X squared, right?

Always positive unless X is zero.

Exactly.

So the new dot product is the old one plus Y times a positive number.

If Y was one, meaning WOLD dot X equals zero, you've added a positive value pushing WTX upwards towards being positive.

If Y was met of one, meaning WOLD dot X is zero, you've added minus one times a positive value.

So you've subtracted a positive value pushing WTX downwards towards being negative.

So in either case, the update nudges the dot product WTX in the direction it should have gone for that specific misclassified point X.

Precisely.

It adjusts W to better classify that point.

Now that adjustment might mess up another point it got right before, which is why you need to iterate through the data multiple times.

OK, that makes intuitive sense.

But the huge question then must have been,

is this guaranteed to actually stop?

If a perfect separating line or plane exists, will this simple nudging process definitely find it, or could it just bounce around forever?

That was the crucial question.

Is convergence guaranteed?

And the amazing answer, proven back in the 60s by Bloch and later rigorously analyzed by Minsky and Papert, is right yes.

If the data is linearly separable, the perceptron learning algorithm will find a separating hyperplane in a finite number of steps.

OK, how do they prove that?

What's the gist?

The core idea of the proof is pretty clever.

It tracks how the current weight vector W evolves.

Specifically, it compares W to a hypothetical correct weight vector, WA.

Just imagine one possible perfect solution exists, call it W.

The proof watches two things as updates happen.

First, how aligned W gets with W.

This is measured by their dot product, WW.

Second, the squared length of W itself, which is WW.

So the alignment with the correct answer and the length of our current guess.

Right.

The proof demonstrates that every time an update happens, because of a mistake, the alignment term WW increases by at least a certain fixed positive amount.

This minimum increase amount, often called gamma, is related to how clearly separated the data is the margin.

OK, so WW gets strictly better with each mistake.

Yes.

Meanwhile, the squared length WW also increases with each update, but the proof shows it increases by at most a certain amount, related to the size of the input vectors.

Crucially, the guaranteed increase in alignment, WW, is shown to sort of outpace the increase in length, WW, over the updates.

So the alignment improves more significantly relative to the growth in length.

And because it makes guaranteed progress towards alignment with every single mistake,

it must reach a state where it correctly classifies everything, assuming a solution exists.

The proof actually gives an upper bound on the number of updates needed.

That's really powerful.

Not just that it works, but that it's guaranteed to finish, and relatively efficiently, a solid foundation.

Absolutely.

It was a huge theoretical win for early AI.

This simple algorithm, grounded in vector math, could provably learn to separate data.

If the data was linearly separable.

Which brings us to the catch.

Exactly.

The AI winter.

Minsky and Papert's 1969 book Perceptrons laid out the convergence proof, which was great, but it also famously highlighted the model's Achilles heel.

The fact that a single layer perceptron, the kind we've been discussing, can only solve problems where that linear separation is possible.

If the data categories can't be divided by a single flat hyperplane, the algorithm we described won't converge.

It literally cannot find a solution because none exists in that simple linear form.

And many real problems aren't that clean cut.

Right.

The classic killer example they used was the XOR problem.

Exclusive OR.

Okay.

Think of four points on a 2D graph.

Zeroes, zeroes, zero, and lever one are in class A, or in class B.

Like the XOR logic gate output is one, class B.

If inputs are different, zero, class A, if they're the same.

Got it.

Corners, zero, zero, zero, zero, zero, and one, or the other.

Now try drawing one single straight line that perfectly separates the class A points from the class B points.

You can't.

Any line you draw will cut through both classes.

It's impossible with one line.

Exactly.

Minsky and Papert proved mathematically that because XOR isn't linearly separable, a single layer perceptron is fundamentally incapable of learning it.

And that reveal had a big impact.

Huge.

It significantly dampened the initial optimism.

People realized this simple, elegant model had very real limits.

Now they knew even then that you could theoretically solve problems like XOR by stacking perceptrons, making multi -layer networks.

Okay, more layers, more power.

Right.

But the problem was, nobody had a good algorithm to actually train the weights in those hidden middle layers.

The perceptron update rule only worked for the output layer where you knew the target label directly.

How do you adjust weights and layers that aren't directly connected to the final output?

Ah, the credit assignment problem.

How do you know which weights in the middle layers were responsible for the final error?

Precisely.

That lack of a training method for multi -layer networks, combined with the clear limitations of single -layer perceptrons shown by XOR, led to a major pullback in AI funding and research.

People got disillusioned.

That period, roughly mid -70s to early 80s, became known as the first AI winter.

Sir James Lighthill's report in the UK around that time specifically mentioned the past disappointments in AI.

Wow, quite a pendulum swing from the initial excitement.

It really was.

But crucially,

the research didn't die out completely.

People kept working on the multi -layer problem.

And eventually cracked it, leading to backpropagation.

Exactly.

The development and popularization of the backpropagation algorithm, especially in the mid -80s by folks like Rumelhart, Hinton, and Williams, was the breakthrough.

It provided a way, using calculus essentially, to figure out how errors at the output should influence the weights in all the preceding layers, even deep ones.

So backprop finally unlocked the potential of those multi -layer networks that could handle non -linear problems like XOR.

Yes.

It built directly on the vector map foundations of the perceptron, but added the mechanism needed for deeper learning.

That, combined with later increases in computing power and data availability,

paved the way for the deep learning revolution we see today.

What a journey.

So we went from Hamilton's bridge inspiration to understanding scalars and vectors, how they're manipulated, the power and meaning of the dot product.

Seeing how that dot product is the calculation engine of the perceptron, defining that separating hyperplane.

Appreciating the elegance of the learning rule.

And the proof that it works for linearly separable data.

Only to hit the wall with non -linear problems like XOR, leading to the AI winter.

Which then set the stage for the next big leap with backpropagation and multi -layer networks.

It all builds.

Absolutely.

These foundational concepts representing things as vectors, using dot products for comparison or projection, thinking geometrically about separation.

They might seem simple with the basic perceptron, but they're absolutely the core ideas scaled up in today's incredibly complex AI models.

It really makes you appreciate how progress happens.

The limitations of one model pushed the field to find new math, new algorithms, building on what came before.

The vector math wasn't wrong.

The architecture needed to evolve.

Precisely.

And it leaves you wondering, doesn't it?

If problems like image recognition or language translation can ultimately be framed as finding complex, separating surfaces in these vast high -dimensional vector spaces,

what does that imply about the nature of information itself?

Is the world at some fundamental level reducible to these kinds of geometric relationships?

And, you know, our own brains may be doing something analogous, some form of massively complex vector processing we don't even understand yet.

It's quite a thought.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Vector mathematics forms the computational backbone of modern machine learning, enabling algorithms to process and classify data through elegant geometric principles rooted in nineteenth-century mathematics. The chapter traces this intellectual journey from William Rowan Hamilton's quaternion discovery through contemporary neural network architectures, demonstrating how classical mathematical structures became indispensable to artificial intelligence. Foundational concepts like scalars, vectors, and unit vectors provide the vocabulary for encoding data as points in high-dimensional spaces, while fundamental operations including addition, subtraction, and scalar multiplication allow meaningful transformations of this data. The dot product emerges as a particularly powerful operation, offering both computational efficiency and profound geometric insight into how different data vectors relate to one another across multiple dimensions. Building on this foundation, the perceptron algorithm translates geometric intuition into a practical learning mechanism by using the dot product to identify which side of a hyperplane any given data point occupies, thereby solving binary classification problems through iterative weight updates. Matrix notation compresses vectors into organized structures that streamline both theoretical analysis and computational implementation, making learning algorithms tractable at scale. The perceptron convergence proof provides theoretical guarantee that when data admits a linear solution, the algorithm will find it within a bounded number of iterations, lending mathematical rigor to the concept of machine learning itself. Yet the chapter critically examines the XOR problem, revealing that single-layer perceptrons cannot solve all classification tasks because they are fundamentally limited to finding linear decision boundaries. This limitation prompted development of multi-layer architectures capable of learning nonlinear boundaries through backpropagation, fundamentally reshaping how researchers approached the problem of teaching machines to recognize complex patterns. Together, these mathematical insights and algorithmic innovations illustrate how geometry and algebra combine to enable intelligent systems.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 2: We Are All Just Numbers Here...

Related Chapters