Chapter 10: The Algorithm That Put Paid to a Persistent Myth
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome to the Deep Dive.
Today, we're really getting into the weeds of modern AI.
We're exploring the core math that actually makes machines learn.
That's right.
And we're basing this whole discussion on just one source, chapter 10 of why machines learn, the elegant math behind modern AI.
It's, well, it's quite a chapter.
It really is.
It covers the history, the math, the whole story behind a really foundational algorithm.
Absolutely.
So our mission for you, the learner, is to break this down.
We want to give you a clear, you know, engaging understanding of this key algorithm that drives so much of the AI we see around us.
We're going to look at the history, the math ideas, those crucial aha moments.
Exactly.
But without drowning you in super technical jargon, we want you to grasp the core logic, why it was such a big deal.
Right.
How it actually works, fundamentally.
So let's start at the beginning.
The chapter kicks off by addressing this
bit of AI folklore, really.
The idea that Minsky and Papert basically killed off neural network research back in the late 60s.
Yeah, by supposedly proving that simple networks, single -layer ones, couldn't even handle a basic problem like XOR.
And this is super important context because it shaped, well, everything that came after.
But Jeffrey Hinton, who's, you know, a giant in deep learning, he had a very different view on what Minsky and Papert actually showed, especially about XOR.
He did.
Apparently he was interested in neural nets way back in high school, even before their book came out.
Yeah.
And he felt their conclusions about XOR were being, let's say, over -interpreted.
He didn't think it meant neural nets were a dead end at all.
So what was his journey?
How did he get to that point?
It wasn't exactly straightforward, was it?
Not at all.
It's fascinating.
Early on, he was interested in how the brain stores memory, sparked by a friend talking about holograms.
Holograms.
Like the idea that memory isn't in just one spot.
Exactly.
That idea stuck with him.
Then he looked into physics, physiology,
hoping for answers about the brain.
But they didn't quite cut it.
No, he felt they focused too much on the biological nuts and bolts, not the actual process of learning or intelligence.
So he tried philosophy.
Right.
Then psychology.
Right.
Philosophy, then experimental psychology.
But still something was missing.
He wanted to build something, you know, to actually see if the ideas worked.
And then he read Hebb, Donald Hebb's book.
Ah, yes.
The Organization of Behavior.
That was hugely influential.
Hebb's ideas about connection strengthening.
That really clicked with Hinton's thinking.
Okay.
So that led him to AI.
Specifically, Edinburgh.
Yes.
He joined the AI school there under Christopher Longat Higgins.
And here's a twist.
Longat Higgins had actually been interested in connectionism and holograms himself earlier.
Oh, really?
But by the time Hinton got there, Longat Higgins had shifted towards symbolic AI, you know, logic, rules, manipulating symbols.
So he tried to steer Hinton away from neural nets.
He did.
But Hinton was skeptical.
He felt intelligence was more than just logic.
He was determined to stick with neural networks, even had to negotiate to keep working on them for his PhD.
Wow.
That took some conviction.
It really highlights how sometimes the big ideas come from pushing against the grain, doesn't it?
Definitely.
Now back to Minsky and Papert for a sec.
The chapter is clear.
Their mathematical proof about single -layer perceptrons.
It was significant, right, within its scope?
Oh, absolutely.
It wasn't trivial.
They mathematically showed the limits of those simple networks, particularly for problems like XOR.
That part was solid.
But Hinton's point, the con job aspect, as he put it.
Right.
His argument was their proof only applied to those single -layer networks.
It said nothing definitive against more complex, multi -layer ones.
Yet the field largely just stopped.
Abandoned the whole approach based on that limited proof.
Pretty much.
It's a kind of cautionary tale, isn't it?
How interpretations can sometimes stifle promising research.
And it turns out Hinton wasn't the only one thinking beyond single layers.
What about Rosenblatt, the perceptron guy?
Exactly.
The chapter brings up Frank Rosenblatt.
Apparently his student George Nagy remembered Rosenblatt being very aware of the difficulty of training multi -layer networks.
And he actually wrote about it back in 61.
He did.
In his book, Principles of Neurodynamics, there's literally a section called back -propagating error correction procedures.
No way.
So he had the basic idea.
He had the concept, yeah.
He described a three -layer network input, hidden output, and the idea of sending corrections backward if the first attempt wasn't right.
So feeding the error back through.
But the crucial piece missing was how exactly to train those connections going into the hidden layer.
The math wasn't quite nailed down for that part.
Okay, so the intuition was there, but not the precise mechanism.
Right.
The existing perceptron training only really worked for the final layers weights.
Rosenblatt knew the earlier layers were key, but optimizing them was the challenge.
And there was another problem he spotted too, right?
The symmetry thing.
Yes, the symmetry problem.
He realized if you start all the connection weights at the same value, like zero.
All the hidden neurons just learn the same thing.
Exactly.
They all get the same input, do the same calculation, produce the same output, makes the whole layer kind of useless for learning complex patterns.
So how did he suggest fixing that?
Well, his analysis showed this was a problem for deterministic updates.
So he suggested introducing some randomness, a stochastic element, into how the weights were updated.
Break the symmetry that way.
Okay.
And Hinton initially thought that meant the neurons themselves should be random, stochastic.
Yeah, that was his first interpretation, which, you know, seemed logical based on Rosenblatt's suggestion, but it wasn't actually the most effective way forward.
It kind of sidetracked him for a bit.
But the chapter hints that the real solution, the one Rosenblatt maybe vaguely touched on, came later with Rumelhart.
Precisely.
That collaboration with David Rumelhart and Ronald Williams too, that's where the modern backpropagation algorithm really took shape.
But getting there wasn't easy for Hinton.
After his PhD in 77,
finding support was tough.
Incredibly tough.
In the UK at that time, there was virtually no support for neural network research.
He actually left academia for a while, taught maths at a free school.
It's hard to imagine now, given how central his work became.
It really shows the climate then.
He struggled to get interviews, got rejected by Sussex.
It wasn't until someone suggested sending his thesis to the U .S.
that things changed.
And that led him to Rumelhart at UC San Diego.
Yes, and the atmosphere there was completely different.
More open, perhaps.
Rumelhart was already deeply interested in connectionism.
It was a perfect match.
So early 80s, they start really focusing on this problem.
How do you actually train these multi -layer networks?
That becomes the central question.
Conceptually, you know what you want.
Input goes in, output comes out, you calculate the error.
And then somehow, somehow adjust all the connections, including the hidden ones, to reduce that error.
And the tool for that adjustment is gradient descent, which we've talked about before.
Right, gradient descent.
Finding the lowest point in the error landscape by taking steps downhill.
But the landscape for these multi -layer networks isn't a simple bowl shape anymore, is it?
Not at all.
It's non -convex, full of hills, valleys, plateaus,
lots of places, local minima, where you can get stuck thinking you've found the bottom when there's a much deeper valley elsewhere.
Like finding a parking spot that looks okay, but there's a better one just around the corner you can't see.
Good analogy.
And interestingly, Minsky himself, with Oliver Selfridge, had written about similar problems way back, this idea of hill climbing and getting stuck on plateaus, the Mesa Phenomenon.
So maybe Minsky and paper were just being realistic about how hard optimization would be for complex networks.
That's the charitable view.
That's one interpretation, yes.
They recognize the difficulty.
But the chapter also mentions the less charitable view.
Right.
Perhaps they were intentionally trying to push their own symbolic AI agenda by downplaying connectionism and Rosenblatt's work on more complex machines.
The Dreyfus brothers certainly suggested that.
Hmm.
Regardless of the motives, the core ideas needed for a backprop didn't just disappear.
There were earlier hints in other fields.
Absolutely.
Control theory, aeronautics.
People like Kelly and Bryson in the early 60s developed methods for optimal control that used similar principles.
And Stuart Dreyfus built on that, using the chain rule from calculus.
Yes.
And Schmidhuber's historical work points to others like Amari and Linon -Muna in the late 60s, early 70s.
Lin and Yuma even developed code for efficient reverse differentiation.
And then there was Paul Werbos' PhD thesis in 74.
Right.
Werbos came incredibly close to the modern algorithm in its general form.
But for whatever reason, it didn't really catch on in the ML community back then.
So despite all these earlier glimmers, it really took Rumelhart, Hinton, and Williams in the early 80s to put it all together in a way that stuck.
Exactly.
Their formulation and, crucially, their successful implementation and demonstration, that's what really launched the deep learning era.
OK.
So to really get what they did, we need to dig into the math a bit, starting with that delta rule again.
Right.
Let's go back to a single neuron first.
It helps build the foundation.
Remember the delta rule or Widerhoff.
Yeah, for adjusting the weights of one neuron.
So simple neuron, input x, weight w, bias b, output y equal wx plus b.
If we wanted to learn a line fitting some data points, x, y, that's linear regression.
The delta rule finds the best w and b.
You start with guesses, calculate the neuron's prediction high hat, find the error y hat, and usually quantify it with square loss.
And minimizing that loss is like finding the bottom of a bowl shape where the axis of w and b.
Exactly.
Gradient descent is how we walk down that bowl.
We calculate the slope, the gradient, which tells us the steepest direction.
And the gradient involves calculus, right?
Partial derivatives.
Correct.
We need the partial derivative of the loss with respect to w and with respect to b.
This tells us how the loss changes if we nudge w or b slightly.
And the chain rule is important here.
Crucial.
Especially as things get more complex.
Remember, if y depends on z and z depends on x, the chain rule links their rates of change.
Deadly x, diddy x.
Okay, descent by two example.
Got it.
So we apply the chain rule to our loss function l, y, w, x, b2 to find angle error and l the arrow.
That gives us the gradient vector.
And the delta rule uses that gradient to update w and b.
Yep.
You adjust w by subtracting a small amount proportional to l e o and adjust b by subtracting an amount proportional to l o.
That proportion is set by the learning rate alpha.
You just keep doing that iteratively over and over with the data.
Exactly.
Until the loss is minimized and the neuron has learned the best line it can.
Okay, that's for one input.
What about multiple inputs?
Like for classification spam or not spam?
Good question.
If you have inputs by one by two, the neuron calculates y hat equals w1 by one plus w2 by two plus b.
The loss and gradient are similar just with more terms l e o w1 l w2 plus b.
And the goal there is to find a line or a plane in higher dimensions that separates the different classes of data.
Precisely, a linearly separating hyperplane.
But, and here we are back at XOR.
What if the data can't be separated by a single straight line?
Then a single neuron, no matter how many inputs, just can't do it.
It hits its limit.
Which brings us neatly to the need for a touch of non -linearity.
Exactly.
Problems like XOR demand more than straight lines.
If you plot XOR, you see you need a curvy boundary, not a straight one.
So one neuron gives one line.
How do we get curves?
You need multiple neurons, arranged in layers.
Maybe two neurons in the first layer find two different lines.
Okay, but how do you combine those lines?
Just adding them up is still linear.
Yeah, you've hit the key point.
The next layer needs to combine the outputs from the first layer non -linearly.
And that's where activation functions come in.
That's it.
Remember, a neuron calculates z equals w dot x plus b, and then the output is y equals az, where a is the activation function.
If az is z, it's linear, and the whole network stays linear.
Right.
Early ideas used a threshold function output zero or one.
That's non -linear, but...
It's not differentiable.
The slope is zero or suddenly infinite.
Exactly.
And that messes up gradient descent.
You can't smoothly calculate how changes affect the output at that threshold point.
So we need something smooth but still non -linear.
Enter the sigmoid function.
It's continuous, differentiable, squashes the output between zero and one.
Like a soft, smoothed -out threshold.
And crucially, its derivative is easy to calculate.
Very conveniently, yes.
Its derivative can be written in terms of the sigmoid function itself, which simplifies the math for backpropagation later.
Okay.
So using sigmoid, we can build a network for XOR, like input layer, hidden layer, output layer.
Yes.
A simple one might have two input neurons, maybe two hidden neurons, using sigmoid, and one output neuron, also using sigmoid.
The math gets more complex now with weights and biases for each layer.
Right.
You have weights connecting input to hidden, let's call them W1, biases for hidden B1, weights connecting hidden to output W2, bias for output B2.
You calculate the hidden activations, A1, then the final output based on those.
And to train that, you need the derivative of the overall loss with respect to every single W and B in the whole network.
Exactly.
And doing that manually by deriving all the equations, it just explodes in complexity as networks get bigger.
It's not sustainable.
You need a systematic way to calculate all those gradients.
Precisely.
And that systematic way, that clever application of the chain rule, is backpropagation, brought to the forefront by Werbos, and then Rumelhart, Hinton, and Williams.
Okay.
Let's unpack the backpropagation algorithm itself.
The chapter uses a simple example first.
One hidden neuron.
Right.
Simplest case.
One input, one hidden neuron, one output neuron.
Let's trace it.
Input X goes in.
Calculate Z1, input to hidden.
Input to hidden equals W1, X plus B1.
Calculate A1, hidden output.
Sigmoid Z1.
Then that A1 becomes the input for the next layer.
Yes.
Calculate Z2, input to output, W2, A1 plus B2.
Calculate Y hat, final output, sigmoid Z2.
Now we compare Y hat to the target Y, get the error E, U or Y, Y hat, and calculate the loss, maybe L equals E2.
Perfect.
Now the goal is to find L Euro W2, L Euro 1 Euro 1, and L Euro 1 so we can update the weights and biases.
And this is where the chain rule becomes the hero.
Absolutely.
We apply it step by step, working backwards.
For L Euro 2, it breaks down into L Euro 1 Euro 2.
And the clever bit is we already calculated most of those parts during the forward pass.
Exactly.
We know Y hat, Z2, and A1 from the forward pass.
Calculating those individual derivatives is straightforward.
Same for L O B2.
So we calculate the gradients for the output layer first.
Right.
Then we continue propagating the error information backwards to find the gradients for the hidden layer, is L O W1 and it LB.
And that involves using the chain rule again, but also using the gradient information we just calculated for the output layer and the weight W2.
Yes, precisely.
You need L O 1 to find A W1 and L O B1.
And L O 1 depends on how A1 affected Z2, which involves W2, and how Z2 affected the final loss.
It's a chain reaction flowing backwards.
And needing that forward pass weight W2 for the backward calculation is what makes it maybe not perfectly biologically plausible.
That's one of the arguments, yes.
How would a real neuron easily access that specific weight from a downstream connection during the learning update?
It's an open question.
OK, but computationally, once we have all those gradients.
We just apply the delta rule update again.
Update W1 B1 W2 B2 by taking a small step proportional to the negative of their respective gradients.
And repeat this whole forward pass, backward pass, update cycle many times.
Many, many times over all the training data.
And the beauty is this whole process works as long as everything in the network is differentiable.
That's the key constraint.
Sigmoid activation helps there.
But yes, if it's differentiable, backprop gives you the gradient.
And gradient descent can minimize the loss.
And you can build much bigger networks this way, more layers, more neurons.
Absolutely.
Different architectures, different connections, different loss functions, all trainable with the same core backpropagation principle tailored to the task.
Like the handwritten digit example.
784 inputs for the image pixels.
Right.
Multiple hidden layers may be fully connected.
Then 10 output neurons, one for each digit.
And backprop handles training even that complex beast.
It does.
It efficiently computes potentially millions of gradients needed to adjust all those connections.
It's incredibly powerful.
So what exactly is the network learning when we do this?
It's more than just fitting a function, right?
That's a crucial point the chapter makes, building on Werbos's original insight.
Backprop isn't just about calculating derivatives.
It's about learning representations.
Yes.
Remember the differentiability requirement.
That's why moving from threshold neurons to sigmoid was so important for Rumelhart, Hinton, and Williams.
And solving that symmetry problem was also key.
Not stochastic neurons, but...
Random initial weights.
That was Rumelhart's insight, according to Hinton.
Just start the weights and biases randomly near zero.
That breaks the symmetry.
And different hidden neurons start learning different things.
Hinton credits Rumelhart a lot for the algorithm itself.
He does.
He emphasizes Rumelhart's role in the reinvention, with Hinton refining and testing, and Williams helping with the math.
So learning representations by backpropagating errors.
The title of their FANAS paper says it all.
It really does.
Before this, algorithms often needed humans to figure out the important features in the data first.
Like, for that nonlinear 2D data, you couldn't just use x and y.
You needed to maybe engineer a feature like xi or by 2.
Exactly.
But neural networks trained with backprop if they have enough capacity in the hidden layers.
They learn the useful features themselves automatically.
Yes.
The hidden units learn to represent the input data in a new way, transforming it so that the final layer can solve the problem, often linearly in that transformed space.
They discover the necessary nonlinear combinations internally.
That's the real power.
Not just curve fitting, but feature discovery.
It's what set it apart from things like the older perceptron learning rule.
The chapter mentions getting it published in Nature wasn't trivial either and needed some careful explanation.
And interestingly, Jan Lacoon was working on similar ideas independently in Paris.
Yes.
Almost simultaneously.
And when he and Hinton eventually met, they instantly clicked.
Lacoon immediately saw how important backprop was going to be.
They both went on to build major labs, really pushing deep learning forward.
Absolutely.
Setting the stage for everything we see today.
The chapter also touches on Rumelhart's later illness, Pick's disease, and Hinton's acknowledgement of his absolutely crucial, perhaps sometimes overlooked role.
Before we wrap up, there's that quick mathematical coda.
Ah, yes.
Deriving the sigmoid derivative using quotient and chain rules.
A nice little demonstration.
And then the final section generalizes backprop to bigger networks using matrices.
Right.
It shows how the same principles apply with multiple hidden layers.
You have weight matrices, w1, w2, w3, and bias vectors, b1, b2, b3.
The forward pass computes layer activations, a1 equals sigmoid, w1x plus b1, a2 equals sigmoid, w2, a1 plus b2, and so on.
Until you get the final output yi hat.
Right.
Then error loss.
And then backpropagation calculates the gradient of the loss with respect to each weight matrix and bias vector using the chain rule in matrix form.
Again, all the terms needed were computed on the forward pass or are the current weights.
And you update the weights and biases using those matrix gradients and the delta rule.
That's the essence of the generalized algorithm.
It scales.
OK.
So that really covers the whole journey presented in the chapter.
It does.
From the historical context, the Minsky paper myth, Rosenblatt's foresight.
Through the core problem of training multi -layer nets, the need for non -linearity.
The elegance of the chain rule, enabling backpropagation.
And the crucial idea of learning representations, not just functions.
It's quite a story.
An algorithm born from trying to understand the brain, now powering so much technology.
So the final thought for you, the learner.
As we've seen this evolution from basic concepts to a powerful learning algorithm,
what might the next fundamental breakthroughs look like?
What principles might future AI build upon?
That's the exciting question, isn't it?
How will we continue to build on these ideas or perhaps find entirely new ones as we keep exploring how intelligence, biological, or artificial actually works?
Something to definitely keep thinking about.
And with that, we've covered the key insights from chapter 10 of Why Machines Learn.
Thanks for joining us on the Deep Dive.
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML ♥Related Chapters
- All-Pairs Shortest Paths: Floyd-Warshall and Johnson's AlgorithmIntroduction to Algorithms
- Getting Started: Insertion Sort and Algorithm DesignIntroduction to Algorithms
- Heapsort: Heaps, Priority Queues, and the Heapsort AlgorithmIntroduction to Algorithms
- Matchings in Bipartite Graphs: Stable Marriage and the Hungarian AlgorithmIntroduction to Algorithms
- Single-Source Shortest Paths: Bellman-Ford and Dijkstra's AlgorithmIntroduction to Algorithms