Chapter 9: The Man Who Set Back Deep Learning (Not Really)

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

We take fascinating topics, find the best sources,

and, well, we explore them in depth for you.

Today we're looking at a really pivotal moment in AI history.

It focuses on a researcher, George Sabanko, who years later was apparently treated like a rock star at a deep learning summer school.

Yeah, quite the reception.

But here's the twist.

There's also this suggestion from another AI pioneer that Sabanko's work might have actually inadvertently held back deep learning's progress.

Yeah.

Maybe by almost two decades.

It sounds like a paradox.

It really does.

And that's what we're diving into today, George Sabanko and his, let's say, complex legacy in AI.

Our main guide here is a chapter from Why Machines Learn, The Elegant Math Behind Modern AI.

And it really digs into Sabanko's work, its context, and these differing views on its impact.

So our mission is to unpack what Sabanko actually did, understand why the reactions are so different, and figure out what this tells us about the foundations of deep learning.

Right.

So to get our bearings,

the book reminds us about that early buzz around single -layer perceptrons back in the late 50s, early 60s.

Rosenblatt, Widrow.

Yeah.

Those names.

That's right.

Those early models, they should promise, generated a lot of excitement.

But then came the cooldown, Minsky and Papert's Perceptrons in 59.

Yeah.

The book details that, well, their work was rigorous mathematically sound, showing the limits of single -layer networks.

But the unfortunate side effect was it sort of cast doubt on more complex multi -layer ideas too.

And that led to the less funding, less research enthusiasm.

Exactly.

It contributed significantly.

But as the book points out, research didn't just stop completely.

No, you had things like Hotfield networks popping up in the early 80s, a different kind of model, more like a one -shot learner.

And kind of behind the scenes, some folks were still chipping away at training multi -layer networks through the late 70s and early 80s.

Chipping away until the big Right.

1986, backpropagation.

Rumelhart, Hinton, Williams.

That paper was absolutely game -changing.

It gave researchers a practical method to actually train those multi -layer networks.

It really laid the groundwork for the deep learning we see today.

And the book mentions there might be even earlier roots to backpropagation.

Maybe Rosenblatt himself.

Yeah.

It hints at that.

Fascinating historical thread.

Maybe for another time.

Definitely.

But okay, so backpropagation is starting to gain traction.

And it's into this environment, 1989,

that Sibenko publishes his paper.

Precisely.

And this is where it gets really interesting because Sibenko's work seemed to offer this fundamental theoretical answer about what neural nets could do.

Now, the book suggests tackling backpropagation before Sibenko.

But we're flipping that.

We are.

We want to understand Sibenko's theorem first, see how it provides this theoretical backdrop for why deeper networks might eventually become necessary and why propagation was so vital.

Plus, honestly, it lets us explore some cool mathematical ideas about functions first.

Okay, I like it.

So Sibenko's 1989 paper, the Universal Approximation Theorem, what's the core idea?

The core idea is actually surprisingly straightforward, but profound.

It states that a neural network with just one hidden layer.

Only one.

Only one.

As long as it has enough neurons in that layer and uses a specific type of nonlinear activation function,

it can approximate any continuous function to basically any level of accuracy you want.

Any function.

Wow.

Okay, hold on.

Approximate any function.

That sounds incredibly powerful.

What does that actually mean in practice if a single hidden layer can do that?

The implications are huge, theoretically.

Think about it.

If you can approximate any continuous function, you can model almost any relationship between inputs and outputs, like speech recognition, mapping audio waves to text, or image classification mapping pixel values to object labels, predicting trends, generating new images, new text.

It all boils down to learning complex functions.

So this theorem is like a foundational guarantee for AI's potential, even with a relatively simple structure.

Exactly.

For you listening, trying to grasp the core capabilities, this is key.

It says even simple networks have immense representational power, at least in theory.

What drove Subenko to ask this specific question,

then, given Minsky and Papert had already shown the limits of single layer nets?

Well, the chapter suggests his background was in signal processing, functional analysis, very mathematical.

He wanted to understand the fundamental capabilities of these networks, especially going just one step beyond the perceptron.

What happens when you add just that one hidden layer?

What are its theoretical boundaries?

That was his question.

Okay, so to really get why one hidden layer makes such a difference, maybe we should quickly recap the perceptron again.

No hidden layer at all.

Correct.

The input layer just holds the data.

Then you go straight to the output layer neurons.

Each output neuron gets all the inputs, applies weights, adds a bias, runs it through a simple threshold function, usually just a step function, on or off.

That's trained with the perceptron learning rule.

We've talked about that before.

Finding a line or a plane to separate the data.

Exactly.

Finding that linearly separating hyperplane.

But that's the catch -chatch.

It only works if the data is linearly separable.

Add a hidden layer, and the game changes.

How so?

Now, inputs go to the hidden layer first.

Those hidden neurons do their own weighted sum, bias, activation function, and their outputs become the inputs for the output layer.

So you've got weights between input and hidden, and weights between hidden and output.

Right.

At least two sets of weights plus biased terms for each layer.

That's the key structural difference.

And the book calls anything with more than one weight matrix, so one or more hidden layers, a deep neural network.

That's the definition it uses.

Yeah.

And this complexity, these multiple layers of weights, is exactly why the simple perceptron training algorithm just doesn't work anymore.

You need something more sophisticated.

Which is where backpropagation comes in, which we'll cover next time.

Exactly.

Backprop was the breakthrough for training these deeper nets.

But Shabenko's work came alongside that, asking, okay, we might be able to train them now, but what are they theoretically capable of?

Even just one hidden layer.

What's the limit?

Okay, so it's all about approximating functions.

The book frames a neural network as basically a function itself.

Right.

Y equals fx.

It takes input x, does its magic with weights and activations, and produces output y.

Precisely.

And training is just finding the best weights and biases so the network's fx gets as close as possible to the true underlying function that relates inputs to outputs in your data.

And that lets you do things like classification, finding the function that defines the decision boundary.

Or regression, finding the function that fits the trend in your data.

Or even modeling the incredibly complex probability functions needed for generative AI like JET GPT.

Exactly.

All fundamentally about function approximation.

And Shabenko wanted to know, is any continuous function fair game?

Can a single hidden layer network, if we give it enough neurons, approximate anything?

What are the theoretical limits?

Right.

So how does it work?

How can one hidden layer approximate potentially very complex functions?

The book uses this stack them up idea.

Yeah, it's a great analogy.

Think about approximating the area under a curve in calculus.

You start with wide rectangles, it's rough.

Make the rectangles narrower, you get closer, infinitely thin rectangles.

That's integration, basically perfect approximation.

Okay, so how does that relate to neurons?

Imagine small groups of neurons, neural units, each designed to output something like one of those thin rectangles.

It produces a certain value over a small input range and zero everywhere else.

Like a little bump function.

Sort of, yeah.

And if you can create lots of these little bumps or rectangles of different heights and at different positions along the input axis.

You can add them all up, stack them up to build a much more complicated shape.

Exactly.

You sum their outputs.

The book mentions Michael Nielsen has a great visual proof using step functions for this intuition.

But Seibanker's proof used a different activation function, the sigmoid, and that's important.

The sigmoid.

Okay, let's break that down.

We know a basic neuron calculates Z equals one by one plus W two by two plus B, the weighted sum plus bias.

Then the output Y is Z on A, where A is the activation function.

Right.

A linear neuron just has AZ on A.

A perceptron often uses a step function, zero below a threshold, one above.

But the sigmoid is different.

It's ZU, one plus EZ.

What does that look like?

It's a smooth S shape.

It starts near zero for very negative inputs, smoothly rises, passes through point five when the input Z is zero, and then levels off near one for very positive inputs.

Smooth.

Why is smooth important?

Ah, that smoothness, that non -linearity is absolutely crucial for training networks with hidden layers using methods like backpropagation.

We'll definitely come back to that.

But for now, think about how you can change this S shape.

How?

The weights, VW, in that Z, O, B, T, X plus B calculation control the steepness of the S curve.

Bigger weights make it transition faster.

Okay.

And the bias B controls the position.

It shifts the whole curve left or right along the input axis.

So by tuning the weights and bias for each hidden neuron, you can get lots of different S shape curves, some steep, some gradual, some shifted left, some right.

Precisely.

Now picture Sydenko's network.

One input, one output, but potentially lots of these sigmoid neurons in the hidden layer.

Each hidden neuron creates its own specific sigmoid curve based on its weights and bias.

And then the output neuron.

The output neuron takes all those sigmoid outputs from the hidden layer and does a linear combination.

It multiplies each hidden output by another weight, let's call it alpha, and sums them all up.

These alpha weights can be positive or negative.

So you're basically adding and subtracting scale versions of all those different S curves.

Exactly.

The math looks like Y X equals sum over all hidden neurons I O of I A, I O A X plus bi, where X is the input, N is the number of hidden neurons, Y and bi are the weights and bias for the I have hidden neuron, and I is the weight connecting that hidden neuron to the final output.

Adding sigmoids.

How does that approximate any function?

The book has examples, right?

Yes.

And they're really helpful.

It shows how you can take just two sigmoid neurons, set their weights and biases carefully, and then use the output weights, the alphas, to essentially subtract one sigmoid from a slightly shifted version of itself.

What does that achieve?

It creates something that looks remarkably like a smooth bump, or a sort of soft rectangular pulse.

Ah, okay.

So you can make those rectangle -like building blocks using sigzoids.

Exactly.

And then if you can make one bump, you can make lots of bumps of different heights and widths and positions by using more pairs of hidden neurons with different parameters.

Add all those bumps together.

And you can start building up approximations of more complex functions.

Yeah.

Like steps or waves.

Precisely.

The book shows approximating Y equals bi 2.

With just 10 hidden neurons, you see the individual sigmoids, light gray lines, and their sum, black line, making a rough approximation.

But then it shows 20 neurons and 100 neurons.

And with 100 neurons, the approximation is visually almost perfect.

It's hugging the true bi 2 curve incredibly closely.

Wow.

And then it shows an even more complex wiggly function being approximated really well with 300 neurons.

It demonstrates the principle.

Add enough sigmoid building blocks, weighted correctly, and you can build almost any shape.

But, and this is a big, but in those examples, someone figured out the right weights and biases beforehand, right?

I get it.

Hand -designed for the illustration.

Absolutely crucial point, yes.

In practice, you don't know the right weights and biases.

That's what the training algorithm, like back propagation, has to learn from the data.

So training is the process of finding the best alphas, ws, and bes to make the network's output match the target function or data distribution.

Exactly.

The goal is to find the parameters that make the network approximate that unknown underlying function connecting inputs to outputs.

And while these examples use 1D inputs -outputs for clarity, the principle holds for high -dimensional data to images, text, sound.

Okay, so the intuition is strong, the examples are compelling,

but intuition isn't proof.

Correct.

The chapter makes that clear.

These examples build understanding, but they don't rigorously prove it works for any continuous function.

For that, we need Sabenko's math.

Which relies on functional analysis and this idea of functions as vectors.

This sounds a bit abstract.

It can be, but it's a really powerful way to think.

Take one Ateca's x between, say, 0 and 10.

Plot it.

Now pick a bunch of points along the x -axis.

0, 1, 2, 10.

Find the sine value at each point.

Okay, so you get a list of numbers, like sin 0, sin 1, sin 2, sin 2.

Right.

That list of 11 numbers, that's a vector in 11 -dimensional space.

Do the same for y's cos x at the same points.

You get a different list of numbers, another vector in the same 11D space.

So any function, if you sample it enough, can be represented as a vector.

Essentially, yes.

The more points you sample, the higher the dimension of the vector and the better it represents the function.

Now imagine sampling at an infinite number of points in the interval.

An infinite dimensional vector.

You got it.

And functions defined over the entire real line, to Laplace -Alen, can also be thought of as points, or vectors, in an infinite dimensional function space.

And networks often deal with vector inputs and outputs anyway.

Right.

So a neural network is just transforming one vector into another.

Input vector, multiply by weights, apply sigmoid element -wise, get hidden vector, multiply by output weights, get output vector.

Okay.

So how does this vector view help prove the theorem?

Well, think of each possible sigmoid function, w t x plus b, the different w and b, as a specific vector in this infinite dimensional function space.

The network's output is a linear combination, weighted sum, of these sigmoid vectors using the alpha weights.

So Sibenka's question becomes, can we reach any vector representing a continuous function by taking linear combinations of enough of these sigmoid vectors?

Is the set of all possible weighted sums of sigmoids dense in the space of all continuous functions?

Meaning, can we get arbitrarily close to any target function?

Dense in the space.

Okay.

Okay, that's the functional analysis language.

The book mentions vector spaces briefly.

Yeah, just the idea that functions live in a space where you can add them and scale them, like vectors.

So Sibenka used a proof by contradiction.

How did that work?

He started by assuming the opposite.

Assume there's some continuous function that cannot be approximated arbitrarily well by a single hidden layer network, no matter how many sigmoid neurons you use.

Okay.

Assume the theorem is false.

Right.

Then, using tools from functional analysis properties of continuous functions, linear functionals, things like the Hahn -Banach theorem,

implicitly he showed that this assumption leads to a mathematical impossibility, a contradiction.

So if the opposite leads to a contradiction.

Then the original statement must be true.

A single hidden layer network can approximate any continuous function given enough neurons and the right weights biases.

But it didn't tell you how to find those weights and biases.

Exactly.

It was an existence proof, not a constructive proof.

It proved it's possible, but didn't give the recipe.

That's where learning algorithms like backpropagation come in.

Okay.

Which brings us back to that unintended consequence.

The theorem proved one layer was sufficient in theory.

And the chapter suggests this might have actually steered people away from exploring deeper networks in the nineties.

It seems plausible.

The focus might have shifted to how wide does the single layer need to be rather than what if we stack multiple layers?

The quote from Sibenko is telling, I didn't say you should use one layer.

People concluded that you only need one layer.

So the real deep learning revolution, starting around 2010,

happened when people did start seriously exploring and training networks with many layers.

The deep part.

Yes.

But we have to remember the other factors the chapter mentions.

That revolution needed the massive data sets and the huge increases in computing power that just weren't available in the nineties.

Theory alone wasn't enough.

True.

Practicality matters.

Yeah.

Did Sibenko himself have caveats?

Even in the 89 paper, he noted that while approximation was possible, getting high accuracy for complex functions might require an astronomical number of hidden neurons.

And he worried about the curse of dimensionality, the known problem that approximation gets exponentially harder as the number of input dimensions increases.

High dimensional spaces are mostly empty and you need vast amounts of data to cover them.

Which is weird because modern deep learning seems to handle high dimensions surprisingly well.

Images have millions of pixels.

Language models handle huge vocabularies.

Exactly.

That's the modern paradox.

Today's deep networks, with billions of parameters over many layers, often work much better than that classical theory might predict.

They don't seem quite as cursed by dimensionality and they often generalize well, instead of just memorizing the training data despite their huge capacity.

So there's still mystery there.

Why does deep work so well in practice?

That's the big question.

And as the chapter concludes, to start untangling that, we need to understand how these deep networks are actually trained.

Which means next time, we dive into backpropagation.

Perfect.

So let's quickly recap this deep dive.

We looked at George Sabenko's universal approximation theorem from 1989, set against the history of perceptrons and the rise of backpropagation.

The core idea.

One hidden layer, enough neurons, nonlinear activation like sigmoid, can theoretically approximate any continuous function?

We saw the intuition stacking up basic shapes like sigmoids combined to make bumps.

The more formal view functions as vectors in infinite dimensional space.

And Sabenko's proof by contradiction, showing these sigmoid combinations are dense in that space.

Right.

An existence proof.

We also touched on that potential irony that proving one layer was enough might have slowed down research into multiple layers for a while.

To fascinating historical twist.

But ultimately, the theorem established the fundamental power of neural networks.

Absolutely.

A theoretical cornerstone.

Which leaves us with a final thought for you, the listener.

If one layer can do anything in theory,

why is depth using many layers been so critical for the AI breakthroughs we see today?

What extra magic does stacking layers provide beyond just adding more neurons overall?

It's something to chew on as we get ready to unpack backpropagation next time.

How does information transform layer by layer?

Excellent.

And with that, we can confirm we've covered the key insights from the chapter in Why Machines Learn Dealing with George Sabenko and the Universal Approximation Theorem.

Thanks for joining us for this deep dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
George Cybenko's universal approximation theorem, published in 1989, stands as a pivotal yet frequently misunderstood milestone in artificial intelligence history. The theorem establishes that a neural network containing only a single hidden layer, when equipped with sufficient neurons and a non-linear activation function like the sigmoid, possesses the theoretical capacity to approximate any continuous function to arbitrary levels of precision. Despite its mathematical elegance and rigor, the initial interpretation of this result created an unintended obstacle to progress in deep learning research. Many practitioners and theorists concluded from Cybenko's work that shallow networks were theoretically adequate, making deeper architectural designs unnecessary—a misconception that likely impeded the practical development of more advanced neural systems for years. The proof itself relies on sophisticated mathematical concepts, particularly the treatment of functions as elements within infinite-dimensional vector spaces and the strategic assembly of sigmoid neurons to construct increasingly refined function approximations. Cybenko's logical approach employs proof by contradiction, grounded in functional analysis frameworks that provide the formal scaffolding for this landmark result. Beyond the technical details, this chapter illuminates a fundamental paradox that continues to shape machine learning research: the persistent gap between what mathematical theory predicts and what practitioners observe empirically. Classical theoretical analysis suggests that networks with millions of parameters should severely overfit when applied to high-dimensional data and suffer catastrophically from the curse of dimensionality. Yet contemporary deep neural networks routinely defy these theoretical warnings, achieving exceptional generalization performance across countless practical applications despite their enormous parameter counts and operation in extremely high-dimensional spaces. By examining how a single influential theorem simultaneously advanced understanding while sowing confusion across generations of researchers, the narrative demonstrates how mathematical abstraction and hands-on innovation operate in tension throughout artificial intelligence development.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥