Chapter 11: The Eyes of a Machine

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

Today, we're plunging into a truly incredible story,

how machines learn to see.

It's quite a journey weaving together biology, some really elegant mathematics and key engineering breakthroughs.

That's right.

We're digging into the source material you shared, specifically that great chapter, the eyes of a machine from why machines learn the elegant math behind modern AI.

Yeah.

And our mission today is really to unpack those key moments, the history, the biological sparks, the math concepts, the algorithms, everything that led to the kind of computer vision we see all around us now, you know, especially with neural networks.

We'll try to break down the tricky ideas just for you.

It's a story that, you know, genuinely starts with looking inside cat brains and ends up with these massive digital networks, recognizing images like never before some surprising turns along the way.

Okay, let's get into it.

When we talk about machine vision and neural nets, where does that story really kick off?

Well, interestingly, it starts not in a computer lab, but with neurophysiologists,

David Hubel and Torsten Wiesel back at Harvard in the early 1960s.

Their work on how cat CA was foundational, Nobel Prize stuff, eventually in 81.

Right.

And their experiments, reading the source, they sounded unbelievably detailed, recording from single neurons and anesthetized cats.

I mean, that sounds incredibly challenging for the time.

Always painstaking.

Hubel had actually invented this special tungsten electrode back in 57, just for this kind of thing, recording from one neuron without a breaking if the animal shifted.

Crucial for stable recordings.

And the setup itself.

Wow.

Anesthesia, eyelid clips, atropine for the pupils, muscle paralysis, needing artificial breathing,

contact lenses.

It really paints a picture of the technical difficulty back then.

It absolutely does.

And the source does touch on the ethical side too.

The debates about animal experiments that were starting to bubble up even by the 80s.

Yeah.

It mentions criticism from people like Steven Zach about later kitten experiments, but then it also includes reader responses, defending the work, pointing to medical benefits like preventing blindness in kids and arguing the animals were treated humanely.

It kind of presents both sides of that contemporary debate.

Exactly.

The source lays out that discussion pretty neutrally, acknowledging the controversy as it was expressed then.

What's undeniable though, is that their findings, you know, despite those historical complexities,

hugely influenced how people thought about building artificial systems.

And wasn't there a key moment of like accidental discovery?

Absolutely.

A total science classic.

They were struggling initially to get these cortical neurons to fire consistently.

Then suddenly they hear one neuron going crazy, the distinctive popping sound, just as they were swapping projector slides.

And they realized it wasn't the image on the slide the neuron cared about.

It was the faint edge of the slide itself moving across the screen, but only when it was at a specific angle.

Boom.

They'd found an edge detector.

Wow.

That feels like the seed for the whole idea of feature detection.

And it led to their big hypothesis, right?

About the brain processing vision in layers, a hierarchy.

Yes.

And to get that hierarchy, you need a couple of terms.

Visual field, just what your eyes see when you're looking somewhere.

And a neuron's receptive field, that's the specific little patch within the They vary a lot in size.

Okay.

And the first layer, getting input from the retinal ganglion cells or RGCs, they have the smallest receptive fields.

Exactly.

Hubel and Weisel proposed the visual cortex builds from there.

They talked about simple cells.

Imagine a simple cell getting signals from say, four RGCs all lined up.

It's wired so it only fires if all those specific RGCs fire together.

Which would happen only if like a vertical line crosses exactly over those four RGCs.

Precisely.

It's like a little pattern detector.

You can sort of see the connection to the threshold idea and artificial neurons.

Individual inputs aren't enough, but together they cross the line.

And these simple cells were specialists.

One for vertical, one for horizontal, one for 45 degrees, that kind of thing.

Exactly right.

Orientation specific.

Then building on that, they described complex cells.

Now, these guys listen to lots of simple cells.

But crucially, simple cells that all detect the same feature like a vertical edge, just in slightly different positions within a larger area.

Ah, okay.

So the complex cell fires if any of those vertical edge simple cells fire.

It doesn't care precisely where their vertical edge is, just that it's somewhere in its patch.

You got it.

That gives it spatial invariance or translational invariance.

It finds the feature even if it shifts around a bit in its receptive field.

And you see the receptive fields getting bigger RGCs than simple cells.

Then complex cells cover more ground.

They even mentioned hypercomplex cells, right?

Sensitive to the length of an edge?

Yes.

Firing best for edges of a specific length.

You can start to imagine how you'd combine these things hypercomplex cells for corners, maybe combine those to detect squares or triangles.

And building up like this could also have other variations, like rotation or different lighting.

Which, kind of jokingly, leads to that idea of a grandmother cell that only fires when you see your grandma.

Yeah, a bit of fun lore, as the source notes.

And look, the real brain is way more tangles than this neat hierarchy.

But the story, the concept, building complex detectors from simple ones, gaining invariance that was hugely influential for the first artificial vision networks.

Okay, so that's the biological spark.

How did these ideas actually jump over into AI?

Well, before this biological thinking took hold, computer vision was pretty different.

Researchers would have to sit down and try to manually define the features a computer should look for.

Pixel patterns for edges, corners, things like that.

Like building a giant dictionary of shapes.

Kind of, yeah.

And then you try to manually figure out rules for handling variations in position or lighting or slight distortions.

It was incredibly tedious, computationally heavy, and honestly, it really struggled with that invariance problem.

The brain seemed so much more effortless.

So someone tried to build a system mimicking that biological hierarchy.

Exactly.

Kunihiko Fukushima, working in Tokyo at NHK Labs.

He was really inspired by Hubel and Weisel.

In 1975, he came up with the Cognitron.

It was one of the first complex neural networks really designed for image recognition.

It even used a heavy and learning rule, sort of like neurons that fire together, wire together.

So it learned patterns by strengthening connections.

That was the principle.

But Fukushima himself pointed out a big problem.

The Cognitron was position dependent.

Show it the same pattern shifted slightly, and it wouldn't recognize it.

It lacked that crucial translational invariance.

Which must have led him to the Neocognitron in 1980.

Precisely.

The Neocognitron was his direct shot at building Hubel and Weisel's hierarchy explicitly into a network.

He designed it with layers of S -cells and C -cells.

S for simple, C for complex.

You got it.

S -cells acted like simple cells, detecting features like edges in local patches.

Then multiple S -cells, all looking for the same feature but in adjacent patches, would feed into a C -cell.

And the C -cell fires, if any of its S -cells fire, acting like a complex cell.

Exactly.

So the C -cell knows the feature is somewhere in its bigger receptive field.

That's how it achieved translational invariance.

Then C -cells fed S -cells in the next layer, building up the hierarchy.

The source calls it a drastic solution for its time.

And it really worked, recognizing handwritten digits even when shifted or distorted.

That sounds like a major step.

It absolutely was.

But the Neocognitron had its own issue.

The training algorithm was pretty clunky and custom built.

It only adjusted the S -cell weights, for example.

And this is where the story kind of jumps forward a decade or so to Jan Lacoon.

Right.

Lacoon tackled the same challenge but ended up with a different network and crucially a better way to train it, one that scaled better.

Yes.

Jan Lacoon's journey is fascinating.

Early on, he was just captivated by intelligence and had the strong conviction that machines had to learn.

You couldn't just program intelligence top down.

And he was influenced by reading about things like the Perceptron in debates about learning versus innate ability.

The source mentions Seymour Paper.

Indeed.

And digging into the early AI literature, he zeroed in on a key problem.

Single layer things like the Perceptron were limited.

To do more complex stuff, you needed multi -layer networks.

People knew this back in the 60s, but the problem was how to train them.

And his studies, like reading Duda and Hart's classic pattern recognition book, led him to a core idea.

Learning means minimizing an objective function.

Okay, this is super important.

The objective function is basically the thing you're trying to make as small as possible when you train the network.

It's usually the loss function, which measures how wrong the network's prediction is, plus something called a regularizer.

And the regularizer is there to prevent overfitting, right?

Yeah.

Where the network just memorizes the training example.

Overfitting means it does great on the data it trained on, but falls apart on new stuff.

The regularizer adds a penalty for complexity, encouraging the network to find simpler, more generalizable patterns.

Minimizing this whole objective function is key to getting a model that actually works in the real world.

So his PhD work was about figuring out an algorithm to minimize that objective function for these multi -layer networks.

Yes.

He developed an algorithm that was very closely related to backpropagation.

His view was about sending virtual target values backward, through the network, to figure out the errors and update the weights.

The source says it effectively was backpropagation under certain conditions.

And even then, he was thinking about how to build networks that could handle image invariance.

He presented this, apparently in French, and caught Jeff Hinton's attention.

That's the story.

Hinton invited him to Toronto for a postdoc, and that's where Lacoon really started hammering out the ideas for what we now call Convolutional Neural Networks, or CONVNETs.

But back then, in the late 80s, you couldn't just download TensorFlow or PyTorch.

Absolutely not.

It was a massive practical hurdle.

Lacoon and his collaborator, Lyon Batu, had to write their own simulator, SN Ancestor to PyTorch, the source notes.

Lacoon apparently said having that tool felt like having superpowers, because they could iterate so much faster than others.

And this work eventually landed him at Bell Labs with access to that big USPS digit data set.

Right.

Bell Labs recruited him.

Armed with the USPS data, he built a neural net for recognizing handwritten digits.

But computers were still too slow, so he went further, wrote a compiler to translate his Lisp network definition into optimized C code to run on dedicated hardware digital signal processors.

And the demo story is great.

Yeah, rigging up a camera so you could write a digit, and the DST hardware would recognize it instantly.

Lacoon said, even though he was confident, seeing it actually work live was absolutely elating.

And that system became Lynette, the blueprint for pretty much all modern CNNs.

Okay, so the heart of Lynette and all these CNNs is the convolution operation.

Let's break that down.

Right.

So think of your image.

Convolution involves taking a small matrix, maybe three by three pixels, maybe five by five, called a kernel or a filter, and sliding it over the image.

Like the source example with the five by five image and two by two kernel, you put the kernel on the top left patch.

Exactly.

Then you do an element -wise multiplication, multiply the kernel's top left value by the image's top left pixel value, the kernel's top right, by the image's top right, and so on for the whole patch.

Then you sum up all those products.

And that sum gives you one single pixel value in a new output image.

Precisely.

Then you slide the kernel over, usually by one pixel, it's called a stride of one.

And repeat the multiply and sum process.

Yep.

Slide, multiply, sum.

Slide, multiply, sum.

Across the whole row, then down to the next row, covering every possible position the kernel can sit on the input image.

The source gives that formula for the output size floor of input size kernel size stride plus one.

Useful if you need to calculate dimensions.

And those pre -wit kernels in the example really show what different kernels can do, right, using a 28 by 28 -digit image.

Yeah, it's a great illustration.

One specific three by three kernel, when you convolve it with the image,

produces an output image that highlights all the horizontal lines.

A different kernel highlights the vertical lines.

These were hand -designed filters.

Ah, but Laken's big insight was that the network could learn the kernel values.

Exactly.

That's the magic.

Instead of us trying to figure out the perfect edge detector kernel, the network learns the best kernel values, which are just the neuron's weights during training through backpropagation.

Each position the kernel lands on corresponds to one neuron in that first hidden convolution layer.

The kernel values are its weights.

The image patch is its input.

The weighted sum is its output.

So that first convolution layer is like a bank of simple cells, each with its learned kernel, looking for a specific little pattern in its receptive field, the patch under the kernel.

Spot on.

And you don't just learn one kernel per layer.

You learn multiple kernels simultaneously, each specializing in detecting a different low -level feature.

The output for each kernel across the whole image is called a feature map.

And then you stack these convolution layer?

Yes.

The feature maps output by one layer become the input images for the next convolution layer.

And because the neuron in, say, layer two is looking at a patch of feature map from layer one, its effective receptive field, in terms of the original input image pixels, gets bigger.

Which helps build up that translational invariance like the Hobel and Weasel hierarchy again, detecting features regardless of exact position.

It does.

And later layers combine the simpler features detected earlier to find more complex patterns.

It's compositional.

Okay.

Besides convolution, there's also pooling, right?

Yes.

Pooling is another common step, often done right after a convolution layer.

Its main job is to shrink the feature maps down spatially.

Why shrink them?

Two main reasons.

It reduces the number of parameters and computations needed in the next layers, making things more efficient.

And, by summarizing a region, it makes a network even more robust to small shifts and distortions in the feature's position, boosting that translational invariance again.

And the most common type is max pooling.

Right.

You take a small window or filter, say two by two, place it on the feature map, and just take the single maximum pixel value from that two by two region.

That maximum value becomes one pixel in the new smaller output map.

And you usually slide this pooling window so it doesn't overlap, like a stride of two for a two by two filter?

That's very common, yes.

The source example shows a four by four input, becoming a two by two output with a two by two filter in stride two.

Same size formula applies.

Importantly, pooling layers usually don't have weights to learn themselves, but their structure affects how gradients flow back during training.

Okay, convolution pooling.

Let's put it together.

How does a simple CNN architecture for digit recognition actually look?

Alright, so you start with your input image, the handwritten digit.

First, you hit it with the convolution layer.

Let's say you have different kernels in this layer.

Each kernel slides across the image, doing its convolution thing.

Producing five different feature maps, each highlighting where its specific learned feature was found.

Exactly.

Maybe one finds diagonal lines, another finds curves, etc.

These maps might be slightly smaller than the input, say 24 by 24.

If the input was 28 by 28 and the kernel was five by five,

then you typically apply max pooling to each of those five feature maps.

Shrinking them down.

Right.

Maybe a two by two max pool takes those 24 by 24 maps down to 12 by 12.

So now you have five 12 by 12 pooled feature maps.

You might even repeat this another convolution layer, maybe with more kernels, say 10, applied to those five maps, followed by another pooling layer.

Okay, so you've extracted and downsampled features.

How do you get to the final prediction?

Is it a four?

Good question.

After the last pooling layer, you take all the pixel values from all the resulting feature maps and you flatten them basically.

Just line them all up end to end into one single long vector.

This long vector becomes the input to one or more standard fully connected layers.

These are the classic neural network layers where every neuron in the layer is connected to every neuron in the previous layer.

Like we've discussed in previous deep dives.

Exactly.

So maybe you have one fully connected layer, FC1, taking that flattened vector.

Then its output goes to a second fully connected layer, FC2.

And the final layer is the output layer.

For digits zero through nine, this would have 10 neurons.

One neuron for each possible digit.

Precisely.

The neuron that outputs the strongest signal, the highest activation that's the network's guess.

This looks most like a four.

If it were cats versus dogs, you might just have one output neuron indicating the probability of cat.

And the whole thing learns through supervised training.

Yes, because you have labeled data.

You know this image is a four.

So you show it an image, it makes a prediction via those output neurons.

Yeah.

You calculate the error, how far off was its prediction from the true label four?

Then back propagation kicks in.

Back prop calculates the gradient, essentially, how much each weight everywhere in the network, in the kernels and the fully connected layers, contributed to that error.

Then you nudge all those weights slightly in a direction that would reduce the error for that specific image.

And you just do that again and again for thousands, millions of images.

That's the training loop.

Doing it for the whole data set at once is gradient descent.

More practically, you use small batches of images for each update that's stochastic gradient descent, or SGD.

The source calls it a junk and walk towards the best set of weights.

You repeat this over many passes through the data called epochs until the network gets good.

But the person building the network still has to make some choices up front.

Yes, those are the hyperparameters.

Things like

which activation function to use in the neurons needs to be differentiable for back prop, how many convolution layers, how many kernels in each, the size of the kernels three by three, five by five, the pooling size and stride, how many fully connected layers, how many neurons in them, all that stuff isn't learned.

The source calls tuning these an art.

So Lacan's Lynette using these principles was deep because it had hidden conv pool layers and it actually worked commercially for NCR reading checks back in the early 90s.

Proof of concept.

Absolutely.

A real world success story for deep learning and back prop.

Okay.

So if Lynette worked then, why didn't deep learning just take over computer vision immediately in the 90s?

What held it back?

Yeah, that's a good question.

A few things were going on.

One was the rise of other machine learning methods, especially support vector machines, SVMs.

SVMs were mathematically elegant, maybe easier to grasp theoretically for many.

And crucially, good software libraries became available for them.

They worked well on the smaller data sets people mostly had back then.

While CNNs still felt a bit like black boxes and harder to implement.

Exactly.

The lack of standardized, easy to use software frameworks for CNNs was a big deal.

Like Lacan found, you pretty much had to roll your own.

And because Bell Labs couldn't just open source Lacan's tools, it hindered reproducibility and wider adoption.

People had to reinvent the wheel.

But Lacan kept working on CNNs, advocating for them.

He did.

And his works show they were better on low -res images.

But the source mentions this issue of scale.

On higher resolution images, performance wasn't keeping pace.

You needed much, much bigger networks.

And bigger networks meant vastly more computations.

Tons of matrix math needing parallel processing.

Which leads us to the hardware solution.

GPUs.

Graphical processing units.

Designed for pushing pixels in 3D games.

But it turned out their architecture loads of simple cores working in parallel was perfect for the matrix multiplications needed to train huge neural nets.

The source mentions Jürgen Schmidhuber's group using GPUs for training big MLPs on MNIST around 2010 as an early sign.

Yeah, showing the potential speed -up.

But the massive CNN breakthrough, the one that really changed everything for large -scale vision, came out of Jeff Hinton's lab in Toronto.

With those key students.

Alex Krzyzewski, the GPU whiz, and Ilya Sutskeber, the visionary.

Right.

Hinton's lab had already been playing with GPUs for a non -CNN project before that, actually.

Detecting roads and aerial photos.

He mentions that ironically might have slightly delayed their full pivot to CNNs on GPUs.

He even had to convince Microsoft to buy GPUs for a project back then.

And Sutskeber.

The source paints him as having this deep belief in neural nets from early on.

Yeah, he saw them as obviously correct.

A powerful model.

He even seemed surprised the ideas were relatively simple compared to, say, physics.

He just had this strong intuition.

They had a much higher ceiling than methods like SVMs, if they could be trained properly.

And the key ingredients finally came together.

Two things.

Data and compute.

First, the data.

Fei -Fei Li and her team released ImageNet in 2009.

Millions of labeled images.

Thousands of categories.

Just a massive data set.

Perfect for training big models.

Which led to the ImageNet Challenge, the annual competition starting in 2010.

Using 1 .2 million images to train.

Testing on 100 ,000.

Yep.

And for the first couple of years, 2010, 2011, the winners were using the standard techniques, extracting handcrafted features like SIFs, then feeding them into SVMs.

But Sutskeber felt those methods had a low ceiling.

He was convinced neural nets were the way if you had enough data and enough compute.

ImageNet was the data.

GPUs were the compute.

The Coons group apparently saw the same opportunity, but didn't have someone ready to tackle the huge implementation challenge.

And that's where Krzyzewski was key in Hinton's lab.

He had the CDA programming skills.

Honed on smaller data sets to actually build and train a massive CNN on GPUs efficiently.

Sutskeber reportedly convinced him to scale up his skills for ImageNet.

And the result was AlexNet.

Named after Alex Krzyzewski.

Built by him, Sutskeber, and Hinton.

A deep CNN.

Way bigger than anything before half a million neurons.

60 million parameters trained on ImageNet using GPUs.

Multiple Convi layers.

Pulling fully connected layers.

1000 outputs for the categories.

Used ReLU activations too.

And the 2012 ImageNet results were dramatic.

Totally decisive.

AlexNet won easily.

Its top 5 error rate was about 17%.

The next best, using traditional methods, was around 26%.

Previous winners were even higher, like 28%.

It wasn't just a bit better.

It was a huge jump.

The source basically calls AlexNet the moment deep learning finally lived up to its promise.

It really was.

It validated Sutskeber's conviction.

Deep learning powered by big data and GPUs was going to change everything.

And the impact since then.

It's been everywhere.

Oh, massive.

CNNs and deep learning didn't just conquer computer vision, object recognition, face detection, all that.

They spread like wildfire.

Natural language processing, machine translation, medical imaging, finance, generative models.

The list is endless now.

But interestingly, the source ends by highlighting a bit of a mystery.

Even with all this success, we don't fully understand mathematically why these huge deep networks work so incredibly well.

It's a genuine puzzle that challenges a lot of existing ML theory.

The source quotes Mikhail Belkin, comparing it to the paradigm shift quantum mechanics brought to physics.

Maybe we need new theoretical tools to really grasp deep learning.

So the engineers building these things are like cartographers, mapping the terrain.

And the theorists are racing to draw the proper maps and figure out the underlying principles.

It's this really exciting interplay between practice and theory right now.

Empirical results are pushing the boundaries of understanding.

Wow.

What a story.

From Hubel and Weisel watching cat neurons fire.

To Fukushima, building the neocognitron based on that hierarchy.

To Lacun, creating LeNet with convolutions and backprop.

And finally, AlexNet, unleashing the power with GPUs and ImageNet.

It really shows how biology, math, engineering, and just plain curiosity come together.

Absolutely.

So you listening now have a much better sense of the elegant math and the fascinating history behind the eyes of a machine.

Think about it next time your phone recognizes your face.

Or you see object detection in a self -driving car demo.

All those layers we talked about are under the hood.

And maybe chew on this thought.

As these networks get bigger and better, the theory explaining why is still catching up.

What new math, what new insights might we need to fully unlock that mystery?

And what could that tell us about intelligence itself?

Both artificial and maybe even our own.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Biological vision systems inspired the foundational principles underlying modern machine vision, beginning with Hubel and Wiesel's discovery that the mammalian visual cortex contains neurons responding selectively to oriented edges and localized features arranged in hierarchical layers. This neurobiological observation suggested that artificial systems could replicate visual perception by stacking layers of feature-detection mechanisms, each learning to recognize progressively more complex patterns. Fukushima's neocognitron translated these insights into a concrete computational model, proposing that networks organized in this hierarchical manner could perform object recognition without explicit programming of feature definitions. The approach gained practical traction through LeNet, developed by LeCun and colleagues, which applied backpropagation training to convolutional architectures for handwritten digit recognition and established that the biologically-inspired framework could achieve reliable performance on real tasks. The mathematical substrate underlying convolutional systems relies on learnable kernels that function as feature detectors, scanning input feature maps through convolution operations to identify relevant patterns. Spatial parameters like stride control the step size of this scanning process, padding modifies boundary conditions to preserve spatial dimensions, and max pooling reduces dimensionality while selecting the most activated features across local neighborhoods. These operations collectively produce receptive fields that grow larger in deeper layers and generate robustness to spatial transformations, allowing networks to recognize visual content regardless of position, size, or rotation within the image. The field underwent explosive expansion following AlexNet's dominant performance at the 2012 ImageNet competition, where Hinton, Krizhevsky, and Sutskever demonstrated that deep convolutional networks trained on graphics processing units could vastly surpass traditional computer vision approaches. This milestone established deep convolutional architectures as the dominant methodology for visual recognition and initiated the contemporary era of deep learning applications across imaging domains.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥