Chapter 33: Machine-Learning Algorithms: Clustering, Gradient Descent, and Weights
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
All right, welcome to the deep dive.
You know why you're here, right?
You want to get up to speed on some seriously important stuff and you want to do it fast.
Well, today we're going to deep dive into machine learning.
It's basically all about getting computers to learn from data to improve themselves.
Imagine a supercharged way of gaining knowledge and it brings together some pretty cool stuff from AI, you get statistics, even optimization techniques.
Yeah, exactly.
And what I think is really remarkable about machine learning is that
it can actually handle problems where the rules aren't really that clear, where there's a lot of uncertainty involved.
And that's actually why you see it popping up everywhere in all sorts of different fields like medical diagnosis, helping doctors make better decisions.
And even when you see those targeted ads online, yeah, that's machine learning working behind the scenes, trying to figure out what you might be interested in.
Basically, it's all about finding patterns and making predictions in these really complex situations.
Okay, so for today's deep dive, we're going to really focus on giving you a good understanding, a solid understanding of three really, really fundamental machine learning algorithms.
We've got this awesome resource here and it breaks down some pretty cool stuff.
Keras means clustering, multiplicative weights and gradient descent.
Think of this as your cheat sheet to get a handle on some of the core concepts behind machine learning.
And just to give you a little bit of context along the way, we'll be touching on these main types of machine learning.
We've got supervised learning first.
This is where the computer learns from data that's already been labeled.
When you think about identifying spam emails, each email has already been marked as spam or not spam.
And then we've got unsupervised learning.
Here, the goal is to find hidden structures and data that hasn't been labeled.
And guess what?
Our first algorithm, clustering, falls right into this category.
Oh, and then there's reinforcement learning.
This is where an agent learns by interacting with its environment and getting feedback.
But we won't be going into too much detail about that one today.
Awesome.
So let's jump right in.
Let's kick things off with clustering.
Imagine you have just this
mountain of data points and you want to organize them into meaningful groups based on how similar they are.
That's essentially what clustering is all about.
So you've got end data points and you want to divide them into, at most, many groups.
And the idea is the items within each group, they should be more similar to each other than items that are in different groups.
Right, exactly.
And to really make this idea of similarity measurable, we often represent each data point as a feature vector.
Think of it kind of like a digital profile with d characteristics, or you could call them attributes by one, by two, and so on, up to xd.
So now, how do we figure out how similar two data points are?
Let's say we have x and y.
Well, one way to do that is to look at how dissimilar they are.
And a very common way to measure this dissimilarity is by using something called the squared Euclidean distance.
Squared Euclidean distance.
Can you give us the quick and dirty on that one?
Oh, absolutely.
It's basically like a way to calculate the total square difference between the corresponding attributes of these two data points.
So for example, let's say you're comparing two different products and you're basing this comparison on price and weight.
You'd find the square difference in their prices, the square difference in their weights, and then just add those two squared values together.
The formula itself looks something like this.
Delta of x, y equals the norm of x minus y squared, which also equals the sum from a equals one to d of a of a minus y squared.
So basically, the bigger this number is, the more different the two points are.
Gotcha.
So it's going to like giving us a numerical way to see how far apart two data points are in our future space.
Okay, so before we even start grouping these data points, the chapter emphasizes something called data pre -processing.
Why is that so important?
That is a really, critical step because, you know, when you deal with real world data, it can be pretty messy, to be honest.
You might have some information missing for certain attributes or, you know, the different attributes might be measured on completely different scales.
For example, you know, imagine you're clustering people based on age, which might range from like one to 100 years and income, which could range from maybe $10 ,000 to a million dollars.
You can see how the values in the distance calculation, even if age is like way more important for the kind of clusters you're trying to create.
Right.
So like a huge difference in income could just completely mask a tiny little difference in age, even though age is actually more relevant for our grouping.
Exactly.
That's precisely why we need data pre -processing.
It helps make sure that each attribute gets to contribute fairly to the distance calculations.
So if you have missing values, what you could do is, you know, just ignore those data points altogether.
Or you could try filling in that missing value with something like the median value for that attribute across all the other data points.
Now for scaling and normalization, you could transform all the values for a specific attribute so that they fit within a particular range, maybe between zero and one.
Or you could standardize them, you know, make them have a mean of zero and a standard deviation of one.
This way, no single attribute can just dominate those distance calculations, just because it has a large numerical range.
That's a lot of sense.
So, okay, we've cleaned our data, we've pre -processed it, and now we're ready to divide our set S of end points into those distinct subsets.
Those are our clusters, S1, S2, up to SX.
And this whole process, this is called K -clustering.
Now how do we actually go about finding a good K -clustering?
Well, that's where the K means problem comes in.
The whole point is to find K representative points, and we call these centers, which we can denote as C equals C1, C2, all the way to care, such that the total distance of each data point in our set to its closest center is as small as possible.
So the function we're trying to minimize here is F of S, C equals the sum over all X and S of the minimum over J from one to K of the distance between X and CJ.
And this can also be written as the sum from L equals one to K of the sum over all X and SL of the distance between X and CL.
Basically, you can think of it as we're trying to find those K average points that best represent, you know, the clusters in our data.
All right.
So we're basically looking for these K central points, and they should minimize how spread out the data is within each cluster.
And now the chapter also mentions that this K means problem, it's NP -hard.
What does that mean in practice?
So NP -hard, what it essentially means is that finding the absolute best set of K centers, it's a real computational challenge, especially when you're dealing with massive amounts of data and a large number of clusters.
There isn't a known algorithm that can guarantee finding the single best solution in a reasonable amount of time for these really big problems.
So instead of aiming for perfection, we often turn to algorithms that try to find a good enough solution, you know, something called a local minimum.
So that brings us to Lloyd's procedure, right?
This is like the go -to algorithm for actually trying to solve this K means problem.
Can you walk us through how it works?
Absolutely.
Lloyd's procedure, it's an iterative process.
So you start off by picking an initial set of K center points.
And a pretty common way to do this is to randomly select K data points from your data set and just use those as your initial centers.
Then you repeat two main steps, and you keep doing this until the clusters start changing.
Okay, so what's step number one?
Step one is all about taking each data point and assigning it to the cluster whose current center is closest to it based on that squared Euclidean distance we talked about earlier.
This is what we call the nearest center rule.
Now, if a data point happens to be the exact same distance from two or more centers, well, you can just pick one at random, it doesn't really matter.
And importantly, a data point will only switch clusters if the new center it's moving to is strictly closer than its current center.
So it's like each data point is looking around deciding which of those current centers it feels most connected to.
Okay, what about step two?
What happens then?
Step two is where we recalculate the centers for each cluster.
So for each of those K clusters, you take all the points assigned to it, and you calculate their centroid, it's basically just the average of all those points.
And you do this separately for each attribute.
So let's say we're looking at the eighth attribute, the new center for cluster L, which we can call C sub L sub A, it's simply the average of the eighth attribute values of all the data points in cluster SL.
And if we write this in vector form, it looks like this.
C superscript L equals one over the size of SL times the sum of all X in SL.
But hold on, there's a little catch here.
What if a cluster ends up with absolutely no data points assigned to it?
What do we do then?
Well, in that case, the chapter suggests setting its center to the zero vector.
Yeah, that makes sense.
So basically, we're shuffling the points around, putting them into the closest clusters, and then we're updating those center points based on the new clusters.
How do we know when we're done?
When do we stop this whole process?
So you stop when the data points stop switching clusters between iterations.
That means our clustering has stabilized.
And what's really cool is the chapter highlights that with every iteration of Lloyd's procedure, except maybe the very last one, the value of our objective function, which is that total squared distance to the nearest center, it actually goes down.
And this tells us that the algorithm is making progress.
It's getting closer to a local minimum.
A local minimum again, meaning it's not necessarily the absolute best clustering, but it's a good solution that the algorithm settled on.
Yeah, exactly.
And because Lloyd's procedure can sometimes get stuck in a local minimum, a common strategy is to just run it multiple times.
But each time, you start with a different initial center points.
Then you can compare the final results, looking at the value of that objective function for each run and just pick the clustering that gives you the lowest total squared distance.
Now, the computational cost for each iteration of Lloyd's procedure, it's O of DKN, where D is the number of attributes, K is the number of clusters, and N is the number of data points.
And if the algorithm takes T iterations to converge, then the overall time complexity becomes O of TDKN.
Gotcha.
So we've got a pretty solid grasp of how K means clustering works.
Now, the chapter switches gears and focuses on a different kind of problem involving multiplicative weights algorithms.
It sounds like something completely different.
Oh, it definitely is.
Here, we're moving away from, you know, just finding patterns in static data.
And we're stepping into scenarios where we need make a series of decisions over time, and we get feedback after each decision.
And these multiplicative weights algorithms, they're super useful in areas like game theory, online learning, and even for finding approximate solutions to some pretty tough optimization problems.
The chapter really focuses on their application and online learning.
Online learning.
So that implies we're learning as we go, right?
We're not starting with all the information up front.
Exactly.
Imagine a scenario where you need to predict a binary outcome.
Let's say whether an event will happen a one or not happen a zero.
And you need to do this at each step in time.
And let's say you have access to predictions from n different experts, and each expert has their own track record, you know, how accurate they've been in the past.
So this learning from experts problem, it's really about how you the learner can combine these expert predictions to make your own prediction, which we'll call P at time t.
And your goal is to minimize the total number of times you get it wrong, the total number of mistakes.
Okay, so it's like we're trying to be as good as the best expert, even though at the beginning, we have no idea who that best expert even is.
Exactly.
That's the challenge.
And to measure how well we're doing, we use this concept called regret.
Regret is basically the total number of mistakes we've made up to a specific point in time,
minus the total number of mistakes made by the single best expert in hindsight, the one who made the fewest errors over that entire sequence of predictions, we'll call that mum star.
So the whole idea is to come up with an algorithm that keeps this regret as small as possible.
So how do these multiplicative weights algorithms help us achieve that low regret in this whole learning from experts setup?
Well, the chapter introduces this really cool algorithm called the weighted majority algorithm.
The key idea here is to keep track of a weight for each expert.
At the very beginning, you know, we don't know who's best.
So we assign every expert a weight of one.
That means we trust them all equally at the start.
Then at each time step, we look at the predictions made by all the experts, right?
And we calculate the sum of the weights for all those experts who are predicting a one, we'll call this up weight at time t.
And we also calculate the sum of the weights for those predicting a zero, which we'll call down weight at time t.
So then we make our prediction based on these weighted sums, right?
Exactly.
If up weight at time t is bigger than down weight at time t, we predict a one.
Otherwise, if down weight at time t is bigger than or equal to up weight at time t, and we can have a rule to break ties consistently, we predict a zero.
Basically, we're going with the prediction that has more weight backing it up, you know, more support from the experts we trust.
That makes sense intuitively, right?
The experts who have been more accurate in the past, they'll have higher weights.
So they'll have more influence on our current prediction.
What happens after we've made our prediction and we see what the actual outcome was?
Well, that's where the multiplicative part of the algorithm comes in.
Once we observe the true outcome o at time t, which is either a zero or a one, we update those expert weights.
So if an expert made a wrong prediction q sub i at time t is not equal to o at time t, we take their current weight and we multiply it by a factor of one minus epsilon, where epsilon is a parameter we choose beforehand, and it's somewhere between zero and one half.
Now, if an expert got it right, their weight stays the same.
Ah, so we're punishing the experts who made mistakes, reducing their influence on our future predictions.
What kind of performance can we expect from this algorithm?
Does it actually work well?
Well, there's a theorem in the chapter, theorem 33 .4, and it actually gives us a performance guarantee.
It states that for any expert EI and for any time period t zero, the total number of mistakes made by the weighted majority algorithm M of t zero is at most 2 .1 plus epsilon times i of t zero plus 2 times the natural log of n divided by epsilon.
Now, what this really tells us is that our algorithm's performance in terms of the total number of wrong predictions, it won't be much worse than the performance of any single expert, even the best one in hindsight.
And that term with the natural log, that basically accounts for the fact that we start off not knowing which expert is best, and we need to learn the reliability as we go along.
Wow, that's a pretty powerful result.
So we don't need to know from the start who's the best.
The algorithm will learn to trust more accurate experts over time.
Now, the chapter also mentioned something called a randomized weighted majority algorithm.
How is that different?
Oh, that's an interesting variation.
In the randomized version, instead of always following the weighted majority, you know, the prediction with the most weight behind it, we actually treat those weights as a probability distribution over the experts.
So at each time step, we randomly pick an expert to follow based on their current weight.
And obviously, experts with higher weights have a higher chance of being chosen.
And the cool thing is, the expected number of mistakes for this randomized approach, it actually has a slightly better bound.
It's at most one plus epsilon times m star, plus the natural log of n divided by epsilon, where epsilon is between zero and one half.
And m star is, you know, the number of mistakes made by the best expert in hindsight.
That's interesting.
So adding a little bit of randomness to our decision making can actually improve performance and expectation.
All right, so we've covered clustering, we've talked about learning from experts.
And now let's move on to our final topic.
Gradient descent.
This seems like a really fundamental technique for optimization.
Oh, it absolutely is gradient descent.
It's like a cornerstone for so many different machine learning algorithms.
It's a very general,
very widely used iterative method for finding a local minimum of a function.
Let's call this function f.
It takes multiple input variables and spits out a single output.
So we can write this mathematically as f r n to r.
And for those special functions that have this property called convexity, gradient descent can even find a point that's really close to the global minimum, which is, you know, the absolute best possible value for that function.
The basic idea is pretty intuitive, actually.
It's all about taking small steps in the direction that leads to the steepest decrease in the function's value, like going downhill.
In that direction, that's given by the gradient of the function, right?
You got it.
The gradient of a function f at a particular point x, which we note as nabla f of x, it's basically a vector that contains all the partial derivatives of f with respect to each of its input variables.
So let's say our input is x equals by one by two up to x n.
Then the gradient nabla f of x is the partial derivative of f with respect to by one, the partial derivative of f with respect to by two, all the way to the partial derivative of f with respect to x n.
And the important thing to remember is that the gradient vector, it points in the direction of the greatest rate of increase of the function at that point.
So if we want to decrease the function's value, we need to move in the opposite direction of the gradient.
Precisely.
And that's the main idea behind unconstrained gradient descent.
We start off with an initial guess for where the minimum might be.
We'll call it x zero.
Then we iteratively update our guess by taking a small step in the negative direction of the gradient at that point.
So we get x at t plus one equals x at t minus eta times nabla f of x at t.
Now, eta here, it's a small positive value and we call it the step size or the learning rate.
And choosing the right eta is super important.
If it's too big, we might overshoot that minimum and just bounce around it.
But if it's too small, it might take forever for the algorithm to converge.
So we're basically taking these little downhill steps until we hopefully hit a minimum point of the function.
How do we know when to stop?
When are we done?
Typically, we run the algorithm for certain number of iterations that we set beforehand, maybe t iterations or until some condition is met, a stopping condition.
And common stopping conditions are when the magnitude of the gradient gets really small, meaning we're in a pretty flat area of the function and not making much progress, or when the change in the function's value between iterations is super tiny.
The chapter also mentions that sometimes the final output of the algorithm is just the average of all the points we visited during those iterations.
So x average equals 1 over t times the sum from t equals 0 to t minus 1 of x at t.
Now, you mentioned earlier that convex functions are especially well -suited for gradient descent.
What's so special about them in this context?
A convex function, imagine you draw its graph, right?
If you pick any two points on that graph and connect them with a straight line, that line segment will always lie above or right on the graph itself.
Mathematically, we can express this with the inequality.
f of lambda times x plus 1 minus lambda times y is less than or equal to lambda times f of x plus 1 minus lambda times f of y.
And this holds for any input points x and y in the function's domain and for any lambda between 0 and 1.
Now, the really great thing about convex functions and optimization is that any local minimum you find, it's also a global minimum.
So if gradient descent hits a point where the function's value stops decreasing, you know for sure that you found the absolute best solution for that function.
Wow, that's a huge advantage.
So for these convex functions, gradient descent is much more likely to lead us to the true best solution.
Now, are there any guarantees about how well it will perform?
Can we say anything about its convergence?
Yes, actually.
There's a theorem in the chapter, Theorem 33 .8, and it provides a convergence result for convex functions.
So let's say x star is the point that minimizes our
R is the initial distance between our starting point x0 and that minimizer x star.
And L is a value that puts a limit on how big the gradient can get during the iterations.
If we carefully choose our step size, eta, to be R divided by L times the square root of t, then the average of those first t points we visit, x average, will satisfy.
f of x average minus f of x star is less than or equal to RL divided by the square root of t.
What this really means is that if we want to have a certain level of accuracy, epsilon, meaning we want f of x average to be within epsilon of the optimal value, f of x star, then the number of iterations t we need to run is roughly proportional to R squared times L squared divided by epsilon squared.
Okay, so to get a more accurate solution, we might need to run the algorithm for a longer time for more iterations.
Now, the chapter also talks about something called constrained gradient descent.
How is that different from the unconstrained version?
Right.
So constrained gradient descent comes in when we want to minimize a function, f of x, but not over the entire space of possible input values.
We want to restrict our search to a specific region, which we'll call k.
In this region k, we assume it's a closed convex body.
You can think of it as like a well -behaved enclosed area, no holes, no sharp edges, that kind of thing.
Now, the update rule in constrained gradient descent is a bit different.
First, we take a regular gradient descent step, just like before, x prime at t plus 1 equals x at t minus eta times f of x at t.
But here's the catch.
The resulting point, x prime at t plus 1, it might fall outside of our allowed region k.
So then we have to project this point back onto the closest point within k to get our next iterate.
So we get x at t plus 1 equals pi sub k of x prime at t plus 1.
Pi sub k represents that projection operation, kind of like nudging the point back into the allowed region.
And there's a theorem in the chapter, theorem 33 .11, and it tells us that constrained version of gradient descent has a similar asymptotic computational complexity to the unconstrained version.
So it's like taking our downhill step, and then if we've accidentally stepped outside the boundaries, we just got to push ourselves back in.
Now, what are some of the really important applications of gradient descent in machine learning?
Where do we actually see it being used?
Gradient descent, it's such a versatile tool, and it has tons of applications.
One really important area is in solving linear systems of equations, you know, those equations look like A equals B, where A is a positive semi -definite matrix.
It turns out that finding the solution to this system is the same as minimizing a particular convex function.
F of x equals 1 half x transpose A x minus B transpose x.
Another very common application is in linear regression.
Here, we have a bunch of data points, and we want to find a linear function f of x equals w0 plus the sum from j equals 1 to n of wjxj that best fits those points.
And how do we do this?
Well, we define a loss function that measures how well our linear function predicts the output values given the input data.
Then we use gradient descent to find the best values for those weights, w0, w1, up to o n, that minimize this loss function.
A really popular loss function is the sum of the squared errors between the predicted values and the actual values.
And if we want to prevent our linear regression model from overfitting the training data, we can actually put constraints on how big those weights can get.
And this would then fall under the umbrella of constrained gradient descent.
Wow, that was a really thorough overview of three incredibly important machine learning algorithms.
We talked about k -means clustering for finding those hidden groups in data,
multiplicative weights for smartly combining expert advice over time, and gradient descent, which is this incredibly powerful and widely used optimization technique.
Exactly.
And you can really see how even though these algorithms are designed for different types of problems, they all represent essential strategies for enabling computers to learn from data and tackle those really complex tasks.
And for you, our listener who's always eager to understand complex topics efficiently, grasping these fundamental algorithms provides a really strong foundation for delving deeper into the fascinating and ever -evolving world of machine learning.
And this leads us to a final thought for you to ponder.
With the amount of data in our world growing exponentially, what unexpected problems or challenges do you think these machine learning approaches might be applied to in the future?
What surprising innovations could they help to create?
It's definitely something to think about.
Absolutely.
Well, this brings us to the end of our deep dive into k -means clustering, multiplicative weights, and gradient descent.
We've explored all the key concepts, theories, findings, and examples related to these three fundamental machine learning algorithms as presented in the chapter.
Thanks for joining us on the deep dive.
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML ♥