Chapter 9: Medians and Order Statistics

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

All right, welcome back everyone to the deep dive.

You know, we often take a look at these kind of big abstract ideas, but today we're gonna zoom in on something that feels super basic at first glance.

Hmm, I'm intrigued.

What's got you so excited?

Well, today's all about finding a specific number in a set.

You know, like what's the smallest number or what's the biggest or even what's right in the middle of the median.

Okay, I see where you're going with this.

Sounds pretty straightforward.

You just sort the numbers and boom, you can pick out whatever you want.

Yeah, exactly.

Sorting is the obvious answer, but what if I told you there are ways to find that special number, like way faster, like without even bothering to sort the whole set?

Really, I mean, sorting feels pretty fundamental.

Like how much faster can you even get?

Well, that's what we're gonna dive into today.

We're talking about finding what are called order statistics, basically.

Finding an element based on its rank in the sorted order, but without the sorting overhead and get this, we can even do this in linear time.

Whoa, linear time.

That means the time it takes only grows proportionate to the number of elements.

It's like the holy grail of efficiency for algorithms, especially when you're dealing with huge data sets.

Exactly.

Like imagine you've got mountains of data, like millions or even billions of entries.

Do you really wanna waste time sorting the whole thing if you just need to know, say, the top 10 values or the exact median?

Yeah, that makes a lot of sense.

So where do we even begin with this kind of magic?

What are the techniques that let us skip the sorting step and still find exactly what we're looking for?

Okay, so to kick things off, we'll start with the basics, finding the minimum and maximum values.

It'll set the stage for understanding how we can get even more clever.

Sounds good to me.

Back to the fundamentals.

So imagine you've got a bunch of numbers, right?

How would you like naturally figure out the smallest one?

Well, I'd probably start by looking at the first number and then just go through the rest, keeping track of the smallest one I've seen so far.

Exactly.

Our sources tell us that this basic approach actually uses N1 comparisons.

You have N elements and you need to compare each one to your current smallest, except for the very first one.

And what's cool is that you literally can't do any better than that.

There's a proven lower bound.

You need at least N1 comparisons to guarantee you've found the absolute minimum.

That's like a tournament, right?

If you want one winner, everyone else has to lose at least one match.

Yeah, perfect analogy.

And finding the maximum works the same way, just flipped.

You keep track of the biggest number you encounter and again, you're stuck with those N1 comparisons.

So if we want both the minimum and the maximum, we just do both of those things.

That's what you might think, right?

And if we did them separately, that would cost us like two N2 comparisons, which isn't terrible, it's still linear time, so it scales well, but we can actually be even more efficient.

Wait, really?

How do you improve on something so basic?

The trick is to work with the numbers in pairs.

So instead of looking at them one by one, we grab two at a time.

Okay, I'm listening.

What do we do with these pairs?

Well, first, you compare the two numbers in the pair.

The smaller one is your potential new minimum and the bigger one is your potential new maximum, right?

Makes sense so far.

Then you compare the smaller one from the pair to your current minimum.

And you compare the bigger one from the pair to your current maximum.

Ah, I see.

So we're kind of doing two checks at once by leveraging the comparison we already did within the pair.

Exactly, for every two elements, we're only doing three comparisons total, one within the pair and then two more against our running minimum and maximum.

So we cut down the work significantly, especially as our set of numbers gets larger.

Okay, that's pretty clever.

But how do you get started?

Like, what are your initial minimum and maximum when you haven't looked at any numbers yet?

Good question.

If you have an odd number of elements, you can just take the very first one and say, okay, this is both my current minimum and my current maximum for now.

Then you proceed with the pairs.

But if you have an even number of elements, you start by comparing the first two.

Whichever is smaller becomes your initial minimum and the larger one becomes your initial maximum.

Then you carry on with the rest of the pairs.

Okay, so you handle that initial setup a little differently depending on whether you've got an odd or even number of elements.

Makes sense.

Right, and this whole strategy of processing in pairs means we only need at most three times the floor of N2 comparisons.

Much better than that two and two we had before.

It's awesome how even these seemingly simple tasks have room for optimization when you think about them a bit differently.

Totally.

All right, so we've got the extremes figured out, the smallest and the largest.

But what if we want something like in between, say the 10th smallest element or the median, that sweet spot right in the middle.

That feels like a whole different beast.

Yeah, it definitely seems more complex, but here's the mind blowing part.

We can still solve this, what's called the selection problem in linear time.

At least on average.

Oh really?

Like even though we're not sorting the entire set.

Yep, and the algorithm that does this is called randomized select.

As the name hints, there's a bit of randomness involved and it takes inspiration from the quick sort algorithm.

Okay, I'm starting to see a pattern here, optimization, randomness, cleverness, I like it.

Exactly, but randomized select is like quick sort's more focused cousin.

It's all about efficiency.

So quick sort recursively sorts both sides of a partition, right?

But randomized select just focuses on the side that has the element we actually want.

Ah, so it's like it's not getting sidetracked by sorting parts of the data that don't matter for our specific goal, makes sense.

Exactly, and the way it works is, first, there's this super simple base case.

Like if you're only looking at one element, well, that's gotta be your answer, no matter what rank you're looking for.

But if you have more than one element, you randomly pick one to be your pivot.

This is super important that random choice helps avoid those worst case scenarios where quick sort can get bogged down.

Okay, so randomness for the win again.

But once we've got our pivot, what do we do with it?

So just like in quick sort, you use a partitioning step, you rearrange the elements so that everything less than or equal to the pivot is on one side, and everything greater than or equal to the pivot is on the other side.

And after this partitioning, the pivot is sitting right where it belongs in the sorted order.

Got it, that's the same partitioning idea from quick sort.

But what happens next?

We've got our pivot nicely in place, but how do we actually find that smallest element?

Well, now comes the fun part.

After partitioning, you know the rank of your pivot.

Let's say there are k elements smaller than or equal to the pivot.

If i, the rank we're looking for happens to be equal to k, then boom, our pivot is the answer.

We're done with this part.

Okay, that makes sense.

But what if i isn't equal to k?

What if it's smaller or larger?

Right, so if i is smaller than k, then our target element must be somewhere in that group of elements smaller than the pivot.

So you make a recursive call to randomize select, but this time you only focus on that smaller group.

And if i is larger than k, then the target must be in the group of elements larger than the pivot.

Do you make a recursive call on that larger group?

But you gotta adjust the rank you're looking for because you've already accounted for the pivot and all the elements smaller than it.

I am following.

So we're basically using the pivot to narrow down our search and zero in on the right spot.

But you said this has a linear running time even though we could pick a bad pivot.

How does that work out?

The key is that on average, the pivot is gonna be decent.

Like it's not always gonna be the absolute best splitting the data perfectly in half, but there's at least a 50 -50 chance that it'll divide things up well enough to make significant progress.

And even if we get unlucky a few times and pick bad pivots, the good ones statistically make up for it.

If you dig into the math a bit, it turns out the expected number of comparisons randomized select makes is proportional to N, the number of elements.

That's why it's got an expected running time at DIN, which is awesome.

I love how these algorithms combine randomness and cleverness to achieve such efficiency.

It's really cool.

But you also mentioned there's another algorithm, Select, that guarantees linear time even in the worst -case scenario.

That sounds like a real triumph of computer science.

Oh, absolutely.

Select is a remarkable algorithm.

It's a bit more intricate than randomized Select and might not be the go -to choice for everyday use because it has a slightly higher constant factor in its complexity.

But theoretically, it's a game -changer.

It proves that you can always find the ill -cellus -mellus element in linear time, no matter what, no randomness, no luck involved.

Wow, no randomness.

So how does it work?

What's the secret sauce that makes it so reliable?

The key is in how it picks the pivot.

It goes through this multi -step process to find a pivot that guarantees a good enough split every single time.

Ready for it.

Lay it on me.

I'm all ears.

All right, so first things first, if you don't have a number of elements that's divisible by five, you repeatedly find and remove the smallest element until you do.

That's step one.

Okay, so we're setting things up nicely.

What's next?

Then you divide your elements into groups of five.

Each group has exactly five elements in it.

Got it.

Little groups of five.

Then what?

Now you sort each of these groups.

Since they're so small, sorting is super fast.

You can just use a simple sorting algorithm like insertion sorting.

Right, makes sense.

We're not talking about sorting millions of numbers here, just five at a time.

Exactly.

And once each group is sorted, you find the median of each group, which is just the middle element.

Okay, so now we've got all these medians, one from each of our groups of five.

What do we do with them?

Here's where it gets really cool.

You take all those medians and recursively call the selected algorithm on them.

You're basically finding the median of the medians.

Whoa, hold on.

So we're using the algorithm itself to pick the pivot.

That sounds like it could take a while, right?

It might seem that way, but it's totally worth it.

This median of medians is the key to guaranteeing that good split we were talking about.

Because of how it's chosen, it's gonna have a substantial number of elements, both smaller than it and larger than it.

It's like a really well -balanced pivot.

Okay, I'm starting to see the magic.

So once we've got this median of medians pivot, what do we do with it?

Same as before you partition the original array around it.

Everything smaller goes to one side, everything larger goes to the other.

Then you figure out the rank of the pivot,

and based on that and the rank you're looking for, you make a recursive call to select on either the smaller or larger portion, adjusting your target rank accordingly.

Okay, I get the steps, but how does this special pivot guarantee that the algorithm runs in linear time?

The key is that because of how the median of medians is chosen, you can mathematically prove that the subproblem you're recursively calling select on will always be significantly smaller than the original problem.

In fact, it'll be at most 70 % of the original size.

Think about it, at least half of the groups of five have a median less than or equal to the median of medians, right?

And within each of those groups, there are at least two other elements that are also smaller than or equal to their median, and therefore smaller than or equal to the median of medians.

This means at least roughly half the groups contribute at least three elements that are smaller than or equal to the median of medians.

The same logic applies for finding elements larger than the median of medians.

Ah, okay, so it's like this median of medians creates a kind of natural dividing line, ensuring that you're always making significant progress with each recursive step.

Exactly, because of this property, the worst case running time can be described by a recurrence relation, which if you solve it, turns out to be variant.

That's why select guarantees linear time, even in the worst possible case.

Wow, that's so clever.

So both randomize select and select figure out the order of elements just by comparing them, right?

They don't need to know anything specific about the data itself.

You got it.

They both work within what's called the comparison model.

This means they only use comparisons to determine the order, nothing else.

And that's interesting because it contrasts with sorting algorithms, right?

Like the best sorting algorithms have a lower bound of n log n comparisons.

Exactly, that's one of the reasons why these selection algorithms can achieve linear time.

They're not trying to sort the entire set, just find one specific element.

And that more focused goal lets them be way more efficient.

Okay, so just to recap, we've got this super efficient method for finding the minimum and maximum, a randomized algorithm that's usually really fast for finding any i -th element, and a more complex but guaranteed linear time algorithm for the same problem.

And all of this without ever needing to fully sort the data.

That's right.

We've explored a whole bunch of clever techniques for finding those specific elements, those order statistics that we're looking for.

And it all comes down to thinking about the problem a bit differently, using randomness strategically, and sometimes even using recursion in unexpected ways.

Yeah, it's amazing how much depth there is to a problem that seems so simple on the surface.

It shows that even the most fundamental tasks can have surprising solutions when you dig a little deeper.

Absolutely, and our source material probably goes into even more detail about all sorts of interesting variations and related problems.

You could explore things like finding the second smallest element with the fewest possible comparisons, or tackling puzzles like the classic racehorse problem, which can actually be solved using selection algorithms.

There's also a lot more to uncover about the probabilities at play in randomized select.

And you could explore different tweaks to the select algorithm, like using groups of different sizes instead of five.

And there are all sorts of specialized selection problems, like finding weighted medians, or efficiently identifying very small or very large order statistics, or even figuring out the median of two arrays that are already sorted.

It's amazing how this one concept, finding the fewest smallest element, branches out into so many fascinating directions.

It really highlights the beauty and depth of algorithms.

Couldn't agree more.

It's a testament to the power of creative problem solving and the elegance of mathematical thinking.

Well, that brings us to the end of our deep dive into selection algorithms.

Thanks for being our guide through this fascinating world of efficient solutions.

It was my pleasure.

Always happy to geek out about algorithms.

And to our listeners, as you navigate the ever -growing sea of data in your own lives, remember this sometimes, you don't need to know everything.

Sometimes just pinpointing that key value, finding that middle ground, or understanding where a particular data point falls within the grand scheme of things can be more valuable than having the entire data set perfectly ordered.

So the next time you're faced with a mountain of information, think about the power of selection algorithms.

And as always, stay curious.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Finding the ith smallest element in an unordered collection, known as the selection problem, represents a fundamentally different computational challenge from sorting, one that permits linear time solutions without exhaustively ordering all data. The minimal cases establish the baseline: identifying a single extremum requires precisely n–1 comparisons, while locating both minimum and maximum simultaneously demands only 3⌊n/2⌋ comparisons through pairwise processing where smaller elements are tested against the running minimum and larger elements against the running maximum. The randomized approach to general selection adopts the partitioning framework familiar from quicksort by selecting a random pivot, partitioning the array around it, and recursing into only the side containing the target position, producing an expected time complexity of Θ(n) because a sufficiently balanced pivot emerges with probability at least one half. For scenarios requiring guaranteed performance bounds regardless of input distribution, a deterministic method achieves Θ(n) worst-case time by constructing a reliable pivot through a multi-step process: grouping elements into sets of five, computing the median within each group, recursively identifying the median of all these medians, and using that element as the partition pivot, this strategy ensures that every recursion eliminates a constant proportion of remaining elements and generates the recurrence T(n) ≤ T(n/5) + T(7n/10) + Θ(n), which resolves to linear time. The randomized variant excels in practical settings through its simplicity and lower constants, while the deterministic guarantee remains theoretically elegant and input-independent. A crucial distinction separates selection from sorting: no Ω(n log n) lower bound constrains the selection problem in the comparison model, establishing that identifying an element at a specific rank is inherently easier than arranging all elements in sorted order.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 9: Medians and Order Statistics

Related Chapters