Chapter 11: Hash Tables: Hash Functions, Open Addressing, and Collision Handling
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement, not replace, the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Alright, so today we are going deep on a topic that is, well, it's kind of hidden, but it's absolutely crucial to how a lot of our digital world actually works.
We're talking about hash tables.
Yeah, hash tables, they're kind of a workhorse data structure.
They're behind the scenes in a ton of different systems, you know, anything where you need to store and retrieve information quickly and efficiently.
What really struck me, you know, going through the research for this, is how often we're actually interacting with them without even realizing it.
Like, just think about using a Python dictionary.
Oh yeah, absolutely.
Python dictionaries, they're like the poster child for hash tables.
When you use a dictionary in Python, under the hood, it's a hash table doing all the heavy lifting.
That's why they feel so fast most of the time.
So let's get into it, right?
What makes these hash tables so good at those core operations, those dictionary operations?
Inserting new data, searching for existing data, and deleting data.
Well, it all boils down to this concept of direct addressing.
You can kind of think of it like this.
Imagine if the key to your information, like a name or an ID number, was literally the address where that information is stored in memory.
So it's like, if I need to find Bob's phone number, and Bob is the key,
the computer could just jump directly to the spot where that phone number is stored.
Right, exactly.
And in that scenario, finding what you need would be incredibly fast, almost instantaneous.
We call that constant time, or O1, in the language of algorithms.
No matter how much data you have, the time to find something stays the same.
That sounds amazing.
But I guess there's got to be a catch, right?
The universe of possible keys, like all the names in the world or all the product codes, that's got to be huge, way bigger than the amount of data we actually need to store most of the time.
Yeah, you hit the nail on the head.
That's the big challenge with direct addressing.
It's just not practical when the set of possible keys is absolutely massive compared to the actual number of things we need to store.
Think about trying to allocate memory for every single possible social security number just to store a few thousand employee records.
Yeah, that would be like building a warehouse the size of a city just to store a few boxes of stuff.
It just wouldn't make sense.
Exactly.
And that's where hash tables come in.
They give us a way to use a much smaller, more reasonable amount of storage space while still getting that near instant access most of the time.
So they're like a space -saving magic trick.
How do they actually pull that off?
The key ingredient is something called a hash function, usually denoted as HK, where K represents the key.
This function takes your key, whatever it might be, and it transforms it into an index.
That index corresponds to a slot in a table, our hash table.
So it's like this function takes something that could be anything, like a name or a product code, and it crunches it down into a number that tells us where to look for the corresponding data in our table.
Exactly.
That's the beauty of it.
And the result of the hash function, that index or slot number, we call that the hash value.
It's like the assigned address for that piece of data in our more compact storage space.
All right.
So we've got this function that tells us where to put things in our table.
But what happens when two different keys end up getting mapped to the same slot?
I mean, we can't just shove two different pieces of data into the same spot, right?
That's the million -dollar question.
And that's where we get into the world of collisions.
When two or more different keys produce the same hash value, we've got a collision.
It's like two different people showing up at the same apartment number.
Very awkward.
And I'm guessing this is where things start to get a little tricky, right?
Yeah.
Dealing with collisions is a key part of making hash tables work effectively.
Luckily, there are well -established strategies for managing this same apartment number problem.
And our sources dive into two main approaches, chaining and open addressing.
They each have their own strengths and weaknesses, and we'll go through both so you can really get a feel for how they work.
OK.
So let's start with chaining.
How does that handle these collisions?
So with chaining, you don't just store one piece of data at each slot in your table.
Instead, you have a linked list at each slot.
So if multiple keys map to the same slot, they just get added to that linked list.
So it's like each slot in our table becomes the head of a conga line, and all the keys that hash to that slot just join the line.
Yeah.
That's a good way to think about it.
Now, when you want to insert something new, you hash its key to find the right slot, and then you just add it to the beginning of that linked list.
That's really fast, takes constant time, 01 in the worst case.
So inserting new data is a breeze.
But what about when you need to find something?
You hash the key to get the slot, but then you've got to go through that whole conga line, right?
Right.
Exactly.
You have to search through the linked list at that slot, checking each item until you find the key you're looking for.
So if the list gets really long, the search time can become proportional to the length of that list.
That's the worst case scenario.
And what about deleting something?
Is that a conga line disruption?
Actually, deleting can be pretty efficient with chaining, especially if you use a doubly linked list.
That means each item in the list knows the items before and after it.
So if you have a pointer directly to the item you want to delete, you can just update a couple of pointers and remove it from the list.
It's like a quick surgical removal from the conga line.
So we've got these lists at each slot, and it seems like the length of those lists is going to be pretty important for how quickly we can find things.
That brings us to the concept of the load factor, right?
Exactly.
The load factor, often represented by the Greek letter alpha, basically tells us how full our hash table is on average.
It's the ratio of the number of elements stored in the table and divided by the total number of slots, m.
So it's alpha equals n over m.
And this load factor is a key indicator of the average case performance we can expect with chaining, isn't it?
It is.
There's this ideal scenario called independent uniform hashing.
That's where we assume each key has an equal chance of ending up in any slot, and those chances are independent for all the keys.
In this scenario, the average time to search for something in a hash table using chaining is what we call theta of 1 plus alpha.
Okay, so I'm hearing a couple of Greek letters here.
What's the big takeaway?
The bottom line is, if we can keep that load factor relatively small, then the average search time stays close to that ideal O1 performance.
And even if the load factor does get a little bigger, the performance degrades gracefully.
So chaining can still give us that fast average access, even when we have collisions, which is why it's used in so many different applications.
So chaining sounds pretty neat.
It's like a way to organize things efficiently, even if we have these little collisions along the way.
But let's talk about the other strategy our sources dive into.
Open addressing.
This one sounds a little more, well, direct.
Yeah, open addressing is a different beast altogether.
The key difference is that we don't use any external structures like linked lists.
All the data is stored directly within the hash table itself.
Okay, so that sounds simple enough at first.
But what happens when there's a collision?
If a new key wants to go into a slot that's already taken, what do we do?
That's where the idea of probing comes into play.
When we try to insert a key and its hashed slot is already taken, we don't just give up.
Instead, we systematically check other slots in the table until we find one that's empty.
So it's like if our preferred parking spot is taken, we keep driving around the lot until we find an open space.
But there's got to be some method to this madness, right?
We can't just be driving around randomly.
Absolutely.
The sequence of slots we check is called the probe sequence, and it's determined by the key itself and something called a probe number.
For any given key, K, the probe sequence denoted as HK0, HK1, HKM1, must be a permutation of all the slots in the table from 0 to M minus 1.
Okay, so this probe sequence ensures that if there's an empty slot anywhere in the table, we'll eventually find it.
It's like a systematic search of the entire parking lot.
What about when we need to find something that's already in the table?
We follow the same probe sequence that was used when that key was originally inserted.
We start at the initial hashed slot and then follow the sequence.
If we find the key, great.
Search successful.
If we hit an empty slot, it means the key was never inserted, so the search is unsuccessful.
Okay, so inserting and searching make sense.
But our sources mention that deleting an element in open addressing can be a little tricky.
What's the catch there?
This is a really important point.
If we just mark a deleted slot as empty, let's say by setting its value to nil, it can actually mess up the probe sequences for other keys that were inserted after the deleted element.
Imagine someone parked their car in that empty spot after we just drove through.
So if we later try to find one of those keys, we might hit that empty slot and think, oh, it's not here when it actually is somewhere else in the table.
It's like a phantom parking spot.
So to avoid this, we often use a special marker, often called deleted, to indicate that a slot used to hold something but is now empty.
When we're searching, we treat these deleted slots as occupied and keep following the probe sequence.
But when we're inserting, we can treat them as empty slots and place a new key there.
That's clever.
But having too many of these deleted markers lying around could slow things down, right?
Yeah, it's a trade -off.
Too many deleted slots mean more probing during searches.
Interestingly, our sources pointed out that one specific type of open addressing, called linear probing, has a way to delete elements without using this deleted marker, which is a nice little optimization.
All right.
So we've talked about this probe sequence quite a bit.
What are some of the ways we can actually generate these sequences in practice?
There are a couple of common techniques.
The simplest one is called linear probing.
The probe sequence is calculated as HKI equals H1K plus I modulo M.
Basically, you start with an initial hash value, H1K, and then for each subsequent probe, you just add one and take the result modulo of the table size.
So you're essentially just checking the next slot over and over again, wrapping around if you hit the end of the table.
So it's like, nope, this spot's taken.
Let's try the next one and the next one and the next one.
Simple enough.
But our sources also mentioned something called primary clustering.
What's that all about?
Primary clustering is kind of the Achilles heel of linear probing.
If you get a bunch of consecutive slots that are already occupied, any new key that hashes to one of those slots will end up being placed at the very end of that run, making the run even longer.
It's like everyone's trying to squeeze into the same row of parking spaces.
So you end up with these long clusters of filled slots.
And that can slow down searches because you might have to skip over a whole bunch of occupied spots just to find an empty one.
It's like rush hour in the parking lot.
Yeah, that's a good analogy.
However, linear probing does have a potential advantage.
Because the probes are sequential in memory, it can sometimes be good for cache performance.
Modern computers have these different levels of memory, and keeping the data you're accessing close together in those faster memory levels can speed things up.
So there's a trade -off there.
Simplicity and potentially better cache behavior versus this clustering issue.
What's another probing technique that tries to address that clustering problem?
That would be double hashing.
With double hashing, the probe sequence is defined as h -k -i equals h -1 -k plus i times h -2 -k, where h -1 -k and h -2 -k are two different hash functions.
Wait, two hash functions?
Why do we need two?
The second hash function, h -2 -k, determines the step size between probes.
So instead of just checking the next slot over each time, we take jumps of different sizes depending on the key.
That helps to break up those long clusters that can form with linear probing.
Okay, so it's like instead of driving to the next spot over, we consult the second map that tells us to jump ahead a few spaces, then maybe a few more, and so on.
That makes sense.
Are there any special rules for this second hash function?
Yeah, for double hashing to work properly, h -2 -k needs to be relatively prime to the table size.
That means their greatest common divisor should be 1.
This ensures that we have the potential to check every slot in the table if we need to.
One common way to do this is to make them a power of 2 and design h -2 -k to always produce an odd number that's less than 1.
Another way is to make them prime number and have h -2 -k return a positive integer less than 1.
Gotcha.
So those are some of the ways to actually generate these probe sequences.
But how do open addressing techniques perform on average?
I mean, we're still dealing with collisions, right?
Right.
Under a somewhat idealized scenario called independent uniform permutation hashing, which basically means we assume all possible probe sequences are equally likely, the average number of probes for an unsuccessful search is bounded by 1 divided by 1 minus alpha.
And for a successful search, it's bounded by 1 alpha times the natural logarithm of 1, 1 alpha.
Again, alpha is the load factor.
So the takeaway is, as long as we keep the table from getting too full,
open addressing can be very efficient on average, even with collisions.
Exactly.
But, you know, something we haven't talked about yet is how important the hash function itself is.
The way we map those keys to slots in the first place plays a huge role in how well the whole system performs.
Right.
I was just thinking that.
I mean, if we have a bad hash function that clumps a bunch of keys together in the same few slots, it doesn't matter how fancy our collision resolution strategy is, we're going to have a bad time.
Absolutely.
The goal of a good hash function is to approximate that ideal scenario we mentioned earlier, independent uniform hashing.
We want each key to have an equal chance of ending up in any slot, and we want those chances to be independent for all the keys.
A well -designed hash function will minimize the number of collisions we get in the first place.
So how do we go about choosing or designing these magical hash functions?
Well, there are a few different approaches.
One category is called static hashing.
Static hashing means we pick one hash function and stick with it for the entire lifetime of the hash table.
Simple enough.
What are some common static hashing methods?
One is the division method, which is super straightforward.
The hash function is simply HK equals K modulo M, where M is the table size.
For this method, it's generally a good idea to choose me to be a prime number that's not too close to a power of two.
Why prime and not a power of two?
What's the thinking there?
Choosing a prime number as the modulus, M, tends to spread the keys out more evenly across the table, especially if the keys themselves have some underlying patterns or regularities in their distribution.
And staying away from powers of two helps to avoid potential biases that could arise from the binary representation of the keys.
Okay, that makes sense.
What are some other common static hashing methods?
Another popular one is the multiplication method.
The hash function for this one is HK equals the floor of M times K times A modulo 1.
Here A is a constant value between 0 and 1, and K times A modulo 1 represents the fractional part of the product of the key and the constant A.
Interestingly, with the multiplication method, the choice of M is often less critical than with the division method.
So we have division and multiplication.
Are there any other notable static hashing techniques?
Yeah, there's also the multiply -shift method.
This one is particularly efficient when the table size M is a power of 2.
Let's say M equals 2 to the power of ill.
You multiply the key K by a carefully chosen Wubit integer A, and then the hash value is obtained by taking the most significant all bits of the lower Wubit half of this product.
This method is often very fast because it uses basic multiplication and bitwise shift operations which are very efficient at the hardware level.
So we've got these static methods where we pick a single hash function and stick with it.
But our sources also talk about random hashing and universal hashing.
How are those different?
With random hashing, instead of using just one fixed hash function, we randomly choose a function from a whole family of hash functions when we create our hash table.
This choice is made completely independently of the keys we're going to store.
The advantage is that it can give us provably good average case performance because the odds of consistently picking a bad hash function that leads to lots of collisions are pretty low.
So it's like, instead of putting all our eggs in one basket, we're spreading them out across a bunch of different baskets, hoping that at least one of them won't break.
Exactly.
And then there's universal hashing, which takes this idea even further.
A family of hash functions is considered universal if, for any two distinct keys, the probability of those keys colliding under a randomly chosen function from the family is very small.
At most, one meal, whereas a table size.
So it's like, no matter how someone tried to pick keys to mess with our hash table, if we're choosing our hash function randomly from a universal family,
we're pretty much guaranteed to get good performance on average.
It's like having a magic shield against those pesky collisions.
Yeah, that's a good way to think about it.
Universal hashing is especially powerful when we're using chaining because it guarantees good average performance, regardless of how the keys are distributed.
Okay, so that's the what of universal hashing.
What about the how?
How do we actually design these universal families of hash functions?
One approach uses number theory.
We start by picking a prime number, p, that's bigger than any possible key we might encounter.
Then for any two integers, a in the range from one to p minus one and b in the range from zero to p minus one, we can define a hash function, a, g, b, k, as a times k plus b, modulo p, modulo m.
If we take all the possible choices for a and b, that whole set of functions forms a universal family.
That's pretty elegant using prime numbers and modular arithmetic to create these families of hash functions.
Are there other ways to do it?
Yes.
There's also a more computationally focused method that uses the multiply shift technique we talked about earlier.
If you take a family of multiply shift hash functions that use odd integer constants for a, that family can be shown to be two -memory universal, which is also a very strong guarantee on the probability of collisions.
So we've got these sophisticated ways to handle collisions and these clever techniques for designing hash functions.
But what about when we have really long inputs, like entire documents or large data structures, how do we hash those effectively?
For long inputs, we can extend the ideas of universal hashing by breaking the input down into smaller chunks and applying our chosen hash function iteratively across those chunks.
Another common and very powerful approach is to use what are called cryptographic hash functions.
Cryptographic hash functions, those sound serious.
They are.
Their primary purpose is for security, things like ensuring data integrity and creating digital signatures.
But these functions have the nice property that they can take an input of any length and produce a fixed length output, a hash value, and they're designed to behave very much like a random oracle, which means they provide a really good approximation of that ideal independent uniform hashing we've been talking about.
So even though they're designed for security, they can also be very useful for hash tables because they tend to spread those hash values out very evenly across the table.
Exactly.
They're a great tool to have in your arsenal when you need to hash long, complex inputs.
This deep dive into the theory of hash tables has been fascinating.
It's amazing how these fundamental concepts underpin so much of our digital world.
But what about the practical side of things?
What are some of the real -world considerations we need to keep in mind when we're actually using hash tables?
That's a great question.
Our sources do touch on some important practical aspects.
One key consideration is the impact of memory hierarchies in modern computers.
As we mentioned earlier, modern computers have different levels of memory.
Registers, multiple levels of cache, and then main memory.
Each level has different speeds and access times.
So where our data is stored in this hierarchy can really affect performance.
Exactly.
And this is one area where linear probing, even with its clustering issue, can sometimes have an advantage.
Because linear probing accesses memory sequentially, it can lead to better cache utilization.
If the data we're accessing is stored close together in those faster cache levels, it can speed things up considerably.
So it's like, even though we might have to skip over a few occupied slots in the table, if all those slots are located right next to each other in memory, it might still be faster overall than jumping around to different parts of the table.
Right.
It's a trade -off between the number of probes and the speed of each individual memory access.
What about the hash functions themselves?
Can modern hardware help us out there?
Absolutely.
Modern CPUs often come with advanced instruction sets that can significantly speed up complex computations, including the kinds of computations used in some of the more sophisticated hash functions, even cryptographic ones.
This means that using these more complex but potentially more robust hash functions can be more practical in performance -sensitive applications.
So the hardware is catching up with the algorithms, making those really strong hash functions more viable.
Exactly.
And our sources even highlighted a specific example of a hash function designed with memory hierarchies in mind, the WeHash function.
WeHash function.
What's special about that one?
Well, the WeHash function is designed to be very efficient to compute using the CPUs registers, which are the fastest type of memory available.
It uses simple operations like addition, multiplication,
and swapping halves of a word of data.
And there are extensions of the WeHash function that can handle inputs of different links by processing the data in fixed -sized chunks.
So it's like they've really squeezed every ounce of performance out of that function by taking advantage of how modern hardware works.
Exactly.
And it shows how the design of algorithms and data structures is often intertwined with the capabilities of the underlying hardware.
To wrap things up, our sources included a really helpful glossary of key terms.
Maybe we could run through a few of them just to solidify our understanding.
Yeah, let's do a quick review.
All right.
So first up, we have chaining.
Right, chaining.
That's the collision resolution method where we use linked lists at each hash table slot to store multiple keys that map to the same location.
Next we have collision.
A collision is simply the event that happens when two or more different keys get mapped to the same slot in the hash table by our hash function.
And then there's the hash function itself.
The heart of the operation.
The hash function is what takes a key as input and produces an index or hash value that corresponds to a slot in the table.
And the hash table is the data structure that uses this hash function to store and retrieve data efficiently.
Exactly.
And the load factor is a key measure of how full the hash table is.
It's the ratio of the number of elements currently stored in the table to the total number of slots.
We also talk about open addressing.
Open addressing is a family of collision resolution techniques where all the elements are stored directly within the hash table itself.
And we resolve collisions by probing for other empty slots.
And probing is that process of systematically checking a sequence of slots when a collision happens.
Either to find an empty slot for insertion or to locate a key during a search.
And last but not least, we have universal hashing.
Universal hashing is a powerful technique where we use a family of hash functions that have a very low probability of any two keys colliding when we choose a function randomly from that family.
It's a way to guarantee good average case performance regardless of how the keys are distributed.
Well, that was quite a journey through the world of hash tables.
It's incredible how this seemingly simple concept plays such a huge role in so many of the systems we use every day.
Yeah, it's a testament to the power of clever algorithms and data structures to manage information efficiently.
We've seen how hash tables can give us that near constant time performance for those core dictionary operations, inserting, searching, and deleting, which is crucial for so many applications.
And we've seen how the choice of hash function and collision resolution method can have a big impact on performance, both in theory and in practice.
Absolutely.
So for our listeners out there, I want to leave you with this thought.
We've seen how hash tables take a potentially massive universe of keys and map it down to a much smaller, more manageable space, allowing for that incredibly fast access.
Think about how this principle of mapping an efficient lookup might apply in other areas of your life or work.
Could you use a similar approach to organize information, streamline processes,
or even think about problem solving in new ways?
That's a great question to ponder.
Thanks for joining us on this deep dive into hash tables.
It's been a pleasure.
And until next time, keep exploring the fascinating world of algorithms and data structures.
There's always something new to discover.
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML ♥Related Chapters
- Amortized Analysis: Accounting, Potential, and Dynamic TablesIntroduction to Algorithms
- Binary Search Trees: Querying, Insertion, and DeletionIntroduction to Algorithms
- Dynamic Programming: Rod Cutting, LCS, and Optimal BSTsIntroduction to Algorithms
- Exploring Data with Tables and GraphsElementary Statistics
- Goodness-of-Fit and Contingency TablesElementary Statistics
- Greedy Algorithms: Activity Selection, Huffman Codes, and CachingIntroduction to Algorithms