Chapter 9: Performance – Design Patterns & Optimization
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Welcome back to the Deep Dive.
We are really tearing into the core design attributes of successful software these days, and today we're tackling the one quality that, well, everyone notices immediately when it's missing, speed.
You know, as the great Mae West once said, an ounce of performance is worth pounds of promises.
That quote just hits the nail right on the head, doesn't it?
When we, as architects, talk about performance, we're really talking about time, specifically the system's ability to meet its designated timing requirements.
Right.
And that could be anything, you know, computations done in nanoseconds, maybe disk access in milliseconds, or even making sure a request crossing a continent gets there in under, say, 100 milliseconds.
Time is the resource we have to manage.
Yeah, it's not just about raw power, though, is it?
I mean, our sources make it really clear that while hardware keeps getting cheaper, we still hit these fundamental limits, you know, situations where we just cannot solve a problem fast enough to be useful.
So the mission for this Deep Dive, really, is to guide you that architectural journey.
First step,
defining what speed actually means for your specific system
and then applying the right tactics, the right patterns to actually achieve it.
And that definition, it really does depend entirely on the context.
Like, are you building a system for
high -volume transactions, maybe an e -commerce site?
Okay.
Or are you controlling machinery, something like an engine?
Totally different needs.
Right, right.
So for that high -volume system, the metric that usually matters most is throughput, isn't it?
How many requests can we handle per second or per minute?
Exactly, throughput.
But if you're dealing with a real -time system, like that engine controller example,
well, precision is king there.
So we focus instead on latency, how long the response takes, and maybe even jitter, which is the acceptable variation in that response.
Jitter, okay.
Yeah.
And performance scenarios, they're basically the language we use to capture these specific, measurable requirements.
It's how we talk about performance precisely.
Got it.
So let's lay that foundation then, scenarios and concurrency.
Right.
So to really get performance, you have to understand the basic flow.
An event arrives.
It consumes resources, CPU, network, memory, whatever, and it uses up time.
And all this is happening while the system is probably trying to handle lots of other events at the very same time.
And that same time part brings us straight to concurrency, doesn't it?
Immediately, yeah.
Concurrency is just such a foundational concept for any architect, often under -taught, I think.
But it's how we exploit delays, basically.
We run operations in parallel, maybe using multi -core processors or spinning up new threads.
Okay.
So we can use the time one operation is maybe blocked, waiting for I .O., perhaps, to make progress on another operation.
So concurrency boosts efficiency, makes sense, but it also introduces the scary part, the race condition.
Ah, yes, the race condition, probably the most notorious sporadic bug in software development.
Imagine this, right?
Two concurrent threads.
They both try to run the statements X1 followed immediately by X++ line.
And they're doing this against the same shared variable X.
Okay.
So intuitively, you'd think the final value of X should be two, right?
Most people do.
But because of how the system interleaves the tiny atomic assembly instructions underneath.
Well, depending on that exact timing, that interleaving, the final result could actually be three or even stay at one sometimes.
Wait, hang on.
How could it possibly end up as three or one?
Okay.
So imagine both threads read the initial value of X, which let's say is zero for simplicity here.
They both read zero.
Right.
Thread A calculates zero plus one equals one.
Thread B calculates zero plus one equals one.
Now, if thread A writes its one back to X,
and then thread B writes its one back to X, the final value is just one.
The first increment got overwritten.
Oh, wow.
Okay.
That's how you got one.
How about three in the original X1, X++ example?
Is that a typo?
Ah, good catch.
My example using X1, X++ leading to three was likely a bit confused in my explanation there.
Apologies.
The classic issue is more about overwriting or unexpected results like one or two when you might expect something else sequentially.
The key point remains the interleaving makes the result unpredictable.
It depends entirely on timing.
Okay.
Got it.
The unpredictability is the killer.
Exactly.
Which is why these bugs are absolute nightmares to reproduce and fix.
I mean, the sources mentioned this one operating system bug took over a year for it to happen again, so they could finally track it down.
Whoa.
Okay.
That story alone really drives home why managing shared state is so critical.
So how do we architecturally deal with this?
What are the solutions?
There are fundamentally two ways.
You can impose some structure using locks, semaphores, mutexes, things like that.
These basically enforce sequential access.
Only one thread can hold the lock and touch that shared state at any given time.
Okay.
So locks force them to take turns.
Right.
The other approach, which is often preferable if you can manage it, is to partition this state.
Design your system so that threads simply don't need to share that X in the first place.
Eliminate the contention entirely.
Ah, okay.
Avoid the sharing altogether.
That sounds cleaner if you can do it.
Definitely less prone to deadlock and performance issues that locks themselves can introduce.
All right.
So mastering concurrency is key.
Then we need to precisely define the performance problem itself using that scenario definition we mentioned.
You said there are six parts.
Maybe we can group them.
How about input, context, and output?
What defines the input side of a scenario?
Good way to think about it.
So the input is defined by the source and the stimulus.
The source is just where the event comes from.
Is it an external user clicking a button, another system sending a message, or maybe an internal thing like a timer firing?
Okay.
And the stimulus?
The stimulus defines how those events arrive.
Are they periodic?
Like exactly every hundred milliseconds?
Predictable.
Or maybe stochastic following some kind of ability distribution,
like requests arriving randomly but averaging 10 per second.
Or they could be completely random, unpredictable arrival.
Got it.
Source and stimulus define the input.
What about the context?
Context includes the artifact.
What part of the system are we actually talking about?
Is it the whole system?
Or just one specific component like the user interface or a particular microservice?
And context also includes the environment.
Is the system running in its normal mode?
Is it under some kind of temporary overload or maybe even an emergency mode with different performance expectations?
Normal overload emergency.
Okay.
That leaves the output definition.
Right.
The output side is the response and the response measure.
The response is simply what the system does.
Does it return calculated value?
Does it log an error?
Or maybe under overload, the defined response is to just ignore the request.
That's a valid response too.
Ignoring it.
Okay.
Interesting.
Yeah.
Sometimes necessary.
And then most importantly, the response measure.
This is the actual metric we care about.
Is it latency response time?
Throughput request per second.
Jitter variation in latency.
Or maybe even resource utilization, like keeping CPU below 80%.
Okay.
So putting that all together, we get away from vague stuff like the system should be fast.
Exactly.
And instead we define something concrete and measurable, like the example given.
Under normal operations environment, 500 user source initiate 2000 requests.
Stimulus, stochastic in a 30 second interval, hitting the whole system artifact, and all requests must be processed response with an average latency of two seconds response measure.
Perfect.
Now that gives the development team a clear testable target to hit.
The job is defined.
Okay.
So with that vocabulary for defining the problem lockdown, the challenge shifts, right?
It becomes about controlling time.
The sources say architects focus on two fundamental things that contribute to latency.
What are those again?
Yeah.
The two big contributors are processing time and block time.
Processing time is when the system is actually doing work,
consuming resources,
CPU cycles are firing, memories being accessed, data is moving across the network, active work.
Okay.
Active work.
And block time.
Block time is when the processing for a particular request has essentially stalled.
It's paused.
It might be waiting for say, disk IOTA complete or waiting for a network response from another service, or maybe waiting for a lock to be released by another thread.
It's waiting time, not working time.
Got it.
Processing versus waiting.
And that distinction matters because it matters because our architectural tactics, the tools we use have to target one or the other, or sometimes both.
You need to know where the time is actually going.
Another really key insight here about how different resources behave when they get saturated, when they're overloaded.
Right.
You mentioned CPU versus memory.
Exactly.
CPU load.
Well, as it increases, performance tends to degrade somewhat gracefully.
Things get slower, maybe noticeably slower, but often steadily slower.
Okay.
Steady decline.
But memory is different.
You said it's much more dangerous.
Why?
Oh, memory saturation is a completely different beast.
When you run out of physical RAM, the operating system starts doing something called page swapping, or just swapping.
It desperately tries to make space by moving chunks of data, pages, from fast RAM to the much, much slower hard disk or SSD.
Ah, the dreaded disk swap.
Exactly.
And the overhead of constantly moving data back and forth between RAM and disk is just enormous.
So system performance doesn't just degrade slowly.
It can absolutely crash.
It can fall off a cliff, going from acceptable one moment to basically unusable the next.
Catastrophic failure.
Wow.
Okay.
So managing memory usage is incredibly critical then.
Paramount.
You have to avoid hitting that swap threshold, if at all possible, for performance sensitive systems.
All right.
So knowing about processing time, blocked time, and resource saturation,
let's get into the tactics.
You said two main categories.
Two main categories.
Yeah.
The first one is control resource demand.
This is all about reducing or managing the amount of work this system is asked to do in the first place.
Limiting the work.
Makes sense.
How do we do that?
Well, one way is to explicitly manage event arrival.
This often gets formalized in things like service level agreements or SLAs.
The SLA might say something like, look, Mr.
Client, we guarantee you a response time of Y only if you send us fewer than X events per second.
Ah, okay.
So it sets expectations and boundaries for the client too.
Exactly.
It prevents the client from unintentionally de -dousing your system,
basically.
Protects the system's resources.
Okay.
What else falls under controlling demand?
Another technique is managing the sampling rate.
This applies if you're dealing with streams of data, like from sensors or maybe video frames.
You can sometimes maintain a predictable latency by simply reducing the frequency.
Sample the sensor every second instead of every 100 milliseconds.
But there's a trade -off there, obviously.
Oh, absolutely.
You lose fidelity, you lose detail in the data, but it might be necessary to meet the timing requirements.
It's an architectural trade -off.
Right.
Or what if you can't reduce the amount of work, but some work is more important than other work?
Then you prioritize events.
You rank the incoming requests based on importance.
So, you know, a critical alert, like a fire alarm message, absolutely must get processed immediately.
To ensure that, the system might be designed to temporarily ignore or queue up lower priority requests like routine status updates when the load gets high.
Okay.
Prioritizing.
Seems essential for many systems.
Then there's reducing computational overhead.
This sounds more like traditional optimization.
It includes that, but it's broader from an architectural view.
One key tactic here is reducing indirection.
And this hits a classic trade -off.
Ah, the modifiability versus performance trade -off?
That's the one.
Indirection adding layers like wrapper functions, abstract interfaces, message buses makes the system easier to change, easier to maintain, it increases modifiability.
But every single layer adds a little bit of processing time.
Maybe just microseconds, but it adds up.
So when performance is critical?
When performance is absolutely critical, an architect might make the tough call to remove some of those layers.
Cut out the middleman basically for raw speed.
But they do so knowing that they're likely making the system harder to modify later.
It's one of the really painful decisions architects have to make sometimes.
I can imagine.
Okay.
Any other ways to reduce overhead under this category?
Yeah, a couple more straightforward ones.
Co -locating, communicating resources.
If two components talk to each other a lot, put them on the same server, or at least in the same data center rack.
Avoid those slow, unpredictable network hops, if you can.
And periodic cleaning.
Sometimes systems just get crufty over time.
Data structures get fragmented, caches get stale.
So regular cleanup, maybe reinitializing hash tables, clearing temporary files.
Even just a scheduled reboot can restore peak efficiency.
Okay.
Co -location and cleaning.
Got it.
That covers controlling demand.
What's the second big category of tactics?
The second category is manage resources.
This is about optimizing the availability and use of the resources you have.
Okay.
Managing what you've got.
Well, the simplest tactic often is just increase resources.
Get a faster processor.
Add more RAM.
Get a faster network card.
Honestly, given how relatively cheap hardware is compared to, say, senior developer time spent optimizing code, this is often the fastest and cheapest way to get an immediate performance boost.
Throwing hardware at the problem.
The classic solution.
But sometimes that's not enough, or not possible.
Right.
When hardware alone can't solve it, we look at things like introducing concurrency.
We talked about the processing activities in parallel to reduce that block time, keep the CPU busy.
We can also maintain multiple copies of computations.
Think of microservices running behind a load balancer.
If one server is busy, the request goes to another one.
This reduces contention.
Okay.
Multiple copies of the code running.
What about data?
Similar idea.
We maintain multiple copies of data.
This can be done through data replication, where you have identical copy of the database on multiple servers, reducing read contention.
Or through caching, which is about storing frequently accessed data in a faster, closer storage medium, often in memory instead of on disk.
Caching gives you faster access speed.
Caching seems like a huge win.
It often is.
But there's a massive challenge that comes with any kind of data duplication, whether replication or caching.
And that challenge is consistency.
Ah, right.
If you have three copies of the data and one copy changes.
How do you make sure the other two copies get updated?
And when?
Instantly.
Eventually.
Ensuring consistency across multiple copies is a really complex architectural problem.
The mechanisms for synchronization can themselves introduce performance penalties and complexity.
It's a major trade -off.
Okay.
Consistency is the catch with multiple data copies.
What's the final piece of resource management?
The final piece is schedule resources.
This comes into play whenever you have contention.
Multiple requests wanting the same single resource at the same time.
Think one CPU core, one disk controller, one critical data structure protected by a lock.
When contention happens, someone has to decide who goes next.
That's scheduling.
A policy is needed.
Right.
Someone needs to be the traffic cop.
What kind of policies are there?
Well, the simplest is FIFO.
First in, first out.
Just process requests in the order they arrive.
Simple to implement.
But the big problem with FIFO is that one really long, slow request can get to the front of the line and block everyone else behind it, even if they have tiny urgent requests.
Yeah, that doesn't sound ideal for responsiveness.
Often isn't.
So you move to priority -based scheduling.
With fixed priority scheduling, you assign a static priority level to each type of task or request.
Then the scheduler always picks the highest priority waiting task to run next.
Okay, but how do you assign those priorities?
That depends on the system's goals.
Priorities might be based on, say, semantic importance, like we said.
Fire alarm is higher priority than a status update.
Or, especially in real -time systems, it might be based on timing constraints.
Deadline monotonic assigns higher priority to tasks with shorter deadlines.
Makes intuitive sense, right?
The one due sooner goes first.
Or for periodic tasks, rate monotonic assigns higher priority to tasks that run more frequently, have shorter periods.
Deadline monotonic, rate monotonic, okay.
Those are fixed priorities.
Are there dynamic ones too?
Yep.
Dynamic priority scheduling adjusts priorities on the fly based on the current system state.
Common examples include round robin, where everyone gets a small time slice in turn, good for fairness.
And more complex ones like earliest deadline first, EDF, which constantly reevaluates which waiting task has the absolute nearest deadline and runs that one.
Or least slack first, which prioritizes the task with the least amount of waiting time left before it misses its deadline.
Wow, EDF and least slack sound complicated, but maybe more efficient.
They are.
In fact, EDF and least slack are theoretically optimal on a single preemptible processor in terms of meeting deadlines.
They maximize the chances of hitting all timing requirements,
but they have higher overhead because the scheduler is constantly recalculating priorities.
So again, it's a trade off.
Architects have to make predictability and simplicity fixed priority versus potentially higher efficiency, but more overhead dynamic priority.
It really is fascinating how these abstract concepts, controlling demand, managing resources, scheduling apply so broadly.
You mentioned the traffic analogy earlier.
Absolutely.
It makes it concrete.
Think about traffic management on a highway.
Those ramp meters, the lights that control how quickly cars can merge onto the freeway.
That's a physical example of managing the event arrival rate.
HOV lanes, high occupancy vehicle lanes.
That's prioritizing certain events, cars with multiple people, over others and simply adding more lanes to the freeway.
That's directly analogous to maintaining multiple copies of computations, more paths for cars, to reduce contention and increase throughput.
The underlying principles are universal.
That's a great way to visualize it.
Okay.
So those are the general tactics.
Now let's look at some specific baked in solutions, the performance pattern.
Right.
So tactics are the building blocks, the ingredients.
Patterns are more like established recipes, recurring architectural structures, specifically designed to solve common performance problems.
Some patterns, by the way, also help with other qualities like availability.
The circuit breaker pattern is a good example of that overlap.
Good point.
Okay.
Let's look at four key performance patterns mentioned in the sources.
First up, the service mesh.
Sounds very modern, very microservices.
It absolutely is.
Primarily used in microservice architectures.
The defining feature of a service mesh is the sidecar.
Imagine a little helper process, a proxy, that gets deployed right alongside every single one of your microservices.
Often they're packaged together in what's called a pod.
A sidecar proxy for every service.
Okay.
What's the performance benefit of that?
Well, the sidecar handles a lot of the cross -cutting concerns, things like monitoring, security checks, service discovery, and critically, inter -service communication routing.
Because that sidecar proxy lives right there on the same machine, often in the same container pod as the service itself, the communication between the actual service code and its proxy is extremely fast.
Local communication, not network calls.
Ah, so it cuts down on network latency for all that operational stuff the service needs.
Exactly.
It keeps the service code cleaner too, focused on business logic.
The trade -off, of course, is that you're now running all these extra sidecar processes, which consumes additional CPU and memory.
Add system overhead.
Okay.
Service mesh for microservices via sidecars.
Next pattern.
The good old load balancer.
Pretty ubiquitous.
Extremely common.
Yeah.
A load balancer acts as an intermediary.
It sits in front of a pool of identical servers or services that can all handle the same type of request.
It provides a single point of contact to the outside world, one IP address, usually.
Clients talk to the load balancer.
Then it implements the schedule resources tactic we talked about.
It distributes the incoming requests across the pool of servers using some policy, maybe simple round robin, maybe sending it to the server with the fewest active connections, at least busy.
Right.
And the benefits are pretty clear, aren't they?
Lower latency, usually, because the load is shared.
More predictable response times.
Yep.
And it also helps with availability.
If one server in the pool fails, the load balancer just stops sending requests to it.
The failure is invisible to the clients.
Plus, it makes scaling easy.
Just add more servers to the pool.
Seems like a win -win.
What's the catch?
The main catch is that the load balancer itself has to be really, really fast and reliable.
If it's slow, it just becomes a new bottleneck for the entire system.
Ah, okay.
So the load balancer itself often needs to be highly performant.
Precisely.
Which is why in high -demand environments, you'll often see the load balancers themselves replicated for both speed and fault tolerance.
Got it.
Okay, third pattern.
Throttling.
Throttling is essentially a specific packaging of the manage work requests tactic, or maybe manage event arrival.
A throttler component sits in front of a resource or service and monitors the rate of incoming requests.
If the rate exceeds some predefined threshold, the throttler starts limiting access.
It might delay requests or even reject them outright.
So it's like a bouncer at a club, keeping the place from getting too crowded.
That's a perfect analogy.
Exactly.
It keeps the downstream service operating within its performance sweet spot, preventing it from getting overloaded and potentially collapsing.
It helps the service handle variations in demand gracefully.
Makes sense.
The trade -off seems obvious though.
Yeah.
If the incoming demand consistently exceeds the throttle's limit, you're going to be actively losing or rejecting requests.
That might not be acceptable.
And, just like the load balancer, the throttling logic itself has to be incredibly fast.
If checking the limit takes too long, the throttler itself becomes part of the performance problem it's trying to solve.
Right.
Fast logic is key.
Okay.
Last pattern.
MapReduce.
This one sounds more specialized.
It is.
MapReduce was specifically designed to tackle a massive scale performance challenge.
How do you efficiently sort, process, and analyze extremely large, often unsorted data sets?
Think terabytes or petabytes of data.
The kind of stuff Google, Netflix, big data companies deal with.
Okay.
Processing huge amounts of data.
How does it work?
The name implies two steps.
Sort of.
It relies on a specialized infrastructure.
And then the programmer provides two key pieces of code.
A map function and a reduce function.
The map phase is where the parallelism really shines.
The infrastructure takes the enormous input data set and divides it into chunks.
Many instances of your map function run in parallel, each processing one chunk.
The map function typically does some initial processing, maybe filtering data, extracting key pieces, and hashing results into different buckets.
Crucially, these map instances are usually stateless.
They don't need to talk to each other.
They just process their own input chunk independently.
Stateless and parallel sounds highly scalable.
What happens after the map phase?
Then the infrastructure does a shuffle.
It gathers all the intermediate results from the map tasks grouped by those buckets and distributes these buckets to different nodes where the reduce phase happens.
Many instances of your reduce function then run in parallel, each taking one bucket of related data and performing the final aggregation or analysis.
The output of the reduce phase is always much smaller than the original input data set.
Massive parallelism in both phases, managed by the infrastructure.
The benefit is clear for huge data sets.
Huge benefit.
Exploitation of parallelism plus resilience if one map or reduce task fails.
The infrastructure can usually just restart it on another node with only a small impact on the overall job time.
What are the limitations or tradeoffs?
Well, the infrastructure overhead is significant.
It's just not worth it for small or even medium size data sets.
You need that massive scale to justify it.
And the effectiveness really depends on being able to divide the work relatively evenly.
If your data is skewed such that one map task or one reduce bucket gets vastly more data than others, you lose a lot of the parallelism advantage.
That becomes your bottleneck.
Okay, so map reduce.
Powerful for huge divisible data sets but overkill otherwise.
Hashtag tag outro.
So that really takes us through the core concepts of performance architecture, doesn't it?
It does.
We started by defining performance.
It's all about meeting those specific timing requirements.
And we saw that success really hinges on understanding and mastering concurrency, especially how to manage shared state safely.
Then we dug into the systematic approach using those two main categories of tactics.
First, control resource demand, limiting the work through things like managing arrival rates, prioritizing, reducing overhead.
Right.
And second, manage resources, optimizing what you have through increasing resources using concurrency, replicating computations and data carefully, and applying smart scheduling policies when contention occurs.
And we wrapped up with those specific patterns.
Service mesh, load balancer, throttling, and map reduce, which are like pre -built solutions for common performance bottlenecks.
Yeah.
And for those of you out there actually analyzing or designing systems, remember the practical value here.
The sources mentioned things like a tactics -based questionnaire.
Using a systematic approach like that can really help you identify potential performance risks based on your architectural choices before you get too far down the road.
That's a crucial point.
Thinking about this early and maybe as a final provocative thought for you to chew on, something that ties a lot of this together.
Think about the fundamental architectural choice you often face using synchronous communication versus asynchronous communication between components.
Ah, like blocking calls versus sending messages and not waiting.
Exactly.
How does that fundamental choice to block the caller or not radically impact things like latency, jitter, and especially which of these performance tactics become most critical, particularly around resource contention, block time, and the types of scheduling you might need.
How does synchronous versus asynchronous change the game for achieving predictable performance?
That's a deep one.
That choice really does ripple through the whole design, doesn't it?
It dictates so much about how you need to apply these tactics.
Something definitely worth pondering.
Well, thank you for joining us for this deep dive into software performance.
We hope it was useful and we'll catch you next time.
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML ♥Related Chapters
- The Learning ContinuesSystem Design Interview - An Insider's Guide (Volume 1)
- About Crystal Structures and Diffraction PatternsStructure of Materials: An Introduction to Crystallography, Diffraction and Symmetry
- Ad Click Event Aggregation DesignSystem Design Interview - An Insider's Guide (Volume 2)
- Attribute-Driven Design – Creating ArchitectureSoftware Architecture in Practice
- Behavioral PatternsDesign Patterns: Elements of Reusable Object-Oriented Software
- Broad Patterns of EvolutionCampbell Biology in Focus