Chapter 13: Stock Exchange System Design

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today, we are putting on our architect hats and designing one of the most demanding systems on the planet.

A cutting -edge electronic stock exchange.

Exactly.

We are leaving the days of, you know, shouted floor trading behind and focusing instead on these silent supercomputers that process billions of transactions daily.

And it's not just about being fast.

No, it's about achieving near -instant speed predictably.

It's a design problem.

Well, it's defined by paradox.

On one hand, you have this massive scale, billions of events.

Right.

But on the other, the absolute requirement for ultra -low latency means that, traditional distributed systems wisdom, it often just fails.

So we have to understand why.

We have to understand why the fastest solution often involves consolidating the core functions onto a single gigantic highly optimized server.

It really defies modern scaling conventions.

Okay.

So that counterintuitive design choice is what's going to drive this entire Deep Dive.

It is.

All right.

Let's unpack this by setting some ground rules because the requirements are, they're just impossibly stringent.

They really are.

For our Deep Dive, we're simplifying the exchange scope.

So we're only trading stocks.

Okay.

And we are limiting ourselves to handling new limit orders and cancellations.

Which simplifies things a lot by ignoring complex market orders.

It does.

But even with those constraints, the functional requirements are intense.

Right.

Our to place and cancel orders.

And they need to get matched trade notifications in real time.

And view the constantly changing live order book.

The scale we're talking about, even for a small to medium exchange, is maybe a hundred trading symbols.

And supporting up to a billion orders every single day.

A billion.

A billion orders a day is a massive administrative headache, especially when you think about compliance and risk.

Absolutely.

We need simple hard risk checks, like limiting a user to trading a maximum volume of a million shares of Apple stock in one day.

Sure.

Something like that.

But critically,

our system has to verify funds and withhold the necessary capital before an order is even placed.

That pre -order check is completely non -negotiable.

And that leads directly into the non -functional requirements, which are the true design constraints here.

Okay.

Like what?

We need at least 99 .99 % availability.

That's less than six seconds of unplanned downtime per year.

Wow.

Fault tolerance has to be instant.

But the killer requirement is latency.

We are aiming for millisecond level round trip latency.

And more importantly, we need stable performance.

Exactly.

We need the 99th percentile latency to be just as low as the average, which means almost no trades get delayed.

Let's ground that volume requirement with some quick math.

If you take that billion orders per day and spread it over the standard 6 .5 trading hours, the baseline is already huge.

But we know trading isn't steady.

You get these massive spikes at the open and the close.

So if we factor in those peak multipliers,

that frenzy of activity right when the market opens.

Our system has to be engineered to handle approximately 215 ,000 orders per second.

That one number.

Yeah.

That dictates everything.

It's why we have to abandon typical distributed systems wisdom.

Okay.

So moving into the architectural context,

we need to know who's using this system because not all traders are created equal.

That's right.

You have the broker, which is your friendly retail trading interface versus the institutional client, the massive hedge funds and high frequency traders.

Yep.

They handle enormous volumes and they demand specialized features and most of all, absolute low latency.

And their instructions are different too.

We're focused on the limit order, which specifies a fixed price.

It's patient.

It waits for a match.

And it's important to contrast that with a market order, which we excluded for simplicity, that just executes immediately at whatever the best prevailing price is.

So it sacrifices cost certainty for a guarantee of execution.

Exactly.

And all this complex data needs a common language, right?

Yeah.

Which is where the fake kicks protocol comes in.

The financial information exchange.

It's the vendor neutral standard for all this securities transaction info.

It allows all the different brokers and institutions to speak the same technical language.

Right.

So let's talk about the structure.

When you visualize this exchange, you really see three distinct communication flows.

And that separation is vital because not all data matters equally.

Not at all.

So the absolute most crucial path is the trading flow.

We call it the critical path.

This is the entire loop from a client placing an order to getting that execution confirmation.

Okay.

So where does that order hit first?

It hits the client gateway.

Our lightweight gatekeeper.

Right.

Authentication, rate limiting, making sure the message is formatted properly.

It passes validation and then goes to the order manager.

And that's where those mandatory risk checks and fund verifications happen against the user's portfolio.

The validated order then goes to the sequencer.

Now this is an indispensable component, right?

Absolutely.

It doesn't just put a time stamp on it.

It stamps the order with a deterministic sequential ID.

Before passing it to the matching engine.

The literal core of the system.

The matching engine is where the magic happens.

It is.

It maintains the order book and it finds a match between a buyer and a seller resulting in two execution reports or fills.

And those executions go back to the client.

Right.

And the sequencer guarantees that even if 10 orders arrive at almost the exact same microsecond, the matching engine processes them in a guaranteed repeatable order.

That functional determinism,

that must be crucial for audits and recovery.

It's everything.

Okay.

So the second flow is the market data flow.

Right.

Once the matching engine produces those executions, they feed into the market data publisher.

Which uses that information to reconstruct the real time order books.

Yeah.

Showing price depth and all that.

Yep.

L1, L2.

And it generates candlestick charts.

This data is then consumed by the data service, which brokers subscribe to.

And finally the reporting flow.

This one is non -critical.

The reporter just collects a logs of all orders and executions and writes them to a standard durable database.

For compliance, settlement and tax reporting.

Exactly.

We put this flow completely off the critical path because if reporting slows down by a second, no one cares.

But if trading slows down.

We lose business.

Instantly.

Instantly.

Let's drill down on the complexity of that order manager.

You mentioned risk checks,

but just managing the life cycle of an order.

Yeah.

From received to confirmed, partially filled, canceled.

That sounds like an incredibly intricate state machine.

It is.

And that's often handled using the event sourcing paradigm.

Okay.

Explain that.

So instead of constantly updating a database entry that stores the current state of an order, we store an immutable, sequential log of every single event that affects that order.

So the order's current state is just a projection of that log.

Precisely.

This gives you perfect auditability.

You can replay the entire history if you need to, which is a massive requirement in finance.

That makes perfect sense for compliance.

Okay.

Now let's look at the API in the data itself.

Brokers interact with our client gateway using what?

Standard restful endpoints?

Yeah.

For basic actions.

Think PSV1 order or GeekDV1 execution.

But you mentioned the institutional clients.

They use something different.

They do.

They often use custom ultra low latency protocols directly.

But regardless of the protocol, the underlying data models are the same.

You have product data.

Like the stock name, which is cacheable.

Then the order, which is the instruction, and the execution, which is the matched result.

And crucially, on that critical path, this data isn't being written to disk, no way.

It's all stored in shared memory for speed.

The true optimization genius, though, seems to lie in the order book data structure.

If we're processing hundreds of thousands of orders per second,

every operation, adding an order, canceling one, executing a match, has to be lightning fast.

We need O1 time complexity.

We do.

And a standard data structure often fails on cancellation.

How so?

Well, a standard list of orders waiting at a specific price would be too slow if you needed to cancel an order that's stuck in the middle.

So what's the solution?

The system uses a map to quickly locate the price level, but the list of orders within that price level is implemented as a doubly linked list.

I see.

Okay, explain why that doubly linked list is the trick for O1 cancellation.

So when a client requests a cancellation, we first use a helper map, let's just call it an order map, to look up the exact memory location of that specific order in O1 time.

Okay, so you have a direct pointer to it.

Exactly.

If it were a singly linked list, to remove that node, you'd have to walk the entire list from the beginning just to find the node before it so you could update its pointer.

That's slow.

That's O -N.

But because it's a doubly linked list.

The node itself holds a pointer back to the previous node, so you can immediately reroute the previous and next pointers and just delete the node.

In perfect O1 time complexity.

That one tiny data structure optimization underpins the entire performance guarantee of the system.

That is where it gets really interesting.

It absolutely is.

And speaking of optimization, the Market Data Publisher generates those candlestick charts.

Open, close, high, low prices over time intervals.

Right.

To handle that massive volume of recent ticks for hundreds of symbols without drowning in memory or creating constant garbage collection.

Let me guess another clever data structure.

Pre -allocated ring buffers.

You use them to hold the newest ticks and then you push the older consolidated sticks to disk.

Smart.

That brings us to the deep dive on performance, resilience, and fairness.

We set the stage earlier with the paradox.

Traditional distributed systems are just too slow.

Network latency alone, 500 microseconds per hop plus slow disk access.

It just kills your performance goals.

So the designers decided the only way to meet those microsecond level latency goals was to what?

Eliminate the network entirely for the critical pass.

Pretty much.

The solution is the single server architecture.

You take the most crucial components, gateway, order manager, sequencer, and matching engine, and you put them all on one gigantic, highly optimized server.

Once they're on one machine, how do they talk to each other without slowing down?

You still have different processes running.

This is where you get into some operating system wizardry.

First, the processes are often single threaded and their main processing loops are pinned to fixed CPU cores.

CPU pinning.

And the benefit of that is twofold.

Right.

Exactly.

You eliminate the cost of the OS constantly switching tasks, so no context switching.

And since only one thread is modifying the state on that core, you eliminate the need for costly locks.

No lock contention.

Which guarantees predictable, low latency.

Highly predictable.

And their internal messaging system, the thing that replaces the network.

They leverage the map to system call over the memory back file system, Devshim.

So you're creating a segment of shared RAM that multiple processes can instantly read and write to.

Yes.

You completely bypass the slow kernel network stack and traditional disk I .O.

This creates an event store that operates with sub -microsecond latency.

The data itself is encoded using SBE, simple binary encoding, for maximum speed.

It's truly incredible how much effort goes into shaving off single microseconds.

Yeah.

So let's talk resilience within this single server design given our 99 .99 % availability goal.

The baseline for resilience is the hot -warm design.

Okay.

You run a primary or hot matching engine that sends executions, but you simultaneously feed all the incoming events to a secondary warm instance.

And the warm instance processes everything and builds up the exact same state.

An exact replica of the order book state, but it stays silent.

If the hot server fails, the warm server instantly takes over and becomes the new hot.

Guaranteeing continuous operation because its state is already perfect.

Correct.

And what if the entire data center burns down?

You need resilience across multiple locations.

Then you introduce a consensus algorithm, usually Raft, deployed across data centers.

And Raft maintains state consistency.

The leader replicates every event to the followers.

This ensures that if a catastrophic failure happens, your RPO, your tolerance for data loss, is near zero.

And your RTO, the recovery time, is within seconds because of the automatic failover.

Exactly.

Let's quickly confirm the matching logic itself.

We've talked about determinism, but what's the default algorithm?

The vast majority of stock exchanges use a straightforward FIFO, first in, first out, algorithm within a price level.

So if two buy orders hit at the same price, the one the sequencers amped first gets matched first.

That's it.

Determinism is insured because you're relying on that sequence ID, not something unreliable like server clock drift.

Now let's touch on the delicate topic of fairness.

When latency is measured in microseconds, clients will pay enormous sums just to get their order to the front of the queue.

Or to receive market data a microsecond faster than their competition.

How do exchanges ensure fairness?

Right.

Well, the primary mitigation for market data is multicast, often using reliable UDP.

So instead of sending updates to tens of thousands of subscribers one by one, the market data publisher just broadcasts the update simultaneously to a multicast group.

Every participant in that group theoretically gets the update at the exact same moment.

And we have to talk about collocation.

Yeah, yes.

The practice where institutional clients pay a premium to physically place their servers inside the

You're paying for proximity.

Literally reducing latency to the time it takes for light to travel between their machine and yours.

It's the ultimate measure of the latency race.

It is.

It's seen as a paid VIP service.

And it doesn't really violate the spirit of fairness, but it highlights the extreme lengths people go to for speed.

And finally, network security.

You have to address the threat of DDoS attacks.

What are the key strategies?

You isolate public and private services, you rely heavily on caching,

and you have to harden your URLs.

Meaning you structure API paths simply like data recent, so they're highly cacheable.

And resistant to complex query string manipulation that can bypass your cache.

And then of course, aggressive rate limiting and safe listing or block listing known bad IPs are essential.

This has been an incredible exploration of a highly specialized architecture.

I think the crucial takeaway for me is understanding counterintuitive design philosophy, right?

That absolute performance and predictability often leads you to reject distributed systems in favor of consolidating that critical path onto one single gigantic, highly optimized server.

It's the ultimate engineering trade off.

You're sacrificing potential horizontal scale for predictable ultra low 99th percentile latency.

But here is a provocative thought for you to explore next.

While traditional exchanges rely on this single server supremacy, the rise of decentralized finance, especially those using automated market making or AMM.

They're challenging that model.

They are.

They're lowering the entry barrier and injecting this innovative energy into the industry by rejecting the centralized exchange structure entirely.

A fantastic point.

We've covered the critical architecture, the reliance on the sequencer for determinism, the use of UMMAP and CPU pinning for that sub microsecond communication and the absolute necessity of oh one data structures like the doubly linked list for the order book.

We did.

Thank you so much for guiding us through this ultra low latency world of modern finance system design.

And from the last minute lecture team, thanks for tuning in.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Building a high-performance electronic stock exchange requires architecting a system capable of matching buyers and sellers across millions of securities transactions while maintaining extreme reliability and speed constraints. The design must handle billions of daily orders from tens of thousands of concurrent users while guaranteeing 99.993% availability and ensuring response times measured in microseconds rather than milliseconds. Understanding the distinction between limit orders, which specify a maximum or minimum price, and market orders, which execute immediately at current market prices, forms the foundation for comprehending how brokers and institutional clients interact with the platform. Market data visibility is stratified into multiple levels, with L1 providing basic bid-ask information, L2 offering order depth, and L3 delivering complete order book transparency, all communicated through standardized protocols like FIX to ensure interoperability across trading firms. The architecture separates into three distinct flows: the latency-critical trading path where orders traverse from the Client Gateway through the Order Manager for risk assessment and fund verification, then through the Sequencer for ordering, and finally into the Matching Engine; the Market Data flow distributing price updates; and the Reporting flow handling post-trade analytics. Achieving sub-millisecond performance requires radical optimization techniques, including co-locating all trading-path components on a single physical server to eliminate network delays. Within this server, components communicate through shared memory mechanisms using memory-mapped files, reducing end-to-end latency to tens of microseconds by avoiding disk and network bottlenecks entirely. Single-threaded application loops pinned to dedicated CPU cores eliminate context switching overhead and contention. The system employs event sourcing, where the Sequencer assigns unique identifiers to every order and execution, creating an immutable event log that enables deterministic state recovery and fault tolerance through redundant instances and Raft-based consensus protocols. Order book operations—placement, matching, and cancellation—achieve constant time complexity through specialized data structures combining doubly-linked lists and maps, typically using FIFO matching semantics. Operational resilience addresses denial-of-service threats through layered security defenses and ensures fair market data distribution via reliable multicast, maintaining system integrity under adversarial conditions.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥