Chapter 12: Digital Wallet System Design

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today we're tackling a design problem that sits right at the intersection of massive scale and absolute financial integrity.

We really are.

We're looking at how you build the backend for a high -volume, high -stakes digital wallet service.

I mean, a system that has to be perfect while handling just incredible traffic.

Yeah, we're diving into the blueprint for a system that's designed to handle cross -wallet balance transfers.

So think of it like moving money directly from wallet A to wallet C all inside the same platform.

Simple enough on the surface.

The devil is, as always, in the details, specifically the Okay, let's unpack those because they really set the stage for why this is such a monster of a design challenge.

Right.

So functionally, it's simple.

Support the transfer.

Performance -wise, it's astronomical.

We have to handle 1 million transactions per second.

A million TPS.

And then you have the non -functionals.

Absolute transactional guarantees,

reliability of at least 99 .99%, and the big one, reproducibility.

That's the auditor's dream, isn't it?

It is.

It means you have to be able to fully reconstruct the historical balance of any account at any point in time just by replaying the source data.

So no questions are left unanswered.

Exactly.

It's what turns a simple ledger into a verifiable, auditable source of truth.

If there's a problem, the system itself can prove what happened.

The scale is what just jumps out at me first, though.

1 million transfers a

And every single transfer needs two database operations, right?

A deduct and a deposit.

So we're actually looking at 2 million operations per second hitting the system.

That's the back of the envelope math.

And if you think about a standard database node that handles maybe 1 ,000 TPS, you'd need 2 ,000 database nodes just to handle the load.

That's not just an engineering problem.

That's a massive budget problem.

Optimization is not optional here.

It's mandatory.

So before we even get into architectures, the sources point out,

really crucial API detail.

Okay.

For that single POSD endpoint balance transfer, the amount has to be passed as a string.

A string, not a number.

Why is that?

It's standard procedure in finance.

You completely avoid any potential for those weird floating point arithmetic errors.

The precision loss.

Exactly.

When you're dealing with real money, absolute precision is completely non -negotiable.

Okay, got it.

Precision over convenience.

So let's get into the first real attempt at this scale.

The in -memory sharding solution.

Right, this is all about speed.

You store all the account balances in a sharded key value store, something like Redis.

You partition it by hashing the account ID.

Zookeeper keeps track of where everything is, and the wallet service itself is stateless, so it's easy to scale out.

That definitely sounds fast.

Everything's in memory.

But my first worry is atomicity.

The sender and receiver are almost certainly on two different totally independent Redis shards.

And that is the fatal flaw.

It fails the correctness requirement immediately.

How so?

A single transfer has to update two different Redis nodes.

If your wallet service process crashes after it deducts money from Redis A, but before it credits Redis C, money is just gone.

It vanishes from the system.

Total failure.

So we need transactional guarantees.

The obvious next step is to replace those speedy Redis nodes with sharded relational databases, something with ACID properties.

Now we're in the world of distributed transactions.

We have to make sure that the debit on database A and the credit on database C either both happen or neither of them do.

And the sources get into two main ways to do that, starting with the classic low level one, two -phase commit,

2PC.

Right.

With 2PC, the wallet service is the coordinator.

In phase one, it sends a prepare command.

Prepare.

It tells the databases to lock the resources and get ready to commit.

Then in phase two, it sends the final commit or abort command.

Okay.

So that sounds like it guarantees atomicity.

What's the catch?

The catch is performance and reliability.

It's incredibly slow because those databases hold locks for a really long time, just waiting.

It kills throughput.

Kills it.

And the coordinator itself is a single point of failure.

If it dies after the prepare phase, the databases stay locked forever.

So 2PC is too slow and way too brittle for this scale.

Which brings us to the more modern compensation -based approach.

Try, Confirm, Cancel, TCC.

Okay.

How does TCC change things?

It completely changes when locks get released.

In the first phase, the try phase, you execute a complete local transaction.

A full transaction.

Yes.

So for a transfer from A to C, A's database just does the deduction locally and immediately releases its lock.

And C's database.

C's database gets what's called a NOP, a no operation.

It does nothing.

Well, it acts as a placeholder.

It logs the incoming request, but it does not credit the balance yet.

And this is the key insight because A has been debited, but C hasn't been credited.

The total amount of money in the system is wrong.

For a moment, yes.

And we call this the unbalanced state.

I see the trade -off.

You get much higher throughput because the locks are short -lived, but you have to manage this temporary application level imbalance.

Exactly.

If all the try operations succeed, phase two is confirmed, and C's database finally executes the credit.

If anything fails, phase two is cancel.

And cancel is just another transaction to put the money back into A's account.

Precisely.

A compensating transaction.

What about resilience?

What if the coordinator crashes mid -flight?

You need a phase status table.

It tracks a transaction's progress.

So if the coordinator restarts, it knows exactly where it left off and what to do next.

Okay, but what about really tricky edge cases?

What if a cancel command shows up before the try command it's supposed to be cancelling because of network delays?

That's a great question, and it's a huge complication.

If the cancel arrives first, the database has to be smart enough to handle it.

It records an out -of -order flag.

So it sees the cancel for a transaction it's never seen before and just makes a note of it.

Right.

And then when the delayed try command finally shows up, the database checks that flag, sees the marker, and just immediately fails the request.

Wow.

So you prevent the debit from ever happening?

That's some complex logic to build into the application layer.

It is.

So now we have atomicity, but let's shift to that other critical requirement.

Auditability.

Reproducibility.

This is where event sourcing comes in.

Okay.

So we're moving away from just updating the current state of a balance.

We are.

With event sourcing, instead of saving the final balance, we say the immutable sequence of every single change that ever happened to that balance.

Let's break that down.

There are four key terms here.

First up, the command.

The command is the intent.

It's the request from the outside world.

Transfer $50 from X to Y.

These go into a queue like Kafka.

But a command is just a request.

Why do we also need an event?

Because the event is a validated immutable fact.

The command can be rejected, but an event is the result of a command that passed all the business rules.

It's always in the past tense.

$50 was debited from X.

Got it.

And these events are deterministic, no randomness.

None.

And applying these events is what generates the state.

The state being the current account balance.

Right.

The state is just the current view.

It's transient.

You can always regenerate it from the events.

And the thing doing that work is the final piece, the state machine.

The state machine is the core logic.

It does two things.

It validates commands to produce events, and it applies those events to update the state.

And the huge advantage here is that the event list is now the absolute source of truth.

The legal ledger.

That's it.

If an auditor asks what the balance was at 2 p .m.

last Tuesday, you just replay the events up to that timestamp.

You can prove it.

That's powerful.

But if the system is just writing this long stream of events, how does a client actually check their balance quickly?

We use a pattern called command query responsibility segregation.

The idea is you separate your write path from your read path.

The event sourcing system handles the complex writes.

For reads, we publish the events to external read -only state machines.

And those external machines can create different views of the data, optimized for specific queries.

Exactly.

One view for super fast balance lookups, another for a full audit trail.

But that decoupling creates a lag, right?

This means it's an eventually consistent architecture.

It is.

The read -only views will always be a tiny bit behind the write path.

It's a necessary trade -off for this kind of scale.

Okay, a necessary choice.

So let's talk about performance because that 1 million TPS requirement is still hanging over us.

How do we make this event sourcing model faster than a remote Kafka and database setup?

We bring everything local.

High performance event sourcing.

We store the commands and events on local disk in an append -only file.

Sequential writes are incredibly fast.

And to make that even faster, the source is mentioned using memory mapped I .O.

or a map?

Yes, maps the disk file directly into the application's memory space.

The OS kernel handles caching everything, which bypasses a lot of system calls and really speeds things up.

And for the local state, the balances themselves?

We use a file -based database like RocksDB or Squilite.

RocksDB uses an LSM tree structure.

Why is that a good fit here?

An LSM tree or log -structured merge tree is designed specifically to take tons of small random write operations and turn them into big sequential writes on disk.

Which is much, much faster.

Exponentially faster.

It's perfect for handling a million tiny updates a second on one machine.

Okay, so we're fast now.

And to make restarts quicker, we also add snapshots.

Right.

Periodically, you save the entire current state as a snapshot file.

When you restart, you don't replay events from the very beginning.

You just load the last snapshot and replay from there.

Much faster recovery.

But we've just created another huge problem.

We have.

By making everything local, that one server is now stateful.

It's a massive single point of failure.

If it dies, we lose everything.

So we have to secure that state.

And the most critical data to replicate is the event list.

If you have that, you can rebuild everything else.

How do we replicate it reliably?

Through consensus -based replication.

We use an algorithm like Raft.

Raft.

We use Raft to synchronize the event list across a small group of nodes.

Usually three or five.

It guarantees the list is identical in the same order on all of them.

As long as the majority of nodes are healthy, the system is available.

Okay, so we're reliable.

We're fast.

But what about responsiveness?

We still have that eventual consistency lag from CQRS.

How do we make the user feel like the transfer was instant?

We get clever.

We use a reverse proxy.

The client talks to the proxy.

We then modify the read -only state machine, which is a follower in the Raft group, to push the latest status back to the proxy as soon it processes an event.

Oh, that's smart.

So it masks the lag.

The system is still asynchronous underneath.

But from the user's perspective, they get what feels like a real -time confirmation.

Exactly.

Now, for the final leap to get to a million TPS, we can't do it with just one Raft group.

We need distributed event sourcing.

Meaning we shard the data, we create multiple independent Raft node groups or partitions, each handling its own slice of the accounts.

And because a transfer from A to C will now likely cross two different partitions, we have to bring back a distributed transaction manager, like TCC or, more commonly, the SAGA model.

Okay, let's walk through that final workflow.

A transfer from A to C using the SAGA model across two partitions.

Right.

The SAGA coordinator gets the command.

First thing it does is create a status record for the transaction in its own table.

Step one.

Then what?

It sends the A $1 command to the leader of partition one.

Partition one does its whole local process.

Command validation, event creation, Raft consensus, state update.

And then its read path pushes the success status back to the SAGA coordinator.

The coordinator gets that, updates its status table, and only then does it start the second part.

Correct.

It then sends the C plus $1 command to the leader of partition two.

And partition two does the same dance.

The exact same dance.

Command, event, Raft, update, and a status pushback.

Once the coordinator gets that final confirmation for partition two, the whole distributed transaction is done, and it can finally send a success response to the client.

And that sequential nature is really the core difference between SAGA and TCC.

It is.

SAGA executes things one after the other.

TCC can run its try phase in parallel, hitting both partitions at the same time.

So TCC might have lower latency.

It can.

But SAGA is often seen as simpler to manage for rollbacks, especially when you have many, many services involved.

You just execute the compensating actions in reverse sequential order.

What an incredible journey.

I mean, we started with a fast but totally flawed in -memory solution.

Then we added distributed transactions for atomicity, but realized we were missing that crucial historical trace.

Which pushed us to event sourcing for that auditability and reproducibility.

We then optimized it for insane performance with local files, map, and LSM trees.

And finally, we made it reliable with raft consensus, made it feel responsive with the reverse proxy push model, and orchestrated the whole distributed system with SAGA.

It's just a masterclass in layering complexity to solve competing needs.

You need the speed of a distributed system, but the absolute certainty of a financial one.

And that actually raises a really interesting final question for you to think about.

We deliberately chose this eventually consistent architecture with a complex coordinator like SAGA.

So given that a single transfer has to go through a coordinator, then a raft group, then back, then to another raft group,

how much latency is actually acceptable before a user stops perceiving that confirmation is instant?

And where do all these design choices force us to compromise on true real -time feedback?

That is a great thought to chew on as you review these concepts.

Thank you for sharing your sources and letting us take this deep dive with you today.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Building a digital wallet system that processes one million transactions per second while guaranteeing atomicity and maintaining 99.99 percent uptime presents fundamental architectural challenges that require moving beyond naive in-memory approaches. Initial solutions using Redis sharding with centralized coordination through ZooKeeper fail because transferring money between accounts requires two independent state updates, either of which could fail partway through, leaving the system in an inconsistent state. Distributed transaction protocols solve this problem by ensuring all-or-nothing semantics across multiple database nodes. Two-Phase Commit represents the foundational approach, using locks during a prepare phase before committing, but introduces severe performance bottlenecks and creates a single point of failure in the coordinator process. Application-level protocols like Try-Confirm/Cancel and Saga provide better scalability by treating each phase as an independent transaction and explicitly handling compensating transactions when failures occur, though both require business logic to manage undo operations and state tracking. Event sourcing transcends traditional transaction mechanisms by treating the system as an immutable append-only log of all state changes, enabling perfect reproducibility and auditability since any historical state can be reconstructed by replaying events through a deterministic state machine. To optimize this architecture for both latency and throughput, Command-Query Responsibility Segregation separates write operations that generate events from read operations that query pre-computed views, allowing independent scaling of each path. Persistence efficiency comes through memory-mapped files and RocksDB, technologies that maximize I/O throughput, while periodic snapshots reduce reconstruction time by eliminating the need to replay the entire event history. Reliability and fault tolerance are achieved by replicating the event log across a cluster using the Raft consensus algorithm, which guarantees correctness as long as a majority of nodes remain available. Horizontal scaling is accomplished by partitioning the event log and account data across multiple node groups, with distributed transaction protocols like Try-Confirm/Cancel coordinating transfers that span partitions. Asynchronous processing patterns and reverse proxies further reduce latency, allowing the system to acknowledge requests before events are fully persisted while maintaining eventual consistency guarantees.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥