Chapter 11: Payment System Architecture

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

If you've ever bought anything online,

you know that moment of truth, right?

Clicking place order.

Today, we are pulling back the curtain on the complex engineering that makes that single moment possible.

Yeah.

We are tackling the design of a highly reliable, scalable,

and flexible payment system,

basically the financial engine for a global e -commerce giant like an Amazon.

It's one of those systems that is completely invisible when it works, but when it fails, it's just catastrophic.

Payment systems are, I think, universally intimidating because the stakes just couldn't be higher.

A simple logic error, a dropped message, it's not just a bug.

It means massive immediate revenue loss, and it completely destroys trust.

So let's define the perimeter for you.

Based on the requirements we've pulled from our source material, we're designing the payment back end only.

Right.

Not the front end.

Exactly.

We're not touching the UI, and critically, we aren't directly processing credit cards.

We're offloading that courage complexity to third party payment service providers, the

like a Stripe or a PayPal.

And that's a really strategic choice.

It means our system follows a vital security rule.

We never touch, never see, and never ever store sensitive credit card data.

Which frees us up.

It frees us up completely to focus on what actually matters for us, financial correctness and reliability.

Okay.

So what's our target volume?

It's about one million transactions per day.

Yeah, which sounds like a lot, but it shakes out to only about 10 transactions per second.

So not a huge throughput problem.

No, it's a gift, really.

It means we don't need some exotic high throughput solution.

We need bulletproof, reliable components that just get the data integrity right every single time.

Functionally, the system has to support two major movements of money.

Right.

The pay inflow, which is money from the customer to the seller held by the site.

And the payout flow.

And the payout flow, where the e -commerce site finally pays those sellers on its platform.

And then on the non -functional side, we're looking at fault tolerance, reliability, and what I like to call the financial detective work.

The reconciliation process.

Exactly.

It's mandatory for handling all the inevitable inconsistencies between our system and those external PSPs.

So once we define the perimeter, the next step is mapping out the core architectural components that handle these high level flows.

Let's start with the pay inflow.

Who are the key players here?

So internally, the most critical component is the payment service.

You can think of it as the brain.

The brain.

Okay.

It accepts that initial payment event, does some initial risk checks like for anti -money laundering, and then it just orchestrates the entire process.

Right.

And then sitting right next to it is the payment executor.

Why can't the payment service just execute the payment itself?

Why split them up?

It's all about decoupling.

The payment service handles the big picture, the whole checkout event.

But the payment executor, that handles a single payment order.

And this is so important because one checkout event from a user might have orders for ten different sellers.

Oh, I see.

And each of those needs to be a separate, trackable order, even if it's all part of one card transaction.

So outside our own system, we've got the external players, the PSP like Stripe, which talks to the banks,

and the card schemes.

Yeah, the card schemes, the big ones, Visa, MasterCard, they're the organizations that process the whole credit card operation end to end.

Got it.

And then internally, we need our sources of truth.

Our financial truth tellers, yeah.

You have the wallet, which is pretty simple, it just tracks the current account balance for every merchant.

And then the big one, the ledger.

And the indispensable ledger.

This is the immutable record keeper.

It enforces that foundational double entry principle.

Where every debit has an equal credit.

Exactly.

So, the sum of all transaction entries always has to equal zero.

It guarantees financial accountability.

Okay, let's trace the data flow here, because this sequence feels complex.

The user clicks buy, that generates the payment event.

The payment service stores it, splits it up into payment orders, and sends those to the payment executor.

Right.

The executor stores that individual order, and then this is the key step that calls the external PSP to actually execute the charge.

And once the PSP says success.

The internal synchronization kicks off, the payment service immediately updates the wallet, so the seller's balance goes up.

And at the same time, it's updating the ledger.

Yes, exactly.

It calls the ledger system to append that new, unchangeable transaction record.

And only when both the wallet is updated and the ledger is committed can the whole event be marked as complete.

So quickly then, the payout flow, us paying the seller, is it really just the reverse of all that?

Architecturally, pretty much, yeah.

It mirrors the flow.

But it usually involves different specialized external services.

Instead of strike for money in, the payout flow might use a provider like Topalti to move funds out of our bank account into the sellers.

Okay, so we get the high level movement of money.

Now how does the rest of our system actually kick off that workflow?

This brings us to the API.

Right.

The main entry point is a standard restful API, post V1 payments.

It takes a unique checkout and a list of all those payment orders.

And this is where we hit a really interesting finance -driven technical requirement.

In the data model, the amount field must be a string, not a double.

Absolutely.

Or an integer of the smallest currency unit, like cents.

Never ever a floating point number.

I know this is a huge deal, but can you break down for everyone why it's so non -negotiable?

In finance, if it's money, you treat it like text.

Floats kill accuracy.

You just have to avoid those unintended rounding errors.

Different systems, different languages, even different CPUs can handle floating point precision slightly differently.

A tiny error, a fraction of a cent,

times millions of transactions, that's a catastrophe legally and financially.

A string guarantees the exact value is kept end to end.

That makes perfect sense.

So looking at our data model, we have the payment event table and the payment order table.

The payment order is globally unique, and you said this has a critical second job.

It does.

That unique payment order gets reused when we talk to the external provider.

The PSP then uses that ID as its deduplication ID.

Or its idempotency key.

Or its idempotency key, exactly.

This is our very first defense against double charging.

It makes sure that even if we send the same request twice by mistake, the PSP only processes it once.

Okay, so we established that 10 TPS is pretty modest, yet the source material is very firm on this.

Use a traditional relational database with ACD support, not NoSQL, not NewSQL.

Why stick with the old tech?

Because integrity beats scale here.

For financial records, you need guaranteed transactional consistency.

You need ACID atomicity, consistency, isolation, durability.

A classic.

They're proven over decades.

The headache that you would get from eventual consistency, which is common in NoSQL, is just it's not worth any small gain in throughput.

We are all in on correctness and auditability.

RDBMS gives us that flawlessly.

Let's move into integrating with those PSPs.

We said companies do everything to avoid storing credit card info to dodge that PCI DSS compliance nightmare.

How do they do that technically?

Through the host to payment page model.

So we either redirect the customer, or we embed an iframe on our site.

The key thing is, the sensitive card data is entered directly onto the PSP's page.

So it never even touches our servers?

Never.

The data never passes through our system.

We just don't handle the liability.

Okay, let's trace that asynchronous flow.

It sounds like a bit of a dance.

It is.

It starts with our payment service sending a registration request to the PSP.

That request has a unique UUID we call a nonce.

To make sure the registration itself only happens once?

Exactly.

The PSP then sends back a temporary token, which our payment service stores.

And then the customer's browser uses that token to load the PSP's secure page.

They enter their details, the payment happens on the PSP's end, and they get sent back to our site.

Right.

But we can't trust that browser redirect for the final status.

Of course not.

The definitive authoritative status comes later, asynchronously, via a webhook.

It's a URL we pre -registered, and the PSP calls it to tell our back end the final result.

This is key because it stops our system from just stalling and waiting.

So what happens when this nine -step process with all these external parties goes wrong?

What's the final safety net?

That is where reconciliation comes in.

It's that indispensable mandatory financial detective work.

It's a periodic comparison.

Usually the PSP or our bank sends us a settlement file every night.

This file has the complete record of every single money movement.

And we compare it to our own records.

Our reconciliation service parses that file and compares it line by line against our internal ledger.

A finance colleague once told me that the reconciliation file is where technical elegance goes to die.

Sounds tedious.

It is, because you never assume the external system is right, but you also don't assume your own system is perfect.

When mismatches are found, and they will be found, they get categorized.

Can we fix it automatically?

Does it need a human to look at it?

Or is it something totally weird that needs a deep investigation?

OK, so if we have a payment that gets stuck, maybe in a risk review for a whole day, how does the system handle that without blocking everything else?

Well, because we offload that user interaction to the PSP's hosted page, the PSP is the one that manages that pending status.

Our system just has the initial token.

We don't resolve the payment until that asynchronous webhook finally arrives.

The whole design is meant to handle that external latency.

OK, let's talk internal communication.

We have all these services.

Payment service, ledger, wallet.

Should we just use fast, synchronous HTTP calls between them?

You should really try to minimize synchronous calls.

HTTP is simple, sure, but it leads to poor failure isolation and tight coupling.

If the ledger service is slow, the whole payment service just stalls.

So async is the way to go?

Asynchronous communication, yeah, through message queues.

It's essential for scale and for resilience.

We're talking specifically about a pattern called multiple receivers.

Multiple receivers.

Exactly.

We use a distributed message bus like Kafka.

The payment service publishes one single message saying a payment was successful.

And that one message gets consumed independently by lots of other services, analytics, billing, the wallet service.

If the analytics service is down, who cares?

The wallet still gets updated.

And to handle the inevitable failures, we use special queues.

There's the retry queue.

The retry queue is for errors we think are temporary, you know, like a network timeout.

And if it keeps failing.

Then it lands in the dead letter queue, the DL queue.

Messages in the DL queue are a problem.

They need a human to look at them.

They represent some kind of permanent failure or a bug that needs to be fixed.

OK, let's get to the twin pillars of reliability.

How do we enforce exactly once delivery?

Because we absolutely cannot double charge a customer.

All right.

So we achieve exactly once by solving two separate problems.

You need at least once execution, which we solve with retries.

And you need at most once execution, which we solve with idempotency.

And for retries, we can't just slam a struggling service over and over.

No, you have to use exponential back off.

It's critical for system health.

Instead of retrying instantly, you double the wait time after each failure.

Wait one second, then two, then four, and so on.

It prevents a cascading failure where our own retries crash a service that's already in trouble.

OK, that covers at least once.

Now for the final lock,

idempotency.

Idempotency is the at most once guarantee.

It just means that doing the same operation multiple times gives you the same result as doing it the first time.

How does that work in practice?

Our client generates that unique idempotency key, uuid,

and sticks it in the request header.

So if I'm a customer and I get impatient and double click the pay button,

two identical requests hit the payment service, both with the same key.

What happens?

The system sees the key.

It looks it up in the database, often just using a unique key constraint, which is super fast, and it realizes this request is either in progress or already done.

So instead of charging you again, it just short circuits and returns the status of the first attempt.

Our final topic, then, is maintaining system consistency across all these different services, both internal and external.

Internally, consistency comes from that exactly once processing we just talked about.

Externally, with the PSP, it relies on idempotency and, critically, on that mandatory reconciliation process.

You can never, ever just trust the external service implicitly.

And what if we replicate our database for more read capacity?

How do we handle that lag between the primary and the replicas?

It's a hard trade -off.

The simple way is you sacrifice some scale and you serve all reads and writes only from the primary database.

And the complex way?

The complex but more robust solution is to use distributed databases like YugabyteDB or CockroachDB.

They use consensus algorithms like Paxos or Raft to keep all the replicas tightly in sync.

It's more complex to deploy, but you get high availability and consistency.

Okay, finally, let's just briefly touch on payment security.

Besides not storing card data, what are the other essential layers?

It's mostly standard best practices.

Use HTTPS, obviously, to prevent eavesdropping.

Implement robust rate limiting to fight DDoS attacks.

And even though we use tokens from the PSP, we still need our own systems for tokenization to mask any card numbers we use for reference and, of course, strict adherence to all the PCI DSS standards.

Wow.

That was a truly comprehensive journey.

We mapped the flows, pay in and pay out.

We identified the critical components like the ledger, navigated that asynchronous maze of PSP integration with webhooks, and finally, secured the whole thing with those twin pillars.

Retries and a dependency.

Yeah, and the essential takeaway is still this.

You're dealing with money in a distributed system with unreliable external third parties.

Because of that, technical elegance is only half the story.

Reconciliation isn't just a compliance task.

It is the non -negotiable process that verifies the absolute financial truth of your system.

Absolutely.

Today, we focused heavily on credit cards.

But here's a provocative thought for you to take on the road.

Internationally, in markets like India or Brazil,

cash payment systems that use local agents are super common.

How fundamentally would our entire architecture have to change to incorporate cash payments, which have zero credit risk but introduce massive logistical and latency challenges?

That's something for you to mull over until next time.

Thank you for joining us on this deep dive into payment system design.

And a warm thank you from the entire last -minute lecture team.

We hope you feel much better informed.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Designing a payment system for high-volume e-commerce platforms requires balancing correctness and financial integrity over raw performance metrics. With anticipated transaction volumes reaching one million daily operations, the primary architectural concern shifts from throughput optimization to ensuring that every monetary movement is accurately recorded and recoverable. The system separates two distinct flows: the pay-in mechanism that aggregates customer funds into a platform bank account, and the pay-out flow that distributes earnings to merchants through external providers. Core internal services orchestrate these flows—the Payment Service manages workflow coordination, the Payment Executor processes individual transactions, and the Wallet maintains current merchant balances while the Ledger functions as an immutable audit trail using double-entry accounting principles. External Payment Service Providers act as intermediaries to Card Schemes, handling the actual fund transfers while the platform maintains appropriate distance from sensitive cardholder data through tokenization and PSP-hosted collection interfaces, satisfying stringent PCI DSS requirements. Asynchronous communication patterns using message queues enable loose coupling between services and allow payment events to trigger multiple downstream effects simultaneously without creating bottlenecks or failure cascades. The fundamental challenge of preventing duplicate charges necessitates exactly-once delivery semantics, accomplished by combining at-least-once retry guarantees with idempotency mechanisms that render repeated requests harmless through unique identifiers passed to both external systems and internal databases. Retry strategies employing exponential backoff manage transient failures gracefully, while dead letter queues isolate persistently failing messages for manual investigation. ACID-compliant relational databases provide transactional consistency for critical financial records. Reconciliation processes periodically compare internal ledger states against settlement files from banks and payment processors, enabling detection and resolution of discrepancies between system records and actual fund movements. Additional security hardening includes HTTPS encryption, rate limiting to prevent abuse, and comprehensive state tracking that maintains visibility into every transaction's journey through the system.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥