Chapter 8: Distributed Email Service Design

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today, we are tackling a system so massive, so ubiquitous,

and so complex that most of us just take it for granted.

We really do.

The distributed email service.

We are talking Gmail, Outlook, you know, the kind of system that has to handle a billion users globally.

There's billion, yeah.

This isn't just about sending a message.

It's about designing a civilization -level utility.

That scale is the fundamental constraint here.

I mean, when you start talking about one billion active users, handling their data requires you to immediately abandon any monolithic or, you know, traditional server concepts.

Right.

Our mission today is to break down exactly how you build a modern high -scale email platform following the exact structure an expert system designer would, focusing on the core non -negotiable features.

So sending, receiving, filtering.

Exactly.

Sending, receiving, filtering, near real -time searching, and of course robust anti -spam protection.

And as every good system designer knows, we have to start with the nightmare math because the sheer volume of data is what dictates the architecture.

Let's hit you with the envelope calculation right away.

Okay, let's do it.

So if we assume a large engaged user base,

say a user sends just 10 emails a day.

Which feels low even.

It feels very low, but even with that, it translates instantly to a hundred thousand queries per second, QPS,

just for sending email.

One hundred thousand a second.

That's massive traffic.

It is, but that's not even the big problem.

No, the real killer, the one that forces our hand on distributed architecture, is the storage requirement.

Right.

We are dealing with mail metadata,

the subject, the body, the user configuration, which we estimate averages what, 50 kilobytes per email?

Around there, yeah.

So if a billion users receive 40 emails a day for one single year, you are staring down the barrel of 730 petabytes.

730 petabytes, just for the text and metadata.

And that number is conservative.

That's before we even count the binary file.

The attachments.

The attachments.

So if we assume a

20 % of those emails carry, say,

500 kilobyte attachment, you are stacking on an additional 1 ,460 petabytes.

Wow.

In a single year.

You are very quickly into the realm of multiple exabytes of cold storage.

So if you walk away with nothing else, remember those numbers.

730 PB for metadata and 1 ,460 PB for attachments.

This staggering volume means that from the very first line of our design, we absolutely must use a highly reliable, massively distributed database solution.

The single server running a mailed year system just, well, it simply explodes under this pressure.

Okay.

So to understand the modern system, we have to go back.

We have to appreciate the legacy systems they replaced.

We do.

Email still fundamentally operates on historic protocols.

We all know SMTP, the Simple Mail Transfer Protocol.

This is the classic standard for sending emails between mail servers, right?

Right.

It's the lingua franca of email delivery.

It's reliable, but fundamentally simple in its transactional role.

Okay.

So that's for sending.

What about getting the mail?

For retrieval, we have two older systems, and the difference between them is crucial to understanding scale.

First, there's POP, the Post Office Protocol.

The OP.

I remember that.

It's essentially download and delete.

It's designed for accessing email on one single device because once the client fetches it, it's generally gone from the server.

It's stateless and simple.

And then there's IMAP, the Internet Mail Access Protocol, which completely changed the game and introduced the complexity we deal with now.

Oh, absolutely.

IMAP is the dominant protocol today because it allows you to synchronize your data across multiple devices.

So your phone, your laptop, they all have the same view.

Exactly.

The server must maintain the state of every single mailbox, which emails are read, unread, flagged, or moved to a folder and sync those changes instantly.

And crucially, IMPA is efficient on slow connections, right?

That's a huge benefit.

It only downloads the header information until you actually click and open the message.

So IMPA introduces statefulness and the requirement for complex persistent synchronization.

And that's what broke the old systems.

That's what broke them.

And of course, for modern webmail and high -performance mobile apps, we often use HTTP or HTTPS for the retrieval calls.

Bypassing the old protocols entirely.

Right.

Sometimes utilizing more custom specifications like ActiveSync or JMFP for that really efficient real -time sync.

So how does the email know where to go?

It's not magic.

It's DNS routing.

It's all DNS.

When a server sends an email, it queries the domain name system for the recipient's domain to look up the mail exchanger or MX record.

And that record tells it which server to talk to.

Precisely.

You can imagine the DNS returning a list of servers with priority numbers.

A lower number, like five, means that server is highly preferred for delivery.

If that connection fails, the sending server automatically attempts the next lowest priority server.

So it's got built -in redundancy.

Built -in redundancy, ensuring robust delivery attempts.

We mentioned attachments earlier.

They're not just sent as is, right?

They have to be translated.

Right.

Attachments are sent using Base64 encoding and the MIE specification.

And we enforce size limits, typically 20 MLB or 25 MLB, because trying to move files larger than that through the mail transfer agents becomes really inefficient.

Okay, so this brings us to the breaking point.

Why did the traditional architecture fail at the billion -user scale?

Well, historically, a traditional mail server used local file directories, often structures like mailedier, storing mailboxes in separate files on local disk storage.

That worked fine for a small enterprise, but - At scale, that's where disk .io becomes a nightmare.

Precisely.

Disk .io became the primary bottleneck, locking up the entire system.

Furthermore, relying on local directories couldn't provide high availability or reliability.

If one disk fails - The user loses their data.

Gone.

And most importantly, it simply couldn't support modern features like advanced full -text searching, shared labels, or conversation threading for billions of users.

The solution is mandatory.

We must go distributed and decoupled.

So let's jump into the blueprint for that.

The modern, distributed design.

Visually, what are we looking at?

We're looking at a highly modular system where specialized servers feed into a comprehensive specialized storage layer.

Users don't connect to one big box anymore.

So they connect via standard HTTPS to what?

To the web servers for public -facing tasks, loading a folder, checking settings, or submitting an initial send request via a RESTful API.

But for that IMAP -era instant update experience, the system needs to be stateful.

That's where the real -time servers come in.

They are crucial.

They push new email updates instantly when they arrive.

And they have to maintain a connection to every user.

A persistent, stateful connection with every active user, ideally using WebSocket, with long polling as a potential fallback.

They are the backbone of the instant inbox experience.

Now we need to talk about storing those petabytes.

We can't use one database.

We need four specialized storage components.

Absolutely.

First, the metadata database.

This stores the structure, subject, body text, user configuration.

The bulk of the structured email text goes here.

And second, the attachment store.

Why can't we just stick the attachments in our regular NoSQL database?

Because of the IO implications at scale.

Object storage, like Amazon S3, is designed to efficiently handle very large, static blobs up to 25 MB.

Traditional distributed NoSQL databases like Cassandra have practical blob -size limits, often less than a megabyte.

And more critically, caching massive attachments would consume vast amounts of memory and just choke the IO path.

So you separate the big clunky files from the small, fast metadata.

You decouple the high -throughput metadata from the high -capacity, static binary files.

Third, the distributing cache.

I assume this is a standard Redis or Memcached cluster.

Yes.

It caches the most recent emails and frequently accessed data.

Remember that 82 % recency statistic?

Right.

Users mostly look at recent mail.

This cache absorbs the majority of read traffic, dramatically speeding up client loading times.

And finally, the search store.

Why can't we just query the metadata database for text?

Because searching petabytes of text requires a dedicated engine.

The search store is a distributed document store optimized specifically for full text search using a complex data structure called the inverted index.

It has to be separate.

It has to be.

Its optimization strategy indexing every word is fundamentally different from the structure and transaction needs of the primary metadata DB.

Okay.

Let's follow the life of an email.

Alice clicks send.

This is where the decoupling begins.

The request hits a load balancer, which serves two purposes.

Distributing traffic and applying essential services like basic rate limiting.

To stop denial of service attacks or malicious bulk sends.

Exactly.

Then it goes to the web servers, which validate the message against rules like size limits, attachment types, and recipient formatting.

And here's the first big optimization, what we call the intra -domain shortcut.

If the sender and recipient are on the same platform, Gmail to Gmail, for instance, the web server performs an initial spam virus check and immediately inserts the email directly into the storage layer for both the sender's sent folder and the recipient's inbox.

Bypassing the whole queuing system for internal mail must save incredible resources.

It dramatically reduces latency and avoids unnecessary processing.

However, for external emails, the process goes asynchronous.

If validation succeeds, the email metadata is passed to the outgoing queue.

If there is a large attachment, it's stored first in the attachment store, and only the lightweight reference goes into the queue message.

And invalid emails go to an error queue.

Correct.

The distributed message queue is the secret weapon here.

It provides that crucial resiliency, right?

Absolutely.

The queue decouples the action of clicking send from the heavy work of delivery.

SMTP outgoing workers pull from that queue, perform a final spam virus check, store the final version in the user's sent folder, and finally use SMTP to send the email out.

And if the recipient server is down?

The queue workers handle retries using sophisticated algorithms like exponential backoff, preventing resource overload during temporary network issues.

Now the receiving flow and external email arrives at our perimeter.

It hits the SMTP load balancer, which is focused on connection level acceptance policies, immediately rejecting connections from known bad actors.

Those are the first line of defense.

The very first.

It goes to the SMTP servers, and large attachments are immediately offloaded to S3.

And the processing bottleneck is buffered by the queue again.

Exactly.

The email lands in the incoming email queue.

This ensures that even during a global email surge, the system remains stable.

Then what?

Next, the mail processing workers handle the heavy lifting, running complex algorithms to filter out spam, checking for viruses, and applying user -defined rules.

Once validated, the email is stored in the storage layer.

And if the user is online?

If the real -time surveys detect the user is online, the email is pushed instantly via web socket.

If they are offline, the email simply waits in storage and the client pulls it via a standard RESTful API request when they eventually log back in.

Okay, so we established that the metadata database requires extreme performance.

What are the engineering characteristics we designed around?

Well, high reliability is non -negotiable.

Data operations are mostly isolated to a single user's mailbox.

And remember the 82 % recency rule?

Users overwhelmingly focus on the last two weeks of mail.

This usage pattern is key.

Given that a single column can be several megabytes, the email body traditional relational databases are just instantly discarded.

They are.

They're not optimized for these large data chunks, and suffer from prohibitively high IOPS, input -output operations per second.

So we need a custom system.

We need a custom system optimized for low IOPS.

Our database must support strong data consistency, high availability, and easy incremental backups.

Let's look at the NoSQL data model for this.

How do we spread the data out efficiently?

We use the user ID as the partition key.

This is vital.

It ensures all of one user's data is stored on a single node or shard.

That's how you get horizontal scaling.

That's it.

It distributes the load evenly.

Within that partition, we use the email's unique ID, often a timeee you would add, as the clustering key for efficient chronological sorting.

Now for the most fascinating scaling trick.

Denormalization.

I find this concept hard to get my head around.

Why can't we just use a simple Boolean field called isRed?

Because at petabyte scale, querying a non -key attribute, like fetch everything where isRed equals false, turns into a disastrous full table scan across millions or billions of records.

A query would just time out.

It would never return.

It would be completely unacceptable.

So the performance nightmare forces us to redesign the storage.

Estrously.

To optimize these massive read queries, we denormalize the data.

We maintain two separate tables, one for readmails and one for unreadmails.

Wait, two separate tables.

So to mark an email as read, you have to delete it from one table and insert it into another.

Yes.

The application has to perform a multi -step write process, delete the record from the unread table, and insert a new record into the read table.

So we sacrifice application simplicity and introduce these complex multi -step writes just to make the common read queries incredibly fast.

Because the reads are now simple key lookups in a smaller dedicated table, it is a brutal but necessary trade -off.

Wow.

Okay.

And finally, we must decide on the consistency trade -off.

For email, correctness is paramount.

So we favor strong consistency over availability.

What nightmare scenario does that strong consistency prevent for the user?

It prevents the unacceptable scenario where a user might delete an email on their phone, open their laptop immediately afterward, and see the email reappear.

That's a result of weak or eventual consistency.

Ah, I've seen that happen.

We enforce a single primary server for any given mailbox.

If that server fails, client synchronization and updates are explicitly paused until the failover is complete.

We temporarily sacrifice availability to guarantee data integrity.

Okay.

So we've built the system, but the job is only half done.

The entire system fails if emails don't reach the inbox.

Exactly.

We must tackle deliverability, especially since over 50 % of all global email traffic is classified as spam.

That means we have to become experts and send a reputation.

Yes.

We need dedicated IP addresses for sending.

We must classify email types using different IPs for high -volume promotional mail versus low -latency transactional receipts.

And critically, when we introduce a new IP address, we have to warm it up.

Yes.

Slowly.

Over two to six weeks.

Why is warming up necessary?

Major providers like Gmail won't trust an IP that suddenly starts sending millions of emails.

They just assume it's a spam bot.

I see.

Warming up involves gradually increasing volume to build a reputation score, proving to recipient systems that we are a legitimate sender.

We also need feedback loops to manage failures and complaints.

We use keys for this too.

Hard bounces for a permanent failure like an invalid address, soft bounces for a temporary ISP issue, and the most damaging complaints.

When a user clicks report spam?

Yes.

Handling complaints immediately by ceasing to send to that user is essential to maintaining reputation.

And to combat phishing, we rely on standards like SPF, DCOM, and DMRC for authentication.

We have to.

Now let's return to the search store.

We noted that email search is fundamentally write -heavy.

Every send, receive, or delete requires read indexing.

Which is the opposite of a read -heavy web search.

Completely opposite.

Option one is using an external service like Elastic Search, decoupled from the main workflow, using an asynchronous queue like Kafka.

The challenge there, as you mentioned, is the complexity of keeping the primary store and the search index in perfect sync.

Right.

For massive scale like Gmail, where index -disk -io is the final bottleneck, building a custom search engine becomes necessary.

This is where we introduce a revolutionary data structure.

The Log Structured Merge Tree, or LSM.

Okay.

Help us visualize the LSM tree.

Why does it solve the IO problem?

Think of it this way.

Traditional databases perform random writes, which means shuffling all over the disk to find the exact spot to update.

That's slow.

Very slow.

The LSM structure optimizes the write path by almost exclusively performing sequential writes.

Imagine level zero is a sorted in -memory cache.

You just drop new data into it sequentially.

Once that cache is full, you sequentially write it down to a higher level on disk.

So instead of wasting time looking for the right spot, we just pile up data sequentially in memory and on disk.

And then the system periodically merges those levels together when it's less busy.

And that merge is also sequential.

Exactly.

It's incredibly efficient.

By minimizing the random access nature of writing, the LSM structure dramatically increases the write throughput needed for near real -time email indexing.

And finally, ensuring massive scalability and availability.

The whole system is designed for horizontal scaling.

For maximum fault tolerance, data is always replicated across a multi -data center setup.

So if a whole region goes down?

Even if an entire region suffers a network failure, users can access messages from a geographically diverse replica, maintaining high availability globally.

What a journey.

We started with a staggering 1 billion users and petabyte storage, moved through the synchronization necessity of IMAP, and built this custom, highly modular architecture.

Using specialized storage, the metadata DB, S3, cache, and search store.

The essential design decisions were decoupling the flows using asynchronous queues,

optimizing storage for the file type.

And crucially, the consistency trade -off sacrificing short -term availability during failover to guarantee strong data integrity for every single mailbox.

We also cover the non -technical essentials like reputation management and IP warm -up that really determine if the system is successful at all.

Yeah.

And beyond the functional requirements, every global system must incorporate non -functional mandates like compliance GDPR, PII, legal intercept requirements, alongside robust security, encryption, all of it.

Security and compliance are built in from the start.

They have to be.

They are not afterthoughts.

Okay.

Here's a final provocative thought for you to take away.

We noted that when the same email attachment is sent to multiple recipients, the same data is often stored multiple times in the S3 object store.

How could the system be optimized, say, right before the expensive save operation, to check for the attachment's existence in storage first, using techniques like content -based addressing or hashing, saving both time and colossal storage space?

That's the challenge of true efficiency at scale.

An optimization that could save petabytes.

A great challenge to end on.

Thank you for joining us on this deep dive into the massive world of distributed email service design.

β“˜ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Designing a distributed email service at planetary scale demands careful attention to storage requirements, throughput constraints, and architectural tradeoffs that differ fundamentally from traditional single-server mail systems. Back-of-the-envelope calculations reveal that supporting billions of users necessitates petabytes of annual storage capacity for both metadata and attachments, motivating the shift from legacy approaches like Maildir that relied on local file directories and suffered from severe disk I/O bottlenecks. The system must support core functionality spanning email transmission and retrieval, message filtering and organization, full-text searching across mailbox contents, and protective mechanisms against spam and malware. Modern email infrastructure depends on multiple protocols operating at different layers: SMTP facilitates inter-server communication, while POP and IMAP enable client-side retrieval; contemporary implementations often augment these with HTTPS for webmail access or specialized protocols like JMAP over WebSocket to support real-time synchronization and instantaneous updates. The high-level architecture decouples responsibilities across Web Servers handling user interactions, Real-time Servers managing live connections, and a composite Storage Layer integrating a metadata database, dedicated attachment storage via cloud services, distributed in-memory caching, and a search engine powered by inverted indexes for rapid content discovery. Both inbound and outbound email flows leverage asynchronous message queues to decouple processing stages and permit independent scaling of specialized mail components. The database layer presents unique challenges: the extreme isolation of per-user data and the substantial size of individual messages make customized NoSQL solutions with Bigtable-like characteristics preferable to traditional relational models. Efficiently querying read or unread email status in such environments requires strategic denormalization that fragments email records across distinct table structures. Email deliverability emerges as a critical quality metric achieved through sender reputation mechanisms using dedicated IP allocations, close monitoring of bounce categories, and multi-layered authentication standards including SPF, DKIM, and DMARC to prevent spoofing and phishing. Search indexing, characterized as write-intensive, benefits from specialized engines like Elasticsearch or highly tuned proprietary systems leveraging Log-Structured Merge-Trees to minimize random disk access patterns. The overall design employs multi-datacenter replication to ensure service continuity, deliberately favoring strong consistency guarantees for mailbox state even when such choices limit availability during failover scenarios.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML β™₯