Chapter 2: Data Models & Query Languages

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace, the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

So, for our mission today, we're doing a deep dive into, really, the bedrock of data -intensive systems.

The absolute foundation.

Exactly.

We are cracking open that critical chapter on data models and query languages from designing data -intensive applications.

And this isn't just about databases.

Oh, not at all.

It's about the conceptual tools we use to frame reality itself.

I want to start with that quote, the source material uses from Wittgenstein.

The limits of my language mean the limits of my world.

And when you apply that to computing, it just hits so hard.

Data models are our language.

They define the problems we can solve easily.

And the ones we make, you know, needlessly difficult for ourselves, it's the perfect lens to look through.

It really is.

Because these models, they shape every single architectural choice you make.

And what's fascinating is that no application really relies on just one model.

It's always a stack.

A stack of layers.

A stack of layers.

Think of it like this.

At the very top, you, the application developer, you model the real world.

Customers, products, that sort of thing.

Right.

Using application -specific structures.

Maybe objects in your programming language.

That's layer one.

Okay.

Then you hit that abstraction barrier.

Exactly.

Layer two is where you map those structures to a general purpose model, something the database understands.

Like relational tables or JSON or graphs?

Precisely.

And then below that, the engine has its own internal way of doing things, storing data as bytes on disk.

All the way down to the hardware, the electrical currents.

Right.

But we're focusing on that crucial general purpose layer today, because that's where your application development becomes either easy or incredibly hard.

So the central question for our deep dive then is, how do these three major models, relational, document, and graph, actually compare?

And when do their query languages deliver elegance versus, well, headaches?

A lot of headaches.

Let's start with the historical giant.

The relational model.

I mean, this has been dominant since Edgar Codd proposed it back in 1970.

SQL.

Right.

Data organized into tables, which he called relations, made of rows or tuples.

And for, what, 25, 30 years?

It was the standard.

And it was all driven by these seemingly boring use cases, like business transaction processing.

That long dominance, though, it started to show some cracks.

And by the 2010s, you see this explosion of the NoSQL movement.

Right.

Which, it's important to say, eventually came to mean not only SQL.

Yes, exactly.

It wasn't about totally rejecting the relational model.

It was just a recognition that it's not a one size fits all solution.

So what was really driving that?

Was it just the big web scale companies needing to handle massive amounts of data?

That was a huge part of it, for sure.

The need for greater scalability, high write throughput.

But there was also an ideological side to it.

Oh.

A big preference for open source software and just, frankly, frustration with the rigid schemas of relational systems.

This all led to this idea of polyglot persistence.

Which is the idea that you should use the best tool for each specific job.

Exactly.

Even if that means you end up running three different kinds of databases in a single application.

I think the most common pain point I hear developers talk about here is the object relational impedance mismatch.

Oh, yes.

We've all been there, right?

Cursing an object relational mapping framework, an ORM, because it's generating some truly awful inefficient join query.

You're just constantly fighting the tool.

It's a conceptual battle.

Your programming language thinks in objects rich hierarchies, inheritance.

The database thinks in flat tables.

And that translation layer, the ORM, it's just, it's always awkward.

It never quite feels right.

Never.

The friction is always there.

So let's make that really tangible.

If we're modeling a resume, like a LinkedIn profile, how do the models handle that differently?

Okay, great example.

In the relational model, you have to shred it.

You shred the data into pieces.

Person's main info is in one table, their job history.

That goes into a positions table linked by a foreign key.

Education goes into an education table.

So to get the whole profile, you have to do a bunch of joins.

Multiple multi -way joins.

The data is all spread out.

Now contrast that with the document model, usually JSON.

Here you store the entire resume, all the personal details, an array of jobs, an array of schools in one single self -contained document.

So when the app needs the profile?

It's one query, one result.

This gives you excellent locality.

All the related information is stored together, often in the same spot on disk.

It's fast.

But that immediately brings up the whole normalization debate.

Why did we go through all that pain of shredding the data in the first place?

Why not just duplicate the text strings everywhere?

Normalization is all about integrity and managing complexity when you write data.

When you use an ID, like a reginid, the human read -all name, like philanthropy, is stored in only one place.

The canonical source.

Exactly.

So if that name changes, you update one row and one table.

Done.

Imagine if you had duplicated philanthropy across a million user documents.

Oh, what a nightmare.

You'd have a massive update job and you'd probably miss some.

And you'd get inconsistencies.

That's what normalization prevents.

It's all about consistency and efficiency on writes.

So document models struggle when data needs to be shared and interconnected then?

Absolutely.

That's their main limitation.

It shows up with many -to -one and many -to -many relationships.

Like linking a bunch of users to the same company or products to categories.

Right.

And since document databases usually have weak support for joins,

you end up having to emulate those joins in your application code.

So you're making multiple round trips to the database.

Multiple round trips.

Your app has to fetch the user, pull out the industry, then make a brand new query to the industry's collection to get the name.

It's slow and the database can't optimize it.

You've just moved the complexity from the database into your code.

It's fascinating how history just repeats itself here.

The document model's tree structure sounds so much like the old hierarchical models from the 70s.

Like IBM's IMS.

Yeah.

It's the same pattern.

Great for one -to -many relationships, but they just fell apart with more complex connections.

And that led to that great debate in the 70s.

The other competitor to the relational model was the network model, CODISL.

Right.

CODISL tried to generalize the hierarchy.

It let records have multiple owners so it could handle many -to -many relationships.

But it was famously difficult to work with, wasn't it?

Something about developers having to manage access paths manually.

That's the absolute key.

The links were basically just pointers that the application had to manage.

You, the developer, had to write this imperative code saying, go to this record, now follow this specific link.

Sounds awful.

It was like managing memory with pointers in C versus letting a garbage collector handle it.

The relational model won because it abstracted all that away.

It let the database's optimizer figure out the efficient access path automatically.

It was a huge cognitive load off the developer.

Which brings us perfectly to schema flexibility, another huge dividing line.

A massive one.

Relational databases use what's called schema on write.

Meaning the structure is defined upfront and strictly enforced when you write the data.

Exactly.

Columns, data types, constraints, all of it.

Document databases, on the other hand, use schema on read.

People call them schema lists, but that's not quite right, is it?

No, it's misleading.

There's definitely a schema, but it's implicit.

The structure is only interpreted by the application code when it reads the document.

And this is where that developer flexibility really comes in.

Let's say you need to change a user's single -name field to separate first name and last name.

Okay.

In a SQL database, that's a big deal.

You have to run an alter table, then maybe a really slow UPDATE migration to rewrite every single row.

It's a heavy blocking operation.

But in the document world, you just start writing new documents with the new fields.

And the application code just handles both formats.

Your code just has a little if statement.

It checks if it's reading an old document and handles it.

The database stays online, and you can iterate and deploy so much faster.

That feels like a huge win.

So where's the catch?

What's the performance trade -off?

It's in the data locality, specifically with updates.

If your documents are really large, say a 5 -megabyte JSON object, loading that whole thing just to change one tiny field can be super wasteful.

And you often have to rewrite the entire document, right?

Frequently, yes.

Any small update can trigger a complete rewrite of a large document to a new spot on disk.

That can cause a lot of write amplification.

Of course the market hasn't stood still.

We're seeing this massive convergence now.

Absolutely.

Systems like Postgresq will now have amazing support for JSON.

You can stick a document right inside a relational table.

But then that raises the question.

If Postgresq can handle JSON and still give me reliable joins,

why do I need a pure document store anymore?

I think it comes down to philosophy and implementation.

Native document systems were often built from the ground up for massive horizontal scaling, something relational systems have historically struggled with.

And that schema and read mindset is still just better for certain use cases.

Rapidly changing data where agility is more important than consistency.

Exactly.

But you're right.

Convergence is the future.

Everyone wants the best of both worlds.

Okay, let's switch gears to query languages.

Let's talk declarative versus imperative because that was maybe the single greatest innovation of the relational model.

It really was.

Imperative languages, like the old codicile systems, are where you tell the computer how to do something, step by step.

Whereas declarative querying is fundamentally different.

With SQL, you only specify what data you want.

You describe the result you're looking for.

And the database's query optimizer figures out the most efficient how.

Which is this massive advantage.

The abstraction is key.

A database engineer can add a new index tomorrow and your query code doesn't change at all.

It just gets faster automatically.

I love the web analogy here.

Is declarative querying basically the CSS of data?

That's a perfect analogy, yes.

Imagine trying to style a web page using only imperative JavaScript DOM manipulation.

It would be a brittle nightmare.

Whereas with declarative CSS, you just use a selector and say, I want it to look like this.

And the browser figures out how to render it efficiently.

SQL does the exact same thing for data.

But we should also talk about MapReduce, which is kind of an interesting middle ground.

Right, for bulk data processing.

It's not fully declarative, but it's not fully imperative either.

It uses those two pure functions.

The map and the reduce.

If you want to count, say, shark sightings by month from a huge data set, you write a map function to pull out the month and a reduce function to add up the counts.

It's powerful, but writing those two separate coordinated functions can be pretty clunky compared to a simple SQL query.

It can be.

And that's exactly why we saw NoSQL systems, after first rejecting SQL, eventually reinvent declarative querying.

Like MongoDB's aggregation pipeline.

Perfect example.

It's a much more declarative JSON -based way to define a sequence of operations match group project that achieves the same goal as a SQL query without the functional programming headache.

So this all leads to a big question.

What happens when the connections between data are the most important thing?

When many -to -many relationships are everywhere?

When connectivity is king, the graph model is just the most natural fit.

You're no longer thinking in tables or documents.

You're thinking in vertices and edges.

Nords and relationships.

Exactly.

Social graphs, fraud detection, logistics,

anywhere.

The connections matter more than the individual data points.

So what are the main approaches here?

There are two big ones.

First is the property graph model, which you see in systems like Neo4j.

In that model, both the vertices and the edges can have properties, right?

Key value pairs.

Right.

And that structure makes traversal following paths through the graph incredibly efficient.

It's great for modeling complex evolving data.

And the other one is the triple store model.

The triple store.

It stores everything as a simple three -part statement.

Subject, predicate, object.

The elegance is its simplicity.

The object can be a value, which makes the triple a property.

Or it can be another vertex, which makes it an edge.

The query languages for graphs are where you really see the paradigm shift.

Like cipher for property graphs.

It's so declarative with that arrow notation.

It's beautiful.

If you want to find people who emigrated from the U .S.

to Europe, you can express that variable off -path with something like w -i -a -u -b -c -a -n -o.

It's incredibly concise.

And trying to do that same query in SQL is just agony.

It's the acid test.

You have to use these incredibly verbose and complex recursive common table expressions.

Your query can be dozens of lines long and so hard to debug.

All for something that cipher expresses in a few symbols.

It just proves that if your data is a graph, you should use a graph model.

Using the wrong tool makes your application code drastically more complex.

And for triple stores, there's spy -rql, which is conceptually similar in its pattern matching.

Right.

And we should give a nod to datalog, which is older but foundational.

It uses a rules -based approach.

You build up complex queries step by step by defining what the relationships mean.

It's powerful, but it's a very different way of thinking.

So let's bring it all back.

The big picture here is that there's no single best model.

None at all.

The three dominant models, they just serve different purposes.

Document is for self -contained tree -like data.

Relational is for structured data with complex many -to -many links.

And graph is the clear winner when your data is all about the interconnections.

Exactly.

And the choice between that explicit schema on write versus the implicit schema on read really just comes down to a trade -off.

Consistency versus flexibility.

Right.

And the key takeaway from all of this is that your choice of data model, it profoundly dictates everything about your application.

It sets the limits of your development world.

So with that, we have to leave you with one final thought.

The logical data model, everything we've talked about today, it's only one half of the story.

It's just the top layer of the cake.

If you really want to understand performance, you have to go deeper.

You have to understand how these elegant logical models are actually implemented by the database engine.

How they become bytes and pages and indexes on a physical disk.

That's the next layer of abstraction.

That is where the rubber meets the road.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Data models serve as fundamental abstractions that shape how developers conceptualize problems and design solutions, forming a bridge between application-level objects and lower-level storage mechanisms. Understanding the distinctions among the three dominant general-purpose data models is essential for making informed architectural decisions. The relational model, formalized by Codd in 1970 and implemented through SQL, structures data into tables with rows and columns, leveraging normalization techniques to minimize redundancy and efficiently manage complex relationships through joins and foreign key constraints. This approach has proven exceptionally durable for applications requiring structured data and complex many-to-one and many-to-many associations. The document model emerged as part of the NoSQL movement, offering an alternative that appeals to developers seeking greater scalability and flexibility in data structure representation. Documents, typically encoded in JSON format, are self-contained units that reduce the friction between application objects and database storage—the object-relational impedance mismatch—while providing superior data locality that accelerates retrieval of entire records. However, document systems struggle with scenarios involving deeply interconnected data requiring extensive joins across multiple entities. A key distinction lies in their schema enforcement philosophy: relational databases impose a rigid schema-on-write model, whereas document systems permit a more flexible schema-on-read approach. Historically, the document model shares conceptual similarities with the hierarchical model popularized by systems like IMS, which ultimately lost favor due to inherent limitations in representing many-to-many relationships. The graph data model represents a third paradigm, explicitly optimized for highly interconnected data where traditional models become unwieldy. Graph systems, including property graph and triple-store variants, support query languages like Cypher and SPARQL that enable elegant pattern matching and variable-length traversals—following relationships an arbitrary number of times—operations that require cumbersome recursive common table expressions in standard SQL. Foundational languages like Datalog demonstrate how complex queries can be constructed systematically through composable rules. Query languages themselves divide into two philosophical camps: declarative approaches allow developers to specify desired results while the database query optimizer determines the most efficient execution strategy, whereas imperative languages like CODASYL require developers to specify exact retrieval procedures. These data model categories continue to converge, with relational systems increasingly incorporating document capabilities through native JSON support.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥