Chapter 20: Genomics and Proteomics

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Imagine trying to solve a 3 billion piece jigsaw puzzle,

but every single piece is shaped exactly the same.

Oh wow.

Sounds like a nightmare.

And they're all made of just four colors, and to top it off, someone threw away the box cover.

So you have absolutely no idea what the final picture is even supposed to look like.

Yeah, I mean the sheer scale of that problem is, well, it's exactly what humanity was up against when we first decided to read our own genetic code.

It's one thing to know that DNA is a double helix, you know?

Sure.

It is entirely another thing to try and read those billions of letters in the correct order, just to figure out what actually makes us human.

And the scope of this goes, like, way beyond humans now.

There's this massive initiative called the Earth Biogenome Project.

The goal is to sequence all 1 .5 million known eukaryotic species on the planet by 2028, and it comes with a price tag of around $4 .7 billion.

Which is staggering.

But, you know, the potential payoff for medicine, for agriculture, and just our basic understanding of evolution is just as massive as the cost.

But to even attempt something on that scale, geneticists need a roadmap.

I mean, throwing DNA into a sequencer and just hoping for the best simply does not work.

Exactly.

So welcome to your custom deep dive.

We know you are prepping for a massive genetics exam, so think of this as an exclusive mentorship session just for you.

Our mission today is to reverse -engineer the instruction manual of life.

That's a great way to put it.

We're going to explore how geneticists draw the map, how they actually read the sequence, how they figure out what the software commands the biological hardware to do, and finally what we learn when we compare the genomes of wildly different species.

Right, because you have to map the territory before you can actually read the genetic sequence.

Let's focus on that mapping phase first.

The concept of a genetic map, which is often called a linkage map in the text, it's always seemed a bit blurry to me.

Well think of a genetic map, like a hand -drawn sketch of a highway system.

It shows you the relative order of major cities, but it's an approximation.

Genetic maps are based entirely on the natural process of recombination,

or crossing over during meiosis.

Okay, crossing over.

Yeah.

So when organisms are crossed, geneticists look at the frequency of recombination between loci and the offspring.

If the recombination frequency is less than 50%, we know those genes belong to the same linkage group.

They sit close together on the same chromosome.

Right, and the rate of recombination is roughly proportional to the physical distance between them, and we measure those distances in map units,

or centimorgans.

Exactly.

Centimorgans.

I always get tripped up on that.

I mean, crossing over isn't a perfect ruler, right?

It doesn't happen at exactly the same rate everywhere on a chromosome.

It definitely doesn't, and that is a critical limitation you need to remember.

There are recombination hotspots, and then there are cold spots where crossing over almost never happens.

Because of that variation, genetic maps have very low resolution.

To get the actual GPS coordinates, we need physical maps, which measure the direct absolute distance in base pairs.

And historically, building those physical maps relied on a technique called restriction mapping.

I really want to walk through the logic of this, because I see a classic puzzle set up in the source material that looks like a total nightmare math word problem.

Oh, I know the exact type of problem you mean.

Yeah, but there is a brilliant underlying logic to it.

The logic of restriction mapping is pure deduction.

Walk me through the setup.

Okay, so let's say you have a 30 kilobase piece of DNA.

You cut a sample of it with a restriction enzyme called BamHi.

Okay.

You analyze the results, and you find three fragments.

One is 20 kilobases, one is 6 kilobases, and one is 4 kilobases.

Right, so BamHi acts like a pair of chemical scissors.

They get three pieces.

It must have cut the DNA fragment in two specific places.

Exactly.

Now you take a fresh sample of that exact same 30 kilobase DNA, and you cut it with a different enzyme, HPII.

Okay, fresh sample.

Right.

This time you get two fragments, 21 kilobases and 9 kilobases.

Does HPII only have one cut site on this DNA fragment?

Yeah.

We have the pieces, but we still don't know the physical order they go in.

And here is where I get confused.

The next step is a double digest cutting a third sample with both enzymes at the same time.

Right.

Wouldn't cutting it more just give us a blender full of tiny, useless pieces?

Well, not if you look closely at what survives.

What were the results of the double digest?

You get four fragments, a 20 kilobase, a 5 kilobase, a 4 kilobase, and a 1 kilobase piece.

Okay, the key here is to look at what stayed exactly the same.

The 20 kilobase and 4 kilobase fragments from the first BamHi cut are still there in the double digest.

They were completely untouched by the HPII enzyme.

Oh, wait.

Yeah.

But that 6 kilobase BamHi fragment vanished.

In its place we have a 5 kilobase and a 1 kilobase fragment.

Oh, I see it now.

The HPII enzyme must have chopped that specific 6 kilobase piece into a 5 and a 1.

Exactly.

Because the 5 kilobase and 1 kilobase pieces used to be joined together, they must be adjacent on the physical map.

And because the total is 30 kilobases, you can overlap these puzzle tabs to pinpoint the exact cut sites.

So it forces the 20 kilobase piece to be on one end, placed right next to the 1 kilobase piece, followed by the 5 kilobase piece, and then ending with the 4 kilobase piece.

The puzzle literally solves itself.

That is so satisfying.

But once we have that detailed physical map drawn, we face the next hurdle.

I mean, we can't just feed an entire chromosome into a machine and read 3 billion base pairs from start to finish.

No, we definitely can't.

The sequencing technology we have typically only reads small fragments.

It's usually about 500 to 700 nucleotides at a time.

Which means we have to literally shred the genome into tiny pieces, sequence every single one of those shreds, and then somehow stitch them back together in the right order.

Which is an insane task.

The first time geneticists pulled this off for a free -living organism was in 1995 with a bacterium called Haemophilus influenza.

It had a tiny genome of 1 .8 million base pairs.

Still pretty big, though.

True.

But that monumental success paved the way for the Human Genome Project, which kicked off in 1990 to tackle our 3 .2 billion base pairs.

And there were two heavily debated approaches to assembling that massive puzzle, right?

The public consortium used what is called map -based sequencing.

Yeah, map -based sequencing relied heavily on those detailed genetic and physical maps we just talked about.

They partially digested the genome into large, manageable fragments.

Then they cloned those fragments into bacterial or yeast artificial chromosomes.

We call them BACs and YACs.

Wait, wait.

We spliced human DNA into bacteria.

Why?

Basically, we tricked the bacteria and yeast into acting like biological copy machines.

By inserting human DNA into these organisms, they stored and replicated our DNA in very stable, organized chunks.

Oh, that's clever.

Yeah.

And then researchers used genetic markers to align these overlapping fragments into continuous stretches called contigs.

It was highly organized, methodical, and very, very slow.

And on the other side, Craig Venter and his company Solera pushed a completely different method,

whole genome shotgun sequencing.

They proposed skipping the intense mapping phase entirely, shredding the whole genome at once, sequencing the small inserts, and just relying on incredibly powerful computers to find the overlaps.

Which was incredibly controversial at the time.

I mean, the human genome is absolutely full of repetitive sequences.

Imagine a jigsaw puzzle where half the pieces are just identical patches of blue sky.

Right.

If you just shred the genome blindly and you sequence a fragment that is just the same repeating letters, the computer has no idea which part of the blue sky belongs to.

Assembly becomes an absolute nightmare.

But the computers got faster and the algorithms got smarter.

They did.

Venter's shotgun approach proved remarkably efficient.

Today,

virtually all genomes are sequenced using some variation of the whole genome shotgun method.

So the Human Genome Project eventually gave us a complete reference sequence.

Yeah.

But the term reference is doing a lot of heavy lifting there.

There actually is no single human genome.

No, not at all.

The reference sequence is a patchwork.

It was stitched together from DNA donated by several different anonymous individuals.

It isn't the ancestral sequence, and it isn't even the most common sequence.

It serves purely as a baseline for comparison.

Like if you step into an elevator with a total stranger,

your genomes are 99 .9 % identical.

But across 3 .2 billion base pairs, that tiny 0 .1 % difference equates to over 3 million differences in your DNA.

Which is wild.

Yeah.

These single base differences, for instance, where you have a T and the stranger has a G at the exact same location, those are single nucleotide polymorphisms, or SNPs.

SNPs are the foundation of genetic variation.

They arise from a single mutation in a specific ancestor, and then they spread through the population over time.

Because chromosomes are inherited in chunks, SNPs that are physically close together on a chromosome tend to be inherited together.

This specific set of linked variants is called a haplotype.

I sometimes struggle to visualize haplotypes.

Honestly, how are they different from just a random collection of mutations?

Think of it like a dedicated sports fan's outfit.

If someone is wearing a specific team's jersey,

they are almost certainly wearing the matching hat and the matching jacket.

Those items are linked by their allegiance.

You don't need to see the whole outfit to know who they're rooting for.

You only need to see the hat, what geneticists would call a tag SNP, to confidently identify the entire haplotype.

Ah, I like that.

So the non -random association between those genetic variants is linkage disequilibrium, and that concept is the entire basis for genome -wide association studies, or g -lylas.

How do researchers actually use those tag SNPs to hunt down diseases?

Take age -related macular degeneration, which is a leading cause of blindness.

Researchers took 96 people with the disease and 50 healthy people.

Because of linkage disequilibrium, they didn't need to sequence every single letter of everyone's genome.

Right, that would take forever.

Exactly.

They just scanned over 100 ,000 specific tag SNPs across the genomes.

They were looking for which SNPs were consistently different in the sick patients versus the healthy ones.

Because if a tag SNP is showing up constantly in the sick group, the disease -causing gene must be physically sitting right next to it in that haplotype block.

Yes.

And in that specific study, they found a strong association with a gene on chromosome 1 that encodes complement factor H.

See, GWS sounds like a silver bullet, but the book mentions it has actually revealed a massive mystery.

For complex traits like height or cardiovascular disease, the genes identified by GWS often only explain a tiny fraction of the variation.

For blood lipids, it's only about 25 to 30%.

The rest of the genetic influence is hidden in what geneticists call the dark matter of the genome.

Because we have the physical hardware mapped out, but just having the sequence doesn't explain the missing variation.

Yeah, knowing the sequence is like reverse -engineering the hardware of a computer.

You can identify the processor and the memory, but you still don't know what the software code actually commands the hardware to do.

And bridging that gap is the field of functional genomics.

We move from the static DNA sequence to the dynamic transcriptome, which is all the RNA transcribed from the genome, and the proteome, which is all the proteins actually synthesized.

The absolute first step to figuring out what a mystery gene does is simply looking to see if we've seen something similar before, right?

Yeah, we use computer programs like BLAST to conduct homology searches.

We search vast databases for homologous genes, which are genes that are evolutionarily related.

The principle is straightforward.

If two genes look similar in their sequence, they probably perform similar functions.

The source material makes a really important distinction here regarding homology, between orthologs versus paralogs.

Oh yes, orthologs are homologous genes found in different species that evolved from a common ancestor.

Like humans and mice both have a gene for the alpha subunit of hemoglobin.

Those are orthologs.

Okay, different species, orthologs.

Paralogs are homologous genes within the same species that arose from a gene duplication event.

So in humans, the alpha hemoglobin and beta hemoglobin genes are paralogs.

They are related, but they diverged within our own lineage.

Got it.

But just because a gene is sitting in the DNA doesn't mean the cell is actually using it.

How do we know when a gene is turned on?

The textbook chapter has a fascinating figure detailing a breast cancer microarray.

Microarrays were a revolutionary way to measure gene expression.

Picture a tiny glass chip divided into thousands of microscopic spots.

Each spot contains single -stranded DNA probes corresponding to one specific gene.

So to see what's happening in cancer, researchers extract mRNA from breast cancer cells, convert it to cDNA, and tag it with a red fluorescent dye.

Then they extract mRNA from normal, healthy breast cells, convert that to cDNA, and tag it with a green fluorescent dye.

Then they mix the red and green samples together and wash them over the microarray chip.

If the cancer cell is aggressively over -expressing a particular gene, lots of red cDNA will bind to that specific spot.

Under a microscope, that spot glows bright red.

If the normal cell is expressing it more, the spot glows green.

What happens if both cells are expressing the gene equally?

Well, red and green light mix to create yellow.

And if neither cell is using that gene, the spot simply stays dark.

The result is this comprehensive heat map showing exactly which genes have been hijacked in the cancer cells.

Microarrays are incredible, but they are increasingly being replaced by RNA -seq, right?

Next -generation sequencing made it possible to skip the pre -made chips entirely.

Researchers can isolate all the cellular RNA, convert it to cDNA, and literally sequence every single transcript directly.

RNA -seq max give us unprecedented resolution.

An early RNA -seq analysis of yeast revealed that 75 % of the non -repetitive sequences in its genome are actually transcribed into RNA.

75%.

That completely upended our understanding of how active the genome is.

So we can sequence the RNA, but the chapter also talks about visually watching genes turn on inside a living, breathing animal.

Yeah, by using reporter sequences.

You can take the coding region of a gene you want to study and replace it with a sequence that produces a visible signal.

A common choice is the gene for green fluorescent protein, or GFP, which was originally found in jellyfish.

Oh, I've seen pictures of this.

You insert that modified gene into a transgenic mouse.

Now whenever the mouse's body naturally tries to turn on that specific gene, it produces the jellyfish protein instead.

You can literally look under a microscope and watch a specific neural pathway in the mouse's brain glowing green in real time.

It's an incredibly powerful visual tool.

But visualizing a gene doesn't tell you its fundamental job.

If you really want to know what a piece of genetic software does, you run a mutagenesis screen.

You purposefully break the gene and observe what goes wrong in the organism.

Let's walk through the zebrafish experiment detailed in the material, because the logic here is brilliant.

First off, why zebrafish?

I mean, we are studying human genetics.

Zebrafish are ideal models.

They reproduce very quickly,

and crucially, their embryos are completely transparent.

Oh, that makes sense.

Yeah, if a mutation causes a defect in heart development, you can see it happen with a simple microscope.

No dissection needed.

So researchers expose male zebrafish to a chemical called EMS, or they use targeted CRISPR techniques, to induce random mutations in their sperm.

Right, and then they mate those treated males with healthy wild -type females.

The resulting F1 generation of fish will carry some of those mutations.

If a mutation is dominant, the deformity will be immediately obvious in the F1 fish.

But the vast majority of these mutations are recessive, aren't they?

They hide in the genome without causing any visible issues.

Exactly.

So to reveal them, you take those F1 fish that look completely normal, but are heterozygous, meaning they secretly carry the hidden recessive mutation, and you mate them with wild -type fish.

Then you take their offspring and carefully back -cross them.

This calculated reading eventually produces F2 fish that are homozygous for the recessive mutation.

And suddenly, 25 % of the embryos have profound heart defects.

Because you tracked exactly which fish you bred, you can map the exact gene that caused the defect.

You broke the gene, the heart failed to develop, therefore that gene holds the software code for heart development.

It is a painstaking systematic method for mapping the genetic constructions required to build a living organism from scratch.

So we know how to map the hardware, sequence the code, and figure out what the software does.

The final piece of the puzzle zooms all the way out to comparative genomics.

We look at what happens when we compare the genomes of wildly different species.

And the first major comparison to draw is between crokaryotes -like bacteria and eukaryotes -like us.

Cocariotic genomes are fiercely efficient.

Their genome size directly correlates with the number of genes they possess.

It averages out to about one gene for every 1 ,000 base pairs.

And bacteria don't just wait for mutations to evolve.

They swap DNA directly with each other, right?

Horizontal gene transfer is rampant in crokaryotes.

They take up loose DNA from the environment, exchange circular plasmids, or even get infected by viral vectors that carry DNA from another bacterium entirely.

They swap beneficial traits so frequently that attempting to draw a neat branching evolutionary family tree for bacteria is nearly impossible.

It looks more like a tangled web.

Eukaryotes are a completely different story, though.

There is absolutely no clear correlation between how complex an organism is and how large its genome is or how many genes it has.

We are packed full of non -coding DNA.

Only about 1 .5 % of the human genome actually codes for proteins.

The rest has historically been dismissed as junk DNA.

But replicating 3 billion base pairs takes a massive amount of cellular energy.

Why does the cell bother keeping all this extra DNA if it doesn't code for anything?

Some of that non -coding DNA undeniably regulates gene expression, acting as biological switches.

But a massive portion of it might truly be superfluous.

The material highlights an astonishing experiment by researcher Marcelo Nobrega.

His team created genetically engineered mice where they systematically deleted massive gene deserts.

In one specific case, they completely excised 1 .5 million base pairs of non -coding DNA from chromosome 3.

Deleting a million and a half letters of code sounds catastrophic.

That's what you'd think, but the mice were perfectly healthy.

They were functionally and behaviorally indistinguishable from normal mice.

It forcefully suggests that vast regions of the mammalian genome could be deleted with absolutely zero major phenotypic effects.

Which is mind -blowing.

And a huge portion of all this extra DNA actually comes from transposable elements or jumping genes.

Transposable elements are parasitic sequences that can copy and paste themselves throughout the genome.

Over millions of years, they just accumulate.

They make up roughly 45 % of the human genome.

In corn, an incredible 85 % of the entire genome is derived from jumping genes.

But despite all these jumping genes shuffling around and massive gene deserts sitting idle, there is still this beautiful underlying architectural order when we look at evolutionary history.

That underlying order is called collinearity.

If you look at different grass species like rice, wheat, and sorghum, their total genome sizes vary wildly due to those jumping genes.

Wheat has a bloated 17 billion base pairs.

Rice has a lean 460 million.

That's a huge difference.

It is.

But if you look past the junk and examine the genes themselves, the physical order of many of those genes along the chromosomes has remained remarkably the same across all three species.

So they all inherited that basic genetic scaffolding from a common ancestor millions of years ago.

And despite all the evolutionary chaos, the core structure held firm.

Exactly.

The intergenic regions, the spaces between the genes, have expanded and contracted wildly.

But the essential order is preserved.

It's a testament to our shared evolutionary history.

It is a stunning way to bring all these concepts together.

We went from sketching out a rough highway map using recombination to shredding and reading the exact DNA code, to breaking genes in zebrafish to find their function, all the way to seeing the shared architecture of life on Earth.

It really is.

And you know, as you continue studying, I want to leave you with a final thought about that missing variation we discussed earlier.

Oh, the dark matter.

Yeah.

We know that massive gene deserts can be entirely deleted from a mouse with zero observable effect.

We also know that our most advanced Geo -wise studies can still only account for 25 to 30 percent of the variation in complex human traits, leaving the vast majority hidden as dark matter.

The math really doesn't add up there.

Exactly where is that missing variation hiding if millions of non -coding base pairs genuinely do nothing?

But we are still missing the genetic causes for 70 percent of our traits.

Are we fundamentally misunderstanding how non -coding sequences interact in three -dimensional space to influence our biology?

It is the absolute frontier of genomics.

That's a lot to think about.

To you, the student listening, good luck on your genetics exam.

You have the tools, you understand the logic, and you've got this.

And from everyone here at the Last Minute Lecture team, thank you so much for tuning in to this custom deep dive.

We'll see you next time.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Genomics and proteomics represent complementary approaches to understanding the complete molecular blueprint of organisms and how that blueprint is executed at the protein level. Structural genomics begins with the fundamental challenge of mapping and sequencing entire genomes, employing two distinct mapping strategies: genetic maps that estimate gene positions based on recombination frequencies measured in centiMorgans, and physical maps that provide base-pair resolution of actual DNA sequences. Modern genome sequencing relies predominantly on whole-genome shotgun sequencing, which fragments DNA into small overlapping pieces and uses computational assembly to reconstruct the complete sequence, largely replacing earlier map-based approaches. Genomic variation, particularly in the form of single-nucleotide polymorphisms inherited as haplotypes, provides crucial insights into population diversity and disease susceptibility. Functional genomics extends beyond mere sequence information to characterize what genes actually do, examining the transcriptome through microarray analysis and RNA sequencing to determine which genes are expressed under specific conditions, and employing reporter sequences and mutagenesis screens to experimentally link genes to their biological functions. Comparative genomics illuminates evolutionary relationships by contrasting prokaryotic genomes, which are compact with consistent gene density and experience significant horizontal gene transfer, against eukaryotic genomes, which vary dramatically in size and repetitive content; the human genome exemplifies this complexity with only 1.5 percent protein-coding sequence despite containing 3.2 billion base pairs. Proteomics addresses the fact that the genome alone cannot predict cellular function, as proteins undergo posttranslational modifications and exist in dynamic interaction networks. Identifying and quantifying the complete proteome requires specialized techniques including two-dimensional gel electrophoresis for protein separation, mass spectrometry for precise identification, and affinity capture methods for mapping protein-protein interactions and establishing the cellular interactome. Structural proteomics further pursues determination of three-dimensional protein architectures through crystallography and nuclear magnetic resonance spectroscopy, completing the path from genetic information to functional molecular understanding.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 20: Genomics and Proteomics

Related Chapters