Chapter 15: Genomics – Mapping & Sequencing the Genome

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

Today, we're jumping into what one of the biggest revolutions in biology,

genomics.

It really is huge.

And maybe the best way to grasp just how disruptive it is is to start our story somewhere unexpected, like a really cold cave in Siberia.

Ah, the Denisova cave, right.

That's where researchers found these amazing remains, a previously unknown group of archaic humans, the Denisovans.

Just from a finger bone, wasn't it?

A tiny one.

From a juvenile female's finger bone, yeah.

And later, they found a Neanderthal toe bone there, too,

extracting the DNA from those.

Well, genomics confirms something massive about our past.

Okay, I'm hooked.

What did it confirm?

That human history is, let's say, messier and way more interesting than we thought.

We learned through analyzing their entire genomes.

Their entire genomes.

From ancient bones.

Exactly.

That the ancestors of modern humans didn't just, you know, wave hello to Neanderthals and Denisovans.

They actually interbred.

Wow.

And you, the listener, think about this.

If your ancestry is non -African, you almost certainly carry short bits of Neanderthal DNA.

And the Denisovans.

Well, if your family tree traces back to places like New Guinea or Melanesia, you might carry Denisovan genetic features.

It's incredible stuff.

It really puts the power of genomics into perspective right away.

It's such a leap from, say, Mendel studying pea plants.

Oh, absolutely.

That was classic genetics.

Maybe looking at one or two genes at a time.

Genomics is the science of the whole system.

Mapping, sequencing,

figuring out the function of the entire genome.

All the genetic info an organism has.

Okay, so that's a monumental task.

Our mission today, then, is to kind of guide you through the basics of how scientists even approach this.

We're thinking of it in terms of like three main pillars.

That's a great way to put it.

You've got structural genomics, functional genomics, and comparative genomics.

Right.

So structural is the blueprint.

What does it look like?

Exactly.

What's the physical structure?

Then functional asks, okay, what does all this DNA actually do?

That gets into things like RNA, the transcriptome, and all the proteins, the proteome.

And the third pillar,

comparative.

Comparative genomics puts it all in context.

How does this genome compare to others?

How did it evolve?

What can we learn from the similarities and differences across species?

Got it.

So structure, function, comparison.

But none of this would even be conceivable without, I guess, a massive leap in technology, right?

Astronomical is probably the right word.

Think about Robert Holly back in the 60s.

Nobel Prize winner.

Brilliant work.

Yeah.

It took him years to

just sequence 77 nucleotides letters of yeast tRNA.

77.

Okay.

Years for 77 letters.

What can we do today?

Today.

High throughput sequencing machines.

They can blast through over 25 billion nucleotides in a single day.

25 billion per day.

Per day.

What took Holly a decade for a tiny molecule we can now do in, well, less than a second?

That kind of speed.

It creates its own problems, doesn't it?

Where do you put all that information?

You can't just jot it down.

You absolutely cannot.

It's a data tsunami.

And that necessity really drove the birth of bioinformatics.

Ah, bioinformatics.

So this is the computational side of things.

Precisely.

You need serious computing power to organize, label, store, and analyze this just massive flood of sequence data.

It all gets archived in huge public databases.

The big one is GenBank run by the NCBI.

So GenBank is like the giant library of all known sequences.

But if you find a new little piece of DNA in your lab, how do you figure out what it is or where it belongs in that library?

You need a search engine.

And the workhorse for that is often a tool called blast.

Blast.

Like blasting through data.

It stands for basic local alignment search tool.

You feed it your unknown sequence and bam.

It compares it against everything in GenBank.

And tells you.

It can tell you almost instantly, hey, this looks like part of a known gene, or this sequence is really similar to the human beta globe and gene.

Maybe it carries oxygen.

Or even this sequence maps to chromosome 4.

It gives you immediate context.

Super powerful.

Okay.

That makes sense.

You need the tech to get the data and the bioinformatics to manage and search it.

Let's circle back to that first pillar, structural genomics.

Building the blueprint.

Before full sequencing, we had to make maps.

Right.

But not just one map.

Correct.

The challenge was linking together three different kinds of maps of the chromosomes.

Three.

Okay.

What were they?

First, you had genetic maps.

These are based on how often genes or traits are inherited together.

Which tells you about their relative linkage.

The distances are measured in centimorgans.

It's more about probability than physical distance.

Okay.

Linkage maps.

What else?

Then there's cytological maps.

These are the ones you might remember seeing the visual banding patterns you get when you stain chromosomes with dyes like GMSA.

It gives you landmarks.

Like light and dark bands on the chromosome arms.

Right.

The ultimate goal is the physical map.

This is the map based on actual molecular distance measured in base pairs BP or kilobases KB or even megabases MB.

It's the ground truth of the DNA sequence itself.

So the goal was to connect the inheritance patterns from the genetic map and the visual landmarks from the cytological map to the actual DNA sequence distances on the physical map.

How did they create those physical maps?

Several ways.

You could make restriction maps showing where specific enzymes cut the DNA.

Or contig maps which show how large cloned DNA fragments often in vectors like BACs or YAEs overlap each other.

And STS maps based on unique short sequences called sequence tag sites.

And bridging these different maps required markers, right?

Like signposts along the chromosome.

Exactly.

You need high density molecular markers.

Two types were crucial.

Okay.

What were they?

First, RFLPs.

Restriction Fragment Length Polymorphisms.

RFLPs.

Okay.

What are those?

These come about when a mutation, maybe just a single base change, creates or destroys a cutting site for a restriction enzyme.

So when you cut the DNA from different individuals, you get fragments of different lengths.

You can track these length differences through families.

Ah, okay.

So variations in where the DNA gets cut.

What was the other type?

The other type, which turned out to be even more informative, are the tandem repeats.

Things like the NTRs, Variable Number Tandem Repeats, and STRs, Short Tandem Repeats.

Tandem Repeats.

So sequences repeated over and over.

Exactly.

Like catchy, catchy, catchy.

The key here is that the variation isn't usually in the sequence itself, but in the number of times it's repeated.

One person might have 10 repeats.

Another might have 15.

And these vary a lot between people.

Hugely.

They're highly polymorphic.

That makes them incredibly valuable, not just for mapping genes, but also, as you probably know, they're the basis for DNA fingerprinting in forensics.

Right.

Okay.

So you have these different maps and these markers.

How does that help you find a specific gene, say, for a disease?

That leads to a really powerful technique called positional cloning.

Positional cloning sounds like finding something based on its address.

That's basically it.

If you know from family studies that a disease gene is closely linked to a specific RFLP or STR marker, that's your starting point on the genetic map.

Then you use the physical map, maybe those overlapping contigs, to literally walk along the DNA from your known marker towards the disease gene.

You physically isolate DNA pieces step by step.

Kind of, yeah.

You find overlapping clones that take you closer and closer.

Eventually you narrow it down to a small region that should contain your gene.

Then you sequence that region in affected people to find the mutation.

That's how they found the Huntington's disease gene, for example.

Wow.

Molecular navigation.

Okay.

That sets the stage perfectly for the biggest structural genomics project ever, the Human Genome Project, launched in 1990.

The HGP.

Yeah.

A massive international effort, but also quite a bit of drama.

Drama.

How so?

Well, you had two main camps with very different strategies and egos were definitely involved.

There was the public consortium funded by governments like the NIH and Wellcome Trust.

They were using a methodical hierarchical approach, basically map first, then sequence selected large chunks piece by piece.

Okay.

The slow and steady public team, who were they racing against?

They were racing against a private company, Solera Genomics, led by the very ambitious Craig Venter.

Ah, Venter.

I remember the name.

What was his strategy?

Solera championed the whole genome shotgun approach.

Yeah.

Basically, forget the detailed map, just shred the entire genome into millions of tiny random fragments, sequence them all as fast as possible, and then use massive supercomputers to find the overlaps and piece the whole thing back together.

That sounds chaotic.

Did it work?

It was controversial, seen as riskier, but yeah, largely it did work.

In the end, both groups published their draft sequences around the same time in 2001.

It was declared a tie, more or less.

So the race was over.

What did we learn from that first draft?

What were the headline numbers?

Well, the size is confirmed.

About 3 .2 billion base pairs,

roughly 2 .9 billion in the euchromatin, the more gene rich parts.

But the real shocker was the gene count.

Right.

You mentioned earlier that initial estimates were way higher.

Way higher.

People were throwing around numbers like 80 ,000, even 100 ,000 protein coding genes.

The actual number that came out of the HGP and have been refined since is only around 20 ,500.

20 ,500.

That's it.

That seems surprisingly low, less than some plants, right?

Way less than some plants.

Yeah.

It was a big dose of humility.

It told us that human complexity doesn't just come from having more genes than everything else.

There's clearly a lot more going on in terms of regulation and how those genes are used.

So fewer genes than expected.

What about the structure of those genes?

Also interesting.

The average human gene is pretty big, maybe 14 ,000 base pairs long.

But most of that is Intron's regions within the gene.

The actual coding bits, the exons that specify the protein sequence,

they make up only about 1 to 2 % of the entire sequence genome.

1 to 2%.

So what's the other 98 %?

That's the big question that functional genomics tries to answer.

But structurally, we found that a huge chunk of it, maybe half the genome, is repetitive DNA.

Repeats again, like the SDRs.

Similar concept, but on a much grander scale.

About 45 % of the genome is derived transposable elements, or transposons.

That's called jumping genes.

Genetic hitchhikers, basically.

Remnants of old viruses or parasitic DNA.

That's a good way to think of it.

They are sequences that could, or at least once could, copy themselves and insert into new locations in the genome.

The most common types in humans are retrotransposons.

Retrotransposons.

Meaning they use RNA.

Exactly.

They get transcribed into RNA, and then an enzyme called reverse transcriptase copies the RNA back into DNA, which then gets inserted somewhere else.

And there are different kinds of these.

Two main classes dominate our genome.

Lines, which are long interspersed elements, and signs, short interspersed elements.

Lines and signs.

Lines are the more autonomous ones.

They're quite long, and critically, they encode their own reverse transcriptase.

So they have the tools to copy and paste themselves.

And signs.

Signs are much shorter, and they're basically genetic parasites of the lines.

They don't make their own reverse transcriptase.

They rely on borrowing the enzyme made by line elements to get themselves copied and spread around.

The most common sign in humans is called ALU.

You have over a million copies of ALU sequences scattered through your genome.

Wow.

So our genome is like this ancient ecosystem full of these active or remnant mobile elements.

Absolutely.

It's dynamic, not static at all.

That 98 % non -protein coding figure is just mind boggling.

It leads perfectly into functional genomics.

What is all that other stuff doing?

Well, we know some of it makes functional RNA molecules that aren't translated into protein.

Things like ribosomal RNA, rRNA, and transfer RNA, tRNA.

Essential for making proteins.

Right.

The cellular machinery.

And those genes often need to be present in many copies because the cell needs vast amounts of their products.

Like humans have 150 to 200 copies of the RNA genes.

Okay.

So some non -coding DNA makes essential RNAs, but that still doesn't account for most of it.

What about the rest?

The really exciting frontier in the last decade or two has been the discovery of thousands and thousands of long non -coding RNAs or LNC RNAs.

Long non -coding RNAs.

So RNA transcripts that don't seem to make proteins.

Exactly.

It turns out that even though only one, 2 % of the genome codes for protein,

something like 90 % or more of the genome actually gets transcribed into RNA at some point in some cell type.

A lot of that is LNC RNAs.

90 % gets transcribed, but doesn't make protein.

What are these LNC RNAs doing then?

That's the million dollar question or maybe billion dollar question.

We're still figuring it out, but many seem to be involved in gene regulation.

They can act like scaffolds, bringing proteins together or guiding enzymes to specific spots on the DNA.

A classic example is the Zist LNC RNA.

Zist.

What does that do?

In female mammals who have two X chromosomes, one X needs to be shut down, inactivated.

Zist RNA is produced by the X chromosome that will become inactive and it literally coats that entire chromosome, recruiting proteins that silence it.

It's a master regulator.

Okay.

So non -coding RNAs can be really powerful regulators.

That adds another layer of complexity beyond just the protein coding gene.

A huge layer.

And speaking of complexity and variation, let's talk about the differences between individuals.

The most common type of genetic variation is the SMP.

SMP.

Single nucleotide polymorphism.

Right.

Just a single letter change in the DNA sequence.

A C instead of a T, for example.

And these are incredibly common.

On average, you'll find one SMP about every 1200 base pairs between any two people.

Every 1200 bases.

So there are millions of these differences scattered throughout our genomes.

Millions.

Most don't do anything noticeable.

They might fall in non -critical regions, but they are fantastic markers.

And importantly, SMPs that are physically close together on a chromosome tend to be inherited together as a block.

A block of linked SMPs.

Yeah.

That block is called a haplotype.

Think of it like a specific combination of SMP variants along a stretch of chromosome that usually travels together through generations.

And mapping these haplotypes across populations, that was the goal of the HapMap project, right?

Exactly.

The International HapMap project aimed to chart these common haplotype blocks in different human populations.

Why?

What's the use of knowing these common patterns?

It's incredibly useful for finding genes associated with complex diseases.

If you find that people with a certain disease disproportionately carry a specific haplotype,

it tells you that a gene influencing that disease likely lies within or very close to that haplotype block on the chromosome.

It helps narrow down the search dramatically.

These are called genome -wide association studies or GWIs.

Ah, GWAS.

Okay.

So SMPs and haplotypes are key tools for linking variation to traits and diseases.

Absolutely indispensable.

Now to really see function in action, to see what genes are doing in a cell, scientists needed assays, right?

Ways to measure gene activity.

Right.

Two technologies really revolutionize this.

The first gives you a big picture snapshot.

Micro -orays or gene chips.

Gene chips.

Sounds like something from science fiction.

Kind of looks like it too.

It's usually a small glass slide and attached to it are thousands, even tens of thousands of tiny spots.

Each spot contains a known DNA sequence, a probe for a specific gene.

Okay, so a chip covered in gene probes, how do you use it?

You take cells you want to compare, say, normal liver cells versus cancerous liver cells.

You extract all the messenger RNA, mRNA from both samples.

Remember, mRNA levels reflect which genes are active.

Got it.

Then you use an enzyme, reverse transcriptase again, to make DNA copies of that RNA called CDA.

And you label the CDNA from the normal cells with, say, a green fluorescent dye and the CDNA from the cancer cells with a red fluorescent dye.

Okay, green for normal, red for cancer, then what?

You mix them together and wash them over the micro -array chip.

The CDNA molecules will stick, or hybridize, to the spots on the chip that contain their matching gene sequence.

And the colors tell you what's happening.

Exactly.

If a spot glows bright green, it means that gene was highly active in the normal cells but not the cancer cells.

If it's right red, the gene was on the cancer cells but off in normal cells.

Yellow means it was active in both.

Black means it was off in both.

You get a snapshot of the expression levels of thousands of genes simultaneously.

That's incredibly powerful for comparing different states, like healthy versus diseased.

Hugely powerful.

Led to massive projects like the Cancer Genome Atlas, TCGA, building databases of gene expression patterns in different cancers.

Okay, so microarrays give you the RNA levels.

What if you want to see the actual protein, where it is, and what it's doing inside a living cell?

For that, the hero is GFP green fluorescent protein.

GFP?

Didn't that come from a jellyfish?

It did.

A naturally fluorescent protein from the jellyfish Echoria victoria.

The beauty of GFP is that the gene encoding it is relatively small, and when the protein is made, it just glows green.

It doesn't need any special cofactors or anything.

So how do scientists use it to watch other proteins?

It's elegantly simple really.

Using genetic engineering, you fuse the DNA sequence that codes for GFP right onto the end of the gene sequence for your protein of interest.

So you make a combined gene?

You make a combined gene.

When the cell expresses this gene, it makes a fusion protein, your protein of interest with GFP basically tacked onto it.

And because GFP glows?

Your protein now glows green.

So you can put this engineered gene into cells or even whole organisms and then just watch under a fluorescence microscope.

You can see exactly when your protein is made, where it goes in the cell, does it move around?

Yeah.

It's like putting a tiny fluorescent tracking beacon on your protein in real time inside a living system.

That's amazing.

A live movie instead of a static snapshot.

Truly revolutionary for cell biology.

Okay, we've covered structure and function and the tools.

Let's touch on the third pillar comparative genomics.

Looking across species,

there are big differences say between prokaryotes and eukaryotes, right?

Oh, absolutely.

Prokaryotic genomes like bacteria and archaea are typically small, compact, and incredibly gene dense.

Very little wasted space.

Something like mycoplasma genitalium has one of the smallest known genomes for a free living organism.

Maybe only 265 to 350 essential genes.

It's operating near the bare minimum required for life.

Compared to our 20 ,500 genes spread across 3 billion base pairs with lots of non -coding stuff.

Exactly.

Eukaryotic genomes are generally much larger, much less gene dense, full of those introns and repetitive elements we talked about.

Different evolutionary strategies.

And eukaryotes also have those other little genomes in mitochondria and chloroplasts.

Right, the organelle genomes.

Remnants of ancient symbiotic bacteria.

They have their own DNA, usually circular like bacterial DNA.

And they're inherited differently, aren't they?

Typically, yes.

In most animals, including humans,

mitochondrial DNA, MTDNA, is inherited almost exclusively from the mother through the egg cell.

Chloroplast DNA, ZPDNA, in plants is also often maternally inherited.

This makes them really useful for tracing maternal lineages.

And these organelle genomes vary in size, too.

Wildly.

Human MTDNA is tiny, just 16 ,500 base pairs, encoding only 37 genes, mostly involved in production.

But plant MTDNA can be enormous, sometimes hundreds of thousands of base pairs, though still encoding relatively few genes.

There's been a lot of gene transfer to the nucleus over evolution.

But they still have to work together, the nuclear genome and the organelle genomes.

Oh, completely.

Take cellular respiration.

Some protein subunits of the electron transport chain are encoded by the MTDNA.

Others are encoded by nuclear genes, synthesized in the cytoplasm and then imported into the mitochondria.

It requires tight coordination.

So comparing genomes, even within a cell, is crucial.

What about comparing across very different species?

Can we see echoes of shared ancestry?

Absolutely.

One key concept is shared synteny.

Synteny.

What's that?

Synteny refers to blocks of genes whose order is conserved across different species.

Even if the species diverged hundreds of millions of years ago, you can often large stretches of chromosomes where the same genes are lined up in the same relative order.

Give me an example.

Cereal grasses like rice, wheat, corn, sorghum.

Their genomes vary a lot in size, but large blocks of genes show conserved synteny, pointing to their common ancestor.

You see the same thing comparing mammals like humans and mice.

Lots of blocks of conserved gene order, even though the chromosomes themselves have been rearranged.

It's powerful evidence for

Which brings us right back full circle to where we started with ancient DNA, paleogenomics.

Exactly.

Comparative genomics.

But comparing us to extinct relatives, it's incredibly challenging because ancient DNA is usually shattered into tiny fragments and chemically damaged.

But they managed it for Denisovans and Neanderthals.

They did.

Using those high throughput sequencing methods and sophisticated bioinformatics to piece together the fragments and account for the damage.

And that's how we got definitive proof of that interbreeding in our past.

It really ties everything together.

The technology, the bioinformatics, the structural mapping, the comparison.

Okay, let's try to wrap this up.

What are the big takeaways from this deep dive into genomics?

Well, I think the main thing is the scale shift.

We went from studying single genes, maybe painstakingly, to being able to analyze entire genome structure, function, variation, evolution all at once.

Driven by that incredible leap in sequencing speed and the bioinformatics needed to handle the day to day luge.

Right.

And structurally, we learned the importance of correlating different kinds of maps, genetic, physical, to actually find genes through things like positional cloning.

And the surprising result from the HTTP,

relatively few protein coding genes, but a vast amount of non -coding and repetitive DNA, including those active transposons like lines and signs.

And functionally, the store got even more complex.

It's not just proteins.

We have essential non -coding RNAs and this huge universe of LNC RNAs that seem to be master regulators, plus all the individual variation driven by SNPs and haplotypes.

And we developed the tools to study function on a grand scale microarrays for gene expression snapshots, GFP for watching proteins leaven cells.

These tools are crucial for everything from basic biology to personalized medicine, like cancer diagnostics.

So genomics gives us these incredibly powerful tools and a much deeper, sometimes surprising view of our own biology and evolutionary history.

But it feels like we're still just scratching the surface, especially with that 98 % non -protein coding DNA.

What's the big challenge looking forward?

You nailed it.

The human genome project, sequencing the genome.

That was really just step one, the structural phase.

The monumental task for geneticists now and probably for decades to come is the functional phase.

What does the vast majority of the genome actually do?

Figuring out the roles of all that non -coding DNA, understanding the complex regulatory networks involving LNC RNAs and truly connecting the dots between individual genetic variation, those SNPs and haplotypes and complex traits, health and disease.

That's the next frontier, deciphering the function of the rest of the genome.

That's the real challenge ahead.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Genomics encompasses the systematic investigation of complete genomes across their structure, function, and evolutionary relationships, extending far beyond the foundational work of early geneticists. Structural genomics constructs detailed physical maps by measuring distances in base pairs, kilobases, and megabases, integrating this information with classical genetic maps recorded in centiMorgans and cytological maps derived from chromosome banding patterns through the use of anchor markers. Mapping projects depend on molecular polymorphisms including Restriction Fragment-Length Polymorphisms, Variable Number Tandem Repeats, and Short Tandem Repeats to generate contig maps and enable positional cloning strategies. The sequencing revolution was catalyzed by the Human Genome Project, which deciphered the 3.2 billion base pair human genome through both hierarchical BAC clone mapping and whole-genome shotgun sequencing approaches. Analysis revealed that only 1 to 2 percent of the genome encodes proteins, while approximately 50 percent comprises repetitive sequences originating from retrotransposons such as LINEs and SINEs. The human genome contains roughly 20,500 protein-coding genes alongside functional noncoding RNAs including long noncoding RNAs and microRNAs, as well as non-functional pseudogenes. Genetic variation within human populations is dominated by Single-Nucleotide Polymorphisms organized into inherited blocks called haplotypes, which the HapMap Project mapped to identify disease-associated loci. Massive sequencing datasets are stored in repositories like GenBank and analyzed through bioinformatics platforms such as BLAST. Functional genomics employs microarray technology to simultaneously monitor thousands of gene expression patterns and utilizes Green Fluorescent Protein fusions to track protein synthesis and subcellular localization in living cells. Comparative genomics examines genome organization across diverse species, revealing conserved gene clusters known as shared synteny and defining minimal gene sets through study of streamlined organisms. Paleogenomics reconstructs severely degraded ancient DNA, illuminating human evolutionary history and revealing evidence of genetic exchange between modern humans and archaic populations including Neanderthals and Denisovans.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 15: Genomics – Mapping & Sequencing the Genome

Related Chapters