Chapter 21: Genomic Analysis

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

Today we are tackling what you might call the essential shortcut for navigating the modern era of genetics,

genomic analysis.

If you've ever felt a bit lost in that whole alphabet soup of omics,

sequencing, proteomics, all that, this is the deep dive you need to really get up to speed.

Absolutely.

We're talking about a revolution that really kicked off back in 1977.

That's when Fred Sanger and his team managed to sequence the entire DNA genome of the FIX174 virus.

Which sounds impressive, but how big was that?

Tiny, actually.

Only about 5 ,400 nucleotides.

That achievement, it basically launched the

analysis.

The comprehensive study of genomes using recombinant DNA, high throughput sequencing, and crucially bioinformatics.

And fast forward a few decades, and we've leapfrogged from that small virus to mapping out the entire human reference blueprint.

So our mission today is to walk you through that incredible journey.

We'll look at the shift from old school gene mapping to automated whole genome sequencing,

cover the big surprises from the Human Genome Project, and then see how we analyze the functional end products, the transcriptome and the proteome.

Right.

And to really appreciate how fast modern sequencing is, you have to remember what came before.

The classical genetics era.

Before we could just read the DNA sequence, finding a gene was this, well, incredibly laborious two -step process.

First, you need a mutant, either spontaneous or one you induced.

Okay, so you needed something visibly different.

Exactly.

Then you had to generate these painstaking linkage maps using genetic markers,

RFLPs, to sort of estimate where the gene physically was on the chromosome.

That sounds incredibly time consuming.

What were the big limitations when you tried to scale that up, say, to humans?

Oh, the limitations were huge.

First, like you said, you needed that clear phenotype, that visible change.

If a mutation killed the organism or didn't change anything obvious, you couldn't map it.

Second, it gave you absolutely zero information about all the coding parts of the DNA, which we now know are incredibly important functionally.

And maybe most critically, it never gave you the actual DNA sequence itself.

Just a rough location.

So when researchers started thinking about mapping the human genome,

which back then they guessed had maybe a hundred thousand genes trying to use those old methods, like positional cloning, it just seemed completely insurmountable.

Totally insurmountable.

We needed something fundamentally different, technology that could just chew through billions of base pairs all at once.

And that became whole genome sequencing or WGS, the shotgun approach.

That's the one, the famous shotgun cloning approach.

I do like the analogy for WGS.

Can you walk us through how that process kind of bypasses the slow map based methods?

Sure.

So imagine that huge genetics textbook.

We mentioned the entire chromosome with WGS.

You basically shred it mechanically into millions of short overlapping scripts of text.

Okay.

Shred the book.

Got it.

We call these short overlapping DNA fragments contigs.

Now the shredding itself isn't the magic trick.

It's putting it all back together.

Right.

Because if you just have millions of tiny, almost identical strips, how on earth do you reassemble the book?

It sounds like a nightmare jigsaw puzzle.

That's exactly where bioinformatics comes in.

Powerful computer algorithms take all those contigs and align them based purely on where their sequences overlap.

Ah, so the overlap is the key.

It's the key.

It allows the software to reconstruct the full, continuous sequence of the original chromosome.

Craig Venter and his team at Tiger really proved this work back in 95.

They sequenced the entire haemophilus influenza bacterium genome using the shotgun method, showed it was not just possible, but way more efficient.

And the impact on efficiency and cost was just massive, wasn't it?

Staggering.

Automated sequencers boosted productivity something like 500 fold.

And the cost per base pair just plummeted from around a dollar down to less than a tenth of a cent.

That cost drop is really what made huge projects like the Human Genome Project possible, both financially and just logistically.

Now when we get these sequences, quality is obviously crucial.

You hear terms like draft sequence versus reference genome.

What's the difference there?

It's all about accuracy.

A draft sequence might have gaps or areas of lower certainty.

A reference genome, though, requires much more work compiling data, often from multiple sequencing runs, to ensure high accuracy.

And that's measured by coverage.

Exactly.

Coverage tells you how many times, on average, each specific nucleotide in the genome was read during sequencing.

To get a really high -quality reference, you need high coverage.

For instance, the Pseudomonas aeruginosa bacterial genome.

It was sequenced seven separate times to make sure the final compiled sequence was accurate enough.

We've done the sequencing, it's accurate, and we're staring at this enormous digital file.

Billions of A's, T's, G's, and C's.

How do we even begin to make sense of it?

Turn that raw data into biology.

This is where bioinformatics truly becomes essential.

It's this merger of IT, biology, and math.

Honestly, without it, managing this flood of data would be impossible.

Just look at GenBank.

That's the big public DNA database run by NCBI.

It holds over 220 billion bases of sequence data.

And get this, it doubles in size roughly every 18 months.

Wow.

So computation is the only way.

The only way.

Once the sequence is stored, the next big step is annotation.

That's the process of interpreting the raw sequence to find all the important bits, the regulatory elements, the control switches, and of course, the actual genes themselves.

And how do you start finding those?

Do you just scan for known patterns?

One of the first things you almost always do is a BLAST search.

That stands for Basic Local Alignment Search Tool.

It lets you take a sequence fragment you just got, maybe one with unknown function, and compare it against that massive GenBank database of all known sequences.

And if you get a hit, that gives you a clue about function, like comparing across species.

Exactly.

Let's say you sequence a bit of rat DNA.

You BLAST it.

And it comes back showing, say, 93 % identity to the known mouse gene for the insulin receptor.

That's really strong evidence that your rat sequence does the same job.

And BLAST gives you some kind of confidence score.

It does.

It gives you an E value, or expect value.

That number estimates the probability that you'd find a match that good just by random chance.

So the closer your E value is to zero, the more statistically significant your match is, the less likely it's just a coincidence.

Beyond just comparing sequences, though, the annotation software has to be smart enough to look for the specific language of genes within the sequence, right?

The signals that say, hey, a gene starts here.

Precisely.

It looks for known hallmarks.

In eukaryotes, that includes things like promoter sequences upstream of a gene.

You might look for characteristic patterns like TATA boxes, SIAT boxes, or GC boxes where regulatory proteins bind.

It also looks for signals related to RNA processing, like splice sites.

Most introns, the bits that get cut out of eukaryotic RNA, consistently start with the nucleotides GT and end with AG.

And then downstream, you look for termination and polyadenylation signals.

And if it's a gene that codes for a protein, the software is hunting for open reading frames, or ORFs.

That's right.

An ORF is a stretch of nucleotides that reads like a translatable message.

It has to start with an initiation codon, usually ATG in DNA, which becomes AUG in RNA.

And it has to end with one of the three stop codons, TAA, TAG, or TGO.

Simple enough in bacteria, I suppose.

But in eukaryotes, those ORFs are broken up by introns, aren't they?

Doesn't that make prediction harder?

It does make it significantly more complex, because the coding parts, the exons, are interrupted by these non -coding introns.

So prediction software uses other clues.

One important one is codon bias.

Codon bias?

Yeah.

Organisms don't use all the possible codons for a given amino acid with equal frequency.

There's a bias.

So the software analyzes the patterns of codon usage, which can help distinguish the real coding exons from the surrounding non -coding DNA, including introns.

Okay.

So annotation helps us find the genes and predict their function.

That leads us into functional genomics, right?

Actually figuring out what these predicted genes do.

Exactly.

Functional genomics aims to establish the biological role of the RNAs or proteins predicted from the sequence data.

And very often, again, this relies on sequence similarity, comparing our unknown gene to genes whose functions are known.

Which brings up those important terms.

Paralogs and orthologs.

Can you clarify the difference?

Yeah.

These are really fundamental concepts for comparing genes.

Paralogs are related genes that are found within the same species.

They usually arose from some ancient gene duplication event.

Think of the alpha globin and beta globin genes in humans.

They're paralogs, both involved in hemoglobin, but slightly different.

Okay.

Same species, different, but related genes.

Right.

Orthologs, on the other hand, are genes found in different species that evolved from a ancestral gene.

So the gene for the hormone leptin in mice, called lep, and the gene for leptin in humans, LEP, they are orthologs.

They trace back to a single gene in the common ancestor of mice and humans, and they usually retain the same function.

They share significant sequence identity, often over 85%.

So we can identify the genes, predict function through homologs.

Can we also map out the control systems, figure out which switches are being flipped?

We can, yeah.

That's where techniques like chipset come in.

That stands for chromatin immunoprecipitation sequencing.

Sounds complicated.

What does it tell us?

It basically maps out across the entire genome exactly where specific DNA binding proteins are attached.

Think transcription factors, the proteins that turn genes on or off.

Chipset can show you precisely which DNA sequences they're binding to in a particular cell type or condition.

It gives you the regulatory wiring diagram.

Speaking of diagrams, let's talk about the big one, the Human Genome Project.

It mapped roughly 3 .1 billion nucleotides.

What was the biggest surprise when the dust settled?

Oh, easily the gene count.

The biggest shock was how few protein -coding genes we found.

Only about 20 ,000.

Wait, only 20 ,000?

The early estimates were way higher, like 80 ,000, even 100 ,000, right?

That's barely more than some simple worms.

How did scientists square that circle?

How do 20 ,000 genes make something as complex as a human?

It was a huge moment of rethinking, a massive intellectual shift.

The main answer lies in alternative splicing.

Okay, explain that.

It turns out that for most human genes, the initial RNA transcript, the pre -mRNA, can be spliced in different ways.

Different combinations of exons can be stitched together.

Ah, so one gene doesn't just make one protein.

Exactly.

Studies show something like 94 to 95 % human genes undergo alternative splicing.

This allows a single gene locus to produce multiple different messenger RNAs and therefore multiple distinct proteins with potentially different functions.

That's how you get complexity from a relatively small gene set.

So fewer genes, but each one works harder, basically.

What else did the HGP tell us about ourselves?

Some other really key findings.

Like we said, less than 2 % of our genome actually codes for protein.

A huge chunk, maybe 50 % or even more, is made up of repetitive DNA elements, things like lines and all those sequences, which are sort of genetic fossils or mobile elements.

And human diversity.

Despite all our visible differences, at the DNA sequence level, any two humans are about 99 .9 % identical.

The variation that does exist, the 0 .1%, is mostly due to SNPs, single nucleotide polymorphisms, just single letter changes and CNVs, copy number variations where people have different numbers of copies of certain genes or DNA segments.

And the story didn't end with the HGP reference sequence, did it?

We learned that even within one person, the genome isn't uniform.

This idea of somatic genome mosaicism.

That's right.

Subsequent projects like personal genome projects started revealing that the cells within a single individual aren't all genetically identical.

Mutations can accumulate in different cell lineages as we develop an age.

So your genome isn't one static thing.

Which leads to the genome is a useful simplification, but it doesn't capture the full diversity.

The pangenome idea tries to encompass the total set of genes and major variations found across an entire species, acknowledging that no single individual represents all of that diversity.

Okay, so understanding our own genome led naturally to comparing it with others.

Comparative genomics.

And this seems to have triggered this whole omics revolution.

It really did.

Comparative genomics, looking at similarities and differences across species became hugely powerful.

And yeah, it coincided with this explosion of related fields,

proteomics, studying proteins, metabolomics, metabolites,

toxicogenomics, how toxins affect gene expression,

and metagenomics, studying communities of organisms.

When you compare genomes, say bacteria versus eukaryotes, the basic structures look really different, don't they?

Very different.

Bacterial genomes are typically small, compact, and packed with genes' high

roughly one gene every thousand base pairs or so.

And they often have genes arranged in operons, where multiple genes involved in one pathway are controlled together as a single unit.

Eukaryotes are the opposite.

Pretty much.

Eukaryotic genomes vary wildly in size, often much larger, they have much lower gene density, genes are spread far apart, they're full of those introns we mentioned earlier, interrupting the coding sequences, and they have vast amounts of that repetitive DNA.

Very different organizational principles.

These comparisons aren't just academic exercises, though.

Comparing the human genome to, say, the dog genome has real medical relevance.

Absolutely.

We share about 75 % of our genes with dogs,

and because of selective breeding, purebred dogs have high rates of certain genetic disorders that are very similar to human diseases.

This makes them incredibly valuable models for studying the genetics of over 400 human conditions.

And our closest relatives, chimpanzees, she shares something like 98 % sequence identity.

Around 98%, yeah.

The fascinating thing there isn't just the similarity, but understanding the differences.

Many key differences seem to lie not in the protein -coding genes themselves, but in how those genes are regulated, especially during development.

There's particular interest in human -accelerated regions, or HGARs.

These are segments of the genome that are highly conserved across most mammals, but show evidence of rapid evolutionary changes, specifically in the human lineage, since we diverged from chimps.

Many Ayers appear to function as regulatory enhancers, potentially driving uniquely human traits, especially related to brain development.

And genomic analysis has even rewritten our understanding of recent human history, hasn't it?

By sequencing ancient DNA.

Oh, completely.

Sequencing DNA from Neanderthal fossils was a game changer.

We found we share about 99 % sequence identity with them.

But the really cool part.

The interbreeding.

Exactly.

Finding that modern humans whose ancestors migrated out of Africa carry between 1 and 4 % Neanderthal DNA.

It's direct evidence that our ancestors met an interbred with Neanderthals, probably somewhere in the Middle East, maybe 45 ,000 to 80 ,000 years ago.

Genomics gave us a window into those ancient encounters.

And then there's metagenomics, which sounds like genomics on a massive scale.

It is, in a way.

Metagenomics, or environmental genomics, is about sequencing DNA directly from entire communities of microbes in an environmental sample.

Think soil, water, air, even the human gut.

You don't try to isolate and culture each individual species, which is often impossible anyway.

You just sequence everything that's there.

You sequence the mix.

The famous Sorcerer 2 global ocean sampling expedition did this.

Sailing around the world, scooping up seawater and sequencing the DNA within.

It revealed thousands upon thousands of previously unknown microbial species and genes.

Like that study of the New York City subway.

Yeah, that was another great example.

They swabbed surfaces all over the subway system and sequenced the DNA.

Nearly half of the DNA they found didn't match any known organism in the databases.

It just highlights this vast hidden microbial world that metagenomics is starting to uncover.

Okay, so we've got the DNA blueprint mapped, annotated, compared.

But the blueprint isn't the same as the working factory.

To see what's actually happening, we need to look at RNA, right?

That's transcriptomics.

Transcriptomics studies the transcriptome that's the complete set of all RNA molecules being expressed in a cell or tissue at a specific moment.

It tells you which genes are actually turned on and how strongly.

It gives you the gene expression profile.

And for years, the main tool for this was the DNA microarray or gene chip.

That was the workhorse, yeah.

Gene chips allowed you to measure the expression levels of thousands of known genes simultaneously.

How did they work, basically?

You'd isolate the messenger RNA, mRNA, from your cells.

That's the template for making proteins.

You'd convert that mRNA into complementary DNA, cDNA, and label it with a fluorescent tag.

Then you wash this labeled cDNA over a glass slide, the microarray chip.

And the chip has probes on it.

Tiny spots, yeah.

Each spot contains a known DNA sequence corresponding to a specific gene.

If that gene was expressed in your sample, the labeled cDNA would bind or hybridize to its corresponding spot on the chip.

The brighter the fluorescence at a spot, the more that gene was expressed.

Powerful for getting a snapshot of thousands of genes at once.

But technology moves on.

Today, the gold standard is RNA sequencing or RNAseq.

Yes, RNAseq has largely replaced microarrays for most applications.

It's a high -throughput sequencing approach applied directly to the RNA, usually after converting it to cDNA.

What makes it better than microarrays?

Several things.

First, you don't need pre -existing probes for known genes.

RNAseq sequences all the RNA molecules present, so you can discover novel transcripts or splice variants that wouldn't be on a standard chip.

It also gives you the actual sequence data, not just a fluorescence intensity level.

This provides much higher resolution, better sensitivity for low -expressed genes, and a wider dynamic range.

Plus, you can even do it in situ within the cell itself to see spatially where transcription is happening.

Okay, from DNA blueprint to RNA activity.

Now for the final players.

The proteins themselves.

The proteome.

That's where proteomics comes in.

Right.

The proteome is the complete set of proteins expressed by a genome, or more specifically, by a cell or tissue at a certain time.

And studying it is crucial because, as we discussed with alternative splicing, the link between gene number and protein number isn't one -to -one.

Because of alternative splicing and also modifications after the protein is made.

Exactly.

Post -translational modifications, or PTMs, things like adding phosphate groups or sugars,

vastly increase the diversity.

So those 20 ,000 human genes, they are estimated to produce maybe closer to 290 ,000 different protein forms when you account for splicing and PTMs.

The human proteome map project aims to catalog these.

So how do you analyze this huge complex mix of proteins?

Two main techniques dominate proteomics.

The first is a separation method called two -dimensional gel electrophoresis, or 2DGE.

Two dimensions.

How does that work?

It separates proteins based on two different properties.

First, you separate them along one dimension based on their isoelectric point, that's the pH at which they have no net electrical charge.

Then you take that separated strip and run it in a second dimension, usually through a standard gel that separates them based on their molecular weight or size.

So you end up with a gel showing potentially thousands of protein spots separated by charge and size.

Exactly.

A complex map of the proteome.

But then you just have spots on a gel.

How do you know what protein each spot is?

That's where the second key technique comes in.

Mass spectrometry, or MS mass spec, is incredibly powerful for identifying molecules.

How does it identify proteins?

You typically cut the protein spot out of the gel, digest the protein into smaller peptide fragments, and then introduce these peptides into the mass spectrometer.

The instrument ionizes the peptides, gives them an electrical charge, and then measures their mass to charge ratio with extreme precision.

And that massy ratio is like a fingerprint.

It essentially creates a unique mass fingerprint for the peptides derived from that protein.

You get a spectrum showing the masses of the different fragments.

This spectrum can then be compared computationally against databases of known protein sequences to identify which protein was in your original spot.

MS can even detect those post -translational modifications.

And this technology, mass spectrometry, it's not just limited to fresh samples from the lab, is it?

This is where things get really wild.

This is where it gets mind -blowing, yeah.

Because mass spectrometry is so sensitive, it can detect minute traces of molecules, even ancient ones.

The truly stunning example is the T.

rex.

The 68 million -year -old dinosaur.

That exact one.

Researchers were able to extract tiny fragments of protein from a 68 million -year -old Tyrannosaurus rex fossil.

They analyzed these fragments using mass spectrometry.

And what did they find?

They found sequences consistent with collagen protein.

And crucially, when they compared these ancient peptide sequences to modern organisms, the closest matches were to birds, like chickens and ostriches.

Wow.

So hard biochemical evidence from a fossil supporting the dinosaur -bird evolutionary link.

Precisely.

Using proteomics, specifically mass spec, to read a protein fingerprint across a gulf of 68 million years, it's just an incredible demonstration of how powerful these analytical techniques have become.

So wrapping up our deep dive today, we've journeyed from the painstaking methods of classical gene mapping to the incredible speed and power of whole genome sequencing.

We saw how essential bioinformatics is for making sense of billions of bases leading to annotation and gene identification.

We covered the revolutionary findings of the Human Genome Project, especially that surprisingly low gene count of around 20 ,000, and how alternative splicing helps explain the complexity.

And finally, we looked at how we analyzed the actual functional molecules, the RNA through transcriptomics with microarrays and now RNAseq, and the proteins through proteomics using 2DGE and mass spectrometry.

It's quite a journey from sequence to function.

So here's something to think about as we finish.

We now understand that our own genomes aren't static monoliths.

They show somatic mosaicism.

We know most of our DNA isn't protein -coding genes, but likely functional regulatory regions, thanks to projects like ENCODE that built on the HGP.

Now consider that final mind -bending example.

Using mass spectrometry to sequence protein fragments from a T -rex fossil, confirming an evolutionary link across 68 million years.

How does that ability to chemically analyze biological molecules from deep time fundamentally shift your perspective on time itself and our connection to the entire history of life?

Thank you for joining us on the Deep Dive.

We hope you feel thoroughly well informed after that exploration of genomic analysis.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Genomic analysis examines entire genomes through the integration of DNA sequencing technologies, recombinant DNA approaches, and computational biology tools that enable researchers to decode, interpret, and compare genetic information at an unprecedented scale. Bioinformatics serves as the foundational bridge connecting biology, information technology, and mathematics, providing the essential infrastructure for storing, retrieving, and analyzing vast quantities of nucleic acid and protein sequence data. Whole-Genome Sequencing, commonly executed through shotgun sequencing methodology, fragments entire chromosomes into overlapping segments called contigs that are then reconstructed computationally by aligning regions of sequence similarity, a process made practical and economical through High-Throughput Sequencing technologies. Following genome assembly, annotation uses computational tools to systematically identify functional regions, particularly Open Reading Frames representing protein-coding sequences and regulatory elements such as promoters and enhancers. BLAST applications enable researchers to search reference databases like GenBank to identify sequence similarities and evolutionary relationships, distinguishing between orthologs found across different species and paralogs arising from duplication events within a single species. The Human Genome Project revealed surprising findings, demonstrating that only approximately 20,000 protein-coding genes exist in humans despite the genome containing 3.1 billion nucleotides, with less than 2 percent dedicated to protein synthesis. Subsequent efforts including Personal Genome Projects uncovered that genetic variation within populations stems primarily from Single-Nucleotide Polymorphisms and Copy Number Variations, insights leading to the pangenome concept that captures all genetic diversity within a species rather than relying on a single reference sequence. Specialized subdisciplines address distinct aspects of genome function and evolution: Functional Genomics establishes gene function through experimental approaches; Comparative Genomics analyzes evolutionary divergence by examining sequence differences between organisms, including the 98 percent sequence identity shared between humans and chimpanzees; Metagenomics investigates genetic material from environmental samples including the Human Microbiome Project; Transcriptome Analysis quantifies expression patterns of all transcribed RNAs within cells; and Proteomics characterizes the complete cellular protein complement using separation techniques like Two-Dimensional Gel Electrophoresis combined with identification methods such as Mass Spectrometry. Alternative splicing substantially expands the proteome diversity, enabling the relatively modest human gene count to generate potentially 290,000 distinct proteins. The ENCODE project further demonstrated that approximately 80 percent of the genome exhibits biochemical functionality, often producing various noncoding RNA molecules with regulatory roles.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 21: Genomic Analysis

Related Chapters