Chapter 18: Microbial Genomics & Bioinformatics Applications

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive, where we unpack the core knowledge from your sources, making you an instant expert.

Today, while we're taffling something huge, microbial genomics.

It's a field that's just exploded, hasn't it?

Moving from, you know, carefully reading DNA letter by letter to actually writing new life forms.

It's really a revolution driven by data.

Yeah.

And our goal today is to sort of decode the technologies behind it.

And maybe the best place to start is the most mind blowing application as synthetic life.

Right.

You have to start with the work from the J.

Craig Venter Institute.

That was a huge deal.

Back in 2010, they announced Mycoplasma Mycoids, JCVI SIN 1 .0.

This thing was essentially built using a computer sequence.

Exactly.

They chemically synthesize the DNA, stitch the whole genome together, and then transplanted that artificial chromosome into a living cell.

The result, a microbe completely controlled by code they designed.

Wow.

And that wasn't just for fun, right?

There was a bigger goal.

Oh, definitely.

It was a proof of concept, a big one, cost something like 30 million dollars.

The idea was, can we design organisms to do specific jobs?

Things like making biofuels or maybe new vaccines,

pharmaceuticals, that sort of thing.

And then they took it even further, didn't they?

They made it smaller.

Yeah.

They realized SIN 1 .0, it was maybe a bit bloated, had over 900 genes.

So they went minimalist.

They systematically cut down the genome, got it down to just 473 genes.

That's SIN 3 .0.

It's like the smallest known self replicating cell, basically defining the minimum gene set for life.

Incredible.

But with that kind of power comes responsibility, right?

Ethical question.

Absolutely.

And they thought about that.

SIN 1 .0 had these built -in watermarks.

It's fascinating.

The synthetic DNA contains encrypted messages.

If you decode them, you find quotes from James Joyce, Richard Feynman.

It acts like a traceable signature.

So if there's ever a bio error, you could trace it back to the source.

That's pretty clever, actually.

It's sort of digital fingerprint in the DNA.

Okay, so to really get how we can write life like this, we need to rewind and look at how we first learned to read the code, right, before all the fancy computers.

Exactly.

We need to go back to the workhorse method for decades, Sanger sequencing.

From 1977, Fred Sanger's idea was ingenious.

Turn sequencing into a measurement problem.

Okay, how did that work?

It involves special DNA building blocks?

Right.

Dedoxynucleotides or DDNTPs, these are modified bases.

The key thing is they lack a specific chemical group, the three prime hydroxyl group.

So if the DNA polymerase enzyme adds one of these DDNTPs to a growing DNA strand, boom, synthesis stops, it can't add any more bases.

Ah, like a dead end.

Precisely.

So you'd set up four separate reactions, each reaction tube has normal DNA building blocks, the polymerase, the template DNA you want to sequence, and a small amount of one specific DDNTP, say DDATP in the first tube, DDTTP in the second, and so on.

And these DDNTPs were labeled usually radioactively back then.

Okay, so each tube creates fragments that stop at a specific letter.

You got it.

You end up with a collection of DNA fragments of all different lengths, each ending with a specific labeled base corresponding to that tube.

Then you separate these fragments by size using gel electrophoresis.

Imagine four lanes on a gel, one for each reaction.

The fragments migrate based on size, shortest moves fastest.

By reading the bands on the gel from bottom to top across the four lanes, you could literally read the DNA sequence.

And it gave pretty decent read lengths for the time, maybe 500 to 800 bases.

Okay, that makes sense.

But you mentioned the drawback, it sounds slow.

Painfully slow and expensive.

That first human genome draft using Sanger methods took about a decade, cost around 300 million dollars.

That kind of cost in time just wasn't feasible for sequencing lots of microbes or, you know, comparing many genomes.

Right, so something had to change.

And it did, massively.

That's the leap to next generation sequencing or NGS.

The core idea is massively parallel sequencing.

Instead of one sequence reaction in a tube, you do millions or even billions simultaneously.

Millions?

How do you even manage that?

Well, it's a totally different setup.

You take the DNA, fragment it into short pieces.

These short pieces are then attached to a solid surface, like a glass slide called a flow cell, usually via adapters ligated onto the ends.

Then on the flow cell, these fragments are amplified in place.

There's a technique called bridge amplification that creates little clusters, dense spots, each containing thousands of identical copies of the original fragment.

Why the clusters?

Why not just sequence the single molecule?

Signal strength, basically.

The signal from adding just one fluorescent base to one molecule would be too weak to detect reliably.

The cluster amplifies the signal.

Okay, so you have clusters of identical DNA.

How do you read the sequence then?

It's not saying your DNDPs anymore.

No, it's different.

A common method is reversible chain termination sequencing, sometimes called sequencing by synthesis.

Here, the special nucleotides have two modifications, a fluorescent tag so the machine can see which base was added and a temporary blocking group on that three prime end.

So the polymerase adds one base.

The machine takes a picture to see the color identifying the base.

Then crucially, enzymes come in and cleave off both the fluorescent tag and the blocking group.

Ah, so it's reversible, unlike the permanent block in Sanger.

Exactly.

Once the block is removed, the polymerase can add the next base in the sequence and the cycle repeats.

Picture, cleave, next base, picture, cleave, over and over.

Okay, I see the trade -off though.

You said the fragments are short.

Yeah, that's the catch.

Read lengths are much shorter with most NGS platforms, maybe 150 to 300 bases.

This is because over many cycles, the sheer volume is staggering.

An old Sanger machine might get a million bases a day, a modern NGS machine.

It can generate like 120 gigabytes, billions of bases in just a couple of days.

That's an astronomical difference.

And that volume must make sequencing whole genomes much easier.

Absolutely.

It powers whole genome shotgun sequencing, WGSS, much more efficiently.

The basic idea of WGSS hasn't changed much since Venter and Smith used it in 95 for the first bacterial genomes, H.

influenza and M.

genitalium.

You break the genome into random fragments, sequence them all, now using NGS, and then use computers to find the overlaps and piece them back together.

Like assembling a massive jigsaw puzzle.

A very complex one, yes.

You align the short reads into longer, continuous stretches called contigs.

Then you figure out the order and orientation of these contigs, linking them into scaffolds, often using information from

paired end reads that span gaps.

Then there's editing and proofreading.

You mentioned NGS made this more efficient.

How?

The biggest hurdle in the old WGSS was creating the genomic library.

That involved cloning all those DNA fragments into bacteria, which was slow and labor intensive.

NGS pretty much eliminates that whole library construction step.

You just fragment the DNA, add adapters, and put it straight onto the sequencer much faster.

And does that massive amount of data help with accuracy too?

Immensely.

We talk about depth of coverage.

That's how many times, on average, each single nucleotide in the genome has been sequenced.

With Sanger, you might sequence a region once or twice.

With NGS, you aim for maybe 30 -fold, 50 -fold, even 100 -fold coverage or more.

This high depth means you can statistically distinguish real variations from random errors made by the DNA polymerase during sequencing.

It gives you very high confidence in the final sequence.

Okay, so NGS is great for assembling genomes, if you can get enough DNA from a pure culture.

But didn't you say earlier that most microbes, maybe 98%, just won't grow in the lab?

How do we sequence those?

Ah, that's where single cell genomics comes in.

It's about tackling the microbial dark matter.

All those bacteria and archaea we know exist from environmental samples but can't cultivate.

The challenge is you're starting with maybe just a few pentagrams of DNA from one single cell.

That's an incredibly tiny amount.

Way too little for standard sequencing prep, I assume.

So how do you amplify it, not PCR?

Standard PCR isn't ideal for amplifying a whole genome from such a tiny starting amount.

It tends to be biased.

So they use a technique called multiple displacement amplification, or MDA.

MDA relies on a special DNA polymerase from a bacteriophage, phi -29 polymerase.

So special about it.

Two things mainly.

First, it's highly processive, meaning it stays attached to the DNA template for a very long time, copying long stretches.

Second, it has strand displacement activity.

As it synthesizes a new strand, it peels away any DNA strand it encounters in its path.

You use random short primers, and this phi -29 enzyme just goes nuts, popping and displacing, creating a tangled mess of amplified DNA representing the entire genome.

And it does this with high fidelity,

fewer errors.

So MDA lets you get enough DNA from one uncultured cell to then sequence using NGS.

Exactly.

It's been revolutionary for exploring microbial diversity, discovering entirely new phyllas, bacteria, and archaea that we literally couldn't study before.

Okay, from sequencing one cell, the next logical step seems to be sequencing everything in an environment.

You're right.

That takes us to metagenomics.

Metagenomics is about studying the collective genomes of all the microbes in a particular environment, soil, seawater, your gut, whatever, by extracting DNA directly from that sample.

Skipping the whole culturing step entirely.

Completely.

Early metagenomics often focused on sequencing just one specific gene, like the 16S ribosomal RNA gene, which is great for identifying who is there, the taxonomy, but shotgun metagenomics using NGS is much more powerful.

Also.

Think about it like a census.

Culturing is like randomly calling two phone numbers in a city you miss almost everyone.

Cloning specific genes is maybe like mailing out a survey better, but still biased.

Shotgun metagenomics with NGS, it's like sending out a whole team of census takers to knock on every door.

You sequence all the DNA fragments you can get.

You get huge breadth of coverage, sampling lots of different organisms and depth of coverage, sequencing their genes multiple times.

So it gives you not just a list of species, but also a catalog of genes.

It tells you about the potential of that microbial community, what metabolic pathways are present, what enzymes might they be making.

It provides clues about what the microbes might actually be doing in that environment.

And it's constantly revealing new enzymes, new pathways, even new phyla.

Okay.

We've generated mountains of sequence data, either from single cells, whole genomes, or entire environments, but it's just letters A, T, C, G.

How do we make sense of it?

That's the domain of bioinformatics.

It's this crucial intersection of biology, computer science, math, statistics.

Its job is to turn that raw sequence data into meaningful biological information, a process called annotation.

And a key part of annotation is finding the genes, right?

How does a computer spot a gene in a string of letters?

It primarily looks for open reading frames, or ORFs.

An ORF is basically a stretch of DNA sequence that potentially codes for a protein.

The computer scans the DNA, typically looking for sequences that start with a start codon, like ATG, continue for a reasonable length without hitting a stop codon, say, at least 100 codons, which is 300 base pairs, and are preceded by a ribosome binding site.

And it has to check both DNA strands and different reading frames.

Yes, exactly.

DNA is double -stranded, and each strand can be read in three different frames, depending on where you start grouping the letters into threes, codons.

So the software has to check all six possibilities, three frames on each strand, to find all potential ORFs.

You can see a visual of this in figure 18 .9.

In the source material, it shows how shifting the reading frame reveals different potential protein sequences.

Once it finds a potential ORF, how does it guess what the gene does?

By comparison, it uses tools like BLAST, Basic Local Alignment Search Tool, to compare the predicted protein sequence of the ORF against massive databases of known gene sequences from other organisms.

If your ORF sequence strongly matches a gene with a known function, like, say, an enzyme involved in sugar metabolism, you can infer that your gene probably does something similar.

And this leads to terms like orthologs and paralogs.

Right.

If you find highly similar genes in different organisms, they're likely orthologs, meaning they probably evolved from a common ancestral gene and retain a similar function.

If you find multiple similar genes within the same genome, they're likely paralogs.

These usually arise from gene duplication events, and often one copy might evolve a new related function over time.

Okay, so annotation gives us a parts list.

But how do we figure out how those parts actually work together in the living cell?

That moves us into functional genomics, trying to link the genotype, the genes, to the phenotype, what the cell actually does.

A classic example is treponema pallidum, the bacterium that causes

another one that's extremely difficult to grow in the lab.

So its whole lifestyle had to be figured out from the genome sequence.

So pretty much.

The annotation revealed it was missing genes for really fundamental metabolic pathways.

No TCA cycle, no oxidative phosphorylation.

It literally can't make its own amino acids or fatty acids.

This immediately told scientists it must be heavily reliant on its host for nutrients.

And consistent with that, about 5 % of its entire genome consists of genes coding for transport proteins, channels, and pumps to scavenge molecules from its surroundings.

Figure 18 .11 illustrates this metabolic dependency.

That's powerful predicting physiology just from sequence.

Can we also see which genes are active at any given time?

Yes, that's transcriptomics, the study of the transcriptome, which is all the RNA molecules in a cell, particularly messenger RNA, mRNA, since mRNA levels generally correlate with gene activity.

We used to use techniques like microarrays, but they had limitations you could only detect gains you already knew about and put on the chip, and comparing results between experiments could be tricky.

So what's the standard now?

RNA -seq.

It's become the method of choice.

You isolate all the mRNA from your cells, convert it into more stable complementary DNA, cDNA, using an enzyme called reverse transcriptase, and then sequence that cDNA using NGS.

The beauty is you sequence everything.

The number of sequence reads that match a particular gene is a direct measure of how expressed that gene was.

It's quantitative and doesn't rely on knowing the genes beforehand.

Is there an example where RNA -seq revealed something unexpected?

A great one is Deinococcus radiodurans.

This bacterium is incredibly resistant to radiation, can survive doses that would kill almost anything else.

People initially thought its genome must be packed with extra DNA repair genes, but sequencing showed, nope, it has fewer repair genes than E.

coli.

So what's its secret?

RNA -seq provided the answer.

When you expose de -radiodurans to gamma radiation and measure its transcriptome, you see a massive rapid upregulation, a huge increase in expression of the few key DNA repair genes it does have, like Greciae.

There are visualizations like figure 18 .13, using hierarchical cluster analysis, that show this dramatically.

Genes that are strongly upregulated often appear bright red, while downregulated ones are green.

You see these specific repair genes just light up after radiation exposure.

It's about response, not just inventory.

Amazing.

Okay, so we've looked at DNA and RNA.

What about the actual workhorses, the proteins?

That brings us to proteomics, the study of the proteome, the entire collection of proteins being produced by a cell under specific conditions.

A common technique to visualize the proteome is two -dimensional 2D gel electrophoresis.

Two dimensions.

How does that work?

It separates proteins based on two different properties.

First, you load the protein mixture onto a strip gel for isoelectric focusing.

Proteins migrate along the strip until they reach the point where their net charge is zero their isoelectric point, so separation based on charge.

Then you take that strip gel and lay it across the top of a standard slab gel, usually an SDS -PAGE gel.

Now you apply an electric current, and the proteins migrate out of the strip and down into the slab gel, separating based on their size.

Small proteins move faster.

You can see this in figure 18 .14.

So you end up with a gel showing spots, where each spot is hopefully a single protein, separated by both charge and size.

Ideally, yes.

You can potentially resolve thousands of protein spots on a single 2D gel.

Then the challenge is figuring out what each spot is.

How do you identify the protein in a spot?

You cut the spot out of the gel, digest the protein into smaller peptide fragments using enzyme like trypsin, and then analyze those fragments using mass spectrometry, MS.

The mass stack measures the precise mass to charge ratio of the peptides.

Often they use tandem MS, MS -MS, where they select a peptide ion, fragment it further inside the machine, and measure the masses of those smaller fragments.

This generates a unique peptide fingerprint, or even partial amino acid sequence information.

You can then search this data against genome databases to identify the protein and the gene that encodes it.

We're building quite a picture.

Genome, transcriptome, proteome.

Is there a way to see how proteins interact with DNA, like transcription factors binding to regulatory regions?

Yes, there are techniques for that.

An older method is MSA, Electrophoretic Mobility Shift assay, but that's done in vitro with purified components.

A more powerful genome -wide in vivo approach is CHPSIC, which stands for chromatin immunoprecipitation followed by sequencing.

CHPSIC.

How does that work?

You treat living cells with a chemical, often formaldehyde, that cross -links proteins to the DNA they are currently bound to.

Then you break open the cells and shear the DNA into fragments.

You use an antibody that specifically recognizes your protein of interest, say a particular transcription factor.

This antibody pulls down the protein, along with any DNA fragments it's cross -linked to.

You reverse the cross -links, purify the DNA, and then sequence it using NGS.

The resulting sequence reads tell you all the locations in the entire genome where that specific protein was bound in the living cell.

It's incredibly powerful for mapping regulatory networks.

Okay, so we have all these omics genomics, transcriptomics, proteomics, even mapping DNA protein interactions.

How do we put it all together?

That's the goal of systems biology.

It represents a shift away from traditional reductionist biology, where you study one gene or one pathway in isolation.

Systems biology aims to integrate all these different data types, the parts list from the genome, the activity levels from the transcriptome and proteome, information about interactions, metabolites, and use computational modeling to understand how all these components work together as a whole dynamic system, trying to predict emergent properties of the cell.

And this holistic view must be useful for engineering biology too.

Absolutely.

Systems biology directly informs synthetic biology.

By understanding the existing cellular networks, engineers can better design and build new biological parts, devices, and systems.

For example, understanding yeast's natural metabolic pathways was crucial for engineers to rationally redesign them to produce complex molecules like artemisinin, the anti -malarial drug.

They essentially rerouted the cell's metabolism.

It all connects.

Now, looking across all the genomes we've sequenced, what big patterns emerge?

That's comparative genomics, right?

Right.

Comparative genomics looks for similarities and differences between to understand evolution and function.

Some broad generalizations have emerged.

For instance, the smallest genomes tend to belong to obligate parasites or endosymbionts, like Candida and Carcinella rudia, an insect symbiont with a genome of only about 160 ,000 base pairs.

It's lost most genes because it relies completely on its host.

Also, bacterial and archaeal genomes generally have much higher gene density than eukaryotes, meaning more of the DNA actually codes for proteins with fewer introns and non -coding regions.

Does comparing genomes also tell us about how microbes swap genes?

Definitely.

Genome analysis has revealed that horizontal gene transfer, HGT, the transfer of genes between unrelated organisms, is incredibly common in microbes.

Often this happens via mobile genetic elements like bacteriophages or plasmids.

You can sometimes spot regions of DNA that were clearly acquired through HGT because they have a different G plus C nucleotide content compared to the rest of the genome.

These chunks are often called genomic islands.

If these islands carry virulence genes, they're called pathogenicity islands.

Can we also see evolutionary relationships by comparing gene order?

Yes, that's the concept of synteny.

Synteny refers to the conserved order of orthologous genes along chromosomes in different species.

If you compare closely related bacteria like Synrhizobium melilidae and Agrobacterium tumifatians shown in figure 18 .19, you see large blocks where the gene order is almost identical high synteny.

But compare more distantly related organisms like E.

coli and S.

melilidae, and the synteny breaks down almost completely.

The genes have been shuffled around extensively over evolutionary time.

Is there a really striking example where comparative genomics revealed an organism's lifestyle?

I think Mycobacterium leprae, the bacterium causing leprosy, is a powerful case study.

Comparing its genome to its close relative, Mycobacterium tuberculosis, which causes TB, showed that the M.

leprae genome is about a third smaller.

It shed genes because it became an obligate parasite.

Exactly.

But it's not just gene loss.

The truly remarkable finding was that the M.

leprae genome is littered with pseudogenes.

These are genes that have accumulated mutations that make them non -functional.

They're like molecular fossils.

leprae has over 1 ,000 of these degraded genes.

It's a clear evolutionary signature of massive gene decay, a consequence of its adaptation to strictly parasitic lifestyle where many functions handled by the host are no longer needed by the bacterium.

It's shedding the baggage.

Wow.

That really paints a picture of evolution and action written in the genome.

So we've covered a massive amount of ground today.

We went from painstakingly reading the code with Sanger to the incredible speedup of NGS, which enabled things like building synthetic life like SIM 3 .0, reading entire environments with metagenomics, and then digging into function with transcriptomics and proteomics.

Yeah.

And the unifying theme, the real engine behind almost all of these advances, has been next -generation sequencing.

Its speed, its cost -effectiveness, and its ability to handle tiny amounts of DNA.

That's what unlocked our ability to study the vast majority of microbial life, that 98 % dark matter that we couldn't culture.

It transformed everything.

Okay.

So to wrap up, here's the thought we want to leave you, the listener, with.

We just talked about M.

leprie shedding genes, creating pseudogenes.

As it adapted to parasitism, it got rid of useless baggage.

Now think about SIM 3 .0, that minimal synthetic cell with only 473 essential genes.

If you were to release SIM 3 .0 into a complex, competitive, natural environment,

what evolutionary pressures would act on it?

Would it start gaining genes to cope?

Or would it find a niche where it could shed even more?

That's a really interesting question, isn't it?

It touches on the future of synthetic biology and evolution itself.

Well, that's all the time we have for this deep dive.

Thanks so much for joining us.

We'll catch you on the next deep dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Microbial genomics encompasses the comprehensive study of genetic material within microorganisms, beginning with foundational DNA sequencing methodologies and progressing through modern computational approaches to interpret the resulting data. The field emerged from classical Sanger chain termination sequencing and has been transformed by Next Generation Sequencing technologies that employ reversible chain termination and sequencing by synthesis, eliminating the need for cloning and substantially improving coverage depth and accuracy. Whole-genome shotgun sequencing remains central to the discipline, relying on genomic libraries and computational assembly of overlapping sequences into contigs and scaffolds that reconstruct complete genomes. For organisms that resist laboratory cultivation, Multiple Displacement Amplification enables single cell genomics, permitting researchers to investigate members of the microbial dark matter previously invisible to traditional cultivation methods. When genetic material is extracted directly from environmental samples, the resulting metagenomics approach reveals the collective genetic potential of entire microbial communities and their metabolic capabilities. The enormous datasets produced by these sequencing efforts require bioinformatics infrastructure, employing algorithms for genome annotation that identify functional genes, locate Open Reading Frames, and classify genes as orthologues across species or paralogues within the same genome. Functional genomics extends beyond sequence cataloging to measure how genomes actually operate, utilizing transcriptomics and RNA Seq to quantify expression levels, while proteomics employs two dimensional gel electrophoresis and mass spectrometry alongside ChIP Seq to characterize proteins and DNA protein interactions. Systems biology integrates these diverse datasets to construct predictive models treating cells as unified systems, a framework increasingly applied in synthetic biology where genetic networks are deliberately engineered for novel functions. Comparative genomics reveals how genome structure varies across different microbial taxa, demonstrating that intracellular parasites frequently exhibit genome reduction through extensive gene loss, while also highlighting horizontal gene transfer as a primary driver of genetic innovation and the appearance of specialized structures such as genomic islands and pathogenicity islands that contribute to microbial diversity and virulence.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 18: Microbial Genomics & Bioinformatics Applications

Related Chapters