Chapter 23: Molecular Evolution and Phylogenetics

Search this chapter

Audio Overview

0:00 / 0:00

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Okay, let's unpack this.

Today we aren't studying history written on parchment or carved into stone monuments.

We are taking a deep dive into the history of life itself, written in the most granular, persistent code we know,

the molecules of DNA and protein.

That's exactly right.

Our focus is molecular evolution.

And the mission today is, well, it's ambitious.

We're trying to understand how the building blocks of life, these molecules change over vast geological time scales.

Millions of years in some cases.

Millions, even billions.

And crucially, we're learning how to read that accumulated molecular change to map the

genealogical relationships between all organisms.

We're going from, you know, single chemical substitutions all the way up to the structure of the entire tree of life.

And before we get into the nuts and bolts, we really have to make a clear distinction here, because this field often gets confused with its fast paced cousin population genetics.

Yes, that's a key point.

We've talked about population genetics before.

That's all about changes in gene frequencies happening very quickly, literally from one generation to the next.

It's the short term view.

The here and now of selection.

Molecular evolution, by contrast, operates on the scale of speciation, of extinction.

We're talking about hundreds, thousands, or even millions of generations.

And over those long, long timeframes, the rules of the game change.

How so?

Well, effects that are basically negligible in the short term.

Things like random sampling error, or differences in fitness so tiny you'd never notice them.

They become these massive cumulative forces that drive genomic change.

Molecular evolution is really the deep time perspective.

It's an intellectually stunning field, isn't it?

It requires a truly multidisciplinary approach.

I mean, to decipher this molecular history book, you need traditional genetics, ecology, evolutionary theory.

And then you layer on top of that some heavy duty statistics and computer science just to handle the sheer volume of data we're dealing with now.

And this field really came into its own, what, after the 1970s and 80s?

Oh, absolutely.

Once molecular tools like cloning, rapid sequencing, and hybridization became routine, these techniques were revolutionary because they, in effect, removed the species barrier for researchers.

Right.

Before that, you were stuck comparing things like bone structure or feather patterns.

Exactly.

And suddenly, we could treat any organism's genome, whether it was a bacterium or a blue whale, as a concrete, measurable historical record.

A record that's constantly being edited.

Yeah.

And not just through simple point mutations, but through these big structural changes, too.

Precisely.

We track the standard point mutations, of course, but also transposition, where segments of DNA literally jump around in the genome.

And large scale gene duplication.

Which creates redundant genetic material, the raw stuff of innovation.

And then there's gene conversion, which is a specific nonreciprocal change during meiosis, where one allele kind of forces the sequence of its partner allele to match its own.

So it's all of these processes, accumulating over eons and getting filtered by selection, that we're trying to quantify.

That's the goal.

So as we delve into the core data today, we are addressing some really fundamental evolutionary mysteries.

Like, what exactly is the role of natural selection at the molecular level?

Why do some genes evolve so rapidly they seem to be racing, while others are basically frozen in time?

And maybe the most fascinating question of all, how do proteins actually acquire totally new biological functions?

Where does novelty come from?

It's time to start the analysis.

Let's see how the data is generated.

Let's jump right in.

The foundational analysis really starts with substitutions in what we call homologous proteins and DNA.

And homology is the key concept here.

It just means that the proteins or DNA sequences we're comparing in different species share a common ancestor.

They're related by descent.

Right.

And early studies, I mean, even before we had widespread DNA sequencing, they focused on protein sequences.

And they immediately noticed a pattern in amino acid replacement that told us a huge amount about how this molecular filter works.

What was that pattern?

They found a strong replacement bias.

When an amino acid was substituted, it was overwhelmingly replaced by one with very similar chemical characteristics.

So think leucine, which is hydrophobic, being replaced by isoleucine, which is also hydrophobic.

A change that's chemically subtle.

And it often only requires a single base pair change in the DNA codon.

That's the other piece of it.

And this observation, it reinforces two absolutely critical principles that really define molecular evolution.

OK.

What's the first one?

First, the raw material, the mutation itself, is a rare event.

But second, and this is more important, natural selection is an incredibly harsh and efficient filter.

Meaning most dramatic changes just get thrown out immediately.

Instantly.

If a change resulted in swapping a large hydrophobic amino acid for a small charged one, a change that would probably require two or three base substitutions and would totally alter the proteins folding,

that change would be detrimental.

Selection eliminates it.

The cost of failure is just too high.

And to even begin making these comparisons, you have to align the sequences.

You can't compare two genomic texts unless you have some hypothesis about which words correspond to which.

Sequence alignment is absolutely essential.

It lets us hypothesize the evolutionary pathway.

When we align two homologous sequences, we can identify the conserved sites where there's been no change, the substitution sites where a single change has occurred.

And then the sites where you have to propose an insertion or a deletion, what's called an indel.

Right.

An indel must have occurred in one lineage but not the other to make the alignment work.

And I imagine this gets computationally tricky, especially when sequences have diverged a lot over time.

It does.

We use pretty complex computer algorithms to create what we call optimal alignments.

These tools are designed to maximize the similarity, so the matched sites, while minimizing the number of gaps or indels that you have to insert.

So it's a statistical best guess.

It's a statistical best guess at reconstructing that true ancestral sequence.

And it's the required first step before we can even start to calculate the amount of divergence.

OK.

So now we run into the first major analytical challenge of molecular evolution,

the problem of counting.

If we just count the observed differences between two really diverged sequences, we are almost guaranteed to be wrong.

And not just wrong, but you'll severely underestimate the total number of changes that truly occurred.

This is the multiple hits problem, right?

Exactly.

Just imagine a single nucleotide site.

Let's say it starts as an A.

Over millions of years, it changes to a G, then to a T, and then by chance, back to an A again.

OK.

So if we only compare the start and end sequences, they look identical.

We'd observe zero differences.

Even though four changes actually happened, or maybe a starting C changes to an A, and then later that A changes to a T.

But two changes actually happened from the common ancestor.

So just relying on the raw observed fraction of differences, what we call P, is just not good enough.

Yeah.

We need a way to mathematically correct for this hidden history, for the changes we can't see.

And that's where the Jukes -Kanter model, proposed by Thomas Jukes and Walter Kantor way back in 1969, provided this breakthrough conceptual framework.

What was their big idea?

Well, they made a simplifying assumption, which is that each nucleotide was equally likely to change into any other of the three possible nucleotides at some rate.

Let's just call it stent.

So the total rate of substitution for any nucleotide is three.

And the power of their model is that it lets you calculate K, the true number of substitutions per site, based on P, the observed fraction of differences.

It's like a statistical correction factor.

It's exactly that.

A correction factor that accounts for those hidden multiple hits, and also for reversions or back mutations.

And the power of this correction, it must grow as the sequences get more and more different.

Exponentially.

If the observed difference P is tiny, say, 2%, then K, the corrected number, is also very close to 2%.

A simple count is fine for closely related species.

But if they're really far apart...

Imagine we compare two sequences where half the sites are different, so P is 0 .50.

The Jukes -Kanter calculation dramatically increases K to 0 .82.

That correction is accounting for the very high probability that those sites have changed multiple times over history.

So that calculation of K becomes indispensable for looking at highly diverged organisms.

But we should probably acknowledge that the Jukes -Kanter model, while it's foundational, is a bit of an oversimplification.

Oh, it is.

We know now from later data that transitions, that's a purine changing to another purine, like A to G, or a pyrimidine to a pyrimidine to a pyrimidine.

They accumulate faster than transversions, which is a purine to a pyrimidine.

And why is that?

Transitions are just chemically simpler, and they're often easier for the DNA repair mechanisms in the cell to miss.

So now we have more sophisticated models that account for these rate differences, but they all build on that same Jukes -Kanter logic of correcting what you see.

So once we have K, the actual number of substitutions, we can bring in the element of time and calculate the substitution rate?

Exactly.

The substitution rate, R, is derived by dividing K, the substitutions, per site, by 2t, where t is the divergence time that we get from the fossil record.

And why 2t?

Because the substitutions are accumulating independently and simultaneously in both lineages ever since they split from their common ancestor.

And this calculation absolutely requires calibration.

You have to have data from at least two species whose divergence time is already known from fossils.

And this is where the implications for evolutionary dating become so profound.

If you can show that substitution rates are relatively constant for a certain gene across many species, then you can use that rate, R, to date evolutionary events for which there is no fossil evidence at all.

Which introduces the idea of the molecular clock.

Which we'll come back to.

But now that we know how to measure the changes, let's ask why these changes don't happen uniformly.

And that brings us into the incredibly powerful filter of natural selection, which causes these evolutionary rates to vary wildly, not just between genes, but even within a single gene.

Right.

And this is where the data tells the most compelling story.

When you look at the raw statistics across mammalian genes,

different regions evolve at wildly different paces.

And it seems to be dictated entirely by their functional constraints.

It's like having a genome where some sequences are traveling at the speed limit and others are just frozen stiff.

So how do we see this?

We can demonstrate this using the distinction between synonymous and non -synonymous substitutions.

This gives us the clearest possible evidence of that selective filter.

Okay, so a synonymous substitution is a change in the DNA that does not alter the amino acid sequence.

Correct.

And these show the fastest rate of change within a functional gene.

And why is that rate considered the speed limit?

Because they're typically tolerated by selection.

The protein's function is unchanged.

They often happen at that third wobble position of a codon.

And the observed rate in mammals is high, around 4 .65 times 10 to the minus 9 substitutions per site per year.

Okay, now contrast that with non -synonymous substitutions.

Those are changes that do alter the amino acid sequence.

And these are often detrimental.

They get quickly eliminated by selection, which results in the slowest rate, only about 0 .88 times 10 to the minus 9.

Wow, that's a nearly five -fold difference.

And that really illustrates the core concept here, which is the distinction between a mutation and a substitution.

This is the critical point.

A mutation is just an error.

It's a replication or repair mistake.

And they're probably generated pretty randomly, so synonymous and non -synonymous changes probably happen at the same raw frequency.

But a substitution is different.

A substitution is a mutation that has successfully navigated the filter of natural selection and actually became fixed in the population.

Since most non -synonymous changes harm fitness, selection rapidly purges them.

That's why the substitution rate is so low.

So the synonymous rate, because it's mostly neutral to fitness, much more closely reflects the underlying true mutation rate of the organism.

So if synonymous sites reflect the speed limit for a functional gene,

where in the genome would we expect to see the absolute fastest rate of evolution, totally free of any constraint?

For that, we look at pseudogenes.

These are molecular fossils.

They're inactivated, non -functional versions of genes that no longer code for a protein.

So they're completely invisible to that selective filter.

Completely.

And their observed rate is the highest we see, about 4 .85 times 10 to the minus 9.

Since selection doesn't act to remove any changes, the rate of substitution in pseudogenes is our best proxy for the organism's raw, unfiltered mutation rate.

And even when we look at other non -quoting regions, we still find this whole spectrum of selective constraint.

It's not just a simple on -off switch of functional versus non -functional.

Not at all.

Take the flanking regions, the sequences upstream and downstream of a gene.

The three -foot flanking region, the part downstream, often has a very high rate of change, around 4 .46 times 10 to the minus 9, because it typically has very little effect on the protein or its regulation.

But introns, which are the sequences that get spliced out of the RNA transcript, they evolve a little bit slower, around 3 .70 times 10 to the minus 9.

So why is there a constraint on the junk that gets tossed out?

Because introns aren't totally inert.

They have to maintain certain short critical sequences that are required for proper splicing.

The five -foot and three -foot splice junctions, the branch point, if a substitution hits one of those vital spots, the messenger RNA can't be processed correctly.

And that's a disaster for the protein.

A complete disaster, often fatal.

So there's a subtle selective pressure that keeps that rate just below the truly neutral rate you see in a pseudogene.

And the five -foot flanking region, which houses all the regulatory elements, is maybe the most fascinating example of extreme constraint on non -coding DNA.

It is remarkably constrained.

This region has the promoter, the transcription factor, binding sites, things like the TATA box.

A minor change here doesn't alter the protein structure, but it can dramatically change the level of gene expression.

So how much protein gets made?

When it gets made?

Precisely.

And since gene dosage and timing are often critically balanced for fitness,

natural selection severely eliminates changes in these promoter sequences, keeping their evolutionary rate very, very low.

The overall takeaway is just undeniable.

The stronger the functional constraint on any segment of a macromolecule, the slower its rate of evolution.

This inverse relationship is, in itself, a giant discovery tool for modern genomics.

You mentioned that sequencing a genome gives you an address list of nucleotides.

In a complex organism, 95 % of that list might seem functionally opaque.

So how do you find the 5 % that really matters?

Comparative genomics uses this principle.

It acts as a molecular conservation detector.

Let's take humans and mice.

We've been separated by 80 to 100 million years of independent evolution.

That's a massive amount of time.

So given the fast neutral rate we see in pseudogenes, any sequence that is truly non -functional should have diverged completely between us and mice by now.

It should be totally scrambled.

So if you're scanning the mouse genome and you find a non -coding region that is remarkably conserved, showing high sequence similarity to a human non -coding region.

That conserved sequence is almost certainly functionally important.

It has to be.

Selection has been actively working for 100 million years to prevent that sequence from drifting away in both lineages.

This insight lets researchers target functional elements without ever stepping into a wet lab.

Evolutionary history essentially performs these massive time -consuming experiments for us.

But the relentless power of selection doesn't just stop at preserving gene function.

It can act on these incredibly subtle nuances that seem almost too small to matter.

Which brings us to this profound idea of codon usage bias.

Indeed.

The slight difference we already saw in the evolutionary rate between synonymous sites and pseudogenes, that was a hint that synonymous changes were not completely neutral.

And this is confirmed by the observation that organisms do not use their synonymous codons equally.

Even though six different codons might all code for the amino acid leucine, a bacterium like E.

coli might use one specific codon, C -U -G, 80 % of the time.

Why does the cell care what synonym it uses?

The mechanism is all about optimization for efficiency and accuracy.

Different synonymous codons pair with different transfer RNAs or tRNAs, and the preferred codons are the ones that pair with the most abundant tRNAs in the cell.

So if you use the most common tRNA, translation goes faster and has fewer errors.

You're optimizing the molecular factory line down to the millisecond.

And the difference in translational speed and accuracy in a single bacterium must be infinitesimally small.

It is miniscule.

But here is the profound testament to the power of cumulative evolution.

Over thousands and millions of generations, that tiny, tiny advantage in efficiency, a fraction of a percentage point in fitness, is enough to determine which lineage survives and which one died out.

Selection optimizes the cellular economy down to the pairing of a codon and a tRNA.

And that economy extends beyond just translation to the raw materials themselves.

The amino acids are chosen based on how much energy it costs to make them.

I remember seeing this table.

And the difference is huge.

It's enormous.

Glycine, one of the simplest amino acids, requires about 11 .7 ATP equivalents to synthesize.

Tryptophan, which is structurally complex, demands an average of 78 .3 ATP equivalents.

So in highly expressed genes, the ones being translated thousands of times a minute, that metabolic cost just gets magnified over and over.

Exactly.

So if a bacterium is trying to be maximally efficient, natural selection will favor using glycine over tryptophan whenever possible, especially in those highly expressed genes.

And we see that in the data.

We do.

Analysis of prokaryotic genomes confirms this.

Highly expressed genes show a strong bias toward using energetically inexpensive amino acids much more frequently than lowly expressed genes.

Selection isn't just pruning bad proteins, it's optimizing the cell's entire budget.

Moving now from these internal gene differences to differences between entire genes, we see the results of functional constraint magnified even further.

Non -synonymous substitution rates can differ by a factor of a thousand between different mammalian genes.

A thousand -fold.

And this difference is driven almost entirely by functional constraint, not by differences in the raw mutation frequency across the genome.

Let's just look at the extremes.

Okay, at the extreme of constraint, we have histone H4.

Right.

Histones are the essential proteins that compact DNA into chromosomes.

They're in all eukaryotes.

Histone H4 is incredibly ancient, and its function is so critical that almost every single one of its amino acids interacts directly with the DNA helix.

So basically no change is tolerated.

Almost none.

Histone H4 is one of the slowest -evolving proteins known.

The version that you find in yeast can functionally replace the human version, despite hundreds of millions of years of divergence.

That's effectively zero tolerance for molecular tinkering.

So what's on the opposite end of that spectrum?

Genes like apolipoproteins.

These carry lipids in the blood.

The functional requirement of their lipid -binding domains is simply that the amino acid residues be hydrophobic.

So it doesn't really matter which specific hydrophobic amino acid it is.

Not really.

Since numerous hydrophobic amino acids — leucine, isoleucine, valine — all perform this function equally well, the protein sequence has huge evolutionary leeway.

It's highly variable and accumulates substitutions rapidly because selection just doesn't care which one is in place.

Now we usually talk about selection as a force that removes change to preserve function.

But there's a famous exception where natural selection actually favors variability, leading to a substitution rate that turns this whole pattern on its head.

The major histocompatibility complex, or MHC, is the perfect example of this.

The MHC is a multi -gene family that's essential for immune recognition.

It presents foreign antigens to your immune system.

So for a population to fight off a whole array of ever -changing pathogens, you'd want high diversity in these MHC genes.

So foundly advantageous.

So instead of being removed, changes are actively preserved or even promoted by selection.

And you see this in the rates.

You do.

In the MHC genes, the rate of non -synonymous substitution — the changes that alter the amino acid sequence — is actually greater than the synonymous rate.

This is the clearest possible molecular signature of adaptive, diversifying selection, actively driving genetic variability.

Before we move on, we have to touch on mitochondrial DNA, MTDNA, because its rate is so unique.

MTDNA is a real outlier.

It's inherited clonally, only from the mother, and it doesn't go through the shuffling of meiosis.

And crucially, its rate of evolution is exceptionally high.

The average synonymous substitution rate in mammalian MTDNA is about 5 .7 times 10 to the minus 8 substitutions per site per year.

Which is about 10 times faster than the average for nuclear genes.

Roughly, yes.

And that rapid, yet still relatively regular, accumulation rate makes MTDNA an extremely useful tool for researchers.

Why is that?

Because of its speed and its clonal inheritance, it's the genomic yardstick of choice for comparing very closely related lineages, or for tracing specific matriarchal ancestries.

It's how researchers were able to map the migration patterns of early modern humans across the continents.

Excellent.

So we've established how to measure molecular change,

and we've seen how selection acts as a filter on those changes.

Now let's leverage that understanding to reconstruct the past.

Molecular clocks and phylogenetic trees.

So the initial idea came from Zucker -Gandel and Pauling in the 1960s.

They observed that amino acid changes in homologous proteins seemed to accumulate at a relatively constant rate over millions of years.

This was the molecular clock hypothesis.

They saw this consistency and proposed that if the clock ticked steadily, you could use molecular divergence data to calculate divergence times, much like geologists use radioactive decay for dating rocks.

It was an immensely promising hypothesis.

It suggested we might not need fossils to date events.

However, as more molecular data came in, it revealed significant departures from perfect constancy.

The clock, it turns out, is not perfectly regular.

What are some of the key deviations that we see?

We see that the pace varies significantly between different lineages.

For example, the rate of molecular evolution in humans and apes is only about half as fast as the rate you see in old world monkeys.

And even more strikingly, rodents accumulate substitutions at roughly twice the rate of primates since the big mammalian radiation about 80 to 100 million years ago.

So the clock is running slower in us than it is in a mouse.

Why is that?

Why is the primate clock running at half speed?

It can't be simple divergence time if you're comparing species that split at the same point.

The primary factor seems to be generation time.

Substitution rates actually correlate much better with the number of germline DNA replications than with simple chronological time.

Ah, so a mouse has a short generation time.

That means its germline goes through many more cell divisions, and thus more chances for replication errors per century than a long -lived primate.

Exactly.

Other factors like differences in DNA repair efficiency also contribute, but generation time is really key.

But regardless of the clock's imperfections, molecular data still holds a vast advantage over traditional phylogeny that's just based on phenotype alone.

Oh, absolutely.

Traditional phylogenies based on morphology, body shape, organs, and so on, are constantly plagued by convergent evolution.

The wings of a bird, a back, and an insect look similar because they all evolved for flight but they are not closely related.

And molecular data, the actual sequence of DNA or protein, is just less prone to those misleading similarities.

Much less prone.

It offers a far more reliable basis for building these trees.

So the end product is the phylogenetic tree, which graphically describes these ancestral relationships.

A tree is composed of branches, which portray ancestry, and nodes.

The terminal nodes are the species we're analyzing today.

The internal nodes represent the common ancestors.

And importantly, the branch lengths are often scaled to reflect the amount of divergence.

And we categorize trees based on whether we can anchor the timeline.

A rooted tree specifies one internal node as the definitive common ancestor of all the other taxa on the tree.

To do this, you have to include an outgroup, a species that you know separated the earliest, to establish that root.

And an unrooted tree.

An unrooted tree just shows the relationships between the nodes, who was related to whom, without specifying the ancestral timeline.

And here we hit that staggering computational challenge.

With even a relatively small data set of, say, 10 species, the number of possible rooted trees you have to test is over 34 million.

And over 2 million unrooted trees.

The number grows exponentially.

It makes the task of finding the single best tree immense.

And adding to that complexity is the distinction you have to make between a gene tree and a species tree.

Right.

They aren't always the same thing.

Why would a single gene's evolutionary history differ from the history of the entire species?

It's due to something called shared polymorphism.

This is where genetic divergence within a single locus can actually predate the speciation event.

So imagine an ancestral population splits into species A and species B.

If that original population was already polymorphic for a certain gene, say alleles 1 and 2, that polymorphism might be maintained through the speciation event.

So if you only build your tree based on that one single gene, you might get a confusing result.

Right.

Like your MHC example from earlier.

You could have a human with one allele appear closer to a gorilla with the same allele than to another human.

Exactly.

Because the variation in that gene is actually older than the human gorilla split itself.

To truly reconstruct a species tree, you have to analyze multiple independent genes.

And even multiple gene phylogenies can be complicated by this phenomenon of genomic blurring, horizontal gene transfer, HGT.

Right.

HGT is the transfer of genes across species lines without traditional reproduction.

It's incredibly common in bacteria.

It's a massive confounding factor in prokaryotic phylogeny because it can make two unrelated species look far more closely related than they actually are.

I think I read that something like 18 % of one E.

coli strains genome might have been acquired horizontally.

In the last 100 million years, yes, it's a huge factor.

And HGT has always been viewed as extremely rare in animals, but you mentioned a truly striking exception.

The bladerotifer adenetovaga.

It's an evolutionary oddity.

It reproduces strictly asexually, and it has this remarkable ability to survive extreme desiccation drying out completely.

And what did researchers find in its genome?

They found that up to 6 % of its genome, especially in the telomere -rich regions at the ends of chromosomes,

appears to be foreign DNA.

It matches genes from bacteria, from fungi, from other distant sources.

6 % foreign DNA is huge.

How is this DNA getting integrated into the rotifer's genome?

The working hypothesis is that it's tied to its desiccation survival.

When the rotifer rehydrates, its cells become highly permeable.

Foreign DNA from the environment bacterial or fungal DNA that survived the drying is taken up and integrated into the rotifer's germ line.

That is just incredible.

It's astonishing.

Some of these foreign genes have even gained typical eukaryotic introns since they were incorporated.

This rotifer really stands as a unique challenge to the assumption that animal evolution is strictly vertical.

So given the computational nightmare of millions of possible trees and all these complexities, how do computer programs actually zero in on the single best inferred tree?

Well since the true history is almost always unknowable, we rely on inferred trees, the most likely relationships given our data and specific assumptions.

And we use three fundamental approaches that operate on completely different principles.

Okay, let's start with the conceptually simplest one.

Distance matrix methods.

Right.

These methods, like UPGMA, they group taxa based purely on their overall genetic distance, on K.

The process is iterative.

You start by finding the two closest species, let's say A and B, and you cluster them together.

Then A and B become a new group, AB.

So how do you calculate the distance from that new group to a third species, C?

You calculate the distance from AB to C by taking the simple arithmetic average of the distance from A to C and B to C.

And you just repeat this process, clustering the next two closest groups until all the species are grouped into a single tree.

It sounds computationally efficient.

What's the weakness?

Its fundamental weakness is that UPGMA assumes the molecular clock is perfectly constant across all lineages.

Since we know that's not true, remember the rodent versus primate data, this assumption often leads to inaccurate trees.

So the second approach, maximum parsimony, relies on a core biological principle.

Evolutionary events are rare.

Parsimony assumes mutations are rare.

Therefore, the best tree, the most parsimonious one, is the tree that requires the minimum number of evolutionary changes to explain the observed sequence differences.

Evolution is assumed to take the laziest path.

And this method famously doesn't use all the sites in the sequence alignment?

It only uses what are called informative sites.

For a site to be informative in a parsimony analysis, it has to favor one tree arrangement over an alternative.

The mathematical requirement is that an informative site must have at least two different nucleotides, and each of those must be present at least twice across the species you're comparing.

And one of the huge benefits of parsimony is that it can help us solve historical puzzles.

Absolutely.

By minimizing the number of required steps, maximum parsimony inherently generates inferred ancestral sequences at the internal nodes of the tree.

This gives us a theoretical glimpse of the genomes of organisms that existed millions of years ago.

Finally, we have the most computationally intensive, but often most rigorous, method.

Maximum likelihood.

Maximum likelihood is purely statistical.

Instead of counting steps or measuring distance, it calculates probabilities.

It takes into account everything we know about how nucleotides change.

For instance, it knows that transitions are more likely than transversions.

And then it calculates the probability of every change for every single possible tree structure.

Yes.

For every possible tree, it calculates the aggregate probability that that tree would generated the observed sequence data.

The tree with the single highest probability is deemed the most likely.

This intensive calculation is why it only became feasible in the era of modern supercomputing.

But the big caveat here is that the result is entirely dependent on the underlying model you choose.

That's critical.

Different statistical models can yield different most likely trees, so researchers have to be really meticulous in choosing and describing the model they use.

However, when multiple methods, distance, parsimony, and likelihood all converge on the same tree structure, you can have very high confidence in that relationship.

Once we have a preferred tree, how do we express our confidence in its specific branching patterns?

We use a statistical method called bootstrapping.

Conceptually, you take your original sequence alignment and you randomly sample the columns with replacement thousands of times.

This creates thousands of slightly perturbed, resampled data sets.

So you're essentially creating thousands of slightly varied molecular histories, all based on your original data.

Exactly.

Then you generate a tree for each of those thousands of resampled data sets, and any grouping that appears in, say, 95 % of those resulting trees is considered extremely well supported by the data.

We then place that confidence number or bootstrap value right next to the node on the final tree.

Molecular analysis hasn't just clarified relationships between dog breeds.

It has fundamentally rewritten the most ancient chapter in the history of life.

The comparison of 16 sRNA sequences, a molecule common to all life,

allowed Karl Woese and Norm Pace to build a tree of life that completely broke the traditional five kingdom system.

And what did that molecular evidence reveal about the domains of life?

It showed three major domains,

bacteria, archaea, and eukarya.

What was genuinely shocking was that the archaea and bacteria, both classified as simple prokaryotes, were found to be as genetically different from each other as bacteria are from eukarya.

Phenotype alone was misleading.

And molecular evidence also settled the crucial question of how complex cells arose, the endosymbion theory.

By analyzing the 16 sRNA sequences found inside mitochondria and chloroplasts, researchers confirmed they had independent evolutionary origins from the nucleus.

In fact, mitochondrial RNA sequences showed their closest relationship to the bacterium

The agent of epidemic typhus.

That is profound.

Our energy -producing organelles are literally descended from an ancient, free -living bacterial ancestor that was engulfed by another cell.

And molecular analysis has also shown a powerful light on our own story.

When analyzing genetic differences among human populations, the variation is surprisingly small.

Human mtDNA, for example, differs by only about 0 .33 % on average.

Which is tiny.

Tiny.

Subspecies of orangutans differ by 5%.

This high degree of genetic similarity means all modern humans are incredibly recently related.

And the key finding for tracing human origins is about where the greatest diversity lies.

The maximum genetic diversity is found within Africa.

This is strong support for the out -of -Africa theory, which posits that modern humans originated and experienced their earliest divergence there.

When small groups migrated out later, they carried only a subset of that initial genetic diversity.

And this is where we get mitochondrial Eve and white chromosome Adam.

Both tracing back to Africa about 200 ,000 years ago.

And moving from human origins to domesticated species, the story of the dog is also beautifully illustrated by molecular phylogeny.

Absolutely.

By analyzing microsatellite markers across dozens of dog breeds, researchers clustered them into four distinct groups.

The oldest cluster, containing breeds like the Siberian Husky and the Chow Chow, showed the greatest molecular similarity to the gray wolf and traces its origin back to Asia and Africa.

And the other clusters represent the more recent, intense European breeding efforts.

Yes.

The remaining three clusters represent later specializations.

Dogs for guarding, like bulldogs.

Dogs for hunting, like retrievers.

And dogs for herding, like collies.

The molecular data just beautifully validates those functional and geographical classifications.

All of this analysis leads us to the biggest question.

Where does new biological function come from?

How does evolution innovate and build proteins that have never existed before?

Innovation is usually not creation from scratch.

It's molecular tinkering, a concept hypothesized by J .B .S.

Haldane way back in 1932.

He proposed that new genes arise by acquiring function through mutating redundant copies of existing genes.

This is the immense utility of gene duplication.

The moment a gene is duplicated, that extra copy is released from the immediate strong constraint of natural selection because the original copy is still there doing its job.

And that release turns the duplicate into an evolutionary sandbox.

It's free to accumulate changes.

Most of those will be bad and will eventually inactivate the copy.

But very rarely, one of those changes results in a new, advantageous function.

This is the foundational mechanism for creating multi -gene families.

The globin gene family is the classic example, right?

With a history of duplication spanning 1 .5 billion years, with different copies specializing for oxygen transport at different stages of our development.

And this duplication and deletion process is dynamic.

It's happening even in humans today through unequal crossing over during meiosis.

If homologous chromosomes misalign, the resultant gametes can have too many or too few copies of a gene.

You also brought up gene conversion as a mechanism of change.

How did that interact with duplication?

Gene conversion is a non -reciprocal process, a molecular editing mechanism where one allele sequence physically replaces a homologous allele.

It can sometimes correct an inactivating change in a duplicated gene, restoring its identity.

But sometimes this molecular correction is actually harmful.

It can override a useful change.

You see this very clearly in the red, green color vision genes on the X chromosome.

Gene conversion sometimes corrects a substitution that differentiates the red and green opsin genes, forcing them to become identical again.

This is a primary reason why about 8 % of human males suffer from color blindness.

So gene conversion is actively erasing beneficial molecular changes.

It highlights that genetic change is a continuous dynamic and sometimes even a detrimental process.

And the extent to which organisms rely on duplication as raw material is just astonishing.

Even in organisms that were specifically selected for having supposedly non -redundant genomes.

Arabidopsis thaliana, the failcress, was chosen for sequencing partly because its genome was believed to be streamlined and non -redundant.

But once it was sequenced, over half of its 25 ,900 genes were found to be duplicates.

It reinforces the principle.

Gene duplication is the primary engine of evolutionary innovation.

And that innovation isn't limited to just duplicating entire genes.

It also happens at a segmental level, which brings us to the concept of domain shuffling.

Yes, internal duplication of functional protein segments is very common.

Human serum albumin is three perfect copies of a 195 amino acid domain.

This matters because exon boundaries in a gene often correspond precisely to the boundaries of functional domains within a protein like a receptor binding site or a catalytic site.

So Walter Gilbert proposed in 1978 that evolution creates new complex proteins not by building them from scratch, but by duplicating and rearranging these functional domains, like mixing and matching Lego bricks.

That's domain shuffling.

It proposes that the movement and combination of these modular, pre -tested functional units encoded by individual exons can rapidly generate novel protein structures.

This mechanism allows evolution to quickly assemble complex proteins with multiple functions, accelerating innovation far beyond what simple point mutations could ever achieve.

What an incredible journey, tracing life's history from the fidelity of DNA repair all the way to the vast computational task of building the tree of life.

If we were to summarize the highest yield takeaways from this whole deep dive, four core principles really define molecular evolution.

First, evolutionary rates are meticulously governed by functional constraint.

Pseudogenes evolve fastest, giving us the true mutation rate, while critical sequences like histone H4 evolve slowest.

Second, natural selection is an absurdly powerful force.

It's capable of optimizing the cellular economy down to the most subtle molecular nuance, like favoring efficiency in codon usage and using inexpensive amino acids.

Third, molecular phylogeny completely revolutionized our view of deep history.

It established the three domains of life, bacteria, archaea, and eukarya, and molecularly confirmed the endosymbiont theory, tracing our organelles back to ancient bacteria.

And finally, innovation is driven by redundancy.

Gene duplication, whether of a whole gene or just a functional domain through exon shuffling, provides the necessary raw material, that sandbox, for natural selection to generate proteins with entirely new biological functions.

And now, for one final thought for you to carry forward, we established that molecular substitution rates vary significantly between species.

Primates evolve much slower genetically than, say, rodents, mostly due to our long generation times.

Right.

So where are you going with this?

Well, if the modern environment is changing faster than ever, climate, pathogens, human impact, what implications might this difference in molecular pace have for the ability of long -lived, slow -reproducing taxa -like primates to adapt genetically, compared to organisms that accumulate substitutions at twice the rate?

It raises a profound question about the future of slow -evolving life in a fast -changing world.

A compelling challenge to end on.

Thank you for joining us on this deep dive into the history written in our molecules.

We hope you feel thoroughly well informed.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Molecular evolution integrates genetic and evolutionary principles to examine how DNA and protein sequences transform across extended periods associated with organismal divergence and speciation. Unlike population genetics, which tracks short-term changes in allele frequencies within populations, this field concentrates on long-term patterns of substitution accumulation—permanent mutations that persist through natural selection—across deep evolutionary timescales. Substitution rates vary considerably throughout the genome depending on the functional role of sequences; pseudogenes and synonymous nucleotide positions experience rapid evolution because they escape selective pressure, whereas nonsynonymous changes in protein-coding regions accumulate slowly to preserve essential protein structure and function. The Jukes-Cantor model provides a quantitative framework for calculating true substitution frequencies by accounting for the statistical likelihood that multiple replacements have occurred at individual sites, obscuring the actual number of evolutionary changes. Central to this field is the molecular clock hypothesis, proposing that substitutions accumulate at relatively constant rates, enabling researchers to estimate the timing of divergence events between lineages. However, substitution rates are not uniform across all lineages; generation time, metabolic efficiency, and DNA repair capacity influence evolutionary velocity. Phylogenetic tree reconstruction relies on three computational approaches: distance matrix methods such as UPGMA, which group organisms by genetic similarity; maximum parsimony analysis, which identifies the evolutionary scenario requiring the fewest mutational steps; and maximum likelihood methods, which calculate probabilities of sequence change under explicit evolutionary models. Bootstrapping provides statistical validation of inferred tree topologies. A crucial distinction exists between gene trees, representing evolutionary histories of specific sequences, and species trees, depicting true organismal relationships; ancestral polymorphisms can generate incongruence between these frameworks. Horizontal gene transfer, particularly prevalent in bacteria and rare in animals except in cases like bdelloid rotifers, further complicates phylogenetic inference. At macroevolutionary scales, ribosomal RNA sequencing established the three-domain classification of life—Bacteria, Archaea, and Eukarya. The endosymbiont theory explains how mitochondria and chloroplasts arose from free-living prokaryotic ancestors through permanent incorporation into eukaryotic cells. Novel genetic functions emerge through gene duplication, exemplified by the divergence of alpha-globin and beta-globin clusters, gene conversion, and domain shuffling. Molecular evidence from mitochondrial DNA and Y-chromosome variation supports the Out-of-Africa model of human dispersal and modern human origins.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 23: Molecular Evolution and Phylogenetics

Related Chapters