Chapter 6: Exploring Evolution & Bioinformatics

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

Today we're embarking on a journey that might seem, well, purely computational on the surface, but it is in fact one of the most compelling fields of modern history, molecular history.

That's a great way to put it.

We are looking deep into how the fundamental blueprints of life, our proteins and our genes,

tell this sweeping story of evolution.

And we're not doing it with fragile fossils, but with sequences and structures.

That's absolutely right.

Our sources today center around chapter six, which really gives us the conceptual and technical toolkit bioinformatics to read those molecular narratives.

Right.

And the central thesis, the core idea we have to hold on to is that evolutionary descent is revealed through what you could call molecular family resemblance.

I love that framing.

It makes so much sense.

We all recognize family resemblance in people, a certain nose shape, a shared mannerism, but in the world of biochemistry, that resemblance is most strongly preserved in the three -dimensional structure of a molecule.

Yes.

So why structure specifically?

Well, because structure dictates function.

It's that simple.

If the function is conserved over millions of years, the shape must be conserved.

Okay.

So let's start with a clear example.

Let's do it.

A really straightforward one is the enzyme ribonuclease, which digests RNA.

If you compare the 3D structure of bovine ribonuclease from a cow with the structure of human ribonuclease, they are just remarkably similar.

Because they do the same job.

They execute nearly identical functions in our respective bodies.

So the molecular architecture has to remain essentially the same.

That makes perfect sense.

But here's the classic aha moment, the part that really proves the power of this idea.

We look at a protein called angiogenin.

Right.

And angiogenin's function is dramatically different.

It stimulates the growth of new blood vessels.

That's tissue development, not digestion.

Totally different worlds, functionally speaking.

And yet its three -dimensional structure is so closely related to ribonuclease that they are clearly, without a doubt, classified as being in the same protein That is the smoking gun.

It is.

That structural conservation strongly suggests that despite their completely diverged modern functions, they both sprang from a common ancestral protein.

So evolution kept the blueprint.

It retained the core folding pattern.

The structure.

Even as the specific chemical role adapted to two totally different physiological needs, the structure becomes the enduring trait that ties the entire family together.

So structure is the ultimate proof of kinship.

But the problem, as you pointed out, is that solving 3D structures is incredibly laborious.

It's slow.

They are, you know, scarce resources.

But what we have in truly vast overwhelming abundance are the gene and amino acid sequences, thanks to decades of sequencing tech.

Sequence data is the treasure trough.

It really is.

And our key insight here is that evolutionary relationships are also profoundly manifest in these sequences.

Just think about the human myoglobin sequence.

That's the protein that holds oxygen in your muscle.

It's 153 residues long, and it differs from the chimpanzee myoglobin sequence in only one single amino acid.

Just one.

One.

That minuscule difference just crystallizes the profound kinship between our two species.

So our mission today is a bit complex.

We need to examine the precise computational methods that are used to compare these sprawling amino acid sequences, deduce that evolutionary kinship based on statistical probabilities, and ultimately unlock the deep history of life that's encoded right within our proteins.

We are bridging structure, sequence, function,

and the massive power of computation.

That's the plan.

And to start our deep dive into the methods, we have to establish the absolute foundation of molecular kinship, which is the concept of homology.

Right, because the entire exploration of biochemical evolution is really an attempt to map how proteins, molecules, and even entire metabolic pathways have changed, adapted, and been preserved over eons.

And homology is, I guess, the fundamental building block of all of that.

It is.

The formal definition is simple, but it's incredibly powerful.

Two molecules are homologous if and only if they were derived from a common ancestor.

It's not just that they're similar.

No.

It tells you not just that they are similar, but why they are similar, because they share a history.

And the practical payoff for a modern biochemist is just enormous.

If you sequence a new protein from, say, a newly discovered bacterium from a volcano or something, and you find it's homologous to a well -characterized human enzyme, you immediately have a strong informed hypothesis about the new protein's biochemical function.

You can jump start years of research just by establishing that kinship.

It's like a cheat code.

It is.

It's the ultimate functional prediction tool.

And we categorize these homologous molecules, or homologues, into two critical classes.

It's based on where that common ancestry played out relative to

speciation.

Okay, so let's start with the first one.

Paralogs.

Paralogs are homologues that exist within one single species.

Our introductory example is the perfect illustration here.

Human ribonuclease and human angiogenin.

They're both in us.

They're in the same human body, but they perform different detailed functions.

Their common ancestor was a single gene that duplicated within the genome of an ancestral organism.

One copy kept the original function, and the other copy was free to mutate and evolve to take on a new related role.

In this case, stimulator blood vessel growth.

So, paralogs are the result of a gene duplication event, and that leads to functional diversification within an organism.

Correct.

Then we have a

Yes.

They're also descended from a common ancestor, but the separation happened because of a speciation event.

The key feature of orthologs is that they usually maintain very similar or even identical functions across that evolutionary split.

Like the cow and human ribonuclease.

Perfect orthologs.

They both digest RNA.

The function was conserved even after the evolutionary path of cows and humans diverged millions and millions of years ago.

That distinction is so crucial because if I find a new protein and determine it's an ortholog of a known human protein, I can pretty confidently predict its exact function.

You can.

But if I determine it's a paralogue, I know its function will be related to the known protein, but I have to do more research to find out how it's diversified.

It tells you everything about the trajectory of that specific molecule.

But now, okay, we move to the heavy lifting.

How do we actually detect this in the vast, noisy sea of sequence data?

That requires some pretty robust statistical analysis.

That brings us directly to statistical analysis and the essential method of sequence alignment.

The premise is pretty intuitive.

Significant sequence similarity implies a shared evolutionary origin.

I mean, if they look the same, they probably share history, which strongly suggests they'll have similar structure and function.

And the sources make a really compelling argument here.

Why is comparing protein sequences so much more robust and effective than comparing, say, nucleic acid sequences like DNA?

It's purely a matter of probability.

It gives us a much higher signal to noise ratio with proteins.

Proteins are constructed from 20 unique amino acid building blocks.

DNA and RNA only use four bases, you know, A, T, C, and G.

So if you randomly align two sequences, the chance of two amino acids matching just by accident is one in 20.

Right.

The chance of two nucleic acids matching accidentally is one in four.

Much, much higher.

Way higher.

Because proteins have so many more possible building blocks,

any similarity we detect is statistically far less likely to be due to just random chance.

So similarity in a protein sequence is a much stronger indicator of genuine evolutionary kinship.

Right.

Let's look at the classic case study of comparing the globins, for example, human hemoglobin, the alpha chain, which is 141 residues, and human myoglobin at 153 residues.

Right.

The initial very simplistic approach would be a kind of brute force slide and match.

You're just sliding one sequence past the other one amino acid at a time and counting the exact sequence identities.

And with that method, what do you get?

You find a maximum of 23 identities using this simple method.

But the moment we move away from that simplest approach,

we run into the limitations.

If we shift the sequence just slightly, we might find 22 matches.

And that simplistic view forces us to choose only one best alignment.

And we're potentially losing the real deeper connection that's hidden in the other ships.

That's because evolution isn't static.

Yeah.

It doesn't just involve single point mutations or substitutions.

It involves the insertion or deletion of entire stretches of genetic material.

A whole block of amino acids might be added or removed.

Right.

To correctly model that history, we have to allow for gaps in the alignment.

So introducing a gap is the computational equivalent of saying, okay, at this point in history, one protein experienced a deletion event while the other one did not.

Exactly.

And by allowing a calculated gap, which is a necessary computational artifact to account for a real evolutionary event, we can synthesize the information found in both the 23 and 22 match simple alignments into one more comprehensive view.

And the improvement in the signal is pretty dramatic.

It is.

That gap alignment yields 38 identities spread over an average length of 147 residues.

That translates to about 25 .9 % identical.

So we've increased the meaningful identity significantly.

But how do we prevent the system from cheating?

If I can just add unlimited gaps, I could force any two sequences to look similar, couldn't I?

That's the challenge.

Yeah, that's exactly right.

We have to apply aggressive penalties for introducing those gaps.

The key idea isn't really the specific numerical result of the scoring calculation, but the conceptual trade off we are making.

We score matches highly say plus 10 points for every identity, but we punish gaps severely.

So you're saying that we have to treat the introduction of a gap, which represents a major evolutionary event like a deletion or an insertion is having a much higher cost than just a single amino acid substitution.

Precisely.

If we apply the simple penalty system like minus 25 points for each gap introduced, regardless of its length, we make sure that the system only allows a gap when the resulting sequence identities overwhelmingly justify that penalty.

It prevents artificial meaningless alignments.

Okay, so once we have that score, we still need to figure out its statistical significance.

How do we know that 25 .9 % identity isn't just the result of random chance, since all proteins draw from the same 20 amino acids?

We use a really powerful statistical technique called shuffling.

We take the sequence of one protein, let's use myoglobin, and we just randomly rearrange its amino acids.

We scramble it completely.

Then we repeat the alignment process with that now randomized sequence, and we record the score.

And you repeat this hundreds or even thousands of times.

Yes, and this generates a massive histogram, a bell curve, showing the of scores you would get purely by chance.

If your authentic real alignment score falls somewhere inside that random distribution, well, you can't rule out chance, and the relationship is weak.

But when this was applied to the authentic myoglobin and alpha -humoglobin alignment, the results were not weak.

Far from it.

The actual alignment score stood substantially above the entire shuffle distribution.

The sources emphasize this stunning probability.

The odds of this deviation occurring by chance alone are approximately 1 in 10 to the power of 70.

That's an incomprehensible number.

It's effectively zero.

When you have statistical certainty that high, you can comfortably conclude that the sequences are genuinely similar, and they must be homologous.

That level of certainty is just astonishing.

But even with 1 in 10 to the 70th odds, we still run into trouble when the evolutionary relationship is more distant, right?

If the proteins diverged billions of years ago, simple identity scoring starts to fail, because too many of those original matched residues have mutated.

That's the limitation of looking only for exact identities.

It's too strict.

We need a more sophisticated tool that acknowledges that, well, not all mutations are equal.

So we introduce the concept of the substitution matrix.

Uh, this is where we stop treating every mismatch as a score of zero.

We start classifying them based on chemical properties, conservative versus non -conservative substitutions.

Exactly.

Think about the functional requirement.

A conservative substitution means one amino acid is replaced by another that is very similar in size and chemical property.

For example, swapping the positively charged lysine, or K, for the positively charged arginine.

Okay.

Evolution essentially approved this switch because it likely had only a minor effect on the protein's overall structure and function.

Therefore, it gets a high positive score.

And conversely, a non -conservative substitution is a radical swap, like replacing that positively charged lysine with a big, bulky, non -polar tryptophan.

It's again.

That massive change would likely disrupt the protein's folding or its activity, and therefore gets a low or even a negative score.

So the real aha of the substitution matrix is that it lets us look past these simple mismatch failures and see that sometimes a swap isn't a mistake at all.

It's a highly intelligent, conservative replacement that evolution approved of.

And that can review these deeply ancient relationships that identity -only scoring would just miss completely.

It's a much more sensitive tool.

And the most widely used one for this is the BLOSUM -62 matrix.

And how was this created?

It wasn't just theoretical, right?

No, not at all.

BLOSUM stands for Blox Substitution Matrix, specifically Blox -62.

It was empirically derived by looking at actual observed substitutions in thousands of evolutionarily related proteins.

It's a map of what substitutions evolution actually allowed to persist over time.

And the scoring becomes very nuanced here.

Very.

Even the identity scores vary.

For example, cysteine C and tryptophan W are rare amino acids.

If you find one aligned in two sequences, it's a bigger deal than finding two very common alanines.

Therefore, cysteine and tryptophan identities get higher scores because they align by chance much less often.

And the penalty system also matures with BLOSUM -62, moving away from just a single flat cost.

Yes.

The system becomes two -tiered, which reflects the evolutionary reality.

It might cost, say, minus 12 points to introduce an initial single residue gap, but only minus two points per residue to extend that existing gap.

Why the difference?

It acknowledges that a single long deletion event is biologically far more probable than many scattered single residue deletions.

I see.

We can see the true power of BLOSUM -62 when we compare really distant relatives like human myoglobin and lupine leukemoglobin, an oxygen -binding protein from a plant.

Right.

A huge evolutionary gap.

Using identity -only scoring, the similarity was so low it only suggested a 1 in 20 chance alignment, a 5 % probability of it being random.

That's a very weak basis for concluding homology.

You wouldn't publish that.

You would not.

But when BLOSUM -62 was applied, incorporating all those beneficial conservative substitutions, the certainty just skyrocketed.

The odds of a chance alignment plummeted to 1 in 300.

That firmer conclusion allowed researchers to establish the homology of these two highly divergent oxygen carriers, bridging that vast evolutionary gap between plants and mammals.

So we have all this statistical muscle, but for the learner, there must be a simple way to conceptualize the strength of a sequence relationship.

Do the sources provide any practical rules of thumb?

They do, especially for sequences that are longer than about 100 amino acids.

You could think of it like a traffic light system for determining kinship.

Okay, let's start with the green light.

If you have greater than 25 % identity, that's the green light.

The sequences are almost certainly homologous.

The signal is strong enough to confirm kinship immediately.

And the red light.

When should you stop looking for sequence evidence alone?

That's less than 15 % identity.

The similarity is statistically insignificant based on sequence comparison alone.

You cannot conclude homology from this data.

And the tricky yellow or amber light, the gray zone.

That's the zone between 15 % and 25 % identity.

This requires further analysis.

The relationship is weak, it's possibly missed, or it's highly divergent.

And this is where you really need structure or functional data to confirm the connection.

And we must always remember the crucial caveat here.

A lack of statistical significance based on sequence alone does not rule out homology.

No, absolutely not.

It just means the sequence evidence has been washed out by evolutionary time.

Which brings us to the ubiquitous tool of modern bioinformatics, the BLAST search.

The basic local alignment search tool.

This is the global standard, right?

It is the first step any researcher takes when a new sequence is elucidated.

You query your sequence against massive global databases, which, according to 2013 data in our sources, already held over 35 million sequences.

It's mind -boggling now.

And BLAST is designed for speed.

Speed and finding local regions of high similarity.

And when BLAST returns a list of potential relatives, how do we interpret the results?

You look at the e -value.

It's the expectation value, an estimate of chance likelihood.

A vanishingly small e -value, say 2 times 10 to the minus 27 for a highly conserved orthologer, indicates extremely high significance.

A high e -value means the similarity could plausibly be due to random chance.

We saw the immediate game -changing impact of this technology back in the mid -1990s, when scientists first sequenced the entire genome of the bacterium hemophilus influenza.

That was a huge milestone.

The researchers identified 1743 protein coding regions, which they call open reading frames, or ORFs.

And for an audience maybe new to sequencing, an ORF is essentially a potential gene, a stretch of sequence that could code for a protein.

That's right.

So of those 1743 potential genes, how many could they immediately identify?

What was the number?

A staggering 1 ,007.

That's 58 % of the ORFs.

They could be linked to a known protein function in another organism solely using sequence comparison via BLAST.

The ability to functionally predict over half of an organism's proteome just by looking at its raw sequence.

It's a profound testament to the power of homology and bioinformatics.

So if sequence alignment and BLAST are so incredibly powerful, often providing 1 in 10 to the 70th odds of kinship,

why do we need to talk about structure at all?

Why bother with the incredibly hard work of solving 3D structures?

That is the perfect question to bridge these two concepts.

We need structure because tertiary structure is the ultimate arbiter of function, and thus it is far more evolutionarily conserved than the primary sequence.

The shape is more important than the letters.

The sequence is mutable.

The function, and therefore the architecture, is sacred.

Evolution will allow the sequence to diverge wildly, but if the protein has to perform the same job like binding oxygen or cleaving a peptide bond,

the overall shape has to stay remarkably similar.

So structure confirms kinship even when the sequence signal is weak or statistically insignificant.

Exactly.

Let's go back to the globin family.

Human hemoglobin, human myoglobin, and lupine leg hemoglobin.

We know they are relatives.

But the sequence identity between human alpha hemoglobin and lupine leg hemoglobin registers at only about 15 % statistically insignificant by our own rules of thumb.

It's in the red zone.

It is.

Yet when you compare their tertiary structures, they are virtually identical folding patterns.

This structural conservation confirms their descent from a common ancestor over billions of years, even when the sequence memory has almost completely faded away.

This means structure can also reveal completely unexpected kinship, where sequence comparison utterly failed to raise a flag.

Yes.

The textbook example here is the structural relationship between actin and heat shock protein 70, or HSP 70.

Okay, so what do they do?

Actin is a massive component of the cell's cytoskeleton, vital for movement and structure.

HSP 70 is a chaperone.

It helps other proteins fold correctly.

Vastly different cellular roles.

And what about their sequence?

Their sequence identity is only about 16%.

That is hovering right around that statistical insignificance threshold.

If you only ran a sequence alignment, you'd likely dismiss the connection entirely.

But the structure told a different story.

It did.

Their 3D structures are noticeably similar.

That shared architecture forces us to classify them as paralogs.

This structural discovery reveals a common ancestor that existed long ago, whose single gene duplicated, leading one descendant to become a structural protein and the other a folding assistant.

As more and more structures are solved, we find these hidden family ties more and more frequently.

So structure not only verifies history, but it reveals entirely new chapters of it.

Can structure help inform our sequence analysis directly instead of just retrospectively?

Absolutely.

It can, by identifying the residues that are absolutely critical for function.

These active site residues are the ones that are most strongly conserved throughout evolution.

For example, in all globins, from the plant to the human, there is a specific histidine residue that must be positioned to interact with the iron atom in the heme group.

A key player.

A non -negotiable player.

In human myoglobin, that's residue 64, and it is universally conserved across the entire family.

And knowing that helps us generate a sequence template.

Right.

A sequence template is a map of those structurally and functionally crucial conserved residues.

If a sequence comparison alone is weak, we can use this template checking if a new, weak sequence has those specific key residues in the correct spatial relationship to recognize just in family members that are otherwise undetectable.

It confirms homology that was statistically borderline.

It's like finding a distant relative with your great grandfather's nose and realizing the whole family's story is true, even though their hair color and height are completely different.

Exactly.

Now let's talk about using sequence analysis to look inward at a protein's own structure to detect internal repeats.

Right.

Over 10 % of proteins contain these internal domains that are essentially repetitions of each other.

And if we align a protein sequence with itself, we can identify if the protein evolved by duplicating a gene segment.

This shows molecular recycling in action.

The case study provided is the TATA box -binding protein, TBP, which is critical for initiating gene transcription.

When researchers aligned the N -terminal and C -terminal halves of TBP, they found 30 % identity over about 90 residues.

Given that 25 % is our threshold for certainty between two different proteins, 30 % identity within one protein is just overwhelmingly significant.

And the structural confirmation supported this internal story.

Completely.

The solid structure of TBP shows two domains that are nearly identical in shape.

This is just convincing evidence that the TBP -B gene evolved by duplicating an ancestral single -domain gene segment.

And this is a classic example of divergent evolution.

Yes, a single ancestor, the early gene diverging to create the two halves of the modern TBP.

OK, now we have to flip the script and discuss the counterpoint to divergent evolution, which is convergent evolution.

This is evolution finding the same structural answer twice,

completely independently.

Convergent evolution occurs when two completely unrelated proteins, starting from different origins,

evolve independently but arrive at remarkably similar structural or chemical features.

It happens because they are solving the exact same biochemical challenge.

It's the efficiency of nature at work.

The textbook example for this distinction is the serine protease family.

These are enzymes specialized in cleaving peptide bonds.

We compare mammalian chemitrypsin and bacterial subtilsin.

They both use the exact same chemical strategy.

A highly reactive active site centered on a triad of three key residues, a serine, a histidine, and an aspartic acid.

And these are positioned identically?

They are positioned in a nearly identical spatial arrangement.

It's a conserved solution for peptide hydrolysis.

Wait, if they have the same three residues positioned identically solving the same problem, how do we definitively say they are not homologous?

We zoom out.

The overall tertiary structures are incredibly dissimilar.

Chymotrypsin is predominantly composed of beta sheets forming a large complex domain structure.

Sublizin, conversely, relies on extensive alpha helices.

So the scaffolding is different.

Totally different.

And furthermore, those key serine, histidine, and aspartic acid residues do not appear in the same order in the primary sequence of the two proteins.

Ah, so despite achieving the exact same highly specific chemical mechanism, the overall scaffold used to support that mechanism is totally different.

The conclusion is inescapable.

They did not inherit that active site from a common ancestor.

They evolved it independently.

Structure comparison is absolutely essential for making that definitive distinction between divergent evolution, which is shared history, and convergent evolution, which is a shared solution.

Finally, before we build our trees, sequence alignment isn't just for proteins.

It's also crucial for understanding RNA sequence alignment and its secondary structure.

RNA folds into these highly complex 3D shapes, often involving extensive base pairing,

and it's notoriously difficult to crystallize.

So we use homology among a family of related RNAs to predict that secondary structure.

And the principle here relies on the idea that the function, the holding pattern, must be conserved, even if the building blocks themselves change.

Exactly.

If you look at homologous ribosomal RNA sequences across different species, the specific bases might vary, but the ability to form a stable base pair is what must be conserved.

Let's visualize that.

So if E.

coli ribosomal RNA has a guanine, a G, paired with a cytosine, C, at a certain position.

Then the human ribosomal RNA might have uracil, a U, paired with adenine, an A at the corresponding line.

Both GC and UA are standard Watson -Crick base pairs.

They both form a double helix.

When we observe these paired mutations, where the identity of the bases changes, but the pairing ability is maintained, we can deduce with high confidence that those segments form a double helix.

So you can map out the whole structure that way.

By applying this logic across an entire family of homologous RNA sequences, we can predict the complete secondary structure, which has been powerfully confirmed by structural determinations of the ribosome machinery.

Okay, now that we have these sophisticated tools for comparing sequences, we can take all that information and use it to construct the grand map of life.

The evolutionary tree or phylogenetic tree.

Right, and the underlying assumption linking our data to time is simple.

The number of sequence differences between any two proteins should be proportional to the amount of time that has passed since their genes diverged.

So the construction process involves taking those deeply aligned sequences, myoglobin, hemoglobin alpha, hemoglobin beta, legamoglobin, and calculating the statistical distance between them.

Then we create a branching diagram where the length of the branch reflects the number of amino acid differences.

And this process immediately reveals relative divergence times.

For example, by comparing the distances, we can conclude that the myoglobin gene diverged from the main hemoglobin line roughly twice as long ago as the alpha chain separated from the beta chain.

Right, but relative time isn't enough.

We want absolute dates.

How do you get those?

To assign real tangible dates, we have to calibrate the molecular clock by linking branch points to well -established divergence times derived from the fossil record.

The sources cite a fantastic example of this calibration using the hemoglobin alpha and beta chains.

They do.

The molecular analysis estimates that the gene duplication event leading to the creation of separate alpha and beta subunits occurred approximately 350 million years ago.

So how do we confirm that date?

We look at the fossil record.

Jawless fish, like the lamprey, diverged from the line that led to bony fish and mammals around 400 million years ago.

Lamprey hemoglobin is built from only a single type of subunit.

It hasn't undergone the alpha -beta duplication yet.

So the existence of single subunit hemoglobin in the lamprey, which diverged 400 million years ago, it brackets the duplication event for the alpha -beta chains to some time between 400 and 350 million years ago.

That makes the molecular estimate highly robust.

It does.

That correlation between the molecular analysis and the physical fossil evidence makes the evolutionary tree far more reliable.

However, sometimes these evolutionary trees throw up branches that simply don't fit the expected model of vertical descent.

And that's often the signature of horizontal gene transfer, or HGT.

HGT is the exchange of DNA between species, not the typical parent to offspring or vertical transmission,

but across species boundaries.

This transfer is typically rapid and often confers a significant selective advantage to the recipient.

The case study provided, Galdieria sulferaria, is profoundly fascinating because it really challenges the classical ladder -like view of the tree of life.

Galdieria sulferaria is a unicellular red alga which places it firmly in the eukaryotic domain.

But it lives in incredibly stressful, extreme environments.

High heat, up to 56 degrees C, extremely low pH, and high concentrations of toxic heavy metals.

So it clearly needed some unique genetic tools to survive those conditions.

When its complete genome was sequenced, the researchers found a startling result.

Nearly 5 % of its open reading frames encoded proteins that were far more closely related to bacterial or archaeal orthologs than to other eukaryotic ones.

Genes had been swapped from prokaryotes into this eukaryote, and the functions of those acquired genes.

They often directly conferred survival advantages in that harsh environment genes for things like arsenate ion transport, which helps remove toxic metals.

The conclusion is strong.

HGT from prokaryotes likely facilitated this alga's spectacular adaptation to its extreme niche.

This radically changes how we view evolution.

We often think of evolution as strictly linear, you know, moving up a ladder, but HGT shows us it's more of a network.

The overall evolutionary history of the organism might be one thing, but the history of an individual gene can be completely different.

Right, having taken a sudden horizontal leap across domains of life.

It's a powerful adaptation cheat code, allowing organisms to acquire fully functional optimized genes for survival without waiting millions of years for gradual mutation.

Everything we've discussed so far relies on inferring history from modern sequences.

But modern biochemical techniques now allow us to directly examine evolution, both by recovering the blueprints of extinct organisms and by, well, running evolution in real time in the lab.

Let's start with sequencing ancient DNA.

While DNA is fragile, its chemical stability allows it to survive for thousands of years under the right conditions, like cold and dry.

And the combination of PCR and modern high -throughput sequencing means we can take these minute fragments of ancient DNA and amplify them to read the complete genetic blueprint of an extinct organism.

The first major breakthrough was the sequencing of mitochondrial DNA from a 38 ,000 -year -old Neanderthal fossil.

This gave us a direct molecular comparison to modern humans.

And what did that initial mitochondrial DNA study reveal about our relationship with Neanderthals?

It showed between 201 and 234 substitutions compared to modern humans.

This is a significant difference, but it's far fewer than the approximately 1 ,500 differences you'd see between modern humans and chimpanzees.

So this positioned Neanderthals as a close but distinct cousin.

And more recently, researchers moved beyond mitochondrial DNA and sequenced the complete nuclear genomes of Neanderthal and the related Denisovan hominins, using even older fossils, some up to 50 ,000 years old.

This gave us the definitive timeline.

The genomic comparison established that the common ancestor of modern human beings and these hominins lived approximately 570 ,000 years ago.

And crucially.

Crucially.

The detailed evolutionary tree resulting from this data showed that the Neanderthal was not an intermediate form on the past of Homo sapiens.

Instead, they were an evolutionary side branch.

A dead end that became extinct.

It's just incredible that we can settle massive anthropological debates about which hominin was an ancestor and which was a cousin.

Not by digging up a new skull, but by running a complex sequencing machine in a lab.

It completely redefines what fossil evidence means.

It absolutely does.

The sources do add a vital note of caution though.

Earlier claims of sequencing far older DNA -like insects trapped in amber were later found to be flawed due to the pervasive presence of contaminating modern DNA.

Successful ancient DNA work requires not just the sequencing tech, but extremely rigorous exclusion protocols to ensure we are only reading the extinct genome.

Now let's pivot from looking into the past to actively creating the future.

Using molecular evolution in the laboratory.

This relies on combinatorial chemistry.

This is literally running evolution in a test tube.

Evolution requires three key components.

One, you need to generate a diverse population.

Two, you need selection based on fitness.

And three, you need reproduction to enrich the fit members.

We can force nucleic acids to undergo all three in vitro.

In combinatorial chemistry is the process used to produce that vast diverse starting population of molecules en masse.

And then we apply a selective pressure like the ability to bind a specific target.

Let's look at the classic experiment where researchers wanted to model the early RNA world before complex proteins existed.

The goal was to create an RNA molecule capable of binding the crucial energy molecule, ATP.

Creating complex specific function from random starting blocks.

This seems like an almost impossible task for a test tube.

The researchers began with an initial pool of 10 to the 14th RNA molecules.

To put that scale into perspective, that's 100 trillion molecules, each 169 nucleotides long, with 120 positions that were completely randomized.

That's massive diversity.

Okay, that's the population generation step.

How did they apply selection for ATP binding?

They passed the massive population through an ATP affinity column.

Only the few rare RNA molecules that, purely by chance, had a structure capable of binding ATP would stick to the column.

All the non -binders were washed away.

And then you get the winners off.

The bound molecules were then eluted or washed off using an excess of ATP.

So that isolates the fit members.

And the third step, reproduction and mutation.

The selected molecules were subjected to reverse transcription, PCR amplification, and then transcribed back into RNA.

Crucially, they used error -prone reverse transcriptase, an enzyme that naturally makes mistakes, fulfilling the mutation requirement with each cycle of amplification.

How long did it take for this artificial selection pressure to yield results?

After just eight generations,

eight cycles of selection and replication, the researchers analyzed the population.

They found 17 different sequences, 16 of which formed a conserved secondary structure and bound ATP with incredibly high affinity, with dissociation constants less than 50 picomolar.

The evolved molecules were highly specific and highly effective.

And the resulting structure confirmed the optimization process.

Hmm.

NMR structural determination of the binding region showed that the 40 nucleotide segment folded into two helical regions separated by an 11 nucleotide loop.

And this loop folded back perfectly to create a deep custom pocket where the flat adenine ring of the ATP molecule fit perfectly.

So in the space of weeks, a complex specific functional structure evolved from pure randomness.

It did.

And these synthetic ligand -binding oligonucleotides are called aptamers.

They aren't just evolutionary curiosities.

They're immensely useful in the real world.

Right.

They show great promise in diagnostics, acting as sensors for everything, from small drug molecules like cocaine to clinically relevant proteins like thrombin.

They can be engineered to bind virtually anything.

They can.

And they've moved into therapeutics, too.

The source mentions mucugen.

What's that?

Mucugen, or pegaptinib sodium, is a successful example of directed molecular evolution resulting in an approved medical treatment.

It's an aptamer designed to inhibit vascular endothelial growth factor, or VEGF, and it is used to treat age -related macular degeneration by preventing unwanted blood vessel growth in the eye.

To quickly recap this deep dive into molecular evolution and bioinformatics, we've covered immense ground, moving from single -point mutations all the way to mapping the history of life.

We began by defining homology, establishing the key difference between orthologos, which are same function different species, and paralogs, which are different functions, same species, arising from duplication.

We detailed the mechanics of sequence alignment, noting the critical role of gaps in scoring systems.

We saw how the statistical rigor provided by methods like shuffling and the nuanced scoring of the Blosom 62 substitution matrix allows us to detect increasingly distant relationships by giving evolutionary credit to conservative substitutions.

We learned that tertiary structure is the most conserved feature, often confirming deep kingship -like with the globins, even when sequence similarity registers as statistically insignificant.

Right.

And structural comparison is also key to distinguishing divergent evolution, like the TBP gene duplication, from convergent evolution, where unrelated proteins arrive at the same solution as we saw in chymotrypsin and subtelicin.

And finally, we highlighted the modern techniques that make this field so dynamic,

calibrating evolutionary trees with physical fossil records to assign concrete dates, identifying unexpected genetic mixing via horizontal gene transfer in organisms like Alderia sulfuraria, and the remarkable feat of performing evolution in a test tube through combinatorial chemistry to rapidly create highly functional aptamers.

All of this analysis, powered by biochemical comparison and computational muscle, gives us an unprecedented ability to decode the history of life one sequence at a time.

So we leave you with one final thought, something to mull over that ties into those spectacular lab evolution experiments we discussed, like the RNA aptamer that developed a perfect binding pocket for ATP in just eight generations of lab selection.

If complex functional structures can evolve and optimize so rapidly in vitro, what does this tell us about the speed and inherent efficiency with which life's fundamental molecules must have first emerged billions of years ago?

It suggests that the molecular engine of evolution is not merely powerful, but perhaps inherently predisposed to finding optimized solutions with startling speed.

A truly fascinating perspective on life's rapid deep history.

Thank you for joining us for this deep dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Evolutionary biochemistry and bioinformatics together provide powerful tools for reconstructing biological history and inferring protein function from molecular data. The foundation rests on understanding homology relationships, specifically distinguishing between paralogs that arise from gene duplication within a single organism and often acquire divergent functions, and orthologs that exist across different species and typically maintain equivalent roles. Detecting these relationships requires sophisticated sequence alignment methods grounded in statistical rigor, where techniques like sequence shuffling validate whether observed similarities reflect genuine evolutionary relationships or emerge from random chance. Substitution matrices such as Blosum-62 refine alignment scoring by recognizing that replacing an amino acid with a chemically similar residue represents a more plausible evolutionary step than a nonconservative replacement. Database search algorithms like BLAST accelerate the identification of homologous sequences genome-wide and quantify the statistical probability of spurious matches. A central insight is that three-dimensional structure proves far more evolutionarily stable than amino acid sequence, a principle demonstrated through structural conservation among functionally diverse globins and the surprising paralogy linking actin to heat shock protein 70. Convergent evolution illustrates how organisms independently solve identical catalytic problems through distinct structural architectures, exemplified by chymotrypsin and subtilisin both employing the same catalytic triad despite fundamentally different protein folds. Secondary structure analysis of RNA exploits conserved Watson-Crick base pairing patterns preserved across homologous sequences to predict folding. Evolutionary trees constructed from sequence divergence gains chronological anchoring through fossil calibration points. Horizontal gene transfer emerges as a major evolutionary mechanism, with the extremophile red alga Galdieria sulphuraria demonstrating how acquiring bacterial genes enables survival in harsh geothermal environments. Modern experimental techniques span ancient DNA recovery from Neanderthal remains through polymerase chain reaction amplification, enabling direct mapping of human evolutionary ancestry, to laboratory evolution methods like SELEX that generate aptamers through iterative rounds of selection and amplification, effectively simulating molecular evolution in vitro.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 6: Exploring Evolution & Bioinformatics

Related Chapters