Chapter 23: Genomics II: Functional Genomics, Proteomics, and Bioinformatics

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace, the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Okay, let's unpack this.

Imagine we've been given the complete blueprint of an entire city, every road, every building,

every single structure meticulously drawn out.

Right.

That's the challenge.

The most challenging challenging projects have gifted us the foundational map, the entire book of life for a species.

It's an incredible amount of data.

It really is.

But what if we want to know how that city actually functions?

Who lives where?

What vital roles do they play?

How do all those structures interact, you know, to create a living, breathing metropolis?

That's the real challenge.

And it's where today's deep dive begins.

We're moving beyond just the map to understand the dynamic life within.

What's fascinating here is that we're truly stepping into the next big frontier of genetics.

Yeah.

It's about moving beyond simply reading the book of life to genuinely understanding the stories and instructions within it.

Right.

This deep dive, and we're drawing from the brilliant insights of genetics, analysis and principles by Robert J.

Brooker here, will explore the innovative experimental and computational approaches researchers use to decode the functions of DNA, RNA and proteins.

So three main areas, then.

Exactly.

We'll focus on functional genomics, basically understanding what genes do.

Okay.

Proteomics unraveling the vast world of proteins.

The workhorses.

And bioinformatics, the powerful computational tools that, well, knit it all together.

And this deep dive, it's really your shortcut to being genuinely well informed about the cutting edge of genetic research, whether you're catching up on a field or just, you know, insanely curious about how we make sense of our vast genetic blueprint.

Which many people are.

Exactly.

You'll walk away with essential insights and some truly surprising facts.

So if genomics map the city for us, today's deep dive is about deploying thousands of tiny sensors to see which businesses are open, which lights are on, where the crucial traffic is flowing.

It's all about uncovering function and purpose.

Indeed.

We're fundamental questions about how biological systems operate at an unprecedented scale, thanks to these ingenious methods.

So let's start by peeling back the layers with functional genomics.

Deciphering the roles of genetic sequences.

The core goal here, as you said, is to understand the specific roles of DNA and RNA sequences, particularly gene function, right?

That's it.

And it's not just about one gene at a time anymore.

It's about looking at genes in groups to understand complex metabolic pathways and how their products interact within a cell.

How did we even start to tackle that scale?

Well, one of the foundational ways we started to approach this on a large scale was with the DNA microarray, often called a gene chip.

The gene chip.

What's the sort of revolutionary insight this technology offered?

Imagine being able to instantly see which of your, say, 20 ,000 genes are on or off in a specific cell type all at once.

Wow, okay.

That's power.

DNA microarrays offered a high throughput snapshot of cellular activity.

This allowed scientists to spot subtle shifts, say, in cancer cells that would have taken years, literally years, to find gene by gene.

And how do they manage to capture that snapshot?

It sounds almost like magic.

Hey, well, at its core, it's quite clever.

You prepare all the active RNA messages from your sample cells, the mRNA,

and you tag them with a fluorescent dye.

Okay, tag them.

Then you wash these labeled molecules over a tiny slide, a chip, that's preloaded with tens of thousands of known DNA sequences.

Each spot on the chip acts like a specific beacon for one gene.

Ah, so the RNA message finds its matching DNA beacon.

Exactly.

Where your labeled RNA messages stick or hybridize and glow, you know those genes were active, were being transcribed in your original cell sample.

And the brightness matters.

Definitely.

The brighter the spot, the more RNA stuck there, meaning the more active that gene was.

That sounds incredibly powerful for understanding gene expression patterns.

What's a real -world example of its impact?

Like, where has this made a difference?

Oh, absolutely.

One of its most impactful applications is in tumor profiling.

For instance, researchers compared healthy liver cells to liver cancer cells using microarrays.

Okay.

The results showed 77 specific spots were much, much brighter in the cancer cells.

The 77 genes overactive.

Significantly overexpressed, yeah.

Yeah.

And this wasn't just a random finding.

It directly pointed to genes likely contributing to cancerous growth.

That helps clinicians categorize tumor types, predict outcomes, maybe even guide treatment.

That's huge.

It is.

And microarrays also help identify genetic variations like mutations and even specific microbial strains.

Lots of uses.

Fascinating how a, like you said, postage stamp -sized chip can tell us so much about disease.

Now, we know that genes are regulated by proteins binding to DNA.

That's fundamental.

How do scientists actually pinpoint where those proteins attach themselves to our genetic code, but like in a living cell?

Right.

Because doing it in a test tube is one thing.

Yeah.

It's not the real environment.

Exactly.

That brings us to chromatin immunoprecipitation, or CHIP.

Proteins binding to specific DNA sites are absolutely crucial for processes like gene transcription, replication,

so many things.

What's truly innovative about Triapy is that it lets us analyze these interactions in vivo in living cells.

That's a huge leap beyond older in vitro methods that couldn't capture this dynamic real -time activity.

So it's like catching them in the act, inside the bustling city.

Precisely.

You essentially use a chemical, often formaldehyde, to sort of freeze or cross -link the proteins onto the DNA they're touching right at that moment in the living cells.

Okay, freeze frame.

Then you break open the cells and chop up the DNA into smaller, manageable pieces.

The key step is using an antibody.

Like a guided missile?

Sort of.

More like a highly specific fishing hook that recognizes and binds only to your protein of interest.

This antibody is usually attached to heavy beads, so you can pull it, and whatever it's attached to, namely our protein DNA complex out of the cellular soup.

Immunoprecipitation.

Got it.

Then you reverse the cross -links, get rid of the protein, and you're left with just the DNA fragments that protein was bound to.

You amplify these DNA pieces using PCR.

Make lots of copies.

And identify them.

Now, if you want to map where this protein binds across the entire genome, not just one or two spots, you'd use a chippy -chip assay.

You take that amplified DNA, label it fluorescently, and wash it over a DNA microarray.

The spots that light up tell you the genomic locations where your protein was bound.

Amazing.

So you can map the protein's parking spots across the whole genome.

That's a good way to put it.

Now, if we think about the bustling activity of the city, all those messages being sent around, what about getting a full comprehensive picture of all the RNA messages in a cell?

Not just on or off, but like everything.

That's where RNA sequencing or RNA sec comes in.

It's a newer, more powerful technique.

The transcriptome is the term for the complete set of all RNA molecules, both the messenger RNAs that code for proteins, and also the non -coding RNAs, which have their own important roles present in a cell at a given time.

So RNA sec maps the transcriptome.

Exactly.

It was developed by Michael Snyder and his colleagues back in 2008, and it gives us an incredibly detailed and dynamic picture of that whole RNA landscape.

So it's more than just an upgrade from microarrays.

It's like a completely different level of insight.

Absolutely.

RNA sec wasn't just an incremental improvement.

It was more like moving from a blurry photograph to a high -definition video of cellular activity.

Okay.

I like that analogy.

The key insight is that we can now capture the true dynamic complexity of gene expression.

We can quantify RNA levels much more accurately than with microarrays, detect rare transcripts, find new splice variants of genes, precisely map where transcription starts and stops, things microarrays often struggled with.

Wow.

This gives us an unprecedented, precise look at a cell's operating system, whether we're comparing different cell types, healthy versus diseased cells, or how cells respond to environmental factors like hormones or toxins.

And how does it achieve that level of detail?

What's the basic process?

Well, you start by isolating the RNA from your cells.

You can even select for certain types, like mRNA.

Then you usually fragment the RNA into smaller pieces.

Okay.

These RNA fragments are converted into more stable DNA copies called cDNA using an enzyme called reverse transcriptase.

Right.

RNA to DNA.

Crucially, these cDNA copies are then directly sequenced using next generation DNA sequencing technologies, the same kind of tech used for sequencing whole genomes, but applied to the RNA messages.

So you get millions of short sequence reads.

Exactly.

And finally, powerful computer algorithms align these sequence reads back to the known genomic sequence of the organism.

This builds a comprehensive map showing which genes were transcribed, at what levels, and even how they were spliced.

That's a huge leap in understanding the dynamics.

Now, if we want to truly understand what a gene does, sometimes the best way, maybe the most direct way, is to see what happens when it doesn't do anything.

Tell us about gene knockouts.

Right.

The ultimate goal here in many large scale projects is to determine the functions of all the genes in a species genome.

It's incredibly ambitious.

How do you do that?

A gene knockout is simply altering a gene, usually by inserting something disruptive or deleting part of it, to inactivate its function.

You break it.

Get it to understand it.

Precisely.

By observing the resulting changes, or the phenotype in the organism, for example, a mouse becoming deaf after a specific gene is inactivated, scientists can infer the normal function of that gene.

In this case, it's likely involved in hearing.

So it's about seeing the critical role a gene plays by removing it from the equation.

Exactly.

Researchers can create knockouts of multiple genes to study complex cellular pathways, or even model inherited human diseases like sickle cell disease in mice to study potential therapies.

How are these knockouts actually made?

It sounds tricky.

It used to be very difficult, but we have much better tools now.

Older methods often use things called transposable elements, jumping genes to disrupt function.

But more recently, the revolutionary CRISPR -Cas system has made gene editing, including creating knockouts, much more precise and efficient.

CRISPR, right.

I know there are massive collaborative efforts to do this, aren't there?

Making knockouts for thousands of genes.

Yes.

A prime example is the National Institutes of Health's Knockout Mouse Project, which started back in 2006.

In collaboration with international efforts, its ambitious goal is to create at least one loss -of -function mutation, a knockout in each of the roughly 22 ,000 protein -encoding genes in the mouse genome.

Every single one.

Wow.

It's a huge undertaking.

And similar knockout collections are also available for other vital model organisms like E.

coli, yeast, S.

cerevisiae, and the worms C.

elegans.

These resources are invaluable for researchers worldwide.

That's monumental, but the insights gained must be incredible.

Now, we've talked about the city's blueprint, the genes and the messages buzzing around the RNA, but the real workhorses, the dynamic structures that carry out most of the daily operations in our cellular city, are the proteins.

Let's dive into proteomics, the dynamic world of proteins.

Right.

This is critical because while understanding the genes gives us the blueprint, proteins are the primary functional molecules.

They do the work.

The proteome is defined as the entire collection of proteins a given cell or organism makes.

All the proteins.

Yeah.

And proteomics is the study of their functions, structures, interactions, everything about them.

Okay.

Here's where it gets really interesting for me.

If the genome is all the genes,

why is the proteome, all the proteins, so much larger and more complex than the genome?

You'd think they'd be like a one -to -one match or close to it.

That's a key aha moment for many.

And it's down to several really remarkable biological phenomena.

First,

and probably most important, especially in complex organisms like us, is alternative splicing.

Okay.

What's that?

Think of it this way.

One original genetic recipe, the pre -mRNA,

but it comes with different sets of instructions for cutting and pasting it together.

The cell can splice that initial RNA message in multiple ways.

Cutting out different bits.

Exactly.

Cutting out different introns, maybe even skipping some exons.

This creates different mature mRNA molecules from the same gene.

Each of these different mRNAs can then be translated into a distinct protein version, maybe with slightly different functions.

So one gene can produce many different proteins.

That's a huge source of diversity, isn't it?

It's massive.

It's like having a single architectural plan that can be adapted to build, I don't know, a house, an office, or a factory, each with a distinct function, just by changing how you assemble the pieces.

Okay, that makes sense.

What else adds complexity?

Well, less common, but still contributing, is RNA editing.

This is where subtle changes are made to the actual nucleotide sequence of an RNA molecule after it's been transcribed.

That can also alter the final protein.

But maybe the biggest contributor to proteome complexity beyond splicing is post -translational covalent modification.

Post -translation, so changes after the protein is made.

Exactly.

After the ribosome synthesizes the polypeptide chain.

Some modifications are permanent and necessary for function, like cutting a protein into smaller active pieces, or forming disulfide bonds to stabilize structure, or attaching things like sugars or lipids.

Essential finishing touches.

Right.

But many other modifications are incredibly important reversible changes.

Think of things like adding or removing phosphate groups.

Phosphorylation.

Ah, yes, phosphorylation.

Or adding acetyl groups, acetylation, or methyl groups, methylation.

These act like on -off switches, or maybe dimmer switches,

transiently affecting a protein's function, its location in the cell, or who it interacts with.

So the same protein can be switched on, or off, or tuned.

Precisely.

A single protein can exist in many different modified states within a cell at any given moment, each potentially performing slightly different role, or being active under different conditions.

This massively increases the functional diversity of the proteome.

That truly explains why the proteome is so much more intricate than just the number of genes.

Okay, so with literally thousands of different proteins in a cell, how do scientists actually separate and identify them to study this complexity?

It sounds like finding needles in a haystack.

It is a formidable challenge, yeah.

And one of the classic but still powerful separation techniques is two -dimensional, or 2D, gel electrophoresis.

Two dimensions implies two steps, right?

Exactly.

It's a clever two -step separation process designed to handle really complex mixtures.

First, the proteins are loaded onto a narrow tube gel and separated based on their net electrical charge.

They migrate through the gel until they reach the point where their net charge is zero.

That's called their isoelectric point.

Okay, separation by charge first.

Then that tube gel, with the proteins now spread out by charge, is laid horizontally onto a flat, rectangular slab gel.

And the proteins are separated a second time, perpendicular to the first direction, but this time based on their molecular mass or size.

Smaller proteins move faster through the gel matrix.

So charge first, then size.

Right.

And the result is this gel with a unique math, potentially hundreds or even thousands of distinct spots scattered across it.

Each spot, ideally, represents a unique cellular protein.

Its resolving power is extraordinary.

It can even distinguish proteins that differ by just a single charged amino acid.

That's impressive.

So once you have these thousands of spots on your 2D gel map, how do you actually identify what protein is in each spot?

Ah, that's where mass spectrometry comes in.

Specifically, a technique called tandem mass spectrometry, or MSMS.

The goal is to take one of those spots from the 2D gel, which contains a purified protein, hopefully, and figure out what it is.

Okay, zoom in on one spot.

You carefully excise that spot,

extract the tiny amount of protein, and then use enzymes, like trypsin, to digest it.

Chop it up into smaller peptide fragments.

So you're essentially breaking it down into smaller, more manageable, identifiable pieces.

Exactly.

Then these peptides go into the mass spectrometer.

The first mass spec step, MS1, the instrument measures the precise mass to charge ratio of these intact peptides very accurately.

Okay, get the mass of the pieces.

Then the instrument operator, or an automated system, selects a specific peptide ion of interest, isolates it, and fragments it further inside the mass spectrometer, usually by colliding it with gas molecules.

Breaking the pieces into even smaller pieces.

Yes.

And the second mass spec step, MS2, measures the masses of these subfragments.

Now here's the clever part.

Proteins are chains of amino acids, and we know the precise masses of all 20 standard amino acids.

So the differences in mass between the subfragments reveal the sequence of amino acids in that original peptide.

Like solving a puzzle based on the weights of the pieces.

Exactly.

You get these short stretches of amino acid sequence from several different peptides originating from your protein spot.

These sequences are then fed into computer software that searches vast protein sequence databases.

Like the ones we talked about earlier.

GenBank, SwissProt.

Well, protein databases like SwissProt or NCBI have a protein database.

The software looks for matches and identifies the full complete protein that these peptides came from.

And can it detect those modifications you mentioned, like phosphorylation?

Oh, absolutely.

Mass spectrometry is fantastic for that.

A modification, like adding a phosphate group, changes the peptide's mass by a specific known amount.

The mass spectrometer detects this mass shift,

telling you not only the peptide sequence, but also that it was modified, and often where on the peptide the modification occurred.

That's truly precise work.

It makes me wonder, if we can map DNA activity with micro -RAs, can we do something similar for proteins, a protein chip?

Yes, protein micro -RAs exist following a similar principle.

Proteins are significantly more challenging to develop and use reliably than DNA micro -RAs.

Why is that?

Proteins are just fussier.

They're far more fragile than DNA.

They need to maintain their complex 3D structures to be functional, and purifying thousands of different proteins in a way that keeps them active is a huge hurdle.

Okay, so more technically demanding, what are the main types, and what can they reveal, assuming you can get them to work?

There are two common types.

First, you have antibody micro -RAs.

Here, instead of spotting the proteins themselves, you spot highly specific antibodies onto the array.

You then take your protein sample, label the proteins fluorescently, and wash it over the array.

The antibodies capture their specific target proteins.

So it measures protein amount.

Exactly.

The level of fluorescence at each spot tells you how much of that specific protein was present in your sample.

It's used for quantifying protein expression levels.

Okay, and the second type?

The second type is functional protein micro -RAs.

In this case, purified proteins themselves are spotted onto the array, and you try to keep them active.

These arrays allow researchers to analyze specific protein functions directly on the chip.

Like what kind of functions?

For instance, Heng Tzu and Michael Snyder, pioneers in this area, created arrays with yeast protein kinases, enzymes that add phosphate groups.

They use these arrays to determine which kinases phosphorylate which other proteins, essentially mapping out a critical part of the cell's signaling network.

In another study, they spotted thousands of different yeast proteins onto an array and exposed it to labeled calmodulin, an important calcium binding signaling protein.

This allowed them to identify many new proteins that calmodulin interacts with.

So protein micro -RAs can reveal not just how much protein is there, but potentially what it's doing and what it interacts with.

That's a powerful way to study the cell's active machinery.

Absolutely.

They have diverse applications measuring protein expression, analyzing enzymatic activities, identifying protein interactions, even screening for drug binding and pharmaceutical research, although the challenges remain significant.

This has been an incredible journey through the experimental techniques, the wet lab stuff, but you mentioned earlier that none of this would be possible without the massive data crunching behind the scenes.

Let's move on to bioinformatics, the computational engine of discovery.

Yes, bioinformatics is absolutely essential.

It's the crucial interdisciplinary field that uses computers, mathematical tools, and statistical techniques to record, store, analyze, and interpret these vast amounts of biological information.

So it's not just sequences.

Oh, no.

We're talking DNA, RNA, and protein sequences, of course, but also data from micro - arrays, mass spectrometry, protein structures, patient clinical data, even scientific literature.

It truly incorporates principles from math, stats, computer science, chemistry, physics, all focused on making sense of biological data.

So it's the brain that interprets all this raw data from our cellular city sensors.

How do computers actually analyze these massive sequence files?

It seems overwhelming.

Well, genetic data, DNA, RNA, protein sequences is perfectly suited for computer analysis because it's inherently digital, you know, a sequence of letters.

Data is stored in computer data files, often entered directly by laboratory instruments like DNA sequencers.

And the speed must be incredible.

Phenomenal.

Computers can investigate this data at speeds of millions or even billions of operations per second.

This makes analyses that would be literally impossible to do by hand or would take lifetimes feasible in minutes or hours.

What kind of questions specifically can these programs answer for us from a raw sequence?

A huge range.

Basic things like, does this stretch of DNA contain a gene?

If so, where does it start and end?

Where are the important functional sequences located like promoters or regulatory sites?

If it encodes a polypeptide, what's its amino acid sequence?

And critically, is this sequence homologous meaning evolutionarily related to other known sequences?

What can that tell us about its function or history?

Can you give an example, like translating DNA?

Sure.

Translating DNA into protein sequence manually is tedious and error -prone.

You have to consider the three possible reading frames.

A computer program can do this almost instantly.

It translates the DNA sequence in all three forward reading frames, identifies potential start and stop codons, and finds the longest open reading frame, or ORF, a stretch of codons without any stop signals.

And that ORF often corresponds to the protein coding part.

Very often, yes, especially in simpler organisms like bacteria.

It offers massive speed and accuracy advantages over doing it by eye.

Yeah, are these tools like locked away in research labs?

Not at all.

Remarkably, many incredibly powerful bioinformatics programs and databases are freely available online.

The National Center for Biotechnology Information, or NCBI, in the US is a fantastic resource for this.

It really democratizes access to cutting -edge research tools.

That's a real game changer for science everywhere.

Now, how do these computational strategies actually identify those functional genetic sequences within the vast stretches of, say, a chromosome?

What are they looking for?

There are a few main strategies computers use.

One is sequence recognition.

Here, the program is essentially given a dictionary of known words or phrases, predefined sequence elements or motifs that have known biological meanings.

Like what?

Things like the TATA box and a promoter, specific transcription factor binding sites, start and stop codons, splice site consensus sequences, poly -ed and annihilation signals.

The program scans the long DNA sequence and flags everywhere it finds these known elements.

So it's like searching for specific predefined keywords in a very long document.

Exactly.

It identifies these known, specialized sequences.

Another strategy is pattern recognition.

In this case, the program isn't looking for specific known sequences, but rather for statistically unusual patterns of symbols.

Like a repeated sequence.

Could be.

Or maybe a region with an unusual composition of Gs and Cs.

Or perhaps a palindromic sequence that reads the same forwards and backwards.

The program identifies these patterns without necessarily knowing their function beforehand.

And finally, programs can look for the organization of sequences.

Meaning how the pieces fit together.

Precisely.

Because the arrangement matters.

A promoter element followed by a start codon, followed by an ORF, followed by a stop codon.

That specific organization strongly suggests a gene.

So the program looks for characteristic arrangements of elements or patterns.

So it's not just about what individual elements are there, but how they're structured and ordered.

Exactly.

These computational approaches help us find all sorts of critical signals buried within the genome.

Beyond just finding these elements, can computers actually predict entire genes within a long, uncharacterized DNA sequence?

Yes.

Gene prediction is a major goal of bioinformatics.

The idea is to identify regions of genomic DNA that encode genes.

Now, RNA sac, as we discussed, is great for finding genes that are expressed under certain conditions.

Right.

The active ones.

But computer programs aim to predict all potential genes in a genome.

Even those that might be silent or expressed only rarely.

How do they accomplish that prediction?

Again, several strategies.

Search by signal involves looking for those characteristic gene signals we just talked about.

Promoters, start -stop codons, splice sites and eukaryotes, terminators.

Finding these in the right order and spacing is a strong indicator.

Then there's search by content.

This looks for statistical properties of protein coding regions that differ from non -coding DNA.

One key property is codon bias.

Most amino acids can be coded by more than one codon, right?

Organisms often show a preference using certain codons for a given amino acid, much more frequently than others.

Protein coding regions tend to have this non -random codon usage pattern, which computers can detect.

Interesting, like a dialect.

Sort of, yeah.

And finally, especially in bacteria where genes usually lack entrons, simply identifying long open reading frames, ORFs, those stretches of codons uninterrupted by stop codons is a very powerful way to find potential protein encoding genes.

In eukaryotes, entrons complicate this because they interrupt the ORF.

It sounds incredibly powerful, these prediction tools, but I imagine they're not perfect.

They are predictions, not certainties.

That's a really important caveat, absolutely.

While these computational methods are powerful and constantly improving, they are not always 100 % accurate.

They might misidentify the true start codon, or get the exact boundaries of entrons and exons wrong, and sometimes they might even flag a region as a gene when it isn't false positive, or miss one that's there, false negative.

So lab work is still needed.

Definitely.

Experimental validation, maybe using techniques like RNAseq, or other molecular methods, is always needed to confirm computationally predicted genes.

They provide excellent starting point strong hypotheses, but not the final answer on their own.

Okay, that makes sense.

Now, one of the most powerful applications of bioinformatics, it seems, involves comparing newly discovered sequences to those that are already known, maybe from other species.

Tell us about homologous genes in large computer databases.

Yes, this is absolutely fundamental to modern biology, the concept of homology.

Homologous genes are genes that are derived from a common ancestral gene.

Shared ancestry.

Exactly.

Because of this shared ancestry, homologous genes usually have similar DNA sequences, and very often they carry out similar or even identical functions in different organisms, or sometimes within the same organism.

For example, the Lacey gene, which encodes the protein that transports lactose into the cell, is found in E.

coli and the related bacterium Klebsiella pneumonia.

These two genes are homologous, sharing about 78 % identical DNA bases, and they perform the same function.

So that sequence similarity points to a shared evolutionary past, and likely a similar job.

Now, I often hear the terms orthologs and paralogs.

What's the difference there?

It sounds a bit confusing.

It's a crucial distinction, but yeah, the terms can be tricky at first.

Orthologs are homologous genes found in different species that evolved from a single ancestral gene in their last common ancestor.

Typically, orthologs retain the same or very similar function.

Think of the human beta -globin gene and the mouse beta -globin gene.

They are orthologs, both ultimately derived from an ancestral globin gene, and both involved in carrying oxygen.

Okay, orthologs.

Same gene, different species, same function usually.

What about paralogs?

Paralogs are homologous genes found within a single species that arose from a gene duplication event.

So an ancestral gene gets accidentally duplicated, and now you have two copies in the same genome.

Over evolutionary time, these copies can diverge slightly in sequence and function.

Backup copies that can specialize?

Exactly.

These duplicated genes form a gene family.

A classic example is the human globin gene family.

We have genes for alpha -globin, beta -globin, gamma -globin, delta -globin, etc.

These are all paralogs, derived from duplications of an ancient ancestral globin gene.

They have related but distinct functions, for example.

Different globins are expressed during embryonic, fetal, and adult development, optimized for oxygen transport under different conditions.

Got it.

Orthologs across species, paralogs within a species due to duplication.

And how do researchers actually use this concept of homology when faced with, say, a newly sequenced gene?

This is where those large collaborative databases are absolutely essential.

Scientists worldwide contribute genetic sequence information to immense public repositories.

GenBank at NCBI is a primary one for DNA sequences.

Uniprot and SwissProt are major ones for protein sequences.

PDB stores 3D structures proteins.

So global libraries of genetic information.

Precisely.

And these databases aren't just raw sequences.

They also include crucial annotations.

Additional information linked to each sequence, like the organism it came from, its known or predicted function, links to scientific publications, and so on.

These databases are critical for sharing data and comparing new findings against the collective knowledge of the field.

It makes biology a truly global cumulative science.

And the go -to tool for searching these massive libraries for homologous sequences is BLAST, right?

The Basic Local Alignment Search Tool.

I've heard of that one.

Absolutely.

BLAST is arguably the most fundamental and widely used bioinformatics tool.

It was developed back in 1990 by Stephen Altschul, David Lippman, and their colleagues at NCBI.

You basically take a genetic sequence you're interested in, maybe a gene you just discovered, or a protein sequence you inputted into the BLAST program.

On a website, usually?

Yes.

Typically through a web interface.

And BLAST then rapidly compares your query sequence against millions, even billions, of sequences stored in the chosen database, looking for statistically significant matches, potential homologs.

How do you interpret the results?

It must give you a huge list of matches.

Is it just the highest percentage match that matters?

Well, the percentage identity, how similar the sequences are, is important.

But the most crucial metric BLAST provides is the e -value, or expect value.

E -value?

What's that?

The e -value is a statistical measure.

It represents the number of matches with a similar score that you would expect to find purely by random chance when searching a database of that particular size.

So lower is better?

Much, much better.

A very small e -value, for instance, less than 1 by 10, 50, that's 1 divided by 1 followed by 50 zeros, indicates it's a highly significant match.

It means the similarity is extremely unlikely to be due to random chance alone, and strongly suggests true homology, a shared evolutionary origin.

Can you give us a real -world example of how e -values work?

Certainly.

Let's say you take the sequence of the human enzyme phenylalanine hydroxylase, that's the enzyme that's deficient in the genetic disorder, phenylketonuria.

PKU.

If you BLAST this protein sequence against a large protein database like SwissProt, the top hits will be the phenylalanine hydroxylase protein from closely related species like chimpanzees and orangutans, with incredibly tiny e -values, essentially zero.

Then you'll see matches from mice, rats, chickens, maybe zebrafish, and even fruit flies, drosophila.

The e -values will gradually get a bit larger as the evolutionary distance increases, but they'll still be extremely small, confirming that these are all indeed homologous enzymes derived from a common ancestor.

The order of the matches beautifully reflects the evolutionary relationships.

That's amazing.

So finding strong homology, indicated by a tiny e -value, to a gene or protein with a known function, can provide a really powerful clue about the function of your newly discovered gene.

Absolutely.

This is one of the most common and fruitful uses of BLAST and homology searching.

There's a very strong correlation.

If two sequences are significantly homologous, they are very likely to share a similar or related function.

For example, when the CFTR gene, the gene mutated in cystic fibrosis, was first identified in humans.

Researchers immediately ran a BLAST search.

The results showed it was homologous to several known proteins in other species that were involved in transporting ions and small molecules across cell membranes.

This was a huge breakthrough.

It provided the crucial first clue that cystic fibrosis likely involves a defect in ion transport, which guided subsequent research and ultimately proved correct.

Homology gave them the functional hypothesis.

Wow, that's a perfect illustration of its power.

Okay, finally, once you've used BLAST to identify a set of homologous genes or proteins, how do you zoom in on the most important parts, the critical sites within those sequences that are likely essential for their function?

That's done through a technique called multiple -sequence alignment.

After finding homologous sequences using BLAST, researchers use computer programs to align two or more of these sequences together, lining them up column by column.

Lining them up to see the similarities and differences.

Exactly.

The goal is to identify conserved sites.

These are positions in the alignment -specific nucleotides in DNA or RNA or specific amino acids in proteins that remain identical or very similar across most or all of the different homologous sequences you're comparing.

And why are these conserved sites so important?

They are highly likely to be functionally critical.

Think about it from an evolutionary perspective.

If a particular amino acid is essential for a protein structure or its ability to bind to another molecule or catalyze a reaction,

then mutations changing that amino acid are likely to be harmful.

Natural selection will tend to weed out those harmful mutations.

So the important bits are preserved over time.

Exactly.

The sequences can change in less critical regions, but the functionally essential sites tend to be conserved, meaning they resist change over long evolutionary periods.

For instance, if you align the protein sequences of the different human globin chains, alpha, beta, delta, gamma, epsilon, zeta, all those paralogs we mentioned, you'll see that certain amino acid positions are identical across all or most of them.

These often include specific histidine residues that are absolutely crucial for binding the heme molecule, which is the part that actually carries oxygen.

Seeing their conservation across all these different globin proteins powerfully underscores their fundamental importance to the protein's function.

And can alignments show other things too, like missing pieces?

Yes.

Alignments often introduce gaps in one sequence relative to another.

These gaps represent insertions or deletions in gatles that have occurred in one lineage compared to another during evolution.

So multiple sequence alignments are incredibly rich sources of information about both function and evolutionary history.

This is incredible.

So what does this all mean?

We've journeyed from the basic map, the genome sequence, through understanding gene activity with functional genomics, exploring the incredibly complex world of proteins and proteomics, and seeing how bioinformatics acts as the essential computational brain making sense of it all.

We've really gone from mapping the city to understanding its bustling economic and social life, haven't we?

We really have.

It's truly about moving from merely knowing the sequence letters to understanding their profound significance, their function, their interactions, their evolution.

It's about understanding the living system.

Which, as you said, raises some deep questions.

It certainly does.

As we continue to uncover these intricate biological mechanisms with ever -increasing detail and scale, you have to ask,

how might this deepening knowledge fundamentally change our approach to health, disease,

agriculture, maybe even the very definition of life itself?

The more we understand the blueprint and how it operates, the more profound the questions become about its design, its potential, and our ability to perhaps modify it.

It's a journey of endless discovery, isn't it?

Where every new piece of information seems to unlock more complex and fascinating questions.

We hope this deep dive has given you, our listener, a powerful new lens through which to view the living world, from the single gene all the way up to the entire organism.

You hope so too.

So as you go about your day, perhaps consider what other complex biological systems, maybe even beyond genetics, might be hiding secrets that are just waiting for the right blend of experimental technique and computational insight to be uncovered.

The possibilities seem almost endless.

Thank you so much for joining us on this incredibly deep dive into functional genomics, proteomics, and bioinformatics.

We truly appreciate you being part of the Last Minute Lecture family.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Understanding how genes function within their genomic context and how those functions translate into cellular outcomes requires integrating molecular observation with computational analysis across multiple biological levels. Functional genomics investigates the dynamic roles genes play in living systems by combining large-scale experimental techniques with sophisticated analytical methods, moving beyond simple gene cataloging to reveal how individual genes and coordinated gene networks orchestrate biological processes. Proteomics recognizes a critical reality in molecular biology: proteins, not genes, perform most cellular work, and the abundance of a transcript often fails to predict protein quantity or activity within cells. Advanced mass spectrometry platforms and protein detection technologies enable researchers to measure thousands of proteins simultaneously, identify chemical modifications that alter protein function, and map physical and functional relationships among proteins within complex cellular environments. Bioinformatics provides the computational infrastructure that transforms voluminous raw data into actionable biological knowledge, offering tools for comparing sequences across species, identifying functionally important conserved regions, predicting three-dimensional protein structures from amino acid sequences, and mapping regulatory networks that show how genes and proteins control one another. Integration of diverse data sources creates systems-level models that capture how organisms maintain function: genomic sequences establish the molecular blueprint, gene expression measurements reveal which genes activate under specific conditions, protein quantification data show actual molecular abundance, and metabolic information demonstrates how these molecules interact in biochemical pathways. Public repositories such as GenBank and specialized interaction databases provide accessible platforms for researchers to query relationships among biological molecules and visualize how changes in one component cascade through interconnected regulatory systems. This integrated perspective illuminates disease mechanisms by showing how genetic mutations or altered gene expression propagate through biological networks, ultimately causing pathological outcomes. Applications to human health demonstrate the practical value of these approaches: functional genomics techniques identify genes responsible for inherited diseases, reveal potential drug targets, enable stratification of patients into treatment groups based on genetic profiles, and support the development of precision medicine strategies tailored to individual genomic variation.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 23: Genomics II: Functional Genomics, Proteomics, and Bioinformatics

Related Chapters