Chapter 4: Protein Primary Structure & Sequencing

0:00 / 0:00
Report an issue

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace, the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Okay, let's unpack this.

If you are trying to understand how life works, I mean, how cells manage basically everything.

Structure, movement, defense, communication.

You pretty quickly land on proteins.

These are the physically and functionally complex macromolecules.

They're the absolute workhorses of life.

They really are the ultimate molecular multitaskers.

And the sheer scope of their responsibilities is staggering.

Your sources highlight that this isn't just theory.

We're talking about the cytoskeleton providing that structural integrity.

You have actin and myosin handling movement.

Hemoglobin carrying oxygen, antibodies on defense.

And of course, enzymes catalyzing every single essential reaction.

And what's so crucial to understand is that proteins aren't static.

They have a whole life cycle, almost like a career path.

That's a great way to put it.

It starts with synthesis, so translation from the genetic code.

Then they mature through these critical steps like post -translational processing.

We're talking about things like selective proteolysis.

Right, cutting off little pieces to activate them.

Exactly.

And then they go through working and resting states, regulated by the cell.

But eventually, like any molecule, they age.

Through processes like oxidation and deamidation.

And finally, they get tagged, usually by ubiquitination, which marks them for degradation.

Then they get turned back into amino acids to be recycled.

What's fascinating here is that the whole goal of modern molecular medicine is tied directly to understanding this dynamic life cycle.

We want to identify protein biomarkers or specific modifications that are associated with the disease,

say, a cancer or a neurodegenerative state.

But to get to that clinical goal, you have to know the protein's exact primary structure, residue by residue.

And that's our mission today.

Giving you a shortcut to understanding the foundational techniques used to figure out that sequence.

We have to start with isolation because the challenge is, well, it's formidable.

A single cell contains thousands of proteins.

And in wildly different amounts, right?

Exactly.

Some are in tiny trace amounts, while others are incredibly abundant.

You need a pure sample to even begin analyzing its properties.

So if we have a crude cell extract,

this soup of thousands of macromolecules,

what's the most basic physical lever we can pull to start separating them?

We start with solubility, often through a method called selective precipitation.

You basically exploit differences in how well proteins dissolve.

For instance, you can adjust the pH to cause what's called isoelectric precipitation.

Or you can use organic solvents to change polarity.

But what's the most common way?

The most widely used technique is called salting out.

You gradually increase the concentration of a salt, like ammonium sulfate.

And different proteins will crash out of the solution at different salt concentrations.

Precisely.

It's a surprisingly effective first cut to get rid of major contaminants.

That gets us closer, but the real workhorse, the one that delivers the high resolution purification we need for sequencing,

has to be column chromatography.

Correct.

Chromatography is always about two phases.

The stationary phase, that's the functionalized beads packed in the column.

And the mobile phase, the liquid flowing through it.

Right.

And for modern work, we really rely on HPLC, high pressure liquid chromatography.

Why the high pressure?

Well, to get better resolution, you need to use incredibly fine particles in the column.

But that small particle size generates a lot of resistance.

Ah, so you need high pressure just to push the liquid through.

HPLC uses these highly rigid silicon particles and robust stainless steel columns to withstand pressures up to several thousand CI.

It makes that high resolution possible.

And chromatography is so versatile because it can separate based on different physical properties.

So let's start with size.

How do you separate proteins that are chemically similar, but just different sizes?

That would be size exclusion chromatography or gel filtration.

It separates proteins based on their Stokes radii.

Which is not just mass, right?

It's essentially a measure of the effective volume the protein occupies in solution.

It accounts for both mass and shape.

Okay, can you give us an analogy for how that works?

Um, sure.

Think of the beads in the column as sponges filled with tiny pores, like little indentations along a fast moving riverbank.

The large proteins are excluded from those pores.

They just can't get in.

So they stay in the fast moving water, travel the fastest, and they emerge first.

And the smaller ones get stuck in the slow lanes.

Exactly.

The smaller, included proteins spend time diffusing in and out of those pores, getting retarded along the way.

So they emerge later.

That's a great mechanical separation.

Okay, next up,

charge.

We can separate by charge using ion exchange chromatography.

This relies entirely on charge -charge attraction.

If your bead matrix has negative functional groups like carboxylates,

it's a cation exchanger.

And it'll bind to the positively charged proteins.

You got it.

And if the beads are positively charged, they bind negative proteins?

So how do you get them off the column once they're stuck?

You don't just wash them off.

You elute by gradually raising the ionic strength of the mobile phase.

You're essentially forcing salt ions to compete with your protein for those binding sites, which releases it.

So high salt releases the protein there.

What if we want to separate based on how hydrophobic or nonpolar a protein is?

That takes us to hydrophobic interaction chromatography.

Here, proteins stick to a matrix that's coded with hydrophobic groups.

What's a little counterintuitive here is that these hydrophobic interactions are actually enhanced by a mobile phase of high ionic strength.

Wait, so the same high salt concentration that releases proteins in ion exchange makes them stick here?

Exactly.

To get the proteins off, you do the opposite.

You gradually lower the salt concentration or add a mild, less polar agent like glycerol.

That's four powerful methods, but the one that truly sets itself apart for specificity is affinity chromatography.

This is the superstar.

It's the most selective because it exploits biology, not just physics.

So it's using the protein's own function against it, in a way.

In a way, yes.

It harnesses a protein's natural high selectivity for a specific molecule, or ligand, which you immobilize on the beads.

Like an antibody's antigen.

Perfect example.

Only the target protein should adhere.

Elution is then achieved by competing with a free soluble ligand.

You just flood the column with it, and your protein lets go of the column to bind the free version and flows right off.

Okay, so we've run several of these columns, and we think we have a pure sample.

How do we prove that purity and further analyze its components?

For that, we turn to polyacrylamide gel electrophoresis, or PAGE.

Specifically, SDS -PAGE.

Right.

Electrophoresis separates molecules using an electric field in a porous acrylamide matrix.

But SDS -PAGE is the standard because of the crucial role played by the anionic detergent, SDS, sodium dodecyl sulfate.

Why is SDS so critical here?

What's it doing?

It does two main things.

First, it completely denatures the protein, unwinding it into a linear chain.

Okay, so no more 3D structure.

None.

And second, it coats that protein in a uniform negative charge.

It binds at a fixed ratio about one SDS molecule for every two peptide bonds.

So this massive negative charge from the SDS just overwhelms the protein's natural charge.

It makes the charge -to -mass ratio approximately equal for every single protein SDS complex.

So the protein's native charge is neutralized.

And since the charge -to -mass ratio is equalized, the only variable left affecting how fast it moves through the gel is size.

Precisely.

We also typically add a reducing agent like 2 -mercaptoethanol to break any disulfide bonds and make sure everything is fully linear.

And the result is that the polypeptides separate primarily based on their relative molecular mass.

Exactly.

And if you stain the gel with something like kumasi blue, you can literally see your purification success.

The sources show how successive lanes of an extract will show a target band becoming more prominent as all the other contaminating bands just disappear.

SDS -PAGE is fantastic for mass, but proteins aren't just defined by size.

They're defined by their unique chemical composition, especially their net charge.

And to separate based on that unique chemistry, you need isoelectric focusing, or IEF.

How does that work?

This technique uses special ionic buffers called ampholytes and an electric field to create a stable pH gradient across the gel.

Proteins migrate through this gradient until they reach their isoelectric point, or PI.

That specific pH where their net charge is zero.

Right.

And at that point, they just stop moving.

The electric field has no more pull on them.

And to achieve the highest possible resolution for a really complex mixture like, say, every protein in a cell, we can combine these two.

Yes.

And that gives us two -dimensional electrophoresis.

You separate the components first based on their PI in one dimension.

Their chemical property.

And then you take that gel and run it sideways to separate them based on their mass in the second dimension.

Their physical property.

And this lets you resolve hundreds, even thousands of distinct polypeptides into these discrete spots on a map.

It's still often used as the gold standard for mapping the complexity of a proteome.

Separation is powerful, but it still doesn't tell us the primary structure, the exact sequence of amino acids.

For that, we have to look back at the historical trailblazers who first tackled this problem chemically.

That story really begins with Frederick Sanger.

He sequenced the first polypeptide insulin.

Which is pretty small, right?

Relatively.

It's a 21 -residue A chain and a 30 -residue B chain linked by desulfide bonds.

He had to chemically break those bonds, then chop up the chains into smaller peptides using various enzymes like trypsin and chemitrypsin.

And his method for identification involved the Sanger region.

That's right.

One fluoro, two rurial, four dinitrobenzene.

It selectively labels the amino terminal residue of a peptide.

By painstakingly analyzing all these different fragments and identifying the N -terminus of each one, he pieced together the sequence.

It sounds like a monumental manual effort.

It was.

It won him his first Nobel Prize.

Wait, he won one for sequencing insulin and then another one for sequencing DNA later?

That's just incredible.

But chemical sequencing was revolutionized by Per Edmund.

What did Edmund change?

Edmund introduced phenyl isothiocyanate, the Edmund region.

And the true genius was in the chemistry of it.

The derivative it creates, a phenylthiohydantoin, can be removed under mild conditions.

Okay.

What's the significance of mild conditions?

Critically, it removes the labeled amino acid while leaving the rest of the peptide intact, just one residue shorter, and it exposes a new N -terminus.

Ah.

So, instead of destroying the peptide to find the first amino acid, you can just peel them off one at a time and repeat the process sequentially.

Exactly.

This allowed for successive rounds of sequencing on a single sample, which led to automation.

But even with automation, it had serious limitations.

Oh, absolutely.

The efficiency is never 100%.

So over many cycles, you accumulate these fragments with different starting points, what scientists call out -of -phase contaminants.

Which messes up your signal.

It does.

It limits the practical read length to a short segment, maybe five to 30 amino acids.

To get a full sequence, you still needed a lot of pure protein and multiple cleavage strategies to generate overlapping fragments.

It was still incredibly slow.

And here's where it gets really interesting.

The molecular biology revolution completely changed the question.

Why painstakingly sequence the protein if you can just sequence the gene?

DNA sequencing became far more rapid.

And today, genomics allows us to sequence a gene and simply translate the nucleotide triplets into the encoded polypeptide sequence.

So for most model organisms, including us, homo sapiens, the primary protein sequence is probably already sitting in a database like GenBank.

In most cases, yes.

That means the task shifted.

We no longer need to chemically sequence the whole protein from start to finish.

So what do we need?

We often just need a small sequence segment, maybe five or six residues, to perform an unambiguous identification using bioinformatics searches against all that genomic data.

So the problem isn't knowing the blueprint anymore.

The modern problem is determining the existence, abundance and functional state of the actual product.

Which brings us to the gold standard for identification today, mass spectrometry.

Or MS.

Right.

MS is vastly superior to Edmund sequencing in both sensitivity and speed.

And since mass and charge are universal properties, MS is incredibly versatile.

It analyzes everything, metabolites, lipids, proteins.

And crucially, MS can detect modifications that the DNA sequence completely misses.

This is where we get the real insight.

That is its major transformative advantage.

Detecting post -translational modifications or PTMs.

These are often the switches that determine function.

Absolutely.

If a protein is phosphorylated, that means an 80 -daltin phosphate group has been added.

That mass increment is crystal clear to the MS, whereas the DNA sequence gives you no clue.

And that 80 -daltin tag can be the difference between a cell behaving normally and one that's signaling for cancer growth.

Exactly.

We can also detect acetylation, which adds 42 daltins, or glycosylation, which can add 162 daltins.

MS is mapping the working code, which the genome completely ignores.

The big challenge must have been getting these massive, fragile biomolecules into the vacuum of the mass spectrometer without them just falling apart.

You can't just heat them.

You can't.

That required the development of soft ionization techniques.

One is electrospray ionization, where you dissolve molecules in a volatile solvent and stitch them through a tiny capillary.

The solvent evaporates, leaving your charged macromolecule suspended in the gas phase totally intact.

And that can be connected right to an HPLC column.

It often is.

The other major method is Maldi.

You mix a sample with a dye matrix, and a laser rapidly excites the matrix, which basically flings the imbued peptide into the vapor phase without thermally heating the peptide itself.

A true technological breakthrough.

So once they're ionized and in the gas phase, how does the machine actually measure the mass?

Well, in modern labs, we use two main types.

For smaller molecules, there's quadricol MS.

You can think of it like a highly specialized filter that only lets particles of a very precise mass -to -charge ratio wiggle their way through a magnetic field to hit the detector.

And for bigger things, like whole proteins.

For large proteins, or complex peptide mixtures, we often use time -of -flight MS, or TOF.

Here, all the ions are accelerated down a long tube, and the time it takes them to reach the detector is inversely proportional to their mass.

So heavier ions are slower.

Heavier ions take longer to cover the distance.

Simple as that.

And to sequence a complex mixture, without purifying every single component first, which is often impossible.

That requires tandem MS, or MS -MS.

This is the real sequence engine.

So two mass specs in a row.

Exactly.

The first one separates the incoming peptide mixture by mass, and selects just one peptide.

That single peptide is then sent to a second chamber where it gets fragmented, usually by colliding it with gas atoms.

And the second MS measures the mass of the pieces.

Right.

And since those fragments typically differ by the mass of just one or two amino acids, computer algorithms can reconstruct the original peptide sequence from those mass differences.

This sounds powerful enough to tackle clinical problems.

It's already a routine.

Tandem MS is used every day to screen newborn blood samples for genetic disorders like phenyl ketonuria by analyzing abnormal concentrations of amino acids and other metabolites.

It's accurate, fast, and incredibly sensitive.

So what does this all mean?

All these, the method, purification, high -res separation, sequencing, it all leads us to the grand scale of proteomics.

The effort to identify the proteome, the identity, the abundance, and the state of modification of all proteins expressed by a cell at a specific time.

And that's the essential distinction, isn't it?

The genome is static.

It's the unchanging blueprint.

But the proteome is dynamic.

It changes based on cell.

Type muscle cell has a different proteome than a neural cell.

It changes over time, like the shift in hemoglobin from fetal to adult life.

And it changes in response to the environment.

And that dynamic nature is precisely why proteomics is so critical for identifying specific actionable disease biomarkers.

To tackle that massive complexity, methods like MUDPIT were developed.

Multidimensional Protein Identification Technology.

It's a fantastic example of using chromatography to simplify the problem before you even get to the MS.

It uses successive rounds of nanoscale chromatography -like reversed phase and ion exchange to resolve a vast complex peptide mixture into many simpler fractions.

The mass spectrometer then analyzes those simpler fractions, and the computer has a much easier job.

And tying all this together is bioinformatics.

Absolutely essential.

The computer algorithms are the unsung heroes here.

They compare the new sequences against known databases, allowing scientists to infer function.

They can detect conserved structural themes, giving us critical clues about what a new protein might do without having to test it from scratch.

So if we were to boil it all down.

If you take away three things today, remember this.

You need high -resolution purification using selective chromatography.

You need precise separation, usually 2D electrophoresis for the ultimate complexity map.

And you need high -powered identification with mass spectrometry being the method of choice.

Especially for finding those critical dynamic post -translational modifications.

Genomics gives us the blueprint, but proteomics gives us the working dynamic picture necessary for identifying disease biomarkers.

Since the genome is static, but the proteome is unique for every cell component and constantly changes with time and circumstances, what is the sheer magnitude of the PASC of defining the ultimate comprehensive human proteome?

What challenge does this dynamic complexity present for truly personalized medical interventions when the target we are chasing is essentially a moving one?

Something to think about as you encounter the world's most versatile molecules.

Thank you for joining us for this deep dive, and keep that curiosity fired up.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers
Proteins are the molecular machines that sustain cellular function, serving as structural components, catalysts, and signaling mediators across all biological systems. Understanding protein architecture begins with isolating pure samples from complex cellular mixtures, a process accomplished through separation techniques that exploit fundamental physicochemical differences. Size exclusion chromatography leverages molecular dimensions through the Stokes radius to achieve separation, while ion exchange chromatography exploits electrostatic properties based on net surface charge, and affinity chromatography exploits specific ligand-binding interactions. High pressure liquid chromatography advances these classical methods by employing small, densely packed matrix particles that achieve superior resolution and reproducibility. Once isolated, proteins are characterized using polyacrylamide gel electrophoresis, which employs sodium dodecyl sulfate as a denaturing anionic detergent to unwind polypeptide chains and fractionate them according to molecular weight. Isoelectric focusing provides an orthogonal separation strategy based on the isoelectric point, the unique pH value where net charge approaches zero, frequently combined with polyacrylamide gel electrophoresis in two dimensional electrophoresis for comprehensive characterization. The determination of amino acid sequences was historically pioneered through insulin sequencing and the Edman degradation method, which sequentially labels and removes amino acids from the N-terminus. Contemporary protein sequencing has been revolutionized by mass spectrometry, an analytical technique offering exceptional sensitivity and the critical advantage of detecting covalent modifications such as phosphorylation and glycosylation that remain invisible to gene-derived predictions. Mass spectrometry instruments exist in multiple architectures, including quadrupole analyzers optimized for smaller peptides and time of flight instruments suited for large protein complexes, with ionization methods like electrospray ionization and matrix assisted laser desorption enabling transfer of intact molecules into the gas phase. The broader field of proteomics extends beyond individual protein identification to mapping the complete proteome of cells under defined physiological conditions, synthesizing genomic blueprints with experimental mass spectrometry data and bioinformatics algorithms to recognize conserved structural motifs like the Rossmann fold and predict biological function. Protein maturation involves critical posttranslational modifications including selective proteolysis to remove signal sequences and disulfide bond formation to stabilize tertiary structures, while protein biomarkers associated with disease states serve as diagnostic indicators and therapeutic targets in modern medicine.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥