Chapter 7: Evaluating Clinical Evidence

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive.

Today we are tackling something that I think every single one of us has wondered about at some point.

Usually while you're sitting on that crinkly paper -covered exam table, you know, waiting for a doctor to come in, we're asking,

what is actually going on in that doctor's head?

I mean, is it just memorization?

Is it like pure intuition?

Or is there an engine running under the surface?

Yes.

Or is there actually a rigorous, almost mathematical engine running beneath the surface of what looks like just a simple physical exam?

It's a fascinating question.

And I think the answer, as we're going to see today, is that it's surprisingly mathematical.

We are diving into chapter seven of Bates Guide to Physical Examination and History Taking.

Right.

And this chapter is titled Evaluating Clinical Evidence.

And for anyone listening who thinks, oh, evaluating evidence, that sounds like a dry stats class, I promise you, hold on, because this is actually about the high stakes game of decision -making under uncertainty.

That's it.

It's about how we know what we know in medicine.

Exactly.

Medicine isn't just knowing facts.

It's about navigating probability.

It's about taking a huge amount of noise and finding the signal.

So let's set the stage here.

The chapter opens with this really, really useful model, a Venn diagram, to explain how a doctor makes a decision.

It's called evidence -based clinical practice.

I feel like evidence -based is a buzzword we hear all the time, but Bates visualizes it perfectly.

Imagine three overlapping circles.

That's right.

And it's a concept that's been around for a while, but seeing it visually just helps you understand the tension in the room, really.

Three circles all overlapping in the center.

Okay.

Let's walk through them one by one, circle number one.

So circle number one is clinical expertise.

This is the doctor's internal library.

It represents their growing skills, their ability to diagnose their past experiences.

So their gut feeling is in there too.

Their gut feeling, but also the proficiency of their actual hands.

Can they feel the spleen?

Can they hear the murmur?

It's the craft.

Okay.

So that's the doctor part.

That's the human standing in the room, circle number two.

Patient preferences.

This is huge.

And the text really emphasizes this.

This is the patient's individual needs, their expectations, their values, because a treatment that works on paper might not work for the specific person's life.

Maybe they can't afford the medication.

Maybe they are horrified of surgery.

Or maybe they just value, you know, quality of life over duration of life, or vice versa.

Exactly.

You can't just treat the disease.

You have to treat the person sitting in front of you.

Makes sense.

And the third circle.

Research evidence.

This is the best available science.

The studies, the data, the hard numbers, the randomized controlled trials that tell us, you know, generally speaking, does treatment A work better than treatment B?

And the sweet spot, the bullseye, is where all three of those circles overlap.

Precisely.

That is the goal of this deep dive.

We want to unpack that third circle, the research evidence, and see how it informs the clinical expertise to help the patient.

We're decoding the logic of the diagnosis.

We are.

The chapter really wants students to understand that you cannot be a good clinician without being able to, you know, read and understand the literature.

You just can't.

I love that.

The logic of the diagnosis.

Yeah.

So let's get into the nitty gritty.

The chapter starts with a bit of a refrain that I found really interesting.

It says we shouldn't just think of labs or x -rays as tests.

Right.

It says asking a question is a test.

It is.

And this is a fundamental shift in thinking for a student or really for anyone new to health assessment.

We usually separate history and physical from diagnostic testing, but Bates argues they are all the same thing.

How so?

When you ask a patient, does it hurt when you breathe or when you listen to their heart, you are performing a diagnostic test, just like drawing blood.

Because it changes the probability of what they might have.

Exactly.

The whole process starts with what we call a differential diagnosis.

That's just a problem.

Okay.

When the patient walks in, that list is huge.

It could be anything.

Right.

A stomach ache could be a pulled muscle, a heart attack or indigestion.

A huge range.

But as you learn more, as you ask questions and you touch the patient, you are assigning probabilities to those potential causes.

You are mentally moving some up the list and some down the list.

And you're trying to get one diagnosis to be probable enough to act on.

That's the goal.

You want to cross a treatment threshold, meaning you are confident enough to actually do something.

Treatment threshold.

I like that.

It implies you don't always need 100 % certainty.

You almost never have it.

You just need enough certainty to justify the action.

Correct.

And that threshold changes.

If the treatment is giving an aspirin, the threshold is pretty low.

If the treatment is chemotherapy, the threshold is very, very high.

Okay.

Let's make this real.

The text gives us a case study that runs through the whole chapter.

We have a 43 -year -old woman.

Let's paint the picture of this patient.

Right.

So let's look at her profile.

She comes into the emergency room or maybe the clinic with right upper quadrant abdominal pain.

Okay.

So that's the area right below the ribs on the right side, near the liver.

Correct.

Geographically, that's where the liver and gallbladder live.

She says the pain is steady, it's severe, and it's been going on for over four hours.

That sounds rough.

And here is the kicker.

It developed about an hour after she ate a fatty meal.

She's also nauseous and she's been vomiting.

Okay.

So immediately,

my ears perk up at fatty meal.

Even I know that's a classic clue.

It is.

And in the language of evidence -based medicine, that piece of the history increases the likelihood of acute cholecystitis, which is just inflammation of the gallbladder.

Why the fatty meal connection?

The gallbladder squeezes bile to help you digest fat.

So if you eat fat, it squeezes.

If it's inflamed or blocked by a stone, that squeezing hurts a lot.

But it's not a done deal yet.

It's not a slam dunk.

No, because your differential diagnosis list still has other things on it.

It could be biliary colic, which is just pain without an actual infection.

It could be acute cholangitis.

That's an infection of the bile ducts themselves.

It could even be hepatitis.

So we need more data.

We need to run more tests.

But in this case, the next test is the physical exam.

Right.

So you take her vitals.

She has a fever of 38 degrees Celsius, so about 100 .4 Fahrenheit.

Her heart rate is up 110 beats per minute.

And when you press on that rate of per quadrant,

it hurts.

It's tender, very tender.

And then there's this specific move mentioned, the Murphy sign.

It sounds like a character from a detective novel.

The Murphy sign.

It's a classic maneuver.

It's very specific to this region.

You press your thumb under the right costal margin.

That's the edge of the rib cage.

And you ask the patient to take a deep breath.

Why a deep breath?

What does that do?

Because when you inhale, your diaphragm pushes down and that pushes the liver and gallbladder down.

If the gallbladder is inflamed, it descends right into your stationary fingers.

And that causes a sharp increase in pain.

It makes the patient abruptly stop inhaling.

That sudden stop.

That is a positive Murphy sign.

So it's like a little trap you set for the pain.

It's a great way to put it.

Okay.

So we have the fatty meal story, the fever, the tenderness, and now this positive Murphy sign.

And you synthesize that.

The text says that while jaundice isn't always present, the combination of fever, tenderness, and a positive Murphy sign significantly increases the probability of colicistitis.

But are we at 100 %?

Have we crossed that threshold?

Not yet.

We order labs.

Her white blood cell count is elevated that suggests an infection is brewing, but her liver function tests are normal.

Which is actually good news, right?

Because it makes one of those other things less likely.

It is.

Normal liver function makes hepatitis pretty unlikely.

So hepatitis drops way down your probability list.

Colicistitis moves further up.

What?

But the text makes a really important point here.

It says, no single element of the history, physical examination, or laboratory results is sufficient to help you cross the treatment threshold.

So you're still in the gray zone.

You're pretty sure.

Yeah.

But you wouldn't like cut her open yet based solely on your fingers and a question about dinner.

Exactly.

So you need the closer.

In this case, it's an ultrasound.

The ultrasound shows gallstones and a thickened gallbladder wall.

Okay.

It even shows a sonographic Murphy sign, which is pain when the ultrasound probe itself pushes on the gallbladder.

And boom, diagnosis confirmed.

Diagnosis confirmed.

She gets admitted for antibiotics and surgery.

But notice the journey.

Every question, every touch, every lab value was a data point that shifted the probability until the doctor was confident enough to act.

It wasn't magic.

It was data accumulation.

It's a detective story with math.

It is exactly that.

And to understand that math, we have to look at how we measure the quality of these tests.

Right.

So speaking of math, brings us to the next section.

If we're going to treat questions and physical exams like their actual tests, we need to know how good those tests are.

We do.

And that introduces two heavy hitters, sensitivity and specificity.

These are the pillars of what we call validity.

First, we have to define validity.

It just means, does the test actually measure what it's supposed to measure?

Does it align with reality?

And to know that, you have to compare it to the truth.

You need an answer key.

You need the gold standard.

That's the term for absolute best measure of truth we have for a specific condition.

For a lung nodule, the gold standard is a biopsy, actually looking at the cells under a microscope.

For colon cancer, it's a colonoscopy.

So we compare our simple tests like the Murphy sign against that gold standard to see how often it was right.

The text describes a two by two table here.

I want to try to visualize this for everyone listening, because if you can see this grid in your head, everything else makes a lot more sense.

It really does.

It's just a square divided into four smaller squares.

A grid.

So on the top, you have the truth of gold standard, two columns.

The left column is disease present.

The right column is disease absent.

Correct.

This is reality.

You either have it or you don't.

Then on the side, you have your test result, two rows.

The top row is test positive.

The bottom row is test negative.

So this gives you four possible outcomes.

Let's walk through those four boxes.

Box A, top left.

The disease is present and the test is positive.

That's a true positive.

That's a win.

We got it right.

Box B, top right.

The disease is absent, but the test says positive.

That's a false positive.

Whoops.

That's a false alarm.

Okay.

Box C, bottom left.

The disease is present, but the test says negative.

That's a false negative.

That's dangerous.

That means we missed it.

The worst case scenario probably.

And box D, bottom right.

The disease is absent and the test says negative.

A true negative.

That's the all clear.

Another win.

Okay.

I've got the grid in my head.

Now let's define sensitivity using this grid.

Sensitivity is the true positive rate.

It looks only at that left column, the people where the disease is actually present.

It asks a simple question.

Of all the people who actually have the disease, what percentage of them will this test catch?

So if a test has 100 % sensitivity, it catches every single person in that column.

There are zero false negatives.

Exactly.

It sweeps up 100 % of the sick people.

And because of that, we have this very useful mnemonic, S -N -O -N -A -U.

No, S -N -O -U.

It stands for a sensitive test with a negative result rules O -U -T the disease.

Unpack that for me.

Why does it rule it out?

Well, think about it.

If a test is super sensitive, it almost never misses the disease.

So if you take that test and it comes back negative, you can be extremely confident you don't have the disease.

Because if you did have it, the test would have screamed positive.

It would have.

So a negative result on a sensitive test is very, very powerful for exclusion.

Okay, that makes sense.

It's like a really, really wide fishing net.

If the net comes up empty, you could be pretty sure there were no fish in that spot.

Perfect analogy.

Now the flip side, specificity.

Okay.

Specificity is the true negative rate.

It looks at the right column, the people where the disease is absent.

It asks, of all the healthy people, what percentage of them will this test correctly identify as healthy?

So a test with 100 % specificity never gives a false alarm, never points a finger at an innocent person.

Right.

It ignores everyone who doesn't have the specific target it's looking for.

And the mnemonic here is SPPIN.

SPPIN.

S -P -P -I -N.

A specific test with a positive result rules in

Because if this very picky test flags you, it's rarely wrong.

Exactly.

If the test is extremely specific and it picks you out of the crowd, you probably have the disease.

It helps confirm or rule in the diagnosis.

The text gives a really good clinical example of this using low back pain.

I think back pain is something almost everyone can relate to.

Yes.

Specifically looking for a herniated disc or sciatica.

We have two different physical maneuvers we can do right at the bedside.

First is the straight leg raise.

You just lift the patient's straight leg while they are lying down.

This test has a sensitivity of about 92%, but a specificity of only 28%.

Okay.

So high sensitivity.

That means it's a NAN AC test.

Exactly.

So if you lift their leg and it doesn't hurt, you can be about 92 % sure they do not have a herniated disc.

It's great for ruling it out.

It's a great screening test.

But if it does hurt, that doesn't tell you much.

Because the specificity is so low, only 28%.

Lots of things could make that hurt.

Tight hamstrings, hip arthritis, you name it.

A positive result is weak evidence.

So for that, you need the other test.

The cross straight leg raise.

This is a really interesting one.

This is where you lift the opposite leg, the one that doesn't hurt.

Wait, the good leg?

The good leg.

If lifting the good leg causes that classic shooting pain down the bad leg, that is a very specific finding.

The sensitivity is low, only 28%.

So it misses a lot of people with the condition.

But the specificity is huge, around 80%.

Some studies say even higher.

So CPIN.

You got it.

If that cross leg raise is positive, you can be very confident in ruling in the diagnosis.

It's highly likely to be a herniated disc.

So logically, you'd use the sensitive test first to screen and rule out.

Then if that's positive or unclear, use the specific test to confirm and rule in.

That is the ideal diagnostic strategy.

You're using the strengths of each tool in the right order.

But.

There's always a but in these deep dives.

Knowing the sensitivity and specificity of a test is great for the doctor who's evaluating the test itself, but it doesn't really answer the question the patient is actually asking.

No, it doesn't.

The patient isn't asking, how good is this test at finding disease in a large population?

The patient is asking, I just tested positive.

Do I have the disease?

And that sounds like it should be the same question, but mathematically, it is a whole different beast.

It is.

That brings us to predictive values,

specifically the positive predictive value, or PPV.

This is the probability that a person with a positive test actually has the disease.

Okay, but isn't that what we were just talking about?

I mean, if the test is good, shouldn't the PPV be high?

You would think so.

But there is a crucial third variable we haven't added to the equation yet, prevalence.

Prevalence.

So how common the disease is in the first place?

Exactly.

How common is the disease in the specific group of people we are testing?

This is so important.

Okay, this is the prevalence trap.

The text has these two boxes, box 7 -4 and box 7 -5, that illustrate this perfectly.

It shows how the exact same test can be either really useful or totally useless, depending on who you are testing.

Let's walk through the math experiment.

It's really stark.

Imagine we have a test.

It's a pretty good test.

90 % sensitivity and 90 % specificity.

Sounds solid.

If I bought a gadget that was 90 % accurate, I'd probably be pretty happy.

Okay.

Now let's run this test on population A.

In this group of 1 ,000 people, the disease is pretty common.

Let's say 10 % of them have it.

So that's 100 sick people and 900 healthy people.

Got it.

100 sick, 900 healthy.

If we run the numbers,

with 90 % sensitivity, we're going to catch 90 of the 100 sick people, so that's 90 true positives.

Okay.

Now look at the 900 healthy people.

With 90 % specificity, we correctly identify most of them as healthy, but we get it wrong 10 % of the time.

The false positive, right?

Right.

So 10 % of 900 is 90.

So we have 90 true positives from the sick group and 90 false positives from the healthy group.

Exactly.

So if you test positive in this group, you are one of 180 people who got a positive result, but only 90 of you are actually sick.

So my chance of actually having the disease, my positive predictive value, is 90 out of 180.

It's 50%.

Blue coin toss.

Okay.

50 % isn't amazing, but it's something.

Now let's change the scene.

Now we go to population B.

Same size, 1 ,000 people, but now the disease is rare.

The prevalence is only 1%.

So only 10 people in this whole group have it and 990 are healthy.

And we use the exact same test.

90 % sensitivity, 90 % specificity, no change.

Right.

So of the 10 sick people, we still catch 9.

That's our 9 true positives.

But now look at the 990 healthy people.

The test still has a 10 % false positive rate.

What is 10 % of 990?

So now we have 9 true positives and 99 false positives.

Whoa.

So if I get a positive result in this group, I'm one of 108 people, but only 9 of us are actually sick.

Correct.

So if you do the division 9 divided by 108, the math comes out to a positive predictive value of just 8 .3%.

That is wild.

A positive result on a 90 % accurate test means there's less than a 10 % chance I'm actually sick.

That is the prevalence trap in a nutshell.

And this is why we don't just screen everyone for everything all the time.

In a low prevalence population like screening the general public for a rare cancer, a positive test is overwhelmingly likely to be a false positive.

And that leads to real harm, right?

It's not just a math error on a page.

It leads to what we call cascades of care.

The patient is terrified.

We order more tests, maybe invasive ones like a biopsy or a catheterization.

Those have real risks, bleeding, infection.

We spend a ton of money.

We cause immense psychological stress.

All for a disease the patient never had in the first place.

All for a false alarm.

The utility of a test depends entirely on the pre -test probability, the prevalence in that person.

So predictive values are super important the patient, but they're shifty.

They change depending on where you are.

Exactly.

Which is why researchers and clinicians wanted a metric that was more stable.

Something that told us about the power of the test itself, regardless of who we are testing.

And that brings us to the likelihood ratio.

The likelihood ratio, or LR.

This is the expert's favorite metric.

It helps us avoid the mental gymnastics of recalculating PPV every single time.

So why is it better?

What does it do?

Because it combines sensitivity and specificity into a single elegant number that tells you how much should I shift my suspicion.

It doesn't tell you the final answer.

It tells you how much to move the needle.

Okay, let's define it.

Technically, it's the ratio of the probability of a finding in a diseased person versus a non -diseased person.

But the easier way to think about it is this.

Does this test result make the disease more likely or less likely?

And by how much?

The text gives us a great little scale to interpret these numbers in box 7 -6.

I think this is really helpful for students.

It is.

It's a cheat sheet.

First, if the LR is equal to 1, the test tells you absolutely nothing.

The probability hasn't changed.

It's useless information.

If the LR is greater than 1, the probability of disease goes up.

If the LR is less than 1, the probability of disease goes down.

And the magnitude matters.

It's not just up or down.

It's how much up or down.

A huge amount.

An LR greater than 10 is a slam dunk.

It creates a large, conclusive shift in probability.

An LR between 5 and 10 is moderate.

An LR between 2 and 5 is a small, sometimes trivial change.

And on the flip side, for ruling things out.

An LR less than 0 .1 is also a large change that basically rules out the disease.

So let's go back to our woman with the gallbladder pain.

We had those physical exam findings.

Let's look at them through this lens of likelihood ratios.

This really explains the logic part we talked about at the beginning.

Right.

This clarifies why we were so unsure earlier.

So, right upper quadrant tenderness.

The text says this has an LR of 2 .7.

2 .7.

So looking at our scale, that's squarely in the small change category.

It increases the probability, but not by a whole lot.

Exactly.

Then we have the Murphy sign.

The LR for that is 3 .2.

Still pretty small.

A little better than just general tenderness, but it's not a smoking gun on its own.

It's not a 10.

Right.

Neither of these is an LR of 10.

That explains why, even with both of them, the text said we still hadn't crossed that treatment threshold yet.

We were building a case brick by brick, but we needed the ultrasound to finish the wall.

What about if she didn't have any tenderness at all?

The absence of tenderness has an LR of 0 .4.

So that's less than 1, which lowers the probability of colicistitis.

But it's not 0 .1.

So it doesn't completely rule it out either.

It just makes it a bit less likely.

I really like this because it quantifies that gray area feeling that clinicians must have all the time.

It's not just maybe, it's this moves the needle a little bit, but we need more.

Precisely.

It turns clinical intuition into a calculation.

It keeps you honest.

You can't just say, I have a strong feeling it's colicistitis.

You have to look at the LRs and realize,

actually, my exam findings alone aren't strong enough to be sure.

So we've got likelihood ratios.

Now let's breast cancer screening, which is a very real and relevant topic for millions of people.

Yes.

Box 78.

We have a 57 -year -old woman.

She's had an abnormal mammogram and she wants to know very reasonably,

do I have cancer?

And here are the stats the text provides.

The prevalence of breast cancer for her age is about 1%.

The mammogram sensitivity is 90 % and its specificity is 91%.

This looks very, very similar to our prevalence trap example from earlier.

We are dealing with a low prevalence condition, just 1%.

It does.

Now the text introduces this really cool visual tool called the Fagan Nomogram.

I love the name.

It sounds like a sci -fi weapon or something.

The Fagan Nomogram.

It's a very clever low -tech calculator.

It allows you to use likelihood ratios without doing any complex algebra.

So what does it look like?

Imagine three vertical lines, like rulers standing next to each other.

The line on the left is the pre -test probability, which is the prevalence.

The line in the middle is the likelihood ratio, and the line on the right is the post -test probability.

That's your answer.

So how do you use it?

You take a ruler or any straight edge.

You put one end on the left line at her pre -test probability, which is 1%.

Then you pivot the ruler through the middle line at the likelihood ratio for the test.

A positive mammogram has an LR of roughly 10.

Okay.

And then you just see where the ruler hits the right line.

That number is your post -test probability.

So for this woman,

we start at 1 % on the left.

We pass through an LR of 10 in the middle.

Where do we land on the right?

We land at about 9%.

9%.

So even with a positive mammogram, the chance she actually has cancer is less than 1 in 10.

Correct.

And that is a very hard thing to explain to a patient who just heard the words abnormal mammogram.

They hear abnormal and they think, I have cancer.

So the text actually suggests a better way to explain this.

One that avoids percentages altogether because they can be so abstract.

It suggests using natural frequencies.

Yes.

Our brains aren't wired for probability.

We're wired for counting things, for stories about people.

So instead of saying 90 % sensitivity, you tell a story with whole numbers.

Like the population examples we used earlier, just making it a crowd of people.

Exactly.

Let's look at box 710.

It lays it out perfectly.

Imagine 1000 women like this 57 year old based on the 1 % prevalence, 10 of them have breast cancer.

The other 990 do not.

Okay.

We have our two groups.

Of the 10 women with cancer, the mammogram has 90 % sensitivity.

So it will catch nine of them.

Those are our true positives.

Of the 990 women without cancer, the mammogram has 91 % specificity, which means it has a 9 % false positive rate.

So it will still flag about 89 of them by mistake.

So the total number of positive tests is 9 real ones plus 89 false alarms.

That's 98 positive tests in total.

So when you look at the patient, you say,

Ms.

Jones, we screened 1000 women just like you.

98 of them got a result like yours.

Of those 98, only 9 actually turned out to have cancer.

The other 89 were perfectly fine.

That is so much clearer and honestly more reassuring than saying you have a 9 % positive predictive value.

It is.

It frames it in reality.

You are one of this group.

Most of the people in this group are okay.

It reduces panic without lying.

It gives the patient a realistic view of their situation.

I love that.

It's compassionate accuracy.

Okay, moving on to section six.

We've talked about the tests themselves, but what about the people performing them?

The text raises a really kind of an awkward question.

Do doctors actually agree with each other?

This is the problem of reproducibility or inter -rater reliability.

If I examine you and I think I find a positive Murphy sign, and then my colleague comes in and examines you five minutes later, will they find it too?

You'd hope so.

I mean, if medicine is a science, it should be repeatable, right?

You would hope, but so much of medicine is subjective.

Our human senses are involved.

So we have to measure this agreement and we use a statistic called the Kappa score, or just K.

Kappa.

What does that measure exactly?

It measures agreement beyond chance.

Because look, if we both just guessed yes or no by flipping a coin, we'd agree 50 % of the time just by accident.

Kappa tells us how much better we are doing than that random coin toss.

So how do we score it?

What's a good score?

A Kappa of zero means we are no better than random guessing.

A Kappa of one is perfect agreement.

We see exactly the same thing every single time.

The text gives a breakdown in Box 711.

A Kappa of 0 .5 is considered moderate agreement.

Greater than 0 .8 is excellent.

And what's a bad score?

Less than 0 .2?

Less than 0 .2 is poor, which basically means we are guessing.

The text gives some pretty humbling examples here.

It does.

For the Murphy -Siner gallbladder test, the Kappa is around 0 .5.

It's moderate.

Two competent doctors will disagree a fair amount of the time on whether it's really there or not.

Maybe one presses a little harder.

Maybe the patient winces differently that time.

And what's an example of a poor one where the agreement is really low?

Listening for an S3 gallop, that's a specific extra heart sound that can be a sign of heart failure.

The Kappa for that is 0 .18.

Ouch.

So basically, hearing that sound is extremely subjective.

Very.

It's a faint sound.

It's subtle.

It's hard to hear.

So relying on that finding as your only data point is incredibly risky.

On the other hand, checking for pulses in patients with arterial disease has a Kappa of 0 .80.

That's excellent.

We can trust that finding much, much more.

This connects right back to what we said at the start, using physical exams as tests.

You have to know the reliability of your instrument.

And if the instrument is my ears listening for a very faint sound, you have to accept that your instrument might be noisy.

Exactly.

The text also mentions precision, which is slightly different.

Precision is about consistency.

If I test the same person repeatedly, do I get the same number?

This applies more to lab tests than physical findings.

Right.

If I send my blood to the lab and they test my cholesterol twice from the same tube, I don't want two different numbers.

You don't.

We measure that with something called the coefficient of variation.

You want that number to be very low.

High precision means consistency.

If a lab test bounces all over the place, it's not precise and it's dangerous to make decisions based on it.

We're round in the corner here.

We've covered the exam, the math of diagnosis, and the reliability of the doctors themselves.

Let's zoom out to that big research evidence circle from our Venn diagram.

How do we actually read and judge a medical study?

This is the scale of critical appraisal.

The text references the famous user's guides to the medical literature and it blows it down to three big questions you should ask every single time you pick up a scientific paper.

Okay.

Question number one, are the results valid?

Which basically means, can I believe this?

Or was the study rigged or flawed in some way?

This brings us to the whole topic of bias.

The text lists four big types of bias.

Let's run through them quickly because these are the traps that researchers fall into.

First up, selection bias.

This happens when the two groups you're comparing are different from the very start.

Like if you test a new drug but you give it to the healthier, younger patients and you give the placebo to the sicker, older patients.

Of course the drug will look good.

Of course, but it's not the drug, it's the selection.

The fix for this is randomization.

You flip a coin, essentially, to decide who gets what.

That washes up those baseline differences.

Okay, second type, performance bias.

This is when the groups get treated differently during the study, aside from just the drug itself.

Maybe the doctors pay more attention to the group getting the new fancy treatment because they're excited about it, they check on them more often.

That extra care improves outcomes, not just the drug.

And the fix for that?

Lining.

The patients shouldn't know what they're getting and if possible, neither should the doctors.

If everyone is blind, they treat everyone the same.

Okay, third,

detection bias.

This is all about how you measure the outcome.

If I'm a researcher and I know you took the new drug, I might subconsciously look a little bit harder for signs of improvement.

Or if I know you took the placebo, I might be more likely to see failure.

And the fix?

Blind the assessors.

The person who is actually measuring the result, reading the x -ray or checking the blood pressure, shouldn't know which group the patient was in.

Makes sense.

And finally, attrition bias.

This is about people dropping out of the study.

If half the people taking the new drug quit because of terrible side effects and you just ignore them and only look at the people who stayed in,

your drug is going to look a lot safer and more effective than it really is.

So what's the fix for that?

You can't force people to stay.

The fix is a type of analysis called intention to treat.

You count everyone who started, regardless of whether they finished or not.

If they dropped out, you still count them in the analysis in their original group.

It keeps the numbers honest.

It simulates the real world where people do stop taking their meds.

OK, so once we've decided the study is valid, we move to question two.

What are the results?

And here we get into the really critical difference between relative risk and absolute risk.

This is crucial for anyone listening who reads health news headlines.

This is maybe the most important statistical distinction for a consumer of medical information.

The text uses the National Lung Screening Trial as a case study.

They compared using low dose CT scans versus standard chest x -rays for screening heavy smokers.

What were they looking for?

The outcome was death from lung cancer.

In the CT scan group, the death rate was 1 .8 percent or 0 .018.

In the x -ray group, the death rate was 2 .1 percent or 0 .021.

OK, so 1 .8 percent versus 2 .1 percent.

It's a difference, but it looks pretty small.

It does look small, but let's calculate the relative risk reduction.

The ratio of the risks is 0 .86.

That means the CT scan group had a 14 percent lower risk of death compared to the x -ray group.

14 percent.

Now that sounds impressive.

If I see a headline that says new scan reduces cancer death by 14 percent, I'm signing up immediately.

Right, and that is the relative risk.

It makes for a great headline.

But now let's look at the absolute risk reduction.

You just subtract the two numbers, 2 .1 percent minus 1 .8 percent.

The difference is 0 .3 percent.

Oh, so my actual personal risk of dying dropped by less than half of 1 percent?

Exactly.

Both numbers are true, but they feel profoundly different.

And this leads to my favorite statistic of all.

The number needed to treat, or NNT.

How do we calculate that one?

It is the inverse of the absolute risk reduction.

So you take one and you divide it by 0 .003, and that equals 333.

So what does that number, 333, mean in plain English?

It means we need to screen 333 people with a CT scan for three years to prevent just one lung cancer death.

Wow.

So 332 people get screened, get the radiation.

Maybe they get false alarms and invasive follow -ups and all that stress.

And they get no benefit in terms of preventing death.

Only one person out of that 333 is actually saved from dying of lung cancer by the screening.

Correct.

That is the reality of screening.

It puts that 14 percent benefit into a very, very sober perspective.

Helps us weigh the costs and the harms versus the benefits.

That leads perfectly into the final section of the chapter.

Communicating all of this to patients.

Because how you say it completely changes what they decide.

It's called the framing effect.

The tech site's a fascinating study on this.

When patients were told a test reduced their risk by a relative 50 percent, 80 percent of them wanted the test.

Okay.

When they were told the absolute risk reduction, or the NNT, only about 40 to 50 percent wanted it.

The number of people wanting the test was cut in half just by changing how the statistic was presented.

So if a doctor really wants to sell a test or procedure, they should use relative risk.

If they want to be neutral and let the patient decide, they should probably use absolute risk or NNT.

The text is very clear that clinicians must present the data neutrally.

We aren't salesmen.

We are partners.

We have to give the patient the full picture so they can make a decision that aligns with their values.

That brings us right back to that second circle of the Venn diagram.

This is the core of shared decision making.

The text lists some frameworks like the five A's, ask, advise, assess, assist, derange.

But the core message is, give the patient the real numbers, check their understanding, and help them align the medical science with their own personal values.

Exactly.

Maybe for one patient, the peace of mind from screening is worth the risk of a false positive.

For another, the potential stress isn't worth it.

There isn't one right answer for everyone.

There's only the informed answer for that individual.

That's it.

And tools like decision aids, pamphlets, videos, websites that are mentioned in the text can really help patients digest all this complex information.

So we have come full circle.

We started with that simple Venn diagram.

We went through probability, the math of diagnosis, sensitivity, specificity, the prevalence trap, likelihood ratios, bias in studies, and finally, how to sit down and talk about it all.

It's a lot.

But it transforms the physical exam from a simple ritual into a true scientific instrument.

And it transforms the patient from a passive recipient of care into an active participant who actually understands the odds they're dealing with.

Well said.

So here's my final thought for the listener to mull over.

We always want certainty.

We go to the doctor, and we want them to say, you definitely have this, or you are definitely fine.

But what this chapter shows so clearly is that medicine is actually the science of managing uncertainty.

It's never zero or 100.

It's always somewhere in between.

So the real question is, how comfortable are you, personally, with living in that gray zone?

That is the question at the heart of it all.

Thanks for listening to this deep dive at Debates Guide, Chapter 7.

A warm thank you from the Last Minute Lecture team.

We'll see you next time.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Evidence-based clinical practice integrates patient history and physical examination findings as diagnostic tools that systematically alter disease probabilities through quantifiable performance metrics. The foundation of this approach rests on comparing clinical and diagnostic test results against a gold standard using contingency tables to calculate sensitivity and specificity, fundamental measures that determine a test's ability to correctly identify diseased and nondiseased individuals respectively. Mnemonics such as SnNOUT and SpPIN provide practical frameworks for clinicians to rule conditions in or out based on these performance characteristics. Predictive values extend this analysis by calculating the actual probability of disease given a positive or negative test result, though these values vary substantially with disease prevalence in the population being evaluated. Likelihood ratios offer a more flexible and clinically robust alternative by quantifying how much a test result changes the odds of disease, enabling transformation of pretest probability into posttest probability through graphical tools like the Fagan nomogram or natural frequency approaches. Reproducibility of clinical findings receives careful attention through inter-observer agreement measurement using kappa statistics, which account for agreement beyond chance levels, while coefficient of variation provides a precision metric for continuous measurements. Rigorous critical appraisal of medical literature requires identifying potential sources of systematic error including selection bias, performance bias, detection bias, and attrition bias that can distort study conclusions. Therapeutic and prevention trials are evaluated by calculating event rates in experimental and control groups, from which relative risk, absolute risk reduction, and the number needed to treat or harm emerge as clinically meaningful effect size measures. Throughout this evidence evaluation process, effective communication of statistical concepts to patients becomes essential for minimizing framing effects and enabling authentic shared decision-making regarding diagnostic testing and treatment options.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 7: Evaluating Clinical Evidence

Related Chapters