Chapter 9: Inferences from Two Samples

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Have you ever found yourself needing to compare two things directly, you know, to make a decision?

Oh, absolutely.

It happens all the time.

Like maybe you're looking at two different health treatments or trying to figure out if a new teaching method actually works better than the old one.

Right.

The big question always is, is the difference you're seeing real?

Exactly.

Is it a significant difference or could it just be, you know, random chance?

It's fundamental.

Whether you're doing research, analyzing business data, or just trying to make sense of the news, you need to know if a difference is meaningful or just, well, noise.

Absolutely.

And that's our session today on the deep dive.

We're here to cut through all that information overload and give you the real nuggets of knowledge from some pretty complex sources.

Today, our journey takes us right into the heart of statistical comparison.

We're looking at inferences from two samples.

Yeah, we're taking a chapter from elementary statistics by Mario Triola.

It's a classic text and we're going to break it down into insights you can actually use.

So maybe in the past, we've focused on understanding just one group.

But today, today we're tackling situations where you want to compare two separate groups or conditions.

It's like moving from looking at one puzzle piece to seeing how two pieces fit together or maybe if they even connect at all.

And to make this really concrete, we're going to use this running example throughout our chat.

It's about comparing e -cigarettes versus traditional nicotine replacement therapies

It's a great real world scenario, really helps illustrate these statistical methods we're talking about.

Okay, so where do we start?

Let's begin with comparing two proportions.

Right.

So this is when you're looking at success rates, basically, or percentages, frequencies, whatever you want to call it for some outcome in two different groups.

Like in our smoking example, what proportion of people quit using e -cigs versus say nicotine patches?

Exactly.

So for that smoking cessation study, researchers had 884 smokers.

They randomly put them into one of two groups, e -cigarettes or traditional nicotine replacement, like patches or gum.

After a year 52 weeks, they check the results.

In the e -cigarette group, 79 out of 438 people had quit.

That's 18%.

18%.

And in the nicotine replacement group, it was 44 out of 446 who quit.

So that's about 9 .9%.

Right.

So you look at those numbers, 18 % versus 9 .9%.

Obviously 18 is higher.

Yeah, seems pretty clear.

But, and this is the crucial statistics question, is that difference statistically significant?

Or could you get a difference like that just by random luck, even if both methods were deep down, equally effective?

Okay, that's where the testing comes in.

That's exactly where hypothesis testing comes in.

So how do we actually test that?

What's the first move?

Well, you start by setting up your hypotheses.

You have the null hypothesis 80, which is kind of your default assumption.

Usually it's that there's no difference.

So the success rates are the same, P1 equals P2.

Precisely.

And then you have the alternative hypothesis, H1.

That's what you're actually trying to find evidence for.

Like maybe P1 doesn't equal P2, or maybe you suspect P1 is specifically greater than P2.

Okay, hypothesis set.

Then we need that significance level, right?

Often 0 .05.

Yep, 0 .05 is pretty standard.

It's your threshold.

How unlikely does the result have to be under the no difference assumption before you say, okay, I don't believe there's no difference anymore?

Got it.

And then comes the calculation.

Then you calculate a test statistic.

For comparing proportions, it's a z -score.

This z basically tells you how far away your observed difference is from zero difference measured in standard deviations.

Okay.

And that leads to the p -value.

Exactly.

The p -value is the probability of getting a result as extreme as yours or even more extreme if that null hypothesis of no difference were actually true.

So for our smoking study,

null is rates are equal, alternative is they're not.

The z -score was calculated as 3 .51.

What kind of p -value does that give us?

That gives us a p -value of 0 .000045.

Wow, 0 .000045.

That sounds incredibly small.

It is.

It's tiny.

It means there's only a 0 .045 % chance, less than one in 2000, of seeing a difference this big just by random chance if the two methods were equally effective.

So it's super unlikely this is just luck.

What does that mean for our null hypothesis then?

It means we reject it because that p -value 0 .00045 is way smaller than our significance level of 0 .05.

Right.

So we conclude there is enough statistical evidence to say there's a significant difference in the success rates.

Based on this study, e -cigarettes seem more effective for quitting.

Okay, so there is a difference.

But knowing that isn't always the whole story.

You often want to know, well, how big is the difference?

Good point.

And that's where confidence intervals come in handy.

For this study, a 95 % confidence interval for the difference in proportions, p1 minus p2, goes from 0 .0363 up to all 0 .127.

0 .0363 to 0 .127.

And importantly, 0 is not in that interval.

Exactly.

Since 0 isn't in there, it confirms our hypothesis test result there is a significant difference.

But it also gives you a range for how big that difference likely is.

So the e -cigarette success rate is probably somewhere between, what, 3 .6 and 12 .7 percentage points higher?

That's what the interval suggests, yes.

With 95 % confidence.

That seems statistically quite strong.

But the source material we looked at makes a really important point here.

Even with an 18 % success rate for e -cigs, that still means over 80 % of users didn't quit.

A critical point?

So how do we square that statistical significance with, like, the practical significance or the bigger health picture?

It's a crucial distinction, and Triola emphasizes this.

Statistical significance doesn't automatically mean practical importance or that something is universally good.

Right.

Even if e -cigarettes help some people quit smoking traditional cigarettes, the source warns they shouldn't be seen as healthy.

They have their own risks, harmful ingredients, addiction, potential lung issues.

So the stats are one piece, but you always need that wider context.

Health recommendations aren't just about one number.

Absolutely.

Always consider the bigger picture, potential harms, everything.

And while we're talking pitfalls, I remember reading about a common mistake.

Something about comparing two separate confidence intervals.

Ah, yes.

Trying to see if two individual confidence intervals overlap.

Yeah, that's the one.

It's a tempting shortcut, but it's flawed.

That method is too conservative.

It often fails to a real difference.

You should always use the proper hypothesis test or calculate the confidence interval for the difference between the proportions.

Don't just eyeball the overlap of separate intervals.

Good tip.

So besides smoking cessation, where else might we use these two proportion tests?

Oh, they're everywhere.

Classic example is evaluating vaccine effectiveness, like the Salk polio vaccine trials, comparing the proportion of kids getting polio in the vaccine group versus the placebo group.

Make sense.

Or maybe analyzing gender differences in certain traits.

The text mentions absolute pitch hearing, comparing the proportion of women versus men who have it.

Anytime you compare rates or percentages between two groups.

Okay, that covers proportions.

Let's switch gears now.

What about comparing averages or means?

That's section nine, tick two.

Two means independent samples.

Right.

Now we're looking at the average value of something in two groups.

But first, a really key distinction.

Independent versus dependent samples.

What's the difference there?

Independent samples are just what they sound like.

The values from one group have no natural connection or pairing to the values in the other group.

Think heights of randomly chosen men versus randomly chosen women.

Okay, they're separate pools of people.

Exactly.

Dependent samples or matched pairs, which we'll get to next, are linked, like before and after measurements on the same person or comparing test scores of twins.

Gotcha.

And there's a hint in the book too, right?

About sample sizes.

Yeah, a useful one.

If your two samples have different sizes and you haven't lost any data, they have to be independent.

If the sizes are the same, they could be dependent, but you need to check if there's a logical pairing.

Okay, so let's stick with independent samples for now.

The example used is comparing heights of US Army men from 1988 versus 2012.

The question is, are people getting taller?

Specifically, is the 1988 mean height less than the 2012 mean?

Right, so the null hypothesis is that the mean heights are equal.

One equals two.

The alternative is that the 1988 mean is less than the 2012 mean.

One one.

And since we probably don't know the true population standard deviations for height back then.

We use a t -test, just like with a single sample mean when sigma is unknown.

Okay, so the data.

In 1988, the sample had 12 guys.

Mean height 1739 .4 millimeters.

In 2012, 15 guys mean 1777 .8 millimeters.

What does the t -test tell us?

When you run the numbers, the calculated t -statistic is manna 1 .679.

And the p -value associated with that?

The p -value comes out to 0 .0546.

0 .0546.

That's really close to our usual 0 .05 cutoff.

It's actually just above it.

It is very close.

So technically, since 0 .0546 is greater than 0 .05, we fail to reject the null hypothesis.

That's the formal conclusion.

We fail to reject the null hypothesis.

So even though the 2012 sample average is higher, we don't have enough evidence from these specific samples to statistically support the claim that the 1988 mean was lower.

Correct.

Based on this data, the evidence isn't strong enough to make that conclusion at the 0 .05 significance level.

It's kind of borderline, really.

It doesn't mean heights aren't increasing, just that these samples don't provide compelling proof.

It really shows how statistics demands a certain level of proof, doesn't it?

Close isn't quite enough sometimes.

It does.

And again, we can look at the confidence interval.

A 90 % confidence interval for the difference in means for this data runs from 9 .77 .9 millimeters to plus 1 .1 millimeter.

Ah, and that interval includes zero.

Right.

Since zero is a plausible value for the true difference between the means, it lines up perfectly with failing to reject the null hypothesis.

No significant difference detected here.

So the practical takeaway isn't necessarily that people aren't getting taller, but maybe we needed larger samples, or perhaps this specific population isn't representative.

Could be either or both.

Larger samples generally give you more power to detect smaller differences if they exist.

Makes sense.

And these independent means tests are used all over.

The book mentions studies on real estate agents selling their own homes.

Right.

Do they get better prices than for clients?

And that fun one about snack bowls at Super Bowl parties.

People apparently ate way more from bigger bowls.

Yeah, about 56 % more.

Definitely something to keep in mind next game day.

Use a smaller bowl.

Okay, let's move on to the other side of that coin.

Matched Pairs, Section 9 -3.

Right.

So now we're talking about those situations where the data points are naturally linked or paired up.

Like before and after measurements.

Exactly.

Or comparing husbands and wives, twins, two different treatments applied to the same subject at different times.

The key is that there's a meaningful connection between each pair of values.

And the big advantage here is...

Control.

By pairing things up, you control for a lot of the extra variation between subjects.

If you measure the same person before and after, you've eliminated the variation between different people.

This makes the test more sensitive to detecting real differences.

Like those early crest toothpaste trials using twins.

One twin gets crest, the other gets a placebo.

Perfect example.

Controls for genetics, environment, lots of stuff.

Makes the comparison of the toothpaste effect much clearer.

So let's look at the example of weight reporting.

The common idea is that people, maybe men in this case, report weighing less than they actually do.

Uh -huh.

A common suspicion.

So we have data with pairs of measured weights and reported weights for, I think, eight male subjects.

What's the absolute first step here?

Crucial first step.

Calculate the difference for each pair.

You subtract reported weight from or vice versa, just be consistent.

So measured weight, reported weight.

Okay.

So if someone measured 152 .6 pounds and reported 150, the difference is plus 2 .6 pounds.

Exactly.

You do that for every single pair in your sample.

In the books example, those differences were 2 .6, 1 .3, 4 .8, or 0 .5, 9 .9, 90 .3, 9 .6, 8 .6, one person reported higher, and 0 .6 pounds.

Got it.

So now we have a single list of differences.

What do we do with that?

You treat that list of differences as a single sample, and you test if the mean of those differences, called d -admu sub d, is significantly different from zero.

Oh, okay.

So the null hypothesis is eight zero, v -apol -a zero is emitting.

On average, there's no difference between measured and reported weight.

Correct.

And the alternative, if we suspect underreporting, would be h1 v'd zero.

Measured weight is greater than reported, so the difference is positive on average.

All right.

So for these eight differences, what did the stats show?

The average difference, d -bar, was 1 .425 pounds.

Running a t -test on these differences gives a t -statistic of 0 .778.

And the p -value for that?

The p -value is 0 .231.

0 .231.

Okay.

That's definitely bigger than 0 .05.

Much bigger.

So again, we fail to reject the null hypothesis.

Correct.

Based on this small sample, there isn't sufficient evidence to conclude that men significantly underreport their weight on average.

That kind of feels surprising, doesn't it?

It seems like such a common belief.

It does.

And it's a great example of how data can challenge our intuitions.

What seems obvious sometimes just doesn't hold up when you actually measure it.

Our gut feelings can be wrong.

Yeah.

Definitely less in there.

And the confidence interval?

A 90 % confidence interval for the mean difference, indeed, goes from a nautical 2 .05 pounds to plus 4 .90 pounds.

And again, that interval contains zero.

Yep.

Reinforces the conclusion.

The true average difference could plausibly be zero or even slightly negative based on this data.

Okay.

Very clear.

Now, we've compared proportions.

We've compared means.

What about comparing how spread out the data is?

Section 9 for two variances or standard deviations?

Right.

Sometimes you're not just interested in the average, but in the consistency or variability within each group.

Why would that matter?

Give me an example.

Well, imagine you're comparing two machines that fill bags of chips.

Both might average the correct weight, say 10 ounces.

But if one machine fills bags anywhere from 8 to 12 ounces and the other consistently fills them between 9 .9 and 10 .1 ounces.

The second one is much more consistent.

Less variation.

You'd probably prefer that one.

Exactly.

Less variation often means better quality control or maybe less risk if you're talking about financial investments with the same average return.

So how do we test if the variances or standard deviations, since they're related, of two populations are different?

We use something called the F -test named after the statistician, Sir Ronald Fisher.

The F -test.

Okay.

Is there anything special we need to know about this one?

Yes.

And this is really important.

The F -test has a very strict requirement.

Both populations you're comparing must have normal distributions.

Normal distributions.

Like the bell curve.

Exactly.

And the F -test is not robust to violations of this assumption.

If your data isn't normally distributed, the F -test results can be seriously misleading.

Unlike the T -test, which handles some non -normality pretty well, especially with larger samples.

Wow.

Okay.

So you really need to check for normality before using an F -test.

That seems like a major catch.

It is.

It's a common pitfall.

You can't just assume normality.

Does that sensitivity ever make it useful or just risky?

If you know you have normal data, it's a perfectly good test.

But because real world data often isn't perfectly normal, statisticians frequently look for more robust alternatives tests that aren't so sensitive to that normality assumption.

Like what?

Are there other ways?

Yeah.

The book mentions a couple, like the COUNT -5 method or the Levin Brown Forsythe test.

These are designed to be less affected if your data isn't perfectly bell -shaped.

Good to know there are options.

Let's quickly look at the example.

Testing if the variation in army weights changed between 1988 and 2012.

Okay.

The null hypothesis here is that the population variances are equal.

The alternative could be that they're not equal.

How's the F -statistic calculated?

It's a ratio of the two sample variances.

Conventionally, you put the larger sample variance on top.

So F equals S11 by 2, S2 -kilogram, where S11 is the larger sample variance.

Got it.

For the weights, the 1988 standard deviation was about 12 .4 kilograms and 2012 was about 15 .5 kilograms.

So the 2012 variance is larger.

Right.

Calculating the S -statistic gives you 1 .5638.

And the P -value for that?

Using a two -tailed test, the P -value is 0 .4779.

0 .4779.

That's quite large, much bigger than 0 .05.

Definitely.

So we fail to reject the null hypothesis.

Meaning we don't have enough evidence to say the variation in weights changed between 1988 and 2012.

Exactly.

Based on these samples, there's no significant evidence that the standard deviations or variances are different.

The spread seems to have remained similar.

Okay.

That brings us to the last main section, nine to five, which sounds pretty modern.

Resampling.

Using technology for inferences.

Yes.

This is a really powerful set of techniques that rely heavily on computers.

Why do we need these if we have all those traditional tests we just talked about?

Well, because those traditional tests, like the T -test and especially the F -test, often come with assumptions like normality or minimum sample sizes.

And real data doesn't always play by the rules.

Pretty much.

Real world data can be messy.

Resampling methods give us ways to make inferences without relying so heavily on those theoretical assumptions.

They're more flexible.

Okay.

So what are the main types of resampling?

I see bootstrapping and randomization.

Those are the two big ones discussed here.

They serve different purposes.

Bootstrapping is mainly used for creating confidence intervals.

How does that work?

Conceptually?

With bootstrapping, you take your original sample and you repeatedly draw new samples from it with replacement, meaning you can pick the same data point multiple times in one bootstrap sample.

Okay.

So you're creating lots of new samples based on your original one.

Exactly.

Thousands of them.

You calculate your statistic, like the difference in means or proportions, for each bootstrap sample.

The distribution of those thousands of statistics gives you an empirical basis for your confidence interval.

It shows you the variability of your statistic based purely on the data you have.

Interesting.

And randomization, is that for hypothesis testing?

Yes, primarily.

With randomization, you take all the data from both original groups, shuffle it all together, and then randomly deal it back out into two groups of the original sizes.

And I do that.

You're simulating the null hypothesis.

If there's truly no difference between the groups, then shuffling the labels shouldn't matter much.

You repeat the shuffling thousands of times and calculate the difference for each shuffled version.

This builds a distribution of what differences look like if the null is true.

Ah, I see.

And then you compare your original actual difference to that simulated no difference distribution.

Precisely.

If your original difference is way out in the pails of that randomization distribution, meaning it's very unusual under the null hypothesis, then you have strong evidence to reject the null.

You get a p -value based on how often the shuffled results are as extreme as your actual result.

That feels quite intuitive, actually.

You're literally seeing how likely your result is if there's no real effect just by shuffling the data around.

It is very intuitive.

It moves away from relying on theoretical formulas and distributions like the t -distribution or z -distribution, and instead builds the distribution directly from your data.

It simulates the uncertainty.

And I assume this takes a lot of computing power.

Oh yeah.

Doing thousands of re -samples by hand would be impossible.

This is where software like Crunch or even Excel add -ins becomes essential.

Can we quickly see how these apply to our examples, like the smoking cessation proportions?

Sure.

Bootstrapping the difference in proportions gives a confidence interval very similar to the traditional method, like .0363 to .123, still not containing zero.

Reinforces the difference.

And randomization testing gives an extremely small p -value, like .000, confirming that the observed 18 % mess 9 .9 % is highly unlikely to be due to chance.

What about the army heights, the one that was borderline?

Resampling there might also give borderline results.

The bootstrap confidence interval might still contain zero, and the randomization p -value might hover around .05.

It reinforces that the evidence in that particular data set just wasn't compelling either way.

Sometimes the answer really is maybe.

And the matched pairs weight example.

Similar story.

Resampling methods would likely confirm the earlier finding.

The confidence interval for the mean difference would include zero, and the p -value would be well above .05, indicating no significant difference.

So these modern methods often confirm the traditional ones, but they seem more robust, especially if the assumptions are shaky.

That's a good way to put it.

They provide a different, often more intuitive way to understand the uncertainty and significance grounded directly in the data itself.

Wow, okay.

We have covered a lot of ground today.

It feels like quite a journey.

It definitely was.

We went from comparing simple percentages, like quit rates, to looking at averages in independent groups, like heights.

Then diving into matched pairs, like the weight reporting.

And even comparing the spread or variation in data with the F -test, keeping that crucial normality assumption in mind.

And wrapping up with those powerful resampling techniques enabled by technology.

Bottom line seems to be that this deep dive has given us, and hopefully you listening, a toolkit.

Tools to analyze claims about two groups, to figure out if differences are real or just random noise.

Absolutely.

It equips you to be a more critical consumer of information in this data -flooded world we live in.

So maybe a final thought for everyone to chew on.

As you go about your week, start noticing how often claims are made by comparing two groups.

Maybe it's in the news and advertisement work presentations.

Two teaching methods, two medical treatments, two marketing strategies.

Exactly.

And now you can start asking critical questions.

What are they comparing?

Means?

Proportions?

Is it independent or matched?

What assumptions are they making?

Maybe implicitly.

And what might the data really be telling us underneath the headline?

Keep that statistical lens handy.

It's incredibly useful.

It really is.

Well, thank you so much for joining us on this deep dive into making inferences from two samples.

It was a pleasure.

Hope it was helpful.

We certainly hope you continue to explore these ideas and apply them.

Until next time.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Comparing two populations requires specialized statistical techniques that differ fundamentally from single-sample inference, and this material develops the conceptual and computational foundation for such comparisons. The distinction between independent and dependent sample structures shapes everything that follows: independent samples arise when observations in one group have no systematic relationship to observations in another group, while dependent samples occur when data points are naturally matched or come from repeated measurements on identical subjects. For independent samples with known population standard deviations, two-sample z-tests provide a direct testing approach, but real applications almost always require two-sample t-tests that estimate these parameters from the data itself. A major practical advancement involves the separate variance method, which removes the assumption that both populations have equal variances, a requirement often violated in genuine datasets and unnecessarily restrictive for modern statistical practice. Dependent or paired samples present a fundamentally different scenario where the natural matching within the data can be leveraged to reduce noise and increase the sensitivity of hypothesis tests; paired-sample t-tests accomplish this by focusing on within-pair differences rather than between-group comparisons. When comparing proportions across two populations, researchers use pooled proportion estimators that combine information from both samples under the null hypothesis of equal population parameters, with inference conducted through two-proportion z-tests. Successful application requires careful decision-making about which test procedure suits a particular study design, verification that underlying assumptions hold, and thoughtful interpretation of P-values and confidence intervals in context. Students must learn to confirm approximate normality in the sampled data, ensure sample sizes are adequate for the chosen procedure, and compute or understand degrees of freedom adjustments that depend on the variance structure and sample structure. Applications spanning clinical trials, survey research, and organizational assessment illustrate how these inferential tools support evidence-based conclusions about whether observed differences between groups reflect genuine population differences or arise from random sampling variation.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 9: Inferences from Two Samples

Related Chapters