Chapter 10: Comparing Two Populations or Treatments

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Have you ever found yourself, you know, wrestling with two choices wondering which one is genuinely better?

Maybe you're trying to nail down if that new study method actually boosts your grades.

Or like, which diet plan really works?

Exactly.

Or if one diet plan truly outshines another, how do we move beyond just a hunch and really know if one thing is different from another when we're staring at, well, a pile of data?

That's really the heart of it, isn't it?

When we see any difference in data, it could be test scores, survey responses, recovery rates, whatever the big question is always, is this a real underlying distinction?

Or is it simply the natural variability we expect just from random chance?

That's precisely what statistics helps us figure out, gives us the tools to tell the difference.

Welcome to the deep dive.

We're the place where we take complex information and unpack it, giving you the essential insights, maybe some surprising facts, to make you truly well informed.

And today, yeah, we're diving deep into a foundational topic for anyone wanting to make sense of comparisons,

understanding how to compare two groups using data.

And our mission today is, well, it's custom tailored for you, the AP statistics learner.

That's right.

We're going to navigate chapter 10 of the practice of statistics.

We'll break down how to compare two populations or treatment groups.

We'll cover the key statistical concepts, the vocabulary, the formulas.

All in plain language, hopefully.

Our goal isn't just for you to ace your next problem, though that'd be great.

Yeah, definitely.

But to truly grasp the why behind the what.

So you can analyze and interpret data like, well, like a crow.

OK, so to kick us off, let's unpack this really interesting mini experiment.

It's called Who Likes Tattoos?

Oh, yeah, I remember this one.

Two students, Sarah and Miranda, they wanted to see if an interviewer's appearance could actually sway survey responses.

Pretty clever.

They went to the Tucson mall, approached 60 shoppers and asked a simple question.

Do you like tattoos?

But here's their design.

Half the shoppers were interviewed by a student whose tattoos were clearly visible.

They were wearing a tank top.

Got it.

The other half saw the interviewer with tattoos hidden by long sleeves.

And crucially, they randomly assigned each shopper to one condition or the other.

That random assignment is key.

So what happened?

Well, the results were quite telling.

When the interviewer's tattoos were visible, 18 out of 30 shoppers said yes.

18 out of 30?

OK.

But when the tattoos were hidden, only 14 out of 30 said yes.

14 out of 30, a bit lower.

So the proportion of yes responses with visible tattoos was 18 divided by 30, which is 0 .6000.

60 percent.

And with hidden tattoos, it was 14 divided by 30, or about 0 .467.

OK, so around 47 percent.

Right.

That gives us an observed difference of 0 .6000 minus 0 .467, which comes out to 0 .133.

Ah, and this brings us right to the core statistical puzzle for this whole deep dive.

Is that 0 .133 difference like a genuine effect of the interviewer's appearance?

Or could it just be random?

Or could it just be a random fluke?

You know, the kind of variation you'd expect just from how they happen to randomly assign people.

Exactly.

So we'll use this tattoo example as we go to kind of illustrate the tools that help us answer exactly this kind of question.

All right, let's jump into the first major tool, then.

Comparing two proportions.

This is what we use when we want to compare the proportion of individuals with a certain characteristic in two different groups.

Like men versus women on some opinion.

Yeah, or like the success rates of two different treatments in an experiment.

Our main focus statistic here is the difference in the sample proportions, usually written as p hat one minus p hat two.

Right, having the 20s.

And what's really fascinating here is understanding the the sampling distribution of that difference.

OK, what do you mean by that?

Well, imagine we could repeat Sarah and Miranda's tattoo experiment like thousands of times.

Each time we'd get a slightly different difference in those sample proportions just by chance.

If you were to plot all those thousands of differences, they'd form a distribution, a pattern.

Ah, OK.

And does that pattern have a shape?

It does.

And under the right conditions, this distribution of differences is approximately normal.

It gets that familiar bell -shaped curve.

OK, under the right conditions, what are those?

Well, the main one here is called the large counts condition.

It basically means you need to have enough successes and failures in both of your samples,

specifically at least 10 successes and at least 10 failures in group one and at least 10 successes and 10 failures in group two.

Got it.

10 successes, 10 failures for both.

Yep.

If that holds, then the sampling distribution is roughly normal.

And the center of this theoretical distribution, its mean, is exactly the true difference in the population proportions.

P -dollar, P -22.

The thing we're usually trying to estimate.

Exactly.

And its variability, the standard deviation of this sampling distribution.

Well, there's a formula for that, too.

It involves the population proportions and the sample sizes.

This standard deviation formula also relies on the idea that your samples are independent and reasonably small compared to their populations.

That's the 10 % condition we sometimes need to check.

OK, maybe an example helps here, like the goldfish crackers.

Perfect.

Let's say bag one of goldfish has 25 % red ones.

So P -dollar equals 7 .255.

And bag two has 35 % red ones.

Two P -dollars equals SWIVE 355.

OK.

If we take a random sample of, say, 50 crackers from bag one and 40 from bag two.

Different sample sizes.

OK.

Yep, that's fine.

The expected difference between our sample proportions, hat 32, would be the true difference.

0 .25 minus 0 .35, which is an error of 0 .10.

So on average, we'd expect the sample from bag one to have 10 percentage points fewer red crackers than the sample from bag two.

Exactly.

And if we quickly check large counts, for bag one, 50 times 0 .25 is 12 .5 successes.

50 times 0 .75 is 37 .5 failures, both over 10.

OK.

For bag two, 40 times 0 .35 is 14 successes.

40 times 0 .65 is 26 failures, also both over 10.

So large counts is met.

Yep.

Which means we know the distribution of sample differences would get, if we repeated the sampling many times, would be approximately normally distributed around that true difference of minus 0 .1.

OK, that makes sense.

But in the real world, we usually don't know the true population proportions, right?

Like with the tattoos, we don't know the true proportion of all shoppers who like tattoos under each condition.

Precisely.

So how do we estimate that true difference?

That's where confidence intervals for PbDollar P22 come in.

Our goal is to create an interval of plausible values for the true difference based on our sample data with a certain level of confidence, like 95 percent or 99 percent.

And we need conditions for this, too, I bet.

You bet.

Three key ones.

First, random.

This means the data has to come from independent random samples or from a randomized experiment like the tattoo study.

Second, the 10 percent condition.

If you're sampling without replacement from a finite population, your sample size should be no more than 10 percent of the population size.

This keeps the calculations for standard deviation accurate.

Not needed for experiments, though.

Got it.

And third.

Third is large counts again.

But wait, since we don't know the true P2 to 1 and P2 dollars, how can we check one Pp gets?

Ah, right.

Good question.

We use our sample proportions, the hat tats, instead.

So we check if one tat hat and one tat tather and 10 10 are all at least 10.

We use our data as the best guess.

OK, makes sense.

Use the sample data to check the condition.

Exactly.

And because we're using sample proportions in our variability calculation, instead of the unknown true proportions, P dollar P2 to we don't call it standard deviation anymore.

We call it the standard error of the difference.

Standard error.

So it's like an estimated standard deviation for our statistic.

That's a great way to think about it.

It estimates how much the difference in sample proportions typically varies from the true difference just due to random sampling or assignment.

The formula looks almost identical just with hats on the peas inside the square root.

OK.

And the confidence interval formula itself.

It follows that general pattern you might remember.

Statistic critical value X standard error.

So for two proportions, it's the observed difference plus or minus the critical value times that standard error we just talked about.

And that sort of depends on our confidence level, like 1 .96 for 95 percent confidence.

Precisely.

And we call this whole thing a two sample Z interval for a difference between two proportions.

OK, let's apply this to the tat2 experiment.

We had heart one point six hundred L visible and heart two of three point four six seven seven.

The difference was point one three three sample sizes were 30 each.

Right.

First conditions random assignment.

Yes.

Ten percent not needed.

It's an experiment.

Large counts.

Let's check using the sample data.

OK.

Group one visible 18 successes 12 failures both ten dollars.

Good group to hidden 14 successes 16 failures both ten dollars.

OK.

Conditions met so we can calculate the interval.

We plug in the numbers, find the standard error, use zero one point nine six six for 95 percent confidence and we get an interval of approximately point zero one two point two five four point zero one two point two five four.

So how do we interpret that in context?

We'd say we are 95 percent confident that the true proportion of shoppers at this mall who say they like tattoos when the interviewers tattoos are visible is between one point two and twenty five point four percentage points higher than the true proportion when the interviewers tattoos are hidden.

And notice that interval point zero one two point two five four does not include zero.

That's the key insight because zero is not a plausible value for the true difference based on our interval.

We have convincing evidence at the 95 percent confidence level that there is a real difference in response rates based on the interviewer's appearance.

It's like our statistical mic drop moment for the interval.

Kind of.

Yeah.

If zero had been in the interval say in a point zero five zero point three hours, then we couldn't rule out zero is the true difference.

We'd say zero is plausible and therefore we wouldn't have convincing evidence of a difference.

Got it.

And a quick tip for the AP exam.

Absolutely.

Using your calculators to prop zent feature is totally fine.

Even encourage for the calculations saves time.

Definitely.

But you must always do two things on your paper.

First, name the procedure you're using to sample the interval for P calorie P two dollars in your plan step.

And second,

report the calculated interval along with your confidence level and your interpretation in context in your do or conclude step.

Don't just write down calculator speed, name it, report it, interpret it.

Got it.

Perfect.

OK, so we've estimated the difference, but sometimes the question is more direct.

Is this difference we saw actually statistically significant?

Is it unlikely to be just random noise?

Exactly.

That's where significance tests for P dollar P 22 come into play.

We're making a decision, not just estimating.

So we need hypotheses.

Yep.

The null hypothesis, T two will be dollars, almost always states there's no difference.

So H E knowledge P one P two equals other or you could just write 20 dollars T one P 22.

They mean the same thing.

No effect, no difference.

Right.

The alternative hypothesis, heiter dollars, specifies what kind of difference you suspect based on the question.

It could be 10 dollars P two dollars or P two dollars P two or is P dollars or just P two tall of P two if you don't have a prior expectation about direction.

OK.

And conditions.

Same as the interval.

Pretty much the same random 10 percent if sampling and large counts using the sample proportions.

But I remember something being different in the calculation.

Something about pooling.

Ah, you have a good memory.

Yes, for significance tests, there's a twist because the null hypothesis assumes H dollar is true.

It assumes tapio mol and two dollars are actually equal.

We should use all the data together to get the best single estimate of that supposedly common proportion.

If you combine the samples.

In a sense,

we calculate a pooled or combined sample proportion usually called hoppy R or just happy C.

It's simply the total number of successes in both samples combined divided by the total number of individuals in both samples combined.

OK.

Total successes over total people.

Exactly.

Hep C X one plus X two and one plus N two dollars.

And why do we do this pooling for the test?

Because if we're starting from the assumption that the two population proportions are equal, that's H dollar alone.

Then combining the samples gives us a better estimate of that single shared value than using happy on or happy to alone.

We use this pool to water when we calculate the standard error for the test statistic.

Ah, so the standard error formula for the test is slightly different from the standard error for the interval.

Precisely.

The standard error for the significance test uses say one formula plus say one one.

Compare that to the intervals S E which used say one one and half.

OK.

Subtle but important difference.

Use pooled for the test, not for the interval.

You got it.

Then our standardized test statistic is a Z score.

It's calculated as say one text hypothesized difference.

So Z one has a using Chrissy using Hesway.

Perfect.

This D score tells us how many standard errors are observed.

Sample difference is away from the zero difference claimed by the null hypothesis.

And then we find the P value from that Z score.

Yep.

Using the standard normal distribution.

The P value is the probability of getting a Z score as extreme as or more extreme than the one we calculated.

Assuming the null hypothesis is true.

OK.

Let's run the cat to example through this significance test framework.

They she dollars this will play.

Let's do a two sided alternative hop feasible.

OK.

Conditions checked earlier are still good.

Now we need the pooled proportion.

Total successes 18 plus 14 are still good.

Total individuals, the third 30, so 60.

So they're all arts 32, 66, which is about point five three three.

OK.

Now we use that how to calculate the standard error for the test.

Right.

Plug zero point five three three.

Then one one thirty dollars.

Ten away thirty dollars into the pooled SE formula and then calculate zero point one three three zero.

What does the Z come out to?

The calculation gives zero procs one dollars and seventy one.

OK.

Zero inverse one dollars seven one.

Now the P value for two sided test.

We look up the area to the right of one point seven day on normal curve and double it.

And that P value comes out to be approximately point zero eight nine.

P value point zero eight nine.

So how do we interpret that P value?

It means if the interviewer's appearance truly had no effect on whether people say they like tattoos, i .e.

if 20 dollars were true, then there would be about an eight point nine percent chance of observing a difference in sample proportions as large as zero point one three three or even larger in either direction just due to the randomness of the assignment.

Eight point nine percent chance.

Now we compare that to our significance level alpha.

Let's say we chose alpha point zero five five beforehand.

OK.

Since our P value point zero eight nine is greater than alpha point zero five.

What's our conclusion?

We fail to reject the null hypothesis.

Exactly.

We don't have convincing statistical evidence at the alpha point zero five five level to conclude that the interviewer's appearance, tattoos visible versus hidden, causes a difference in the proportion of shoppers who say they like tattoos.

So even though we saw a difference of point one three three, it's plausible that a difference that large could have happened just by chance based on this P value.

So the confidence interval told us zero wasn't plausible, but the test says we don't have enough evidence to reject zero.

How does that work?

Ah, good catch.

Notice the confidence interval point zero one two point two five four barely excluded zero n and the P value point zero eight nine was kind of close to point zero five, but still above it.

Both suggest the evidence is borderline.

A 95 percent confidence interval corresponds exactly to a two sided test at the alpha point zero five five level.

Since our interval excluded zero, our P value should be less than point zero five.

Wait, did I calculate that right?

Let me recheck.

Let's see zero one point seven zero P value is two PZ one point seven zero.

Yeah, that's about point zero eight nine.

OK, maybe the confidence interval calculation had slightly different rounding or used a slightly different method.

Let's trust the P value approach for now.

The P value of point zero eight nine is definitely greater than point zero five.

So fail to reject HDO five.

OK, so based on the test, we don't have convincing evidence.

Right.

And this highlights something important.

Failing to reject H two by one doesn't prove H dollars is true.

It just means we didn't find enough evidence against it.

Maybe the real effect is small and we needed a bigger sample size to detect it reliably.

That's a very common reason.

Or maybe there really is no effect.

We just can't be sure based on this data alone.

Got it.

OK, that covers comparing proportions.

What about comparing means like average scores or average heights?

Right.

Let's shift gears to section 10 .2 comparing two means.

This is our tool when we want to compare the average value of some quantitative variable between two independent groups.

Independent groups like comparing a random sample of men to a random sample of women.

Exactly.

Or comparing outcomes for patients randomly assigned to treatment A versus patients randomly assigned to treatment B.

The key is the individuals in one group have no connection to the individuals in the other.

And the statistic we focus on is the difference in sample means bar one bar two.

Precisely.

And just like with proportions, we need to understand its sampling distribution.

What kind of differences bar one bar two would we get if we repeated the sampling or the experiment many times?

Does it also follow a normal distribution?

It can.

The sampling distribution of bar one bar is normal if both of the original population distributions are normal.

OK.

What if they aren't normal?

Then we rely on the central limit theorem.

If both sample sizes, ANO1 and ANEO2, are large enough, usually we say 30 all or 3s is a guideline, then the sampling distribution of the difference in means will be approximately normal, regardless of the original population shapes.

Ah, the CLT saves the day again.

So, normal if populations are normal, or approximately normal if samples are large, what about the center and spread?

The center, or mean, of this sampling distribution is just the true difference in population means, MAO by ANOA2.

Makes sense.

And the standard deviation of the difference depends on the population standard deviations, and the sample sizes, ANO1 to ANO2.

The formula is quarter -fractable 12 and 12 plus fractable and 12, again, assuming the 10 % condition holds if we're sampling.

OK, but just like with proportions, we usually don't know the population standard deviation, sigma dot, ANO2 to ANO2.

Exactly.

That's the practical reality.

So, when we build confidence intervals or run significance tests, we have to estimate them using the sample standard deviations, 6 to 1 and 6 to L.

And when we use sample standard deviations instead of population standard deviations?

We have to use a 2 distribution instead of the normal distribution, and our measure of variability becomes the standard error of the difference.

SE bar 1 plus 12, 1 plus fresfixing.

Got it.

So, procedures for comparing means will involve T stuff, not Z stuff, because we're using CES 1 -0 instead, too.

You got it.

Let's talk confidence intervals for teal month 22.

The goal is to estimate the true difference between the population means.

Conditions first.

Always.

Random data from two independent random samples or two groups in a randomized experiment.

10 % condition if sampling without replacement.

And the big one, normal large sample.

Okay.

Unpack that normal large sample condition for means.

This is crucial.

The condition is met if either both population distributions are stated to be normal or both sample sizes, 21 and 2 money or large.

Typically $2, thus.

What if the populations aren't necessarily normal and the sample sizes are small, say 10 to 50 dollar to dollar?

Ah, then you're in a trickier spot.

If $1 .30 and the population shape is unknown, you must examine graphs of your sample data for each group separately.

Look at dot plots, histograms, or box plots.

What are we looking for in those graphs?

You're looking for reasons to doubt that the underlying population could be reasonably normal.

Specifically, check for strong skewness or the presence of outliers in your sample data.

If you see those, using the two procedures isn't safe.

But if the sample graphs look roughly symmetric and have no outliers, even with small samples, then it's generally considered okay to proceed with the two procedures, assuming the populations themselves aren't drastically non -normal.

You just need to state that you checked the graph.

Okay, so conditions met.

The formula for the confidence interval.

It's the familiar structure.

Statistic bargers, critical value, x, standard error.

For two means, that's p -s -e -t -s -e -t -s -e bar two bars.

We call this a two sample two interval for a difference between two means.

And that critical value comes from the two distribution.

But what about the degrees of freedom?

Df.

I remember this being weird for two samples.

Yes, the degrees of freedom for the two sample procedures are calculated with a rather complicated formula.

The good news...

Let the calculator handle it.

Exactly.

Your calculator or statistical software will calculate the degrees of freedom using that complex formula, often called the Welch -Satterthwaite approximation.

It usually results in a non -integer df.

Always use this value.

It's the most accurate.

What if we have to do it by hand or the calculator isn't available?

There's a conservative, simpler option.

Use the smaller of one -off -one or n -a -one -one as your degrees of freedom.

This is easier to calculate but results in a slightly wider, less precise interval.

So always prefer the calculator's df if possible.

Okay, always use technology's df.

Got it.

Example time.

How about that apartment rent comparison?

Good one.

A student samples 10 one -bedroom apartments and 10 two -bedroom apartments in a large city to compare mean rents.

Let's say 1 -1, $800, 1BR, $6 in dollars, $700, 2BR, $7 a ride.

Both 1th and $10.

Okay, conditions.

Independent random samples.

Assume yes.

10%.

Assume rents in a large city is a huge population.

So yes, normal large sample.

Ah, sample sizes are small.

Lens $10 too.

So we'd need to look at dot plots or box plots of the rents for the 1BR sample and the 2BR sample.

Let's assume the problem states there's no strong skewness or outliers in either sample.

Okay, assume graphs look okay.

Conditions met, now calculate it's a 90 % confidence interval for B1 plus modeling.

We'd use the formula, borrow one p -score fraction.

We plug in the means and standard deviations.

We find the standard error.

Then we need to -dater for 90 % confidence with the calculator's degrees of freedom.

And the calculator would give us the df, maybe something like 17 .8?

Exactly, something like that.

We find the t -dollar for 90 % confidence and df 17 .8 using NVT function.

Let's say the final interval comes out to be negative $144 .80,

Note,

I used slightly different means than the book example here to get this result for illustration.

Let's use the book data.

That I got $174 .88, the difference is $874 .88 on $952 .80, $952 .84.

The difference is $874 .8, $952 .8, negative $144 .11.

Negative $144 .80, negative $80 .80.

How do we interpret this?

Remember, it's me to one me.

Okay, we are 90 % confident that the true mean monthly rent for one -bedroom apartments in this city is between $11 .20 and $144 .80 less than the true mean monthly rent for two -bedroom apartments.

And again, the key insight, does the interval contain zero?

It does not.

Both endpoints are negative.

Since zeros is not in the interval, we have convincing evidence at the alpha point enhanced level corresponding to 90 % confidence that there is a real difference in mean rents between one - and two -bedroom apartments in this city.

Specifically, two bedrooms cost more on average.

Makes sense.

And the AP exam tip is probably similar?

Yep.

Name the procedure.

Two -sample teat interval for you and menu.

Report the interval and definitely report the degrees of freedom your calculator gave you.

Then interpret in context.

Okay.

Now for significance tests.

$H is usually we, more, or eagles, that'll go.

Almost always, yes.

The alternative howt specifies the direction of difference you're testing for, greater than, less than, or not equal.

Conditions are the same as for the interval.

Random, 10%, normal, large sample.

Correct.

And the test statistic is a t -score.

It's t -score, fraxed hypothesized difference.

Text hypothesized difference.

Trefreet, port 1, bart 1.

We're 6, n, s, c, bar 1, plus frax standard error.

Exactly.

This is the two -sample teat test for a difference between two means.

And we find the p -value using the t -distribution with the same weird degrees of freedom from the calculator.

The very same df.

Use your calculator's tcdf function with the calculated t -score and degrees of freedom to find the p -value.

Let's try the longer workweek example.

Comparing mean hours worked in 1975 versus 2014 using gss data.

Okay.

1975, 1, bart 1, 38 .97, s1, 13 .13 tellers, 2, 21 .91, s2, 14 .1919, s2, 14, 14 .355.

Let's test if the mean hours worked changed.

Hypotheses.

Independent random samples assumed from gss.

10 % condition?

Sure.

Population of U .S.

workers is huge.

Normal, large sample.

Yes.

Both sample sizes, 764 and 1501, are way bigger than 30.

So CLT applies.

Conditions met.

Good.

Now calculate the standard error using shoes dollars, n1, s282, n202, then calculate the t -statistic.

280, $41 .91, $38 .97, zimade, se8.

Okay, the difference is about 2 .94 hours.

Plugging into the formulas, what does DO come out to?

The calculation yields approx $4 .88.

Wow, that seems like a large t -score.

Degrees of freedom.

Calculator would give something probably large.

Yeah, with those sample sizes, the df will be quite large, maybe around 1800 or so.

So the p -value for $224 .88 with many degrees of freedom in a two -sided test.

That's going to be tiny, right?

Extremely tiny.

The calculator yields a p -value of approximately .000000115, like one in a million chance.

Okay, conclusion.

Comparing to alpha .0555.

Since the p -value basically zeroes much less than alpha .0555, we reject H $.

So there is very strong convincing evidence of a statistically significant difference in the true mean number of hours worked per week by Americans between 1975 and 2014.

And the sample means suggested increased.

Yes, the sample data suggests the average workweek got longer.

Okay, one last thing on two -sample t -tests.

You mentioned pooling for proportions tests.

Is there pooling for t -tests?

Ah, excellent question.

Yes, some older methods and some software offer a pooled two -sample t -test.

This procedure averages the two -sample variances.

As 12 -of -22 and 20 -of -2 tests, if you are willing to assume that the population variances sigma -12 -2 and sigma -20 -2 are equal.

Assume equal population variances.

Is that usually safe?

Generally, no.

It's often not a realistic assumption.

And importantly, the standard two -sample t -test, the one we just discussed that doesn't assume equal variances, works very well even when variances are different.

So the non -pooled test is better.

It's more robust, yes.

The pooled test can give misleading results if the population variances aren't actually equal.

So the strong advice for AP statistics is, never use the pooled two -test or pooled t -interval for comparing two means unless you are specifically told to assume the population variances are equal, which is rare.

Always use the standard un -pooled procedures your calculator defaults to.

Got it.

No pooling for means unless forced to.

Pooling okay for proportion tests because h -geodal over dollars assumes p -toler two -downs.

You've nailed the distinction.

Alright, now for the last section, 10 .3.

Comparing two means again, but this time with paired data.

What's that about?

This is where study design becomes absolutely critical.

Paired data arises when our two sets of measurements are not independent.

They come in pairs.

Pairs.

Like how?

Two main ways.

First, you might have two measurements on the same individual.

Think before score and after score on a test.

Or measuring the flexibility of someone's left arm versus their right arm.

Okay, same person, two measurements.

What's the other way?

The second way is when you have two distinct but very similar individuals that have been intentionally matched up based on certain characteristics.

And then one member of the pair gets treatment A, the other gets treatment B.

Like using identical twins in a study.

Or matching two patients based on age and health status.

Or using two adjacent plots of land in an agricultural experiment.

So the goal is to compare the two treatments or conditions, but using these related pairs.

Exactly.

And the key idea for analyzing paired data is brilliantly simple, yet powerful.

What is it?

You don't analyze the two original lists of data separately.

Instead, you first calculate the difference within each pair.

Like after minus before.

Or twin A minus twin B.

Precisely.

For every single pair, you compute one difference value.

This transforms your two lists of data into a single list of differences.

Ah.

And then what do you do with that list of differences?

You analyze that single list of differences using the methods we already know for a single sample mean.

Specifically, you use a one sample t interval or a one sample t test on the differences.

So paired data analysis is just one sample t procedures applied to the differences.

That's the secret.

It simplifies everything.

Why is this better than just treating them as two independent samples?

Because pairing often reduces variability.

Think about the identical twins IQ example.

Twins share genetics and often early environments.

Their IQs are likely to be more similar to each other than to randomly selected individuals.

If you just compared the group of high income twins to the group of low income twins using a two sample t test, the natural variation between different pairs of twins might obscure the effect of income.

You'd see lots of overlap.

But if you calculate the difference in IQ for each pair, high income twin IQ, low income twin IQ, you effectively cancel out the shared genetic and background factors unique to that pair.

You're isolating the effect of income within each pair.

Exactly.

And when you look at the distribution of those differences, the effect of income becomes much clearer.

You might see almost all the differences are positive, even if the original IQ scores overlapped a lot.

It increases the power of your analysis to detect a real difference if one exists.

Wow, that's clever.

Turn a two group problem into a one group problem on the differences.

It's a very common and powerful experimental design technique.

OK, so if we're doing a confidence interval for the true mean difference, which we often call media, we're essentially doing a one sample t interval on that list of differences.

So the conditions must be for the differences, right?

Absolutely.

Condition one, random.

The pairs themselves should be randomly selected.

Or if it's an experiment, treatments should be randomly assigned within pairs or the order randomized, like in the caffeine study.

Condition two, 10 percent condition applies if you are sampling pairs without replacement.

Check if the number of pairs is less than 10 percent of the population of pairs.

Condition three, normal large sample for the differences.

So the distribution of the differences needs to be approximately normal or the number of pairs needs to be large.

Correct.

And if NERV is small, you need to make a plot, dot plot, histogram box plot of the differences and check for strong skewness or outliers in that plot.

Makes sense.

Check the differences and the formula for the interval.

It's just the one sample t interval formula applied to the differences.

Where skull is the mean of the differences, d is the standard deviation of the differences, and d is the number of pairs.

Yep.

And the degrees of freedom for t dollars is simply NERV one, much easier than the two sample case.

Definitely.

We call this a paired two interval or a one sample t interval for a mean difference.

Both names are common.

And for significance tests for mood.

Same idea.

It's just a one sample t test performed on the list of differences.

Hypotheses are about the mean difference.

VLO, dollars yellow, and half of it will be 70 or dollar dollars.

Correct.

Conditions are the same as for the paired t interval.

Check randomness of pairs assignment, 10 percent if needed, normal large for differences.

And the test statistic.

It's the one sample t statistic for the differences.

T scart, stach, and out to sell.

Okay.

Find the p value using the t distribution with t indistillator.

Exactly.

Let's look at that caffeine dependence experiment.

11 volunteers measured depression score after placebo and after caffeine.

Order randomized.

They calculated the difference.

Placebo score, caffeine score.

They expect withdrawal placebo to increase depression, so they expect this difference to be positive.

Right.

Sample size is $111 .1, which is small.

So they'd need to check a plot of the 11 differences.

Let's assume it showed no strong skewness or outliers.

Random assignment of order was done, so conditions are met.

The sample data showed a mean difference diff 7 .364 certs and standard deviation of differences 6 .9188x.

Okay.

Now calculate the t statistic.

Tt 7 .364 cents and sick is approximately 3 .533.

Now we need the p value.

It's a one -sided test.

So we find the area to the right of t three dollars of five hundred and three three cents in a t distribution with df to one back four dash one exo eminence 10.

Okay.

Tcds 3 .53 infinity 10.

What does that give?

The p value is approximately 0 .00027.

It's a trainee p value compared to alpha 0 .00c7 parasitizers.

Since p value 0 .0007 hafo, booksever 747, we reject h dollars.

Regulation.

We have convincing evidence that the true mean difference in depression score, placebo caffeine, is greater than zero.

In other words, caffeine deprivation increases depression scores for caffeine dependent individuals like these volunteers.

And because it was a randomized experiment.

We can actually conclude causation.

The caffeine deprivation caused the increase in depression scores.

That's the power of randomization in experiments.

Great example.

Okay.

This distinction seems really important.

How do we know when to use paired procedures versus the two sample procedures from section 10 .2?

This sounds like a classic AP question.

It absolutely is.

It's one of the most common points of confusion.

Your choice of inference procedure paired t or two sample t depends entirely on how the data were collected on the study design.

So lay it out for us.

When do we use two sample t procedures?

Use two sample t procedures when your data come from two independent groups.

This means one two separate independent random samples were taken.

Example, a random sample of boys and a separate random sample of girls.

Two or in an experiment, individuals were randomly assigned to one of two different treatment groups.

Example, treatment A or treatment B, but not both.

This is called a completely randomized design.

A key hint here.

If the two groups have different sample sizes, it must be two independent samples.

It cannot be paired.

Excellent point.

If $1 and $10 .22, it's automatically a two sample situation.

OK.

And when do we use paired t procedures?

Use paired t procedures when the data are collected in pairs.

This means one you have two measurements on the same individual before after a left right condition one condition two two.

Or you have measurements on two individuals who were deliberately matched based on relevant characteristics before the experiment began.

Twins matched patients matched plots.

The essence is that the two measurements within a pair are related or linked somehow.

They're not independent.

Precisely.

And you analyze the differences within those pairs.

Let's test this with that are you all wet scenario scuba fins.

Each diver tests both types of fins paired or two sample.

Each diver provides two measurements.

That's paired data.

Analyze the differences in time for each diver paired t test.

OK.

Piranha fish links comparing links from a sample caught then versus an independent sample caught now.

Two independent samples from different time periods.

Two sample t test.

Wetsuit and shark bites.

Two groups of identical drums.

One group gets wetsuits.

The other doesn't.

Randomly assigned.

Bites measured.

Two distinct groups of drums.

Random assignment to group.

That sounds like a completely randomized design.

Two sample two test.

Even though the drums are identical they aren't paired in the sense of one drum getting both treatments.

Got it.

The design dictates the analysis.

Always.

And let's quickly revisit that get your heart beating activity to really drive home why paired designs are often preferred if possible.

Right.

The pulse rate example.

If we did a completely randomized design randomly assign some students to stand others to sit and compare the groups pulse rates.

The natural variation in resting pulse rates between different people is huge.

Some people just have higher pulse rates than others.

This noise makes it hard to see the signal the effect of standing.

You might get a p value around point zero nine and fail to find significant evidence.

But if we use a matched pairs design where each student measures their pulse both sitting and standing.

And then we calculate the difference standing pulse sitting pulse for each student.

We cancel out that person to person baseline variability.

Now we're just looking at how much each individual's pulse changed.

Exactly.

The variability in the differences will be much much smaller.

And when you run a paired test on those differences you'll likely get a p value very close to zero.

So the pair design was much more powerful.

It was better able to detect the real effect of standing on pulse rate because it control for individual variation.

That's the aha moment.

By reducing variability paired designs often give you more statistical power.

Fantastic summary.

OK.

Let's try to wrap this all up today.

We've really taken a deep dive into comparing two groups.

We navigated the different methods for proportions for independent means and for paired means.

Yeah.

We covered how to think about sampling distributions how to build and interpret confidence intervals and how to perform significance tests for each scenario and crucially how to check those conditions.

Right.

The conditions are key for validity.

If we connect this to the bigger picture these methods aren't just abstract formulas.

They're about making informed decisions from data.

It's about separating a real effect from random chance.

Like deciding if a new drug actually works better than a placebo or if one teaching method is genuinely more effective or even if seeing tattoos influences survey answers.

It gives you the tools to challenge claims and interpret research critically.

So this raises a question for you the listener.

Now that you have these powerful tools for comparison what two things are you curious about comparing in the world around you.

What difference would you want to investigate.

That's a great thought to leave people with.

We really hope this deep dive has brought some clarity and confidence for your AP statistics journey as you tackle Chapter 10.

Thanks for tuning in.

Thank you for joining us on this deep dive.

This has been a last minute lecture production.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Comparing two distinct groups or treatments requires selecting statistical procedures that align with how the data were collected and structured. Independent samples arise when different individuals populate each group, while paired designs occur when the same subjects provide multiple observations or when measurements are deliberately matched across groups. For independent samples involving proportions, constructing confidence intervals and performing two-sample z-tests for proportions demands calculating standard errors that reflect variability from both groups and confirming adequate sample sizes and proper random selection. When comparing means from independent samples with unknown population standard deviations, the two-sample t-test becomes the appropriate method, with the test statistic following a t-distribution and requiring verification that populations are approximately normal and have reasonably equal variances. Paired data structures, including before-and-after measurements, matched-pair studies, and repeated observations within subjects, are analyzed differently through the paired t-test, which transforms the problem into examining a single sample of differences between matched observations. Sound inference depends on verifying critical conditions: observations must arise from random collection processes, sampling distributions should approximate normality, and data points must be independent within groups and between groups. P-values and confidence intervals provide the statistical framework, but their interpretation must always connect to the practical context and substantive meaning of the differences discovered. A common error involves applying two-sample methods to paired data or conflating statistical significance with real-world importance; these distinctions require careful attention. Mastery of this material enables students to identify which procedure matches their data structure, verify the assumptions underlying each method, use technology to compute confidence intervals and test statistics accurately, and formulate conclusions that respect both mathematical evidence and problem context.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 10: Comparing Two Populations or Treatments

Related Chapters