Chapter 11: Inference for Distributions of Categorical Data
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement not replaced the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Hey everyone, have you ever, you know, opened a bag of M &Ms and kind of wondered if Mars, Inc.
is really giving you the color mix they claim?
Or maybe you've seen, like, an election poll showing opinions across different age groups and thought, hmm, is that really how it breaks down?
Well, these scenarios, they're all about categorical data.
And today we're doing a deep dive into how to analyze this kind of data using some
really powerful statistical tools.
Welcome to the deep dive.
Our mission today is to help you get a solid handle on Chapter 11 from the Practice of Statistics, 6th edition.
This chapter introduces a family of tests called chi -square tests.
Think of this as your shortcut to understanding these crucial tools, especially if you're prepping for the AP stats exam.
We'll break down the key ideas, the vocabulary, the formulas, show you how they work in the real world, and importantly, point out common mistakes to avoid.
So up to this point, we've dealt with categorical data before, right?
Like testing a single proportion or maybe comparing two proportions in chapters nine and ten.
Usually just success failure or comparing two groups.
That's right.
But what happens when things get a bit more complex?
What if you have a single variable, but it has, say, three or four or even six categories like those M &M colors?
Ah, OK.
More than just two options.
Exactly.
Or what if you want to compare how something is distributed across more than two groups?
Maybe comparing product preferences across three different regions or treatment effectiveness across four different drug dosages.
Right.
You can't just use a two sample Z test for that.
Precisely.
That's where these chi -square tests come in.
They're designed specifically for situations with one categorical variable having two or more categories, or when you're comparing distributions across two or more populations or treatments.
They let us handle this richer data.
OK, so let's make this concrete.
The book uses this Candyman Can activity with M &Ms.
Mars claims specific percentages, right?
12 .5 % for brown, red, yellow, green, and then 25 % each for orange and blue.
The idea is you grab a random sample, let's say 60 candies.
You count how many of each color you actually got.
Those are your observed counts.
Then you compare that to what you'd expect to get if the company's claimed percentages were perfectly true for your sample size.
And the chi -square test statistic, the symbol of chi, that's the key calculation here, it essentially boils down the total difference between what you observed and what you expected into a single number.
So a bigger chi -square number means a bigger overall difference.
Exactly.
A larger tree suggests a bigger gap between reality, your sample, and the claim.
It gives you stronger evidence against that claim.
We can even think about simulation.
Like imagine doing this over and over.
Right.
Take load the samples, calculate chips for each.
If you plot those chip values, you'd see a distribution.
Then you can see just how unusual your sample's chip value is.
That leads right into the idea of a p -value,
the probability of getting a difference as big as yours, just by random chance.
Okay, so the chi -square value measures the difference.
How do we formally decide if that difference is significant?
Is it just chance, or is something else going on?
This brings us to the first specific test.
The chi -square test for goodness of fit.
You use this one when you have a single categorical variable and you want to test if its distribution matches some hypothesized distribution.
Perfect example.
The M &M colors.
Or testing if a die is fair, meaning do all six sides come up about 16th of the time.
That's a goodness of fit question.
And we use the familiar four -step process for significance tests, yeah.
State, plan, do, conclude.
Absolutely.
Consistency is key in stats.
Okay, step one.
State.
You need your null hypothesis, H0, and your alternative, ha.
And you state these in words, in context.
For the M &Ms, H0 would be something like, the company's claimed color distribution for M &Ms is correct.
Or, the true distribution is the same as the claimed one.
Yeah, and ha's are the opposite.
The true distribution of M &M's colors is not the same as the claimed distribution.
Right, and an important AP exam tip here.
For ha, don't say all the proportions are wrong.
Just say the distribution isn't the claimed one, or at least one proportion is different.
That's safer and more accurate.
The proportions still have to add up to one after all.
Good tip.
Okay, step two.
Plan.
This is where we check conditions.
Three big ones for chi -square tests.
First, random.
Data needs to come from a random sample or randomized experiment, standard stuff.
Second, the 10 % condition.
If you're sampling without replacement, which you usually are with things like M &M bags your sample size, N should be no more than 10 % of the total population size, N.
Keeps the selections independent enough.
And third, the crucial one, large counts.
This one's a bit different.
All expected counts have to be at least five, not the observed counts, the expected ones.
Yes, that's super important.
And on the exam, you absolutely need to show you calculated these expected counts and check that every single one is five or more.
Don't just say large counts condition met.
List them out.
Right.
Which brings up the question, how do we calculate those expected counts?
It's actually pretty straightforward.
For any category in a goodness of fit test, the expected count is just the total sample size N times the hypothesized proportion pi for that category.
So for our 60 M &Ms, if brown is claimed to be 12 .5%.
You'd calculate 60 times 0 .125, which equals 7 .5.
So you'd expect 7 .5 brown M &Ms on average, if the claim is true.
Okay, 7 .5, even though you can actually have 7 .5 M &Ms.
Exactly.
Expected counts are theoretical averages.
They don't have to be whole numbers.
And definitely do not round them during your calculations.
Keep the decimals.
Got it.
Okay, so conditions checked, expected counts calculated.
Now we do the main calculation, the cheese score test statistic itself.
What's that formula again?
The formula is average is observed, count, expected count,
expected count.
The sigma means you sum that up for every category, right?
Yeah, precisely.
For each color of M &M, you find the difference between observed and expected, square it, divide by the expected, and then add all those pieces together.
Why square the difference, and why divide by expected?
Good questions.
We square the difference for two main reasons.
First, it makes all the contributions positive, so differences don't cancel out.
Second, squaring gives more weight to larger differences.
Being off by four feels much more significant than being off by two.
Okay, makes sense.
And dividing by expected.
That puts the difference in perspective.
Think about it.
Being off by five candies isn't a big deal if you expected 100, but it's huge if you only expected eight.
Dividing by the expected count standardizes the squared difference relative to what you expected for that category.
It tells you about the relative size of the surprise.
I see.
So a difference of, say, 4 .5 from an expected 7 .5 for brown M &M's is actually a bigger deal statistically than a difference of 5 from an expected 15 for blue M &M's, even though 5 is numerically larger than 4 .5.
You got it.
The contribution to Choi from brown would be larger in that case.
And remember, use the actual counts in this formula, not proportions or percentages.
For the AP exam, show the calculation for the first couple of terms, then you can use plus dot dot dot notation.
Counts, not proportions.
Got it.
What else do we need in the do -step?
Degrees of freedom?
Yes, degrees of freedom, or DF.
For goodness -of -fit, it's super simple.
DF equals number of categories 1.
So for six M &M colors, DFF equals 6, 1 equals 5.
Correct.
And this DF value is important because it determines the exact shape of the Choi squared distribution we use to find the p -value.
Right, the Choi squared distribution.
It's always skewed, right?
Only positive values.
Yep, and it gets less skewed and more spread out as the degrees of freedom increase.
You use this distribution along with your calculated Choi value and the DF to find the p -value.
Which is the probability of getting a Choi square statistic as big as ours, or even bigger, if the null hypothesis is actually true.
Exactly.
You can use table C in the textbook, which gives you a range for the p -value, or use a calculator function like TDF for a more precise value.
For instance, if our M &M sample gave checks 9 .8 with GF5, the calculator gives a p -value around 0 .081.
Okay, final step, conclude EE.
We compare that p -value to our significance level, alpha, which is usually 0 .05, unless stated otherwise.
Right.
If p -value EA, we reject 8, 0.
If p -value EA, we fail to reject 8, 0.
So for our M &M example with p equals 0 .081, since that's greater than 0 .05.
We would fail to reject 8, 0.
And then we state that conclusion in context.
Since our p -value of 0 .081 is greater than 0 .087, we fail to reject 8, 0.
We do not have convincing evidence that the true color distribution of M &Ms in this bag is different from the distribution claimed by Mars, Inc.
Perfect.
And remember that important distinction.
Failing to reject 8, 0 doesn't prove 8, 0 is true.
It just means our sample didn't give us enough evidence to say it's false.
Maybe it is false, maybe it isn't, but our data wasn't strong enough to make that call.
That M &M example really helps clarify the goodness of fit idea.
But you mentioned it's not just for candy.
Let's talk about that NHL player birthday example from Malcolm Gladwell's Outliers.
Ah, yes.
The theory that being born earlier the year gives kids an edge in age -grouped sports like hockey.
So we can test this using goodness of fit.
We'd hypothesize a uniform distribution of birthdays across the four quarters of the year, right?
Exactly.
Yeah.
8, 0 would be.
Birthdays are uniformly distributed.
25 % in each quarter.
Haha would be.
Birthdays are not uniformly distributed.
Let's say we sample 80 NHL players.
Under 8, 0, we'd expect 20 players born in each quarter.
Janmar, Apergen, Jolsep, Oktek.
But maybe we observe something different, like 32 players born in Janmar and only 12 in Oktek.
If you run the numbers for those observed counts against the expected 20s, you might calculate a chi -square statistic of say 11 .2.
With four quarters, the degrees of freedom would be 41 equals 3.
Okay.
Checks on the .2, DF equals 3.
What's the p -value?
The p -value for that comes out to be about .011.
Ah, so .011 is less than .05.
Correct.
So we reject H0.
Our conclusion.
We have convincing evidence that the birthdays of NHL players are not uniformly distributed throughout the year.
Now, if we get a significant result like that, we shouldn't just stop there, right?
We should dig deeper.
Absolutely.
A significant chi -square tells you that there's a difference, but not how or where.
You need to do a follow -up analysis.
Look back at the individual components of the chi -square sum, the observed expected pieces for each category.
Find the ones that contributed the most to that big chi -square value.
Exactly.
In the hockey example, the Janmar quarter, observed 32, expected 20, and the Okdeck quarter, observed 12, expected 20, would have the largest components.
So we'd point out that significantly more players were born in Janmar than expected, and significantly fewer were born in Okdeck than expected, assuming uniformity.
Precisely.
That gives context and meaning to the rejection of H0, and it supports Gladwell's idea.
That follow -up is crucial for interpretation.
Okay, that covers testing a single distribution's fit.
But what about comparing distributions between different groups, or looking for connections between two categorical variables?
That's where two -way tables come in.
Right, and this leads to two other types of chi -square tests,
the test for homogeneity and the test for independence.
Students often mix these up, but the key difference lies in how the data were collected.
Let's start with the chi -square test for homogeneity.
What's the setup for that one?
Homogeneity means sameness.
You use this test when you want to know if the distribution of a single categorical variable is the same across two or more different populations or treatment groups.
Okay, so multiple groups, one variable we're comparing across them, like the example of background music in a restaurant.
Perfect.
You randomly assign customers to one of three groups.
No music, French music, or Italian music.
Those are your three populations or treatment groups.
Then you record the single categorical variable, entree choice, maybe pasta, fish, steak.
And you want to see if the distribution of entree choices is the same for all three music groups?
Exactly.
Is the pattern of choices homogenous across the groups?
The data comes from multiple independent samples or one randomized experiment with multiple groups.
So the hypothesis for homogeneity would be 8 -0.
The true distributions of entree choices are the same for all three music conditions.
And huh, the true distributions of entree choices are not all the same.
At least one group has a different distribution.
Conditions similar to before.
Very similar, random.
Independent random samples or random assignment to treatment groups, 10%.
Needed for each sample if sampling without replacement, but not for experiments.
Large counts.
All expected counts must still be at least five.
Okay, that's homogeneity.
Now what about the CHI score test for independence?
How is that different?
Independence is about association.
You use this test when you have data from a single random sample drawn from one population.
And for each individual in that sample, you've measured two different categorical variables.
Ah, one sample, two variables, like the anger level and heart disease example.
Exactly.
You take one random sample of people.
For each person, you classify them based on their typical anger level.
Right.
Low, moderate, high.
And D, their heart disease status.
Yes, no.
And the question is, are these two variables related?
Is there an association between anger level and heart disease in this population?
Or are they independent?
Precisely.
That's the core question.
So hypotheses for independence, H0.
There is no association between anger level and heart disease status in the population, or the variables are independent.
Right.
And there is an association between anger level and heart disease status in the population.
The variables are not independent.
Conditions again.
A single random sample this time.
Still need 10 % if sampling without replacement for that single sample.
And still need large counts, all expected counts with five.
The core conditions persist.
It's really helpful that the core conditions are basically the same.
Now, what about the calculations for these two -way table tests?
Is it the same chi -square formula?
It is.
The mechanics are surprisingly similar once you have the expected counts.
Calculating those expected counts is slightly different for a two -way table, though.
How do we do that?
For any cell inside the two -way table, the expected count is calculated as row total column total.
Table total.
Okay.
Row total times column total divided by the grand total for the table.
Yep.
You do that for every single cell on the table.
Again, don't round these expected counts.
And once we have all the observed counts from the data and all the calculated expected counts?
Use the exact same chi -square formula.
CHEF observed expected.
You just sum this calculation over all the cells in the table not included the totals.
Gotcha.
And degrees of freedom for two -way tables.
That's also a bit different.
It's calculated as Df
number of rows one, number of columns one.
Rows minus one times columns minus one.
Okay.
Then just like before, you use your calculated statistic and this Df value to find the p -value using the chi -square distribution.
Interpretation is the same.
Compare p -value to alpha.
Exactly the same logic.
Reject A0 if PIA, fail to reject if PO and PIA, state your conclusion in context.
Clearly indicating whether you found evidence of a difference in distributions, homogeneity, or evidence of an association independence.
Any specific AP exam tips for these tests?
Definitely.
Always clearly state which tests you're using chi -square test for homogeneity or chi -square test for independence.
Show the formula for expected counts and maybe one calculation.
Show the chi -square formula set up with the first term or two.
Report the calculated chi value, the Df and the p -value.
And be careful with calculator output, especially scientific notation for tiny p -values.
4 .82 e -5 is not 4 .82, it's 0 .0000482.
Very small.
Okay.
And just like with goodness -of -fit, if we get a significant result here, meaning we reject A0, we should do a follow -up analysis.
Yes, absolutely.
Essentially, if you conclude there is a difference, homogeneity, or there's an association independence, you need to investigate where those differences or associations lie.
Look at the individual cell contributions to the chi -square statistic again.
Find the cells with the biggest discrepancies between observed and expected.
Right.
Identify those cells and describe the nature of the difference.
For example, we observed many more high anger individuals with heart disease than expected and fewer low anger individuals with heart disease than expected, assuming independence.
This gives substance to your conclusion.
That makes sense.
Now, you mentioned earlier, especially with the independence test, association isn't causation.
Crucial point.
Especially when the data comes from an observational study, like the anger heart disease example, finding a statistically significant association between anger and heart disease does not prove that anger causes heart disease.
Why not?
Because there could be confounding variables.
Maybe people with high anger levels are also more likely to smoke or have poor diets or have a genetic predisposition.
Those other factors might be the real cause or part of the cause.
Association points to a relationship that might be worth investigating further, maybe with an experiment, if possible, but it doesn't seal the deal on causation in observational studies.
Okay.
Really important reminder.
Any other special things to keep in mind with these tests?
Well, just a couple of quick notes.
If you happen to have a two by two table, two rows, two columns, doing a chi -square test for homogeneity is mathematically the same as doing the two sample z -test for comparing two proportions that we learned back in chapter 10, assuming you're doing a two -sided test.
Oh, interesting.
So two ways to get the same answer.
Pretty much.
For that specific case.
Though if you need a one -sided test or a confidence interval for the difference in proportions, stick with the chapter 10 methods.
Okay.
And what if that large counts condition fails?
What if some expected counts are less than five?
Good question.
If that happens, the chi -square distribution might not be an accurate model for the test statistic.
The standard advice is to try collapsing categories.
That means combining similar rows or columns together to increase the counts.
Like if you had strongly agree, agree, neutral, disagree, strongly disagree, maybe combine strongly agree and agree and disagree and strongly disagree.
Exactly.
You try to combine them in a way that makes sense in the context until all the new expected counts are at least five.
You lose some detail, but it allows you to proceed with the test if possible.
Okay.
That's a practical workaround.
So let's zoom out.
We've covered quite a bit on chi -square tests.
We have.
They're a really versatile set of tools.
It seems like they really expand our ability to analyze categorical data beyond those simple two category or two group situations.
Right.
Whether it's checking if a claimed distribution holds up like with the M &Ms.
That's good business fit.
Or comparing distributions across several groups like the music and entree choices.
Collageniety.
Or looking for a relationship between two variables within one population like anger and heart disease.
Independence.
These chi -square tests give us a way to approach all of those scenarios statistically.
They really do.
Understanding how to choose the right test, check the conditions, perform the calculations, and especially interpret the results, including that follow -up analysis, is a huge asset.
It's crucial for making sense of data you'll encounter everywhere and certainly key for doing well on the AP exam.
It's about moving beyond just looking at percentages and asking if the patterns we see are statistically meaningful, right?
That's the core idea.
Yeah.
Are the differences or associations we observe likely just due to random chance?
Or do they represent something real happening in the population?
Chi -square helps us answer that.
So as a final thought for everyone listening, think about all the categorical data you encounter daily.
Election polls, survey results about preferences, maybe even data related to your hobbies or school activities.
Yeah.
It's everywhere when you start looking.
What questions could you ask and potentially answer using these powerful chi -square tools?
What hidden patterns might be waiting in the data around you?
Well, thank you so much for joining us for this deep dive into chi -square tests.
From the entire team here at the deep dive, we wish you the very best on your AP statistics journey.
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML ♥Related Chapters
- Modeling Distributions of DataThe Practice of Statistics
- Sampling DistributionsThe Practice of Statistics
- Augmenting Data StructuresIntroduction to Algorithms
- Batch Processing & Data PipelinesDesigning Data-Intensive Applications
- Collecting DataThe Practice of Statistics
- Comparing Two Populations or TreatmentsThe Practice of Statistics