Chapter 11: Goodness-of-Fit and Contingency Tables

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Have you ever looked at a headline, you know, one claiming a link between two things maybe about some medical study or a surprising trend and just wondered, okay, but how did they figure that out?

Or maybe on a completely different track, how do people doing trial detection or cybersecurity experts?

How do they actually spot weird things in data that will look perfectly normal to the rest of us?

Yeah, those anomalies can be subtle.

Exactly.

So today we're taking a deep dive into, let's say, the strategic side of data analysis.

We're unpacking Chapter 11 of Mario Triola's Elementary Statistics.

A really foundational chapter.

It is.

And we want to equip you with two really powerful statistical tools, the goodness of fit test and contingency tables.

Our mission here is to give you not just the definitions, because you can look those up, right, but the why and the how to think about these methods.

Yeah, the intuition behind them.

And their incredible real world uses, you know, public health, spotting digital scams and crucially how you can avoid common mistakes when you're looking at the numbers.

That last part is key.

Interpretation matters.

It really does.

This isn't just about formulas.

It's about understanding the silent language of data.

And what's really powerful about these specific tests, these chi -squared tests, is that they let us move beyond just describing data, you know, means, medians, that sort of thing.

They actually let us put claims to the test, statistically speaking.

We can ask, does the data we observed actually fit a pattern we expected?

Right, like does it match up?

Exactly.

Or are two different characteristics we measured in our data truly independent of each other?

It's about quantifying those connections or maybe the lack of connections in a really objective, robust way.

OK, let's start with maybe a classic example for that first one, the goodness of fit test.

Imagine you're playing a board game, right, and you suspect the die isn't quite fair.

We've all been there.

Totally.

You roll it, say, 100 times.

How do you go from those raw results, the counts for each number, to actually statistically proving your hunch?

Or maybe another example.

You're looking at a data set of reported weights.

How can you tell if those numbers just look a bit too neat to be actual, precise measurements?

And this is precisely where the goodness of fit test becomes, well, kind of like your statistical lie detector.

OK, I like that, a lie detector.

Its main job, its objective, is to test the hypothesis that an observed frequency distribution.

So what you actually saw in your sample lines up with some claimed or expected distribution.

So comparing what you see to what you should see.

Exactly.

Comparing what you see versus what you should see, if a certain claim about the data's pattern were actually true.

It helps you decide if a single set of counts, like your die rolls, genuinely fits a specific distribution.

Like is it uniform, meaning fair?

Is it uniform, or does it maybe follow a normal curve, or does it fit some other set of proportions someone claimed?

Got it.

And for this statistical lie detector to work properly, there are some rules, right?

Requirements.

Yes, absolutely.

A few critical requirements.

First, like most good statistics, your data needs to be randomly selected.

Can't be biased from the start.

Makes sense.

Second, you're dealing with frequency counts.

So how many times something occurred in different categories, like how many times you rolled a one, a two, and so on.

OK.

Counts.

But the third one, this is really the crucial check.

Your expected frequency for each category has to be at least five.

Expected, not what you actually saw.

Correct.

The expected count needs to be five or more.

This is important because it ensures that chi -square math we use is actually a good approximation.

It's totally fine, by the way, if you observe zero for a category.

Oh, interesting.

Yeah, as long as what you expected for that category was five or more, the test is valid.

OK.

That's a good distinction.

So how do we set up the actual tests, like the hypotheses?

So when you set up this test, your null hypothesis, that's H0, always states that the observed frequency counts do agree with the claim distribution.

They fit.

So for the suspicious die, H0 would be?

It's fair.

All sides are equally likely.

Exactly.

H0, the die is fair.

The alternative hypothesis, H1, just states the opposite.

It says the frequency counts do not agree with the claim distribution.

Meaning at least one side is off?

Precisely.

At least one category significantly deviates from what we expected.

The die is loaded, basically.

OK.

And how do we measure that deviation?

There's a statistic, right?

Yes.

The test statistic is chi -square, written as chi -th.

And this number essentially sums up how much each observed count differs from its expected count.

How does it do that?

Well, for each category, you take the difference, observed minus expected, you square that difference, and then you divide by the expected value.

And you just add all those up across all your categories.

OK.

O, E, B, summed up.

You got it.

And a small chi -square value means, well, O and E are pretty close, suggests a good fit.

And a large value.

A large value means there's a big difference, a significant discrepancy suggests a poor fit.

And degrees of freedom.

How does that work here?

Simple for this test.

It's just the number of categories minus one.

So for a standardized six categories, that's five degrees of freedom.

Got it.

And one last crucial point.

Goodness -of -fit tests are always right -tailed tests.

We're only interested if the differences are significantly large.

OK.

Always right -tailed.

That simplifies things.

It does.

Now, shall we try putting this into practice with an example from the source material?

Yeah, let's do it.

That weights example sounded interesting.

OK.

The measured or reported weights example.

It really highlights a fascinating aspect of human behavior and data.

When people self -report data, especially something personal like weight, the last digits often look very different than if the weights were actually measured using a scale.

How so?

Well, genuinely measured weights tend to have last digits that occur with roughly the same frequency.

You know, zero through nine should be pretty evenly spread out.

A uniform distribution.

OK.

Makes sense.

Random variation.

Right.

But look at table 11 -2 in the source.

It shows data on the last digits of 2 ,784 reported weights from males.

And you see things like, wow, 1 ,175 zeros.

Whoa.

That's a lot.

And only 44 ones.

It just looks off, right?

The claim here, the thing we want to test, is that these last digits do not occur with the same frequency.

So we're claiming it's not uniform.

Exactly.

So our null hypothesis, H0, would be that all probabilities for the digits zero through nine are equal.

That it is a uniform distribution.

And H1 is that at least one probability is different.

Perfect.

Now if H0 were true, if it was uniform, what would we expect for each digit?

A total is 200, 700, and 184.

Ten digits.

So 274 divided by 10, 278 .4.

Precisely.

Our expected frequency, E, for each digit is 278 .4.

And importantly, that's way above our minimum requirement of five.

So we're good to proceed with the chi -square test.

OK.

Now we won't grind through the full calculation here.

But when you plug in all those observed values, like 1175 for zero, 44 for one, and the expected value, 278 .4 for all, you calculate the chi -square statistic.

And what does it come out to be?

It comes out incredibly large.

40490 .174.

Wow.

That sounds big.

It is.

And with nine degrees of freedom, ten categories minus one, the p -value associated with that chi -square value is minuscule.

It's less than .00001.

OK, p -value less than .00001.

So time for our mnemonic.

If the p is low, the null must go.

That p is definitely low.

So we reject 80.

We emphatically reject the null hypothesis.

So what does this actually tell us?

It tells us there's really strong evidence that the last digits don't occur with the same frequency.

Exactly.

And this raises that important question.

What does this imply about the source of the data?

Well, like you said, people probably weren't actually measuring carefully.

They were reporting.

Right.

The incredibly high frequencies for the digits zero and five are the giveaway.

It strongly suggests these weights were reported by the individuals because people tend to round usually to the nearest zero or five.

Yeah, I guess I'd do that too.

Most people would.

It's not necessarily fraud, just human nature.

But it vividly shows why understanding how data was collected is absolutely critical before you interpret it.

Measured versus reported makes a huge difference here.

That's a fantastic illustration.

It really shows how the test can reveal something hidden about data integrity.

And you mentioned another application, Benford's law for cyber security.

Yes, Benford's law.

This one is just fascinating.

It's this surprising mathematical observation that in many, many naturally occurring data sets.

Like what kind of data sets?

Oh, things like financial transactions, stock prices, tax return numbers, population figures, even the lengths of rivers.

In these kinds of data, the leading digit isn't uniformly distributed.

OK.

So what does it look like?

The digit one appears as the first digit about 30 .1 percent of the time.

Two appears about 17 .6 percent.

Three one is about 12 .5 percent and so on, with the higher digits appearing less and less frequently.

Wow, that's counterintuitive.

You'd think they'd be roughly equal.

You would.

But it holds true remarkably often, and it's not just a mathematical curiosity, it's become a really practical tool.

How so?

Well, tax agencies, accounting firms, they use it to flag potential fraud.

If financial data doesn't follow Benford's law, it might be fabricated.

Ah, clever.

And more recently, it's being used in cybersecurity.

It turns out that things like the time intervals between internet traffic packets under normal conditions tend to follow Benford's law pretty closely.

OK, so if it doesn't follow the law?

That could be an anomaly.

A significant departure from Benford's law and leading digits of those in arrival times could be an instant signal that something unusual is happening, maybe a denial of service attack or some other kind of intrusion.

That's brilliant.

It's like a tripwire based on number of patterns.

It is.

It's elegant.

It works in real time.

And it's actually quite hard for an attacker to consciously make their traffic patterns conform to Benford's law.

OK, so let's see the test in action here.

The source has an example.

The cybersecurity example from table 11 to 4, they take a sample of 271 leading digits from internet traffic into arrival times.

The claim being tested is that these digits do fit the distribution described by Benford's law.

So this time, the claim is that it fits.

Correct.

So our null hypothesis, 8 -0, is that the proportions for each leading digit, 1 through 9, match the specific percentages given by Benford's law, P1, 0 .301, P2, 0 .176, et cetera.

And each one would be that at least one proportion doesn't match Benford's law.

Exactly.

We calculate the expected frequencies for our 271 observations based on those Benford proportions.

For example, for digit 1, we'd expect 271 times 0 .301, which is about 81 .6.

We check, and all expected frequencies are safely above 5.

Good.

Requirement met.

Then we calculate the Tray Square statistic.

For this data, it comes out to 11 .2792.

OK, 11 .2792.

Is that big or small?

Need the p -value.

Right.

With 8 degrees of freedom, 9 digits minus 1, the p -value for this Tray Square statistic is 0 .186.

0 .186.

OK.

Comparing that to our usual alpha, like 0 .05.

It's greater.

Yeah.

0 .186 is much larger than 0 .05.

So if the p is high, well, the mnemonic is about low p.

If the p is not low… The null can fly.

Or, more formally, we fail to reject the null hypothesis.

Right.

Fail to reject H0.

So what's the conclusion here?

It means there is not enough statistical evidence to reject the claim that these leading digits fit Benford's law.

Meaning the traffic looks normal.

Exactly.

In the context of cybersecurity, this result implies that the observed traffic data is consistent with normal expected patterns.

Based on this test, there's no statistical reason to suspect a cyber attack has occurred.

It's a simple, yet pretty robust check for anomalies.

That really shows the versatility of goodness of fit.

OK, so that test is great for checking if one set of frequencies fits a specific pattern.

But what if we have data broken down by two different categories,

like vaccination status and autism diagnosis, and we want to know if those two things are related?

Ah, now you're moving into the territory of contingency tables and tests of independence.

Contingency tables.

Sometimes called two -way frequency tables.

They're essentially grids, right?

You've got frequency counts for categorical data, where one variable defines the rows and another variable defines the columns.

Like vaccinated, unvaccinated in the rows and autism in the columns?

Perfect example.

And the word contingent itself kind of implies a dependence.

So these tables are specifically designed to let us test for independence between that row variable and that column variable.

So the test of independence, what's its main goal?

Its objective is straightforward.

To determine if there's a statistically significant relationship, a dependency between the two categorical variables, or if they are effectively independent of each other.

No connection.

And the requirements, similar to before.

Very similar.

You need randomly selected sample data.

The data has to be frequency counts arranged in that two -way table format,

and critically, again, the expected frequency requirement.

For every single cell in that table, the expected frequency, E, must be at least five.

Every cell this time, not just every category.

Every single cell.

Yeah.

If even one cell has an expected frequency below five, the standard chi -square test might not be reliable.

Okay, good to know.

Hypotheses for this test.

For a test of independence, the null hypothesis H0 is always that the row and column variables are independent.

No relationship.

So for the vaccine example, H0 would be vaccine status and autism diagnoses are independent.

Exactly.

And the alternative hypothesis, H1, is that the row and column variables are dependent.

There's some kind of association between them.

And the test statistic,

still chi -square.

Still the same chi -square formula.

The sum of OEE calculated across all the cells in the table this time.

Okay.

But degrees of freedom must be different, right, since it's a table?

It is.

The degrees of freedom for a contingency table test is calculated as number of rows minus one times number of columns minus one.

Ah, R1, C1.

Makes sense.

And just like goodness -of -fit, these tests of independence are always right -tailed tests.

We're looking for significant discrepancies indicating dependence.

Okay, the tricky part here seems to be calculating those expected frequencies for each cell.

How does that work?

It's not just total divided by number of cells, is it?

No.

It's a bit more involved, but quite clever.

For any given cell in the table, the expected frequency, E, is calculated by taking that cell's row total, multiplying it by that cell's column total, and then dividing the result by the grand total of all observations in the table.

Hmm.

Row total, column total, grand total.

Why does that work?

It comes directly from probability theory, specifically the idea of independent events.

If two events, like being in a certain row category and being in a certain column category, are independent, the probability of both happening is just the product of their individual probabilities.

Okay.

The probability of being in a specific row is row total, grand total.

The probability of being in a specific column is column total, grand total.

If they're independent, the probability of being in that specific cell is row total, grand total, column total, grand total.

To get the expected count for that cell in our sample, we multiply that joint probability by the total sample size, the grand total.

So row total, grand total, grand total, grand total simplifies nicely, too.

Row total, column total, grand total.

Ah, okay.

That makes sense now.

It's the count you'd expect if they were truly independent.

Precisely.

Okay.

Let's connect this directly back to that really important high -stakes question we started with.

Is there a link between the MMR vaccine and autism?

Right.

A topic with a lot of history and unfortunately a lot of misinformation stemming from that retracted study.

But our source material uses data from a more recent rigorous study by Jane and colleagues, which concluded no harmful association.

Let's use their data from table 11 and 1 and run this test of independence.

Okay.

So the table laid it out clearly.

Among the unvaccinated kids, 25 had autism, 362 did not.

That's 387 total unvaccinated.

Among the vaccinated kids, 64 had autism, 1 ,427 did not.

That's 1 ,491 total vaccinated.

And the overall study totals were 89 kids with autism and 1 ,789 without, from a grand total of 1 ,878 children.

Got it.

So let's walk through the hypothesis test, just reporting the statistical findings impartially.

Right.

The claim we're testing essentially is the conclusion of the study, that autism is independent of vaccination status.

So our null hypothesis, H0,

is autism diagnosis is independent of MMR vaccine status.

And H1,

autism diagnosis and MMR vaccine status are dependent.

There is an association.

Let's use a standard significance level, say alpha 0 .05.

Okay.

First, we need those expected frequencies.

For example, the top left cell, unvaccinated and autism.

The expected count would be row total for autism, column total for unvaccinated, grand total.

So 89, 387, 1878.

Exactly.

Which calculates out to about 18 .34.

Okay, that's above five.

Do we need to check all four?

Technically, yes.

You calculate E for each of the four cells.

In this case, they all turn out to be well above five, so the requirement is met.

We can trust the chi -square test.

Good.

Degrees of freedom.

It's a two by two table.

So two rows, one times two columns, one equals one.

Just one degree of freedom.

And now for the desk statistic and p -value.

When you run the numbers comparing the observed counts, 25, 362, 64, 14, 27, to their corresponding expected counts using the chi -square formula, the test statistic comes out to 3 .198.

3 .198.

And the p -value for that with one degree of freedom.

The resulting p -value is 0 .074.

Okay, p -value equal 0 .074.

Now we compare that to our alpha of 0 .05.

And 0 .074 is?

It's greater than 0 .05.

So if the p is low, independence must go.

But our p isn't low here.

Right.

Since the p -value, 0 .074, is greater than the significance level, 0 .05, we fail to reject the null hypothesis.

Fail to reject 8 -0, which means?

It means that based on the data from this particular study, there is not sufficient statistical evidence to conclude that autism diagnosis is dependent on MMR vaccine status.

So the data support the idea that they're independent.

The data are consistent with independence.

Yes, they support the study's conclusion that there is no harmful association found in this data set.

It's a really clear example of how these tools let us evaluate claims using objective, data -driven evidence.

Absolutely.

A very powerful demonstration, especially given the context.

And it's so important to remember that caution you mentioned earlier.

Even if we had rejected 8 -0, if the p -value had been small.

Right.

Even if we found a statistically significant dependency, it would not automatically mean that the vaccine causes autism.

Correlation doesn't equal causation.

The classic mantra.

It would only indicate a statistical association exists in the data.

Proving causation requires much more, usually different types of study designs.

Also worth noting, for these two -by -two tables specifically, the chi -square test of independence gives you the same conclusion as doing a z -test for the difference between two population proportions like we might have discussed in earlier chapters.

Ah, okay.

Good connection back.

Now building on these core ideas of goodness of fit and independence, there are actually a few important variations of chi -square tests for slightly different situations.

Okay, like what?

Well one is the test of homogeneity.

It sounds similar to independence, but there's a subtle difference.

How's it different?

This raises a really good question.

In a test of homogeneity,

you're typically sampling from different populations explicitly.

For example, maybe you sample people from several different cities.

Okay, multiple groups.

Right.

And you're asking if these different populations have the same proportion of some characteristic.

Like do cities A, B, and C have the same proportion of people who return lost wallets?

Ah, I see.

So you're comparing proportions across populations.

Exactly.

You want to know if the populations are homogenous, meaning the same with respect to that characteristic.

The mechanics, like the chi -square calculation and degrees of freedom, R1, C1, are actually identical to the test of independence.

Huh.

Same math, different questions set up.

Pretty much.

The hypotheses are framed differently.

H0 is that the populations do have the same proportions.

H1 is that they don't.

Is there an example of that?

Yeah, the source mentions that lost wallet experiment done by Reader's Digest.

They lost wallets in 16 different cities.

They ran a test of homogeneity to see if the return rate was the same across cities.

The p -value is very small, .002.

So they rejected the null hypothesis.

Meaning the return rate wasn't the same.

Correct.

The evidence suggested that the proportion of returned wallets significantly depends on the city.

The cities were not homogenous in their wallet returning behavior.

Interesting.

Okay, what other variations are there?

You mentioned that requirement about expected frequencies needing to be at least five.

What if they aren't?

Great question.

That E greater than or equal to five rule is for the standard chi -square test because it relies on an approximation that works well with larger counts.

When you have a two -by -two contingency table, specifically, and one or more of your expected frequencies dips below five.

The standard test isn't reliable.

It might not be accurate.

The approximation can break down.

In that situation, the preferred method is Fisher's exact test.

Fisher's exact test, okay.

This test provides a precise p -value directly from the probabilities of the table configurations without relying on the chi -square approximation.

It's computationally more intensive, which is why it was harder to do by hand, but modern software handles it easily.

So it's the accurate choice for small expected counts in two -by -two tables.

Exactly.

It's the gold standard then.

The source mentions the MythBusters yawning experiment data.

When they analyzed it, one of the expected frequencies was only 4 .480, just under five.

Ah, violates the rule.

Right.

So using Fisher's exact test was appropriate.

The p -value came out to .513.

Which is high.

Very high.

So they failed to reject the null.

There was no statistically significant evidence from their experiment to support the contagious yawning myth.

Huh.

Okay, busted.

Statistically speaking, one more test mentioned.

Something about matched pairs.

Yes, McNamara's test.

This one is specifically for situations where your data in a two -by -two table isn't from independent groups, but rather from dependent or matched pairs.

What counts as matched pairs?

Classic examples are before and after measurements on the same subject.

Or maybe comparing two different treatments, but applying them to, say, the left eye and the right eye of the same person.

Or, like the example in the book, comparing a protected hip to an unprotected hip on the same nursing home resident.

Okay, so the observations are linked somehow.

Precisely.

They aren't independent samples.

McNamara's test focuses on the discordant pairs.

The pairs where the outcome was different between the matched observations.

A, G, fracture on the unprotected side, but not the protected side, or vice versa.

It tests if the proportions of these two types of discordant outcomes are significantly different.

So for the hip protectors example?

The null hypothesis H0 would be that the proportion of fractures on protected hips is the same as the proportion of fractures on unprotected hips, considering only those discordant cases.

Makes sense.

What did the test show?

The chi -square statistic calculated using McNamara's method was 0 .640, leading to a p -value of 0 .424.

High p -value again.

Yep.

Greater than 0 .05, so we fail to reject H0.

Meaning?

The conclusion from this test is that, based on this data, the hip protectors do not appear effective in preventing hip fractures.

There wasn't a significant difference in fracture rates between the protected and unprotected sides in those discordant pairs.

Wow.

Okay, so we've covered quite a bit of ground with these chi -square -related tests.

Oh, we really have.

We started with goodness of fit, right?

Checking if one set of numbers matches an expected pattern, revealing things about reported weights or even spotting cyber threats using Benford's law.

Powerful for validating assumptions about data.

Then we jumped into contingency tables and the test of independence, looking for relationships between two categorical variables.

And that led us to tackle that really critical question about the MMR vaccine and autism, finding that data supported independence.

Showing how statistics can bring objective evidence to important public health discussions.

And we even touched on homogeneity, comparing proportions across groups, and the specialized tools like Fisher's Exec and McNamars for specific data structures like small counts or matched pairs.

Exactly.

It's a versatile toolkit.

If we connect all this back to the bigger picture, these tools really do empower you, the listener, to look at data and claims more critically.

Absolutely.

They give you a framework for asking, is this pattern real or just random chance?

Is this claimed connection supported by the numbers?

But always remembering those key points.

Check the requirements, especially that expected count rule.

Crucial.

It needs to be at least five for the standard tests.

And never jump from statistical dependence straight to causation.

Never.

Association is not causation.

It's a signpost, perhaps, for further investigation, but not the final word on cause and effect.

These tests provide objective evidence, but interpreting it correctly within the context is vital.

So the next time you see a statistic, maybe about some link between two things or a claim that data behaves a certain way, pause for a second, ask yourself, is this maybe a goodness of fit situation or are they testing for independence in a contingency table?

Think about the underlying structure.

And most importantly, does the evidence presented, does the data actually support the conclusion they're drawing?

Your statistical superpowers definitely leveled up today.

Hopefully.

So here's a final thought.

What other claims, maybe ones you see every day, could you now put to the test using these

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Categorical data analysis and comparing multiple group means represent two interconnected statistical challenges that arise frequently in research. The chi-square goodness-of-fit test addresses whether observed frequencies in a dataset match what would be expected under a particular theoretical distribution, allowing researchers to validate distributional assumptions or test claims about population proportions. When examining relationships between two categorical variables, the chi-square test for independence employs contingency tables to determine if observed cell frequencies suggest genuine association or merely random variation consistent with independence. A parallel procedure, the chi-square test for homogeneity, extends this logic to assess whether categorical distributions remain consistent across multiple populations or whether meaningful differences exist in proportions across groups. Each chi-square approach requires computing a test statistic that quantifies the discrepancy between what was actually observed and what would be predicted under the null hypothesis, then comparing this value to the appropriate critical region in the chi-square distribution. Critical assumptions underpin the validity of these tests, most notably that expected frequencies in contingency table cells must meet minimum thresholds to ensure the approximation works reliably. Beyond categorical comparisons, analysis of variance provides a unified method for determining whether population means differ significantly across three or more groups. Rather than conducting numerous pairwise t-tests, ANOVA partitions the total observed variation into between-group variation reflecting differences among group means and within-group variation reflecting natural fluctuation within each group. This decomposition yields the F-statistic, which follows the F-distribution when all group means are truly equal. The approach assumes that observations are independent, that populations are normally distributed, and that variances are homogeneous across groups. When ANOVA detects significant differences among groups, post-hoc tests such as Tukey's method or Scheffe's test pinpoint which specific pairs of means actually differ, while controlling for the inflated error rates that would result from conducting multiple independent comparisons.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 11: Goodness-of-Fit and Contingency Tables

Related Chapters