Chapter 12: The Chi-Squared Test for a Distribution

Search this chapter

Audio Overview

0:00 / 0:00

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace, the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Imagine, ah, you're at a casino, right, you're standing at the craps table and you're rolling the dice.

Oh yeah, best place to be.

Right, but the shooter just keeps hitting sixes, like roll after roll after roll.

Okay, yeah, that's suspicious.

Exactly.

In your gut, you know these dice are loaded, like the universe feels like it's cheating you.

But if you call the casino manager over, a gut feeling isn't going to cut it.

No, they'll just laugh at you.

Yeah, you have to actually prude it.

You need a mathematical way to look at that pile of data, all those sixes, and definitively state whether this is just a lucky streak or if the underlying reality is actually rigged.

And that is exactly what we're tackling today.

Yep.

Welcome to another Deep Dive.

Today we are serving up a special one -on -one tutoring session specifically for you, especially if you're a college student trying to wrap your head around error analysis.

We're going to decode that daunting wall of math you might be staring at.

We really are.

We're unpacking Chapter 12 of Introduction to Error Analysis.

The chi -squared test for a distribution.

That's the one.

It is honestly the ultimate tool for proving whether the data you observe in reality actually matches the theoretical distribution you think it should follow.

So if you've been following along, you're probably familiar with limiting distributions by now, you know, the Gauss, the binomial, the Quasar.

Right, those are great in theory.

But the million -dollar question is, when you run an actual experiment in the lab, how do you mathematically prove your data fits the distribution you think it does?

And to give you a roadmap for this Deep Dive, we're going to teach this chapter exactly as it appears in the text.

No outside fluff.

Right, no outside fluff.

We'll start with a concrete projectile example, then we'll build a chi -squared formula itself, decode the whole degrees of freedom thing.

Which always trips people up.

Oh, constantly.

Then we'll look at probability tables and finally test three real -world examples.

Awesome.

Let's just jump right in.

Before we start throwing formulas at you, let's visualize the problem using the book's first concrete example.

I like that approach.

So imagine you are firing a projectile from a gun,

like, 40 times, and you're measuring the range, you know, how far it travels before it hits the ground.

Okay.

Now, nature loves a bell curve, so you suspect these ranges follow a normal Gaussian distribution.

Most shots are going to land near the average, with, you know, a few falling short and a few flying unusually far.

But right away, you hit a pretty big physical hurdle with this kind of measurement.

Which is, well, the range of a projectile is a continuous variable.

It can literally be any number.

Oh, right.

Yeah, so you can't ask, how many times did the projectile land at exactly 730 .145 centimeters?

Because the odds of it hitting that exact microscopic point are, like, practically zero.

Exactly.

Inventesimally small.

So you can't just tally up exact hits.

You have to create buckets.

Or as the text calls them, bins.

Bins.

Got it.

You have to group this infinite range of possible values into defined intervals.

And to do this, you first use your 40 measurements to find the best estimates for your distribution.

Okay, so you calculate the mean, right?

Which gives you the center of the bell curve.

Yes.

And in the Texas example, that's 730 .1 centimeters.

And then you calculate the standard deviation, which tells you, like, how wide or fat that curve is.

Right.

Here, it's 46 .8 centimeters.

So once you have the center and the width, you can carve your expected bell curve to

I always think of this, like, setting up differently sized mail slots at the post office.

Oh, that's a good way to look at it.

Yeah.

So,

like, bin one is a slot for any shot that falls more than one standard deviation below the mean.

Bin two is a slot for anything between that lower mark in the center.

Right.

Bin three is from the center to one standard deviation above.

And bin four is a slot for the unusually long shots beyond that.

And because you're assuming this is a Gaussian curve, you actually already know the theoretical probabilities for each of those slots.

Because the geometry of a normal distribution is, like, locked in.

Exactly.

It's well documented.

The two outer tails each hold about 16 % of the probability, and the two central areas each hold about 34%.

So that's 16%, 34%, 34%, and 16%.

Yep.

So you take those percentages and simply multiply them by your total number of shots.

Which is 40.

Right.

And that gives you the expected number of measurements for each bin.

The text calls this e sub k with the little k just representing the bin number.

Okay, let me do that math real quick.

16 % of 40 is 6 .4.

Correct.

And 34 % of 40 is 13 .6.

So our expected numbers for these four slots are 6 .4, 13 .6, 13 .6, and 6 .4.

Exactly.

That is what our perfect theoretical mail carrier should be delivering into those slots.

Right.

But experiments are messy.

Very messy.

You have to look at the actual mail that arrived.

You have to look at your observed numbers, which the book calls o sub k.

You literally count how many of your 40 shots landed in each of those four bins.

And let's say we look at that outer tail.

We only expected about six shots to land in that extreme outer bin.

But what if we count them up and we actually observe, I don't know, 16 shots sitting in that bin?

That is where the alarm bells start ringing.

Right.

Your tiny mail slot is just overflowing.

Yet the heart of error analysis lies in that gap.

You are looking at the deviation,

the observed number minus the expected number, o sub k minus e sub k.

Because we obviously never expect perfect agreement, right?

Random fluctuations just happen.

Of course.

But if your underlying hypothesis about the distribution is correct, those deviations should be reasonably small.

But if the deviation is massive, your hypothesis is probably totally wrong.

Exactly.

But, you know, small and massive are incredibly subjective words.

Like if I'm writing a research paper for a lab, I can't just write, well, the gap looked pretty massive to me.

No, your professor would not accept that.

Right.

We need a universal, ruthless mathematical formula to decide if our deviations are acceptable.

And that brings us to the absolute hero of this chapter, the chi -squared formula.

The Greek letter chi, spelled C -H -I, pronounced chi -squared.

Yes.

This formula gives you a single objective number that represents how well your observed data fits your expected model.

Okay, let's build it for the listener.

The formula defines chi -squared as the sum over all your bins of a very specific fraction.

Right.

So for each bin, you take the deviation, your observed minus expected, and you square it.

That's the top of the fraction.

Yep.

Then you divide that squared deviation by the expected number.

You do this for every single bin, add them all up, and boom, you have your total chi -squared score.

The mechanics of it are actually quite elegant.

Yeah.

But it's really vital to understand the why behind those mechanics.

Yeah.

I'm actually going to push back on the formula right here because this is where I know a lot of students get lost.

Okay, lay it on me.

I totally get why we square the deviation on the top of the fraction.

Like if you have a positive error of three in one bin and a negative error of three in bin, you don't want them canceling each other out to zero when you add everything together.

Right.

Squaring them makes all the errors positive so they accumulate.

It captures the total amount of wrongness, so to speak.

Exactly.

But why do we divide by the expected number, the e -sub -k, on the bottom?

Why is that necessary?

That fundamentally comes down to relative scale.

Think about an absolute error of, say, five.

Your observed number was five off from what you expected.

Is an error of five a big deal?

Um, I guess it depends.

It entirely depends.

If you were running an experiment where you only expected two events to happen and you were off by five, that is a catastrophic failure of your model.

Oh, wow.

Yeah, that's huge.

But what if you expected 10 ,000 events to happen and you were off by five?

In that case, an error of five is practically a miracle of precision.

Exactly.

It's a tiny meaningless fluctuation.

By dividing the square deviation by the expected number, you scale the error relative to the expectation.

Oh, that makes so much sense.

Yeah, it transforms a raw number into a meaningful statistical indicator of severity.

That is brilliant.

It penalizes errors heavily if they happen in bins where events are supposed to be rare.

Precisely.

And speaking of rare events, the source text points out a crucial vulnerability in this formula.

A golden rule of bins.

Okay, what is it?

This is a pitfall you absolutely must avoid.

The mathematics behind the chi -squared test rely on continuous probability approximations.

And those approximations basically shatter if your expected number in any bin gets too small.

The strict rule from the text is that your expected number, e sub k, should be roughly five or larger in every single bin.

Wait, so what happens if you are analyzing something incredibly rare?

Say you're rolling dice and the expected number of rolling five aces is only like .03.

You just throw out the experiment?

Oh, no, you don't throw it out.

You adapt.

How?

You simply combine adjacent bins until the total expected number climbs back up to that safe threshold of five.

So you would merge the four aces bin and the five aces bin into a single four or five aces bin.

As long as the expected total for that combined bucket is above five, the math remains solid.

Okay, so we've carefully set up our bins, we've respected the rule of five, we've run our formula, and we get a total chi -squared score.

Let's say, going back to our projectile example, the math spits out a score of 1 .80.

But like, an error score in a vacuum is meaningless?

Is 1 .80 good?

Is it terrible?

We can't know.

Not until we account for how many moving parts our experiment actually had.

The raw chi -squared score has to be compared to the number of degrees of freedom in your calculation.

We denote degrees of freedom with the letter D.

The text gives a simple equation for this.

It's D equals N minus C.

So the degrees of freedom equals the number of bins, that's N minus the number of constraints, which is C.

Exactly.

Finding N is easy.

If we have four bins, N is four.

But let's slow down here.

What physically is a constraint?

The book says it's parameters calculated from the data, but that just sounds like textbook jargon.

Yeah, let's visualize it physically.

Imagine I tell you I have four opaque buckets of data, and the total number of measurements across all of them is 40.

Okay, 40 total.

I lift the lid off the first bucket, it has 10 items.

I lift the lid off the second, it has 15.

The third bucket, five.

At this point, do I even need to lift the lid off the fourth bucket for you to know what's inside?

No, because 10 plus 15 plus five is 30.

And if the total has to be 40, that last bucket absolutely must contain exactly 10 items.

It's totally locked in.

It is mathematically locked.

That fourth bucket has absolutely no freedom to be anything else.

Right.

By relying on the total number of measurements, which we call N,

you force that last bin into a specific, unavoidable value.

You lost one degree of freedom.

That is wild.

Therefore, the total number N is considered a constraint.

Mind blown.

Okay, so N is always one constraint because the bins have to add up to the total.

Always.

But in our projectile example, we didn't just use N.

We also had to calculate the mean to find the center of our bins, and we had to calculate the standard deviation to find the width of our bins.

Yes, and because you had to extract those two additional parameters from the data itself to build your model, they act as two more constraints.

So they force the data to conform to a specific shape.

Exactly.

Further reducing the freedom of the bins to very randomly.

Okay, so we have the total N, the mean, and the standard deviation.

That is three constraints.

So C equals three.

Right.

And we had four bins.

So D equals four minus three.

We have exactly one degree of freedom.

Spot on.

Now that you have your degrees of freedom, you can calculate the reduced chi squared.

Which you'll see written as the Greek letter chi with a little tilde squiggle on top.

Yep.

And the math here is incredibly straightforward.

You just take your total chi squared score and divide it by your degrees of freedom.

So reduced chi squared equals chi squared divided by D.

Yes.

This step is vital because it standardizes your score.

The golden benchmark, as the text explicitly states, is that the expected average value of the reduced chi squared should be one.

Okay.

So if your reduced chi squared is roughly one or less, there is no reason to doubt your distribution.

Your observed data fits your expected model.

Right.

But if it is significantly larger than one, your hypothesis is likely totally wrong.

That is the baseline rule of thumb.

But as you know, science rarely gives us perfect ones.

We have to deal with the gray areas.

We do.

In our projectile example, our total chi squared was 1 .80.

We divide that by our one degree of freedom and our reduced chi squared is still 1 .80.

Correct.

Now 1 .80 is definitely larger than one, so I'm looking at this and getting a little suspicious.

But is it significantly larger?

Is it large enough that I need to throw out my entire experiment and admit defeat?

To answer that question objectively, you turn to probability tables, like appendix D in the source text.

These tables translate your raw score into a concrete percentage.

They tell you the probability of getting a reduced chi squared value, that large or larger,

assuming your expected distribution is actually true.

So I go to the table, I find the row for one degree of freedom, and I slide over to my value of 1 .80.

The table tells me the probability is approximately 18%.

Yep.

Let's translate that into plain English for you.

This means, if my gun's range was truly perfectly Gaussian, and I ran this exact 40 shot experiment thousands of times, I would get a result this bad or worse 18 % of the time, entirely due to random chance.

And an 18 % chance is a routine, everyday fluctuation.

It happens nearly one out of every five times.

Right.

Because it is so common, you have absolutely no statistical grounds to reject your expected distribution.

The hypothesis survives the lie detector.

But this raises a massive question.

Where is the cliff?

Like, when do we officially draw the line and say, nope, this data does not fit?

The text outlines two standard, ruthless statistical boundaries.

The first is the 5 % boundary.

If you look at the table and the probability drops below 5%, statisticians call that significant disagreement.

At that point, you reject the expected distribution.

Wow.

That is pretty strict.

I mean, it means you are willing to accidentally throw out perfectly good, truthful data one in 20 times, just to be absolutely sure you aren't accepting a false model.

It's a conservative shield against bad science.

But there is an even stricter boundary, the 1 % mark.

If the probability is less than 1%, that is termed highly significant disagreement.

You strongly reject the expected distribution without hesitation.

The data simply does not fit the model.

OK, we have all the pieces on the board now.

We understand bins, constraints, degrees of freedom, the formula and the probability boundaries.

We do.

So to perfectly cement this for you, let's look at how this plays out in the wild.

The text provides three fantastic, real -world examples that cover the three major distributions.

Gauss, binomial and Poisson.

Perfect.

Let's tackle example one, the anthropologist.

This is testing a Gauss or normal distribution.

Right.

So an anthropologist visits an island and measures the heights of 200 adult men.

He naturally suspects their heights should be normally distributed, you know, most men around average height, a few very tall, a few very short.

He groups his measurements into eight bins.

OK, let's run the constraint checklist.

He uses the total number of men measured, which is n.

To set up his expected bell curve, he calculates the mean height of his sample and he calculates the standard deviation.

Yep.

That is three constraints, so eight bins minus three constraints leaves him with five degrees of freedom.

Exactly.

The anthropologist plugs his observed bin counts and his expected theoretical counts into the chi -squared formula.

The total comes out to 17 .5.

Dividing that by his five degrees of freedom gives a reduced chi -squared of 3 .5.

OK, 3 .5 is way bigger than our benchmark of one.

I'm highly suspicious.

And the table confirms your suspicion.

When he looks up a value of 3 .5 with five degrees of freedom, the probability is approximately 0 .5 percent.

Half of one percent.

That is well below the strict one percent threshold we just talked about.

Which forces a definitive conclusion.

This is highly significant disagreement.

The islanders' heights do not follow a normal distribution.

Perhaps there are two distinct ethnic groups on the island with different average heights skewing the data.

Whatever the reason, the single Gaussian hypothesis is completely rejected.

Bam.

The math catches the anomaly.

Let's move to example two, more dice.

This is testing a binomial distribution which deals with discrete events like win or lose or heads or tails.

We are rolling five dice at a time, 200 times in total, and we are counting the number of aces in each throw.

Now, because dice rolls are discrete, we don't need continuous ranges.

Our bins can just be the exact number of zero aces, one ace, two aces, and so on.

We set up four bins.

And the constraints here are fascinating.

Because dice are a known physical system, the probability of rolling an ace is fixed at one in six.

And we know we are rolling exactly five dice.

We have all the parameters of the theoretical binomial function entirely in advance.

We don't have to look at our data to calculate the mean or the width.

Oh, that's cool.

Yeah, the only parameter we take for the data is n, the total number of throws, which is 200.

So here, there is only one single constraint.

So four bins minus one constraint equals three degrees of freedom.

We do the chi -squared math, and the reduced score comes out to 4 .16.

We check the table for a score of 4 .16 at three degrees of freedom, and the probability is a measly 0 .7%.

Once again, less than the 1 % boundary.

Yep.

It is highly significant disagreement.

We must reject the hypothesis that the dice are fair.

The conclusion is inescapable.

The dice are almost certainly loaded.

The casino is cheating.

They absolutely are.

I love it.

It's so definitive.

Finally, example three, cosmic rays.

Ah, yes.

This tests the Poisson distribution, which is used for counting random independent events over time.

We set up a Geiger counter and count the arrival of cosmic rays hitting our lab over 100 separate one -minute intervals.

To handle the data, we group the minute intervals into six bins based on how many rays arrived,

and we make sure to merge any small utter tails, so every bin has an expected count above five following the golden rule.

Naturally.

Now, for the constraints on a Poisson distribution,

we obviously need n, our 100 intervals, and we have to calculate the expected mean count, mu, from our observed data.

Wait, what about the standard deviation?

Don't we need that constraint to know how wide the distribution is?

Ah, the catch there is the unique nature of the Poisson distribution.

In a Poisson model, the standard deviation is automatically defined as the square root of the mean.

Oh, really?

Yes, they are mathematically locked together.

So once you calculate the mean from your data, you essentially have the standard deviation for free.

You don't impose a third separate constraint.

Got it.

Okay, so our constraints are just n and the mean, that's two constraints.

Six bins minus two constraints gives us four degrees of freedom.

The chi -squared math is done, and our reduced score comes out to a tiny .35.

We check the table, and the probability of getting a value that large or larger is a whopping 85%.

Which means we have absolutely no reason to doubt the Poisson distribution.

The fit is highly satisfactory.

You know, I'm going to push back here.

Over.

Yeah, if I'm grading a lab report, and I see a reduced chi -squared score of sort .35, I'm actually suspicious.

Why is that?

Because one is the benchmark.

If a score is way less than one, doesn't that mean it's too good to be true?

Did someone overfit or massage the data to make it look perfect?

That is a very sharp instinct.

And in some fields, an unusually low score actually does trigger an audit for fake data.

I knew it.

But the text addresses this explicitly for standard error analysis.

An unusually low value, like .35, just indicates a very good fit, which is highly likely to be the result of a large, perfectly natural chance fluctuation toward the expected mean.

OK, so it's not automatically cheating.

No.

It doesn't give you magical extra certainty about your conclusion, but it absolutely confirms that the Poisson distribution is a completely valid model for the cosmic rays.

OK, that makes sense.

Well we've journeyed from a single projectile to loaded dice and deep space cosmic rays.

We've built the formula, visualized the constraints, and used probability to draw hard boundaries between fact and random chance.

We've really covered a lot.

It really is a mathematical lie detector for your data.

It is.

And if you can internalize the logic of those three examples, you have the analytical tools to apply the chi -squared test to almost any data set you will ever encounter.

But as we wrap up this deep dive, it leaves you with a lingering, almost philosophical question to chew on.

Oh, I like a good philosophical question.

The chi -squared math is incredibly rigorous.

It can definitively tell you if your observed data fits your expected distribution.

But what if your expected distribution was chosen for the wrong reasons?

The math completely assumes your underlying theoretical logic is sound.

If you test for the wrong reality, say, testing for a normal bell curve when the physics of the universe actually demands a Poisson distribution, even a perfect math score won't save your experiment.

That is so true.

You still have to know why you are asking the question in the first place.

The math is flawless, but the scientist still has to be wise.

Exactly.

The tools only work if you point them in the right direction.

Keep questioning the data.

Keep looking for the underlying reality.

A warm thank you from the Last Minute Lecture team, and we will see you next time on the Deep Dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

The chi-squared test for distributional goodness of fit evaluates whether a dataset is consistent with a proposed theoretical model by comparing what was actually observed against what theory predicts. Data are organized into bins or categories, and the test measures discrepancies between the frequency of observations in each bin and the frequency anticipated if the theoretical distribution were true. The chi-squared statistic itself quantifies these discrepancies by calculating squared deviations between observed and expected counts, then normalizing each deviation by dividing it by the expected count in that bin. This normalization is crucial because it accounts for natural statistical variation; a given difference between observed and expected frequencies carries different weight depending on whether the expected count is large or small. When the theoretical model describes the data well, the chi-squared value tends to be close to the number of degrees of freedom, which equals the number of bins minus one minus the count of parameters estimated from the sample itself. Values substantially exceeding this expectation suggest the data deviate significantly from the proposed distribution. The reduced chi-squared statistic, formed by dividing chi-squared by degrees of freedom, provides a standardized metric with an expected value near one under the null hypothesis, enabling easier comparison across studies employing different numbers of bins. Practical implementation requires careful binning decisions and adherence to the rule that each bin must contain at least five expected observations to ensure the test maintains its intended statistical properties. Determining whether observed discrepancies warrant rejecting the theoretical model involves calculating tail probabilities from the chi-squared distribution and comparing them to conventional thresholds such as five percent or one percent. The method applies flexibly across different distributional assumptions, including normal, Poisson, and binomial models.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 12: The Chi-Squared Test for a Distribution

Related Chapters