Chapter 8: Hypothesis Testing

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive, where we really try to transform complex information into clear, actionable insights.

And today,

we're unlocking a really powerful tool for navigating all the data that's thrown at us.

Hypothesis testing.

I mean, every day you're hit with claims.

Right.

A new diet promises results and ad boasts about durability.

Maybe a news headline claims some big shift in opinion.

Yeah, all the time.

But the real question is how do you actually figure out if those claims hold water?

Exactly.

And that's really our mission today.

This deep dive, while it's going to guide you through the core ideas of hypothesis testing, we're drawing from elementary statistics by Mario Triola.

Classic.

It is.

And we're going to equip you with the knowledge to rigorously test claims about large populations using just a carefully selected sample of data.

So by the end of this deep dive, you should understand the main concepts, the process, sure, but also the practical uses and maybe just as important, the common pitfalls.

Right.

Helping you move beyond just taking information at face value so you can critically evaluate the evidence behind claims yourself.

And to keep us grounded, we'll keep coming back to a central question from the chapter.

Do most internet users utilize two factor authentication?

Which really means are more than 50 percent, you know, the majority actually using 2FA.

That'll be our running example.

Okay.

Sounds good.

So let's jump in.

When statisticians talk about a hypothesis, what are we really getting at?

And how does that lead to a hypothesis test?

Okay.

So at its core, a hypothesis and statistics is really just a statement.

It's a claim about some characteristic, some property of a whole population.

Like average body temperature, maybe, or the proportion we just mentioned.

Exactly.

And a hypothesis test, or sometimes called a test of significance, that's the formal structured procedure we use to see if our data gives us enough evidence to either support or, well, challenge that initial claim.

So this isn't just for scats nerds hidden away somewhere.

Oh, absolutely not.

That's what's so fascinating.

These methods, they're fundamental to decision making pretty much everywhere.

Like where?

Well, think medicine deciding if a new drug actually works.

Or business AD testing on websites to see what converts better.

Even in courtrooms analyzing evidence.

Wow.

Yeah.

So understanding things like the null hypothesis or a p -value, it means you're learning a language that genuinely shapes how decisions are made all around us.

That's a really powerful perspective.

Now you mentioned testing a claim.

I understand we actually need two competing hypotheses to do that.

That's right.

We always set up two.

First is the null hypothesis.

We use a 0 for that.

This is the statement that the population parameter, like the proportion or the mean equals some specific value.

So it's kind of the default assumption, the status quo.

Precisely.

It's the working assumption we hold, unless the evidence strongly suggests otherwise.

And crucially, a 0 always includes equality, like p equal 0 .5.

Okay.

And the other one?

That's the alternative hypothesis, H1.

This is the statement that challenges the null.

It says the parameter is actually different somehow.

Maybe it's greater than or less than, or simply not equal to the value in H0.

And H1 is often what you're actually trying to find evidence for.

Often, yes.

It usually represents the new claim or the effect you suspect might be true.

So let's take our 2FA example.

The claim is most users use 2FA.

That means the proportion p is greater than 0 .5.

How do we set up H0 and H1 there?

Good question.

The claim itself, p 0 .5, doesn't include equality.

So that becomes your alternative hypothesis, H1.

H1 is p 0 .5.

Right.

And the null hypothesis, H0, has to state equality.

So H0 would be p 0 .5.

The test then kind of assumes H0 is true, and we see how likely our sample results are under that assumption.

Okay, got it.

And there's always talk about a cutoff point in these tests.

What's that about?

Ah, yes.

That's the significance level.

We call it alpha, the Greek letter n -f -a.

It's a probability U the researcher set before you even look at the data.

A probability of what?

It's the maximum risk you're willing to accept of making a specific kind of mistake called a type I error.

Which is?

A type I error is when you mistakenly reject the null hypothesis when it's actually true, kind of like a false alarm.

Okay.

So alpha is the chance of a false alarm.

Essentially, yes.

Common choices are 0 .05, which means a 5 % risk, or sometimes 0 .01 for a 1 % risk if the consequences of that false alarm are really serious.

Right.

So we have our claim, our H0 and H1, our risk level alpha.

How does our actual sample data come into play?

Through the test statistic, this is a key value.

It takes your sample information, like the sample proportion or mean,

and converts it into a standardized score, like a Z score or a T score.

And what does that score tell us?

It tells us how many standard deviations away your sample result is from what the null hypothesis predicted.

It measures how surprising your data is, assuming H0 is true.

And the type of score, like Z or T, depends on what you're testing.

Exactly.

For proportions, like our 2FA example, we typically use a Z statistic, assuming certain conditions are met.

For means, especially when we don't know the population standard deviation, we use a T statistic.

Okay.

So we have this test statistic.

How do we use it to decide whether to reject H0 or not?

I hear about p -values and critical values.

Right.

Two main paths to the same decision, usually.

Let's talk p -value first.

The p -value is a probability.

Okay.

It's the probability of getting a test statistic at least as extreme as the one you actually calculated from your sample, if the null hypothesis were actually true.

So a small p -value means your result was really surprising if H0 is true.

Precisely.

A small p -value suggests that your observed data is unlikely under the null hypothesis.

That gives you evidence against H0.

The common rule of thumb is, if the p is low, the null must go.

Yeah, I like that.

Okay.

And the other way, critical values.

Critical values work a bit differently.

They are threshold values on the distribution curve, like the Z or T curve.

They mark the boundaries of what we call the critical region.

The rejection zone.

Basically, yeah.

If your calculated test statistic falls into this critical region beyond the critical value, you reject the null hypothesis.

These critical values are determined by your alpha level and whether your test is left -tailed, right -tailed, or two -tailed.

And that tail depends on the alternative hypothesis, H1.

Exactly.

If H1 uses less than, it's left -tailed.

If it uses greater than, it's right -tailed.

If it uses not equal to, it's two -tailed, meaning you're looking for an extreme result in either direction.

So you compare your test statistic to the critical value or your p -value to alpha.

Then what?

How do you state the final conclusion?

That seems important.

It's absolutely crucial and often where people stumble.

The decision rule is straightforward.

If p -value hi, reject H0.

If p -value y, you fail to reject H0 using critical values.

If the test statistic is in the critical region, reject H0.

If not, fail to reject H0.

Fail to reject,

not accept.

Yes, that wording is critical.

We never say we accept the null hypothesis.

Why?

Because hypothesis testing isn't about proving H0 is true.

It's about seeing if we have enough evidence to reject it.

If we don't have enough evidence, we simply fail to reject H0.

It means the data wasn't strong enough to overturn the status quo.

Okay, that's a key distinction.

And the conclusion should be in plain language.

Yes.

Related back to the original claim in simple non -technical terms.

Figure 8 -5 in the Triola text actually gives great templates for wording this correctly based on whether the original claim included equality or not and whether you rejected H0 or failed to reject it.

Right.

So let's quickly apply this to our 2FA example.

We had 926 users, 52 % used 2FA, claim p 0 .5, HP equals 0 .5.

The test statistic was 0 .1 .25 and the p -value was 0 .1059.

So we fail to reject H0.

Okay.

So compare the p -value to alpha.

0 .1059 is definitely greater than 0 .05.

So we fail to reject H0.

Correct.

And the conclusion in non -technical terms relating to the original claim, p 0 .5, would be something like there is not sufficient evidence to support the claim that most internet users utilize two -factor authentication.

Doesn't mean it's false.

Just not proven by this sample.

Exactly.

The evidence just wasn't strong enough at that 5 % significance level.

Okay.

That lays a really solid groundwork.

Now let's dig into testing claims about specific parameters, starting with proportions like our 2FA example.

Right.

So testing claims about a population proportion, p think percentages, probabilities have specific requirements.

You need a simple random sample, obviously.

Yes.

And the conditions for a binomial distribution should roughly hold fixed number of trials, independent trials, two categories like success failure and constant probability.

And importantly, for the math to work well using the normal curve.

Ah, the NP05 and NQ05 rule.

Exactly.

Both the expected number of successes, n times p, and expected failures, n times q, where q will 1p, under the null hypothesis should be at least 5.

This lets us use the normal distribution to approximate the binomial distribution, which makes calculations much easier.

So for 2FA, claim p 0 .5, h p 0 .5, sample and equal 936, sample portion p n equals 0 .52.

We already found z equal 1 .25 and p a value 0 .1056.

Can you walk through the critical value approach too?

Sure.

With h equals 0 .05 and a right tail test, because h1 is p 0 .5, we look up the critical z value that cuts off the top 5 % of the standard normal distribution.

That value is z equals 1 .645.

And our test statistic was z equal 1 .25.

Right.

Since 1 .25 does not fall into the critical region, which starts at 1 .645 and goes to the right, our decision is the same.

Fail to reject 80.

Same conclusion, different path.

Makes sense.

Both methods should lead to the same conclusion for means and proportions when done correctly.

You mentioned confidence intervals earlier.

For proportions, they might not always perfectly match the p value method.

Why is that again?

It's a bit technical, but it boils down to which standard deviation estimate is used.

The p value and critical value methods use the proportion assumed in the null hypothesis to calculate the standard deviation.

Okay.

But a confidence interval uses the proportion found in your sample, p a n, to estimate the standard deviation.

Because p and e n are usually slightly different, the standard deviation calculations differ slightly.

And occasionally, this can lead to a borderline case where one method suggests rejecting aid 0 and the other doesn't quite.

Interesting subtlety.

So for 2FA, the 90 % confidence interval was 0 .494 to 0 .548.

What does that tell us?

Since that interval contains the null hypothesis value of 0 .5, it supports our conclusion of failing to reject 80.

We can't be 90 % confident that the true proportion is strictly greater than 0 .5.

Got it.

What about a quick example going the other way, maybe a left -tailed test, like fewer than 30 % of adults have sleepwalked?

Okay.

Claim p 0 .30.

So h1 is p 0 .30 and 80 is p equals 0 .30.

Let's say a large survey, n is 19 ,136, found the sample proportion p n was 0 .292.

That's close to 0 .30.

It is, but the sample size is huge.

When you run the numbers, the test statistic is z equals 2 .41.

Since it's a left -tailed test, the p value is the area to the left of an issue 0 .41.

Which is small.

Very small.

The p value is 0 .080.

If we're using 0 .05, then 0 .0080 is definitely less than or equal to 0 .05.

So we reject a 0 this time.

Correct.

We reject p equals 0 .30.

And the conclusion,

there is sufficient evidence to support the claim that fewer than 30 % of adults have sleepwalked.

Okay.

That covers proportions pretty well.

What about testing claims about a population mean?

Is, you said we usually don't know the population standard deviation.

Exactly.

That's the more realistic scenario.

When s is unknown, which is most of the time, we use the sample standard deviation s as an estimate.

And this means we can't use the standard normal z distribution.

We use the student t distribution instead.

You got it.

The student t distribution.

The requirements are a simple random sample, and then either the original population needs to be normally distributed or our sample size then needs to be reasonably large.

Usually n 30 is considered sufficient for the central limit theorem to work its magic.

And the test statistic formula looks similar, but uses s and t.

Right.

It's t equal sample mean hypothesized mean s u c i 10.

How does this t distribution differ from the normal z distribution?

It's also bell shaped and centered at zero, but it's more spread out.

It has wider tails.

This reflects the extra uncertainty introduced by having to estimate the six.

And its shape changes with sample size.

Yes.

Its precise shape depends on the degrees of freedom, which for a single sample mean test is simply d f equals

As the sample size n and thus the degrees of freedom gets larger, the t distribution gets closer and closer to the standard normal distribution.

Okay.

Let's try an example claim.

Mean adult sleep is less than seven hours and he's seven.

We have a small sample and it yields 12.

The sample mean is x equals 6 .83 hours and sample standard deviation s equals 1 .99 hours.

First, since n 12 is small, we'd need to be reasonably sure the population of sleep is roughly normal.

Maybe check a histogram or quantile plot of the sample data, assuming it looks okay.

We calculate the t statistic, right?

Plugging in the numbers t 6 .837, 1 .99 or t 12, which comes out to about negative 4 .290.

Okay.

And the p value for that.

It's a left tailed test with d f equals 11, which is 12th one.

The p value for t equals 0 .290 is quite large.

It's 0 .3887.

Definitely larger than alpha 0 .05, way larger.

So we fail to reject h zero.

Conclusion.

There is not sufficient evidence to support the claim that the mean adult sleep time is less than seven hours.

Even though the sample mean was below seven.

Yes, because it wasn't significantly below seven considering the sample size and the variation in the data, the standard deviation.

This really highlights the difference between a sample result and a conclusion about the population.

Absolutely.

Now think about that body temperature example claim fc equals 98 .6 degrees Fahrenheit.

This is a two tailed test because the alternative would be Hest 98 .6 degrees Fahrenheit.

Since in 30, we can use the t distribution.

The test statistic teat calculates to 96 .61.

That sounds extreme.

It is.

For two tailed test, the p value is the area in both tails combined.

With t of any 6 .61, the p value is incredibly tiny, practically 0 .0000.

Much less than 0 .05.

Definitely.

So we reject eight zero.

There is sufficient evidence to reject the common belief that the mean human body temperature is 98 .6 degrees Fahrenheit.

The evidence suggests it's actually lower.

Which brings us back to that crucial idea.

Statistical significance versus practical significance.

Yes, this is so important.

The body temperature result is statistically significant.

The difference from 98 .6 is almost certainly real, not just chance, but is a 0 .4 degree difference practically significant in everyday life.

Probably not for most people.

Like the Belsammer drug example you mentioned.

Sleeping maybe 16 minutes longer was statistically significant.

Right.

The p value was likely very small.

But does 16 minutes make a practical difference that justifies the cost or potential side effects?

That's a different question.

One that statistics alone can't answer.

You always need context and domain knowledge.

Good point.

Okay, one more parameter.

Standard deviation or variance.

Why test claims about variation?

Think quality control.

In manufacturing, you want products to be consistent.

Reducing variation is often the goal.

So testing claims about population standard deviation or variance is crucial there.

And this uses a different distribution.

Yes, for tests concerning variance or standard deviation, we use the chi -square to distribution.

Chi -square, okay.

What are the requirements?

Simple random sample of course.

But here's a big one.

The population itself must have a normal distribution.

This normality requirement is much stricter for chi -square tests on variance than it is for t -tests on means.

So you really need to check that normality assumption carefully.

You do.

If the population isn't normal, the results can be quite inaccurate.

The test statistic formula is 200A, S2S2, where S2 is the sample variance, and 2 is the hypothesized population variance from H.

And the chi -square distribution looks different too.

Yeah, it's not symmetric like the normal or Ick distributions.

It starts at zero and is skewed to the right.

Its shape also depends on degrees of freedom, which again is DFF is now one.

Let's use that minting quarters example.

Claim.

New process reduces standard deviation of weights to less than 0 .062G.

So 0 .062.

Right.

H1 is 0 .062.

So H0 is 0 .062.

Sample n equals 24 quarters.

Sample standard deviation S equals 0 .0480164G.

First, check normality for the sample weights, assuming that's okay.

Calculate the chi -square statistic.

We plug n24, S here, 0 .0480164, and 0 .062 from n0 into the formula.

The result is 30067 -5.

And the p value.

It's a left -tailed test.

With DFT 0 .3, the p value, the area to the left of 13 .795, is 0 .0674.

How does that compare to alpha 0 .05?

0 .0674 is greater than 0 .05.

So fail to reject 80 again.

Correct.

Conclusion.

There is not sufficient evidence to support the claim that the new minting procedure reduced the variation, standard deviation, in quarter weights below 0 .062G.

Even though the sample standard deviation was lower?

Yes.

It wasn't low enough given the sample size to be statistically significant at the 0 .05 level.

Okay.

We've covered the mechanics for different parameters.

Now let's peel back another layer.

Let's talk about errors, power, and maybe some alternative methods.

You mentioned type I error earlier.

Right.

The type I error is rejecting 80 when 80 is actually true.

A false positive.

The probability is ICIA.

A memory trick could be rejecting a true null, RTN.

And the consequence.

Like concluding a treatment works when it doesn't.

Exactly.

Leading to wasted resources or unnecessary procedures.

What's the other kind of error?

That's the type II error.

This is failing to reject 80 when 80 is actually false.

So it's a false negative.

The probability is called beta.

Like missing a real effect.

A treatment works, but your test doesn't detect it.

Precisely.

A memory hint.

Failing to reject a false null, FRFN.

This could mean a beneficial treatment gets overlooked, which can also have serious consequences.

And there's a trade -off between alpha and beta.

Generally, yes.

If you make alpha smaller, reducing the risk of type I error, you often increase beta, increasing the risk of type II error, and vice versa, assuming the sample size stays the same.

You have to decide which error is more critical to avoid in your specific situation.

This seems related to the power of a test.

What exactly is power?

Power is defined as 1 -4.

So if war is the probability of missing a real effect, type II error, then power is the probability of correctly detecting a real effect.

Meaning, it's the probability of correctly rejecting a false null hypothesis.

You got it.

Power measures how sensitive your test is to detecting a difference or effect if one truly exists.

You want tests with high power.

What makes a test more powerful?

Several things influence power.

A larger sample size generally increases power.

A larger significance level also increases power, but that means accepting more type I errors.

And importantly, the size of the actual effect matters.

It's easier to detect a large difference than a small one.

So researchers might actually calculate the needed sample size to achieve a certain power.

Absolutely.

That's good research design.

They might say, we want 80 % power to detect an effect this specific size at an alpha of .05 and then calculate the minimum sample size needed.

It ensures the study isn't underpowered, meaning it has a good chance of finding a real result if it's there.

Makes sense.

Now what about situations where the assumptions for these traditional tests, like normality, aren't met?

Are there other options?

I think you mentioned resampling methods.

Yes, exactly.

Traditional methods, as we saw, often rely on assumptions.

Resampling methods like bootstrapping and randomization are powerful alternatives, especially with modern computing power.

What's the advantage?

Their biggest advantage is they often don't require assumptions about the underlying distribution of the data.

It doesn't need to be normal.

They also don't usually have minimum sample size requirements and can handle complex situations where formulas are tricky.

Okay.

How does bootstrapping work?

Roughly.

Bootstrapping typically involved creating many, many new samples by drawing repeatedly with replacement from your original sample.

You calculate the statistic of interest, like a mean or proportion, for each new sample, creating a distribution of these statistics.

From this distribution, you can construct a confidence interval.

And use that confidence interval to test the hypothesis.

Yes, similar to how we discussed using confidence intervals earlier.

If the null value falls outside the bootstrap confidence interval, you'd reject 8 -0.

And randomization.

Randomization methods directly test the null hypothesis.

They work by shuffling or modifying the observed sample data in a way that's consistent with 8 -0 being true.

Then they repeatedly resample from this modified data to see how often a result as extreme as the original sample occurs just by chance under 8 -0.

This generates a p -value.

So for our 2FA sleep or quarters examples, we could use these methods as a check or alternative.

Definitely.

They provide a really valuable way to confirm findings or get results when traditional assumptions are questionable.

This whole process seems rigorous, but you also hear about controversies and pitfalls.

Like p -hacking.

What's that?

P -hacking is a serious issue.

It refers to questionable research practices, where scientists manipulate their data collection or analysis in a process, specifically to get a p -value under the magic 0 .05 threshold.

Like stopping data collection as soon as p -0 .05, or trying lots of different analyses.

Exactly.

Or selectively removing outliers, or only reporting tests that worked out.

It's basically cherry -picking results to achieve significance, and it really undermines the scientific process.

And didn't one journal actually ban p -values?

Yes.

Basic and applied social psychology.

They argued p -values were too often misused or misinterpreted, leading to weak or irreproducible findings.

It sparked a lot of debate, but it highlighted legitimate concerns about over -reliance on a single number.

There's also the danger of running too many tests, right?

The aspirin and zodiac sign story.

Yeah, that's a classic cautionary tale from Dr.

Richard Pado.

When asked to analyze his aspirin trial data by countless subgroups, he jokingly included analysis by zodiac sign finding aspirin was ineffective for Geminis and Libras.

Showing that if you slice the data enough ways, you will find something significant just by chance.

Precisely.

Run 20 tests at alpha .05, and you'd expect about one of them to be significant purely by random luck, even if nothing real is going on.

It warns against data dredging or testing hypotheses that weren't planned in advance.

So with all these methods, nuances, and pitfalls, what's the big takeaway for us trying to evaluate claims?

The key is really holistic statistics.

Don't get tunnel vision on just the p -value, look at the whole picture.

Like?

How is the sample collected?

Is it representative?

What does the data distribution look like?

Are there outliers?

What's the effect size?

Is the difference practically meaningful?

Consider confidence intervals alongside the hypothesis test.

Maybe even look at resampling results, if available.

Use multiple viewpoints.

Don't put all your faith in one number.

Exactly.

A thoughtful, comprehensive approach gives you a much clearer and more reliable understanding.

Yeah, there you have it.

A pretty deep dive into hypothesis testing.

We've gone from the basics, null and alternative hypotheses, alpha, test statistics, p -values, through testing specific claims about proportions, means, and standard deviations.

And we also covered the potential errors, the concept of power, resampling methods as alternatives, and some really important pitfalls to watch out for.

Hopefully, you now feel you have a more powerful framework to critically look at claims based on whether it's in the news, in research, or even things you hear in everyday life.

It really is about moving beyond just accepting information and being able to start asking, OK, but what's the evidence?

How strong is it?

Right.

So here's a final thought to maybe chew on.

We've seen hypothesis testing as rigorous, but it's still just a tool, not some kind of magic oracle.

The assumptions we make, the quality of the data we gather, and especially how we a failure to find evidence failing to reject H0 that's all just as critical as the final p -value itself.

So knowing this framework now, what claims that you encounter might you start to see a little differently?

Something to think about.

A warm thank you from the Last Minute Lecture team for diving deep with us today.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Inferential statistics relies on hypothesis testing as a systematic method for evaluating whether observed sample data provide sufficient evidence to support claims about larger populations. The framework begins by formulating two competing statements: the null hypothesis posits no effect or difference exists, while the alternative hypothesis represents the claim under investigation. The directional nature of the test determines which outcomes would lead to rejection, with left-tailed tests examining whether parameters fall below a threshold, right-tailed tests assessing whether they exceed it, and two-tailed tests evaluating departures in either direction from the hypothesized value. Central to hypothesis testing is the significance level, denoted alpha, which establishes the maximum acceptable probability of incorrectly rejecting a true null hypothesis, an error known as Type I error. Complementing this is Type II error, the risk of failing to reject a false null hypothesis, revealing an inherent tension in test design between controlling these two error types. Two main decision-making approaches guide the process: the critical value method determines rejection by comparing a computed test statistic to a predetermined threshold, while the p-value method calculates the probability of obtaining results as extreme as observed data if the null hypothesis held true. Selecting appropriate tests depends on context and available information, with z-tests for means suitable when population standard deviation is known and t-tests preferred when it must be estimated, particularly with smaller samples. Testing categorical data involves similar logic but adapted for proportions. Real-world application requires distinguishing statistical significance, which indicates results unlikely attributable to random variation, from practical significance, which addresses whether findings hold meaningful consequences in applied settings. The chapter provides detailed procedural frameworks, illustrative examples from quality assurance and clinical domains, and essential guidance on verifying underlying assumptions before test selection and execution. Students develop competence in constructing appropriate hypotheses, selecting valid test methods, computing statistics, and interpreting conclusions while avoiding common pitfalls in p-value reasoning and results communication.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 8: Hypothesis Testing

Related Chapters