Chapter 13: Nonparametric Tests

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Have you ever found yourself wondering if, say, those super expensive smartphones actually deliver better quality?

Or maybe you've looked at lottery numbers and thought, is there a pattern?

Or even just, is my body temperature really that classic 98 .6 degrees?

Sometimes the data we're dealing with, like rankings or maybe just categories like yes or no,

it doesn't quite fit the strict rules of the usual statistical tools.

Right.

It's like the data doesn't fit the mold for those standard tests, the ones you probably learned first.

Exactly.

And what's fascinating here is that this is precisely where nonparametric tests are really essential.

Our deep dive today is all about these, well, robust statistical methods.

They let us analyze data without making those really strict assumptions about how it's distributed.

Whether it fits that perfect bilker shape.

Exactly.

That's a common one.

So we're going to explore why these tests are so incredibly versatile and maybe more importantly when you should actually reach for them when you're looking at data.

Okay.

Let's unpack this then.

Our mission today is to cut through the jargon, give you a clear, practical handle on these tools.

We're drawing insights from a key chapter in elementary statistics by Mario Triola.

We want to give you the ins and outs, real world examples, and maybe crucially the common mistakes people make.

So let's start right at the basics.

What's the fundamental difference between a parametric test and a nonparametric one?

Well, think of parametric tests like having a really specific checklist.

They demand that your population data meets certain conditions, like as you said, being normally distributed.

Okay.

Nonparametric tests, though, they're much more flexible.

People often call them distribution free.

Distribution free, meaning they don't care about the shape of the data.

Pretty much.

They don't impose those strict assumptions about the underlying shape, which is a huge advantage sometimes.

Although, you know, while nonparametric is the common term, it can be a tiny bit misleading.

Distribution free is probably more precise.

Also.

Because some nonparametric tests do actually relate to parameters like the median.

They just don't need a specific distribution shape for the data itself.

It's a subtle point, but yeah, good to keep in mind.

I see.

And this flexibility,

it must have some big plus points.

Oh, absolutely.

First, because they have fewer rigid rules, you can use them in way more situations, much wider applicability.

Right.

And second, they can handle different kinds of data.

Data that's already in ranks, like comparing smartphone reviews or even just categorical data, like counting yes versus no.

Makes sense.

But are there downsides?

There must be tradeoffs.

Yeah, there definitely are.

A big one is that they tend to, well, waste information sometimes.

Waste information.

How?

OK, take a simple sign test looking at weight loss.

You might just put a minus sign if someone lost weight, but you ignore how much weight they lost.

The actual number, the magnitude gets thrown out.

I get it.

You just know the direction, not the distance.

Exactly.

And secondly, and this is important, nonparametric tests are generally less efficient.

Less efficient.

What does it mean in practice?

It means that compared to a parametric test, if that parametric test's assumptions are met, that's the key caveat.

The nonparametric test often needs stronger evidence to reject the null hypothesis, like you might need a larger sample size, or the differences in your data need to be bigger.

That efficiency point sounds really important strategically, it's not just theory.

Not at all.

It's a real trade -off.

For instance, rank correlation, which is a nonparametric method we'll get to, has an efficiency of about 0 .91 compared to the standard linear correlation.

So 0 .91.

What does that mean for me if I'm collecting data?

It means, roughly, that using the nonparametric rank correlation might require, say, 100 observations in your sample to get the same statistical power that you could achieve with only 91 observations using the parametric linear correlation.

But only if all the conditions for that parametric test are perfectly met.

Got it.

So it's about optimizing your effort.

Precisely.

Now, to really understand a lot of these tests, we need to talk about ranks.

Okay, ranks.

Like first, second, third.

Yeah.

Exactly that simple.

You sort your data, maybe smallest to largest, and assign a number.

One for the first, two for the second, and so on.

Simple enough.

What if there are ties?

Like two things have the same value.

Ah, good question.

That's crucial.

If you have ties, you assign each tied item the mean of the ranks they would have taken up.

The mean.

Okay, give me an example.

Sure.

Let's say you have values, and three of them are identical.

And they would have been ranked, say, second, third, and fourth.

Right.

You calculate the average of two, three, and four, which is two plus three plus four, three equals three.

So each of those three tied items gets the rank of three.

Okay, that makes sense.

Average the ranks they would have occupied.

You got it.

All right.

Let's maybe dive into our first specific test then.

You mentioned a simple one earlier.

Yeah, let's start with the sign test.

It's remarkably straightforward.

It really just uses plus and minus signs to test a claim.

Plus and minus signs.

Okay.

When would you use this?

It's got three main uses, really.

First, for claims about matched pairs of data.

Think before and after measurements on the same person.

Like testing a diet.

Before weight, after weight.

Perfect example.

Second, it works for claims with nominal data that only has two categories.

Yes, no answers.

Male, female counts.

Things like that.

And third, you can use it for claims about the median of a single population.

Not the mean.

The median.

Right.

The middle value.

So how does it work?

You just count pluses and minuses.

Pretty much.

You convert your data into signs, maybe positive for an increase or one category, negative for a decrease, or the other category.

Any data points that give you a zero difference or land exactly on the median you're testing, you just discard them.

Get rid of the zeros.

Okay.

Then your test statistic, usually called X, is just the count of whichever sign appears less often.

Less frequent sign.

Okay.

Then what?

Then, if your sample size N, that's after discarding zeros, is 25 or less, you look up critical values in a special table, like table A7 in the Triola text.

If N is bigger than 25, you use a z -score approximation.

The trusty z -score.

Yep.

And there's a little tweak there.

You usually add or subtract 0 .5 from your X count.

It's called a continuity correction.

Helps adjust for using a continuous z -distribution with discrete count data.

Okay, continuity correction.

Got it.

Can we walk through an example?

Sure.

Let's use that measured and reported malweights idea.

You ask men their weight, then you weigh them, calculate the difference, measured minus reported.

Right.

Maybe measured is higher sometimes, positive, lower other times, negative.

Let's say you get six positives, three negatives, and one guy was spot on, zero difference.

So discard the zero, N is nine.

Exactly.

The less frequent sign is negative, there are three of them.

So X is three.

X is three.

Now what?

Now you look at table A7 for N all nine and your chosen significance level, say 0 .05, the critical value might be one.

So we need X to be one or less to be significant.

Correct.

Since our X is three, which is not less than or equal to one, we fail to reject the null hypothesis.

Meaning, based on this small sample, we don't have enough evidence to say there's a significant difference between what men report as their weight and what they actually weigh.

OK, that seems pretty clear.

Now you mentioned pitfalls earlier.

Is there something specific with the sign test?

Yes, and it applies more broadly too, but it's very clear here.

You have to use common sense alongside the statistics.

Sometimes your data might screen contradiction to your claim, but the test could technically show significance just in the wrong direction.

What do you mean wrong direction?

OK, take those gender selection studies.

Imagine a method claims to increase the chance of having a boy, so the proportion T should be greater than 0 .5.

Right, more boys.

But then, in a large sample, say 945 births, you only observe 66 boys.

That's less than 7%.

Your gut tells you immediately this doesn't support the claim.

Yeah, absolutely not.

But mathematically, it's possible a sign test comparing this to 0 .5 could yield a significant result simply because it's so far away from 0 .5.

But it's significant in the opposite direction of the claim.

Ah, so the stats might say, wow, that's really far from 50 -50.

But it's far in the direction that disproves your theory.

Precisely.

Statistical significance just measures unlikeliness under the null hypothesis.

It doesn't automatically validate your alternative hypothesis.

You always need to check if the result actually makes sense in the context of your claim.

Is it significant and in the direction you predicted?

That's a really critical point.

Don't just trust the p -value blindly.

Look at the actual data.

And just quickly, another sign test application.

Testing the median body temperature claim.

Is it less than 98 .6 degrees Fahrenheit?

You assign signs based on above or below 98 .6.

It might lead you to reject the idea that the median is 98 .6, similar to a parametric test, but often the p -value from the sign test will be higher.

It's less sensitive, less powerful.

Because it ignores how far below 98 .6 someone's temperature is.

Exactly.

It just sees below, not way below.

That's the information trade -off we talk about.

Okay, so the sign test is simple, maybe too simple sometimes because it ignores magnitude.

What's the next step up?

That leads us nicely to the Wilcoxon signed ranks test.

This is a clever improvement because it does bring those magnitudes back into play, but using ranks.

This generally makes it more powerful than the sign test.

Signed ranks.

Okay.

How does that work?

And when do you use it?

Still primarily for match pairs, testing if the median difference is zero, or, like the sign test, you can adapt it for a single population median claim by pairing each value with the claimed median.

Right.

Let's use the weights example again.

Measured versus reported, how would Wilcoxon handle it?

Okay, first step is the same.

Calculate the differences, measured, reported, again, discard any zero differences.

Got it.

Now, here's the key difference.

Ignore the signs for a moment and rank the absolute values of those differences from smallest to largest.

Handle ties like we discussed before, assign the mean rank.

Rank the absolute differences.

Then you put the original signs, plus or rank, back onto those ranks.

Ah, so now a rank might be plus three or a negative five, for example.

Exactly.

Now you sum up all the positive ranks, then you sum up all the negative ranks and take the absolute value of that sum.

Two sums.

Sum of positive ranks, absolute sum of negative ranks.

Right.

Your test statistic, T, is whichever of those two sums is smaller.

The smaller sum is T.

Don't compare T to a critical value.

Yep.

There's another table, table A8 in Triola, for n up to 30.

For larger samples, again, there's a z -score approximation.

Okay, so for our weights example, maybe T comes out as seven, and the table says the critical value is, say, four for our n and significance level.

Right.

Since seven is not less than or equal to the critical value of four, you'd again fail to reject the null hypothesis.

Same conclusion as the sign test in this hypothetical case.

Could be.

But the Wilcoxon test used more information on the relative sizes of the differences, so if there was a real difference, it would generally be more likely to detect it than the sign test.

It's more powerful.

Okay, that makes sense.

Now, both of those were for matched pairs or single medians.

What if you have two separate groups, like comparing men's heights and women's heights?

Ah, now you need a different tool.

For two independent samples where you want to test if they come from populations with the same median, you use the Wilcoxon rank sum test.

Rank sum.

Is there an easy way to remember which Wilcoxon is which?

I like the mnemonic IRS.

Independent samples use the rank sum test.

IRS.

Independent rank sum.

Okay.

I can remember that.

What are the advantages here?

Well, same nonparametric benefits.

No normality needed.

Can use ordinal data.

But what's really impressive is its efficiency rating.

It's about .95 compared to the parametric t -test for independent samples.

.95.

Wow.

That's really high.

So it's almost as good as the t -test even if the t -test assumptions are met.

Pretty much.

It's a very robust and powerful alternative.

So how does this one work?

Let's take comparing male heights from two different time periods, say ansar I and ansar II surveys, independent groups.

Right.

First step.

Combine all the height data from both surveys into one big list.

Cool.

Everything together.

Yep.

Then rank every single height in that combined list from smallest to largest, handling ties as usual.

Okay.

Rank the whole pool.

Then you go back to your original groups and sum up the ranks for just one of the samples.

Let's say you sum the ranks for all the men in the ansar I survey.

Call that sum R1.

So you get one big sum of ranks for one group.

Then what?

Then you use that sum R1 along with the sample sizes to calculate a Z statistic.

The formula basically compares the rank sum you got R1 to what you'd expect if the two groups really came from populations with the same median and it accounts for the variability.

Calculate a Z score from the rank sum.

Right.

So in that height comparison example, maybe the calculation gives you Z iglonitis 2 .96.

You compare that to the standard critical values for your significance level, often plus minus 1 .96.

And minus 2 .00 is outside plus minus 1 .96.

Correct.

So you would reject the null hypothesis.

Meaning there is a difference in median height between the two survey periods.

Based on that calculation, yes.

It suggests a statistically significant difference.

Interestingly though, Triola mentions that when this same comparison is done using the full much larger ANASOR data sets, the conclusion actually flips.

They find no significant difference.

Really?

So sample size made a huge difference there.

Absolutely.

It's a great reminder that statistical significance can depend heavily on sample size, especially when effects are small.

More data gives you more power to detect differences or more confidence that there isn't one.

Okay.

Good point.

So we've done matched pairs, single median, two independent samples.

What if you have three or more groups, like comparing patient outcomes across three different drug treatments?

Now you're moving into the territory covered by the Kresgell -Wallace test.

This is essentially the non -parametric version of a ANOVA analysis of variance.

It's designed specifically for testing if three or more independent samples come from populations with the same median.

Kresgell -Wallace.

Okay.

And the main advantage over ANOVA is?

No normality requirement.

That's the big one.

ANOVA needs the populations to be normally distributed, which isn't always the case.

Kresgell -Wallace frees you from that assumption.

How does it work?

Is it also based on ranks?

Yes, very much so.

The test statistic is called H.

It basically measures how much the sum of ranks for each group varies.

If the groups come from populations with the same median, their rank sums should be pretty similar and H will be small.

If one group's ranks are consistently high or low, the rank sums will differ a lot and H will be large.

So large H means likely difference.

Exactly.

It's a right -tailed test.

You only reject the null hypothesis that the medians are equal if H is significantly large.

Can we use an example?

Maybe those head injury criterion measurements in car crash tests comparing small, mid -size, and large cars.

Three independent groups.

Perfect.

Similar procedure to the Wilcoxon rank sum, actually.

You pool all the HIC measurements from all car sizes together.

Rank everything.

Rank everything in the combined list.

Then calculate the sum of ranks separately for the small cars, the mid -size cars, and the large cars.

Get three rank sums.

Right.

Then you plug those sums and the sample sizes for each group into the formula for the H statistics.

Okay, and let's say H comes out as 6 .3 and 09.

You'd compare that H value to a critical value from the chi -square distribution, since H follows that distribution approximately.

Using degrees of freedom equal to 1 less than the number of groups.

So for three groups, that's 2 degrees of freedom.

If the critical chi -square value at your significance level, say 0 .05, is 5 .991, then our H of 6 .309 is bigger than that.

Correct.

So you reject the null hypothesis.

You conclude that there is a statistically significant difference among the median HIC measurements for small, mid -size, and large cars.

At least one group is different from the others.

It doesn't tell you which one, just that they aren't all the same.

Okay, that's the Kruskal -Wallis test for three more groups.

Now let's shift gears a bit.

What about correlation?

We talked about linear correlation earlier.

Is there a non -parametric version for that?

Yes, there is.

It's called Spearman's rank correlation test, or often just rank correlation.

It tests for an association between two variables, but it uses the ranks of paired data.

Spearman's?

Yeah.

Okay.

How is it different from the regular correlation, or?

We usually use the symbol ers, row s, for Spearman's rank correlation to keep them distinct.

The big advantages are you can use it with data that's already ranked, or easily converted to ranks.

No normality assumption needed for the variables.

And interestingly, ours can sometimes pick up on relationships that aren't strictly linear but are monotonic.

Monotonic, meaning as one variable increases, the other consistently increases or consistently decreases, even if not in a straight line.

Exactly.

Ours is really focused on linear relationships, but ours is broader in that sense.

Let's revisit our very first question then.

Do better smartphones cost more?

We had quality ranks and costs.

How would we use Spearman's ours here?

First step, you need ranks for both variables.

The quality might already be ranked.

You'd need to take the costs and rank them from lowest to highest, handling any ties by assigning the mean rank, just like before.

So you end up with pairs of ranks.

Yeah.

Quality rank and cost rank for each phone.

Precisely.

Then you calculate the difference between the ranks for each pair, square those differences, sum them up, and plug that sum into the formula for ours.

Okay.

Let's say doing that for the smartphone data gives us Rs equals 0 .796.

Now you need to see if that's statistically significant.

You compare ours to critical values.

Again, there's a table, table A9 in Triola.

If your sample size n is 30 or less.

For larger n, there's a formula involving a t -distribution.

So for our example, maybe N and A10 phones, the table might give critical values of plus minimum 0 .648 at the minimum 0 .05 significance level.

Right.

And since our calculated laws of minimum 0 .796 is outside that range, it's more negative than minimum 0 .648, we reject the null hypothesis of no correlation.

Meaning there is a correlation between quality rank and cost rank.

Yes.

A statistically significant one.

In this case, it's negative.

Wait, wait.

Let me recheck that interpretation.

If higher rank means better quality and higher rank means higher cost, let's rerun that.

Suppose higher rank, better quality, and we rank costs from low to high.

If ours is negative, it means better quality phones.

Higher rank tend to have lower cost ranks, lower cost.

If ours was positive, it would mean better quality phones tend to have higher costs.

Let's assume the example calculation gave Rs plus 0 .796 for discussion.

Okay.

Let's assume ours is s equals plus 0 .796 that's outside plus do it as emissive 0 .648.

So reject null.

There is a correlation.

It seems higher quality rank goes with higher cost rank.

Correct.

So based purely on that, it looks like paying more gets you better quality.

And there is the massive pitfall we need to hammer home.

The statistics show a correlation and association.

They do not show that paying more causes better quality.

Right.

Correlation is not causation.

We hear it all the time.

But why is it so vital here?

Because it's incredibly easy to jump to that conclusion.

Maybe better quality phones require more expensive components, causing the higher price.

Or maybe brands known for quality can charge more.

Or maybe there's some other factor, like marketing budget, linked to both perceived quality and price.

The correlation itself doesn't tell you which way the causal arrow points or if there's an arrow at all.

So making a decision based only on the correlation could be a mistake.

A potentially expensive one.

Think about business strategy or even personal purchasing.

Understanding this difference protects you from acting on spurious relationships.

It's honestly one of the most important takeaways from any stats course.

Good reminder.

OK, one more test to cover from this chapter, right?

The runs test.

Yes.

The runs test for randomness.

This one's a bit different.

It's used to determine if the order in which data occurs in a sequence is random.

Randomness.

Like flipping a coin, head stale should be random.

Exactly.

Or say the sequence of male -female births or maybe the political party of presidents elected over time.

The test looks for evidence of non -random patterns, like clustering or systematic cycling.

How does it do that?

What's a run?

A run is just a sequence of identical data characteristics surrounded by different characteristics or by the start -to -end of the data.

So in F, F, F, M, M, M, F, M, M, M, you have a run of three Fs, then two Ms, then one F, then three Ms.

That's four runs total.

OK.

So you count the runs.

How does that tell you about randomness?

The core idea is this.

If the sequence is truly random, you expect a moderate number of runs.

If you have too few runs, like F, F, F, M, M, M, M, M, only two runs, it suggests clustering or a trend.

If you have too many runs, like F, M, F, M, M, F, M, M, M, 10 runs,

it suggests some kind of systematic switching back and forth.

Both extremes point away from randomness.

So you reject randomness if the number of runs is very low, OR very high.

Precisely.

You're looking for an unusual number of runs, either too few or too many.

How do you apply it?

You mentioned political parties.

Right.

Let's take a sequence like R, D, D, R, D, R, D, R, D, R, D, R, D, R, D, R, D, R.

First, identify the categories, R and D.

Count how many of each?

N1, Republicans equal 8, N2, Democrats equal 7.

Then count the number of runs, G.

Let's trace it.

R1, D, D2, R3, D, D4, R5, D6, R7, D8, R9, D10, R11.

So G is 11.

G is 11 runs.

Now if your sample size is N1, N2, or small, usually 20 or less, and you're using a standard significance level like .05, you look up critical values in another table, table A10.

For N1, 8, and N2, 7, the table might give critical values of say 4 and 13.

So we reject randomness if G is 4 or less, or 13 or more?

Correct.

Our G is 11.

That's not less than or equal to 4, and it's not greater than or equal to 13.

It falls within the range expected for rando sequences.

So we fail to reject randomness.

The sequence of precedents seems random in this example.

Based on this test, yes.

Now what if you have lots of data, like 50 years of global average temperatures?

Can't use the table, then.

Right.

If either N1 or N2 is greater than 20, you use a Z -score approximation again.

First you need to categorize the numerical data.

A common way is to find the mean or median temperature over the 50 years.

Then, for each year, label it A if it's above the mean and B if it's below the mean.

This creates a long sequence of As and Bs.

Got it.

Convert numbers to categories based on the mean.

Yep.

Then you count N1, number of As, N2, number of Bs, and the number of runs G.

Let's say you get N1, 24 As, N2, 26 Bs, and G8 runs.

Only 8 runs in 50 years.

That sounds low.

It does seem low, suggesting clustering.

You'd plug N1, N2, and G into the Z -score formula for the runs test.

Maybe it gives you Z in at a 5 .14.

Wow.

Naked 5 .14.

That's way out there.

Way outside the typical critical values of plus medic 1 .96 for alpha, .05.

So you'd strongly reject the null hypothesis of randomness.

Conclusion The sequence of global temperatures above below the mean is not random.

The very low number of runs suggests a trend likely.

The below mean temperature is clustered early on, and the above mean temperature is clustered which points towards an increasing temperature trend.

Fascinating how that test can pick up a trend just by looking at the sequence order.

It's quite powerful for that specific question of randomness.

So there you have it, quite a journey through these non -parametric tests.

It really feels like opening up a whole new part of the statistical toolkit, doesn't it?

Ways to tackle data when those classic parametric rules just don't fit.

Absolutely.

We've seen how they help us look at everything from smartphone quality versus cost to whether temperature changes are random all without needing those strict assumptions about bell curves and things like that.

It seems a trade -off is sometimes proficiency.

They might need more data, but the payoff is huge versatility.

That's a great summary.

They let you get insights from ranked data, categorical data, data where you're just not sure about the underlying distribution.

It really highlights how important it is to understand your data before you just jump in with a standard test.

And I think the biggest lesson woven through all of this isn't just about the specific formulas for T or H or R's.

It's about the critical thinking, right?

Knowing when to use these, what their limits are, remembering correlation isn't causation.

Exactly.

That's what empowers you to make genuinely informed decisions, not just run numbers.

So the next time you're looking at data that feels a bit different, maybe it's rankings, maybe it's just categories, or you're worried about assumptions.

Yeah.

Remember the sign test, the Wilcoxon test, Kruskal -Wallis, rank correlation, the runs test.

Give them some thought.

They might just show you patterns or answer questions in ways you didn't expect.

It makes you wonder what other real world problems out there are just waiting for a non -parametric approach.

We'd love to hear what you, our listeners, think might be interesting areas for these kinds of tests.

Definitely.

Keep exploring that data.

Until next time,

keep digging for those nuggets of knowledge.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Distribution-free statistical methods become necessary when research data violate the stringent assumptions underlying parametric tests, such as normality or homogeneity of variance. Nonparametric approaches operate effectively with ordinal data, severely skewed distributions, small sample sizes, or situations where the researcher cannot reasonably assume data follow a specific probability distribution. The sign test represents the simplest nonparametric alternative for median hypothesis testing, requiring only directional information about whether observations fall above or below a hypothesized median value. When additional statistical power is desired while maintaining distributional flexibility, the Wilcoxon signed-rank test incorporates rank information from paired differences, allowing researchers to leverage magnitude information without assuming normality. For independent group comparisons, the Mann-Whitney U test ranks all observations across groups and determines whether group positions differ systematically, functioning as the rank-based equivalent to the independent samples t-test. Extending this logic to three or more independent groups, the Kruskal-Wallis test performs omnibus group comparisons through rank analysis, replacing traditional one-way analysis of variance when parametric conditions cannot be satisfied. Measuring bivariate association without linear regression assumptions proves possible through the Spearman rank correlation coefficient, which evaluates monotonic relationships by correlating ranks rather than raw values. The Wilcoxon rank-sum test and goodness-of-fit tests further expand the toolkit for assessing distributional properties and comparing observed data patterns against theoretical expectations. Practical application requires students to understand selection criteria based on research design, sample characteristics, and variable types, as well as computational procedures that transform raw observations into ranks before calculating test statistics. A fundamental tradeoff exists wherein nonparametric methods sacrifice statistical power compared to parametric alternatives when all parametric assumptions are actually satisfied, yet they provide dependable, robust inferences when standard assumptions demonstrably fail, making them indispensable across applied research in behavioral sciences, clinical medicine, and natural resource management.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥