Chapter 3: Describing, Exploring, and Comparing Data

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to The Deep Dive, the show where we take source material and really pull out the key nuggets of knowledge to help you get informed, fast.

Have you ever looked at a pile of numbers, maybe customer reviews, sports stats, economic news, and just wondered how do I actually make sense of this?

Today, we're diving into some essential tools that help us decode those numbers, tell their story.

It's all about helping you become a more informed, critical consumer of information.

Absolutely, and it's such a critical foundation for understanding pretty much anything you can measure.

You know, the theme park wait times, economic trends, it's everywhere.

We'll be drawing from Elementary Statistics, the 14th edition by Mario Triola.

It's a really excellent guide to these core ideas in descriptive statistics.

Yeah, it is.

And our mission today is simple, make these concepts clear, thorough, and honestly easy for you to get a handle on.

We'll use real world examples, practical applications.

We're going to unpack the key definitions, look at the what, but maybe more importantly, the why it matters.

Ready to jump in?

Let's do it.

Okay, so first things first, let's try to find the heart of the data.

This idea of measures of center.

When you've got a whole mess of numbers, what's like the best single value to represent the middle?

That's a great starting point, and it's more nuanced than you might think.

What is the center, really?

There's not just one answer.

We're going to look at four main ways.

Okay.

The mean, the median, the mode, and the mid -range.

Okay, let's start with the mean.

This is what most people, casually, call the average.

You just add everything up, divide by how many numbers you have.

Simple enough.

It seems simple, yeah, but that word average, data additions tend to be a bit wary of it.

Why?

Well, because average can sometimes hide what's really going on.

The mean is mathematically specific, sum of values divided by the count.

That precision matters.

Right, and here's where it gets kind of tricky, isn't it?

The mean's big vulnerability outliers for extreme values.

Imagine Disney World wait times for Space Mountain.

Say you have times like 50 minutes, 25, 75, 35, stuff like that.

For 11 wait times, the mean might be, say, 39 .1 minutes.

Okay, sounds reasonable.

But what if one time, maybe the ride broke down temporarily, was suddenly 300 minutes?

That single, huge number.

It pulls the mean way, way up.

Suddenly, the average wait time looks much longer than what people usually experience.

Exactly, and this leads to a really important idea.

Resistance is a statistic resistant.

Think of it like this.

A resistance statistic is solid.

It doesn't get easily pushed around by one or two extreme values, those outliers.

The mean, it's not resistant.

It's sensitive to those extremes.

So if the mean is susceptible to being pulled around by these outliers, is there another measure of center that,

well, holds its ground better?

Yes, absolutely, and that's where the median comes in.

It's a star player here.

The median is just the middle value when you line all your data up in order, smallest to largest.

If you have an odd number of data points, it's the one smack dab in the middle.

If it's an even number, you take the two middle ones and find their mean.

And the key thing, it is resistant to outliers.

Those extreme values don't really affect it much.

Okay, going back to the Space Mountain example.

With the original 11 times, the mean was 39 .1, but maybe the median was 35 minutes.

Now, add that crazy 300 minute wait time.

The mean skyrockets, right, but the median,

it probably barely nudges, maybe shifts just a tiny bit.

Precisely.

It holds its ground.

There's actually a powerful real world example with Stephen Jay Gould, the biologist.

He was diagnosed with a cancer that had a median survival time of, I think, eight months, which sounds devastating.

But Gould, understanding statistics, knew the median wasn't his personal forecast, because the median isn't dragged down by those who only survived a short time, nor pulled up dramatically by the few who lived for years.

He knew it just meant half lived longer, half shorter.

He realized the distribution likely had a long tail, meaning longer survivals were possible, and the median didn't capture that for him.

It's a great reminder to interpret stats carefully.

That's, wow, that really puts it into perspective.

Okay, so we have mean and median.

What's next?

The mode.

The mode, yep.

This one's pretty straightforward.

It's simply the value or values that appear most frequently in your data set.

A cool thing about the mode is it's the only measure of center you can really use for qualitative data.

Things that aren't numbers, like favorite colors, types of cars, that kind of stuff.

You could have one mode, unimodal, two nodes, bimodal, many modes, multimodal, or sometimes no mode at all if every value is unique.

Right, so probably less useful for, say, precise measurements where every number might be slightly different.

Often, yes.

If you're measuring weights to many decimal places, you might not get any repeats.

But think about something like jersey numbers on a sports team.

Taking the median jersey number doesn't make any sense.

They're just labels.

But the mode.

If number 10 is worn by more players than any other number, that might tell you something, maybe about popularity or position.

Gotcha.

Okay, one more.

The mid -range.

Ah, the mid -range.

Super easy to calculate.

You just take the highest value, maximum, and the lowest value, minimum, add them together, and divide by two.

Finds a point exactly halfway between the extremes.

But because it only uses those two extreme values, it's extremely sensitive to outliers.

Even more so than the mean, maybe.

So honestly, it's not used very often in practice.

Easy, but not very robust.

Okay, so wrapping up this section on center, the key takeaway for you, the listener, is when you see an average reported pause, ask yourself, which measure of center are they using?

Mean, median.

Is it the right one for the data?

Could outliers be messing with the picture?

Think about reports on the average American's height or weight.

If they use the mean, a few very tall or very heavy individuals could skew it.

If they use measurements instead of self -reports, that matters too.

People tend to fudge those numbers, right?

I certainly do.

So knowing how that central value was found is crucial for accurate understanding, especially when designing things like airplane seats or clothing sizes.

Okay, perfect.

So we found the center.

But knowing the middle isn't the whole story, is it?

You could have two data sets with exact same mean or median, but they could look totally different in terms of how spread out the numbers are.

This is where variation comes in.

Oh, absolutely.

Variation is, you could argue, maybe the most important concept in all the statistics.

Understanding spread is critical.

Imagine two soda bottling plants.

Plant A produces Coke cans.

Plant B produces special cola.

Let's say both have a mean fill volume of exactly 12 .2 ounces.

Okay, so on average, they're the same.

Right.

But now imagine you look at the actual fill volumes for hundreds of cans from each plant.

Plant A's volumes are all tightly packed, around 12 .2 ounces, very consistent.

Plant B, their volumes are all over the map.

Some cans have 11 ounces, some have 13.

Huge variation.

If you only looked at the mean, you'd think they're identical.

But variation tells a crucial story about quality control, consistency,

everything.

Companies strive to reduce variation.

That makes sense.

So how do we actually put a number on this spread?

We need measures, right?

We do.

The three main ones were the range, standard deviation, and variance.

The range is the simplest, just like the mid -range was for center.

It's just the maximum value minus the minimum value.

So for our space mountain weights, if the highest was 75 minutes and the lowest was 20, the range is 75, 20 equals 55 minutes.

Easy peasy.

But like the mid -range, it only uses the two extreme values, so it's not resistant to outliers and doesn't tell you about the spread of the data in between.

Okay, so useful for a quick glance, but limited.

What's the main tool then?

Standard deviation.

Exactly.

The standard deviation, usually denoted as S for a sample, is the real workhorse for measuring variation.

It basically measures the typical amount the data values deviate or stray away from the mean.

A larger standard deviation means more spread, more variation.

A smaller one means the data is more tightly clustered around the mean.

Some key properties.

It's never negative.

It can only be zero if all the original data units, like minutes or ounces or inches, which makes it easier to interpret than variance.

And the calculation.

It involves looking at distances from the mean, squaring them.

It sounds a bit involved.

Conceptually, yeah, you find how far each point is from the mean, the deviation.

You square those deviations so positives and negatives don't cancel out.

You average them in a slightly special way, dividing by n1 for a sample, which is a statistical adjustment for better estimation.

And then you take the square root to get back to the original units.

We usually use technology to calculate it, but understanding the concept measuring typical deviation from the mean is key.

That n1 thing.

It just makes our sample variance a better, unbiased guess of the true population variance.

A technical point, but an important one for accuracy.

Okay, so we get a number.

The standard deviation.

But what does that number mean, practically?

Like, if the standard deviation of wait times is 15 minutes, is that big?

Small.

Great question.

This is where rules of thumb come in handy.

The first is the range rule of thumb.

It's pretty rough, but useful.

It helps identify values that might be significant or unusual.

Generally, values are considered significantly low if they're below the mean minus two standard deviations mean mean twos.

And they're significantly high if they're above the mean plus two standard deviations mean plus twos.

Anything in between that range, usually considered not significant within the usual range.

Okay, so for those female pulse rates mean 74, standard deviation 12 .5.

Two standard deviations is two 12 .5 equals 25.

So significantly high would be above 74 plus 25 equals 99.

That pulse of 102 we mentioned earlier, definitely significant by this rule.

Exactly.

The range rule also gives a very crude way to estimate the standard deviation.

If you only know the range, we have a range four.

Very rough, used with caution, but sometimes helpful for a quick check.

Now, what if the data has that nice bell -shaped curve and that normal distribution?

Is there something more precise?

Yes.

For bell -shaped data, we have the wonderful empirical rule, also known as the 68, 95, 99 .7 rule.

It's incredibly powerful.

It tells us about 68 % of the data falls within one standard deviation of the mean between mean s and mean plus s.

About 95 % falls within two standard deviations between mean twos and mean plus twos.

And about 99 .7 % almost everything falls within three standard deviations between mean threes and mean plus threes.

So for IQ scores, with a mean of 100 and standard deviation of 15, 95 % of people would have IQs between 100 minus 215 and 100 plus 215.

That's 70 to 130.

Precisely.

The empirical rule gives you a really good mental map of where most of the data lies in a normal distribution.

But not all data is bell -shaped, right?

What if it's skewed or just weirdly shaped?

Good point.

For any data set, regardless of its shape, we have Chebyshev's theorem.

The trade -off is that it's much less precise than the empirical rule.

It gives minimum percentages.

It states that for any data set, at least 1, 1K2 of the data values lie within K standard deviations of the mean, where K is any number greater than 1.

So for K2, within two standard deviations, it guarantees at least 1, 44, 75 % of the data is in that range.

For K3, it guarantees at least 1, 1989%.

It's lower bound, but it applies universally.

Okay.

That's useful as a baseline guarantee.

Now, earlier you mentioned comparing variation between different types of things like heights and weights.

They have different means, different units.

How do we compare their relative variation?

Ah, yes.

That's where the coefficient of variation CV comes in.

It's designed exactly for this purpose, comparing variation between data sets with different scales or units.

You calculate it by taking the standard deviation, dividing it by the mean, and then multiplying by 100 % to express it as a percentage.

CV's mean 100%.

It essentially tells you how large the standard deviation is relative to the mean.

So if we looked at men's heights and weights,

maybe heights have a standard deviation of, say, three inches and a mean of 70 inches.

Weights might have a standard deviation of 30 pounds and a mean of 180 pounds.

Which one varies more relatively?

We'll see.

For height, CV equals 370, 100 % at 4 .3%.

For weight, CV 240 and 80, 100%.

That's 16 .7%.

So even though the standard deviation for weight, 30 pounds, looks much bigger than for height, three inches, the relative variation, the CV, is much higher for weight, 16 .7%, than for height, 4 .3%.

Weights are more variable relative to their average value than heights are.

The CV lets you make that apples to apples comparison.

That's really clever.

Okay, so we've got center, we've got spread.

What about understanding where one specific data point fits into the whole picture?

Like, how does my test score compare to everyone else's?

Excellent question.

This brings us to measures of relative standing.

How does one value stand relative to the others?

And the absolute king here is the Z score, sometimes called the standard score.

It's incredibly useful.

A Z score tells you exactly how many standard deviations a particular beta value, let's call it X, is away from the mean.

The formula is Z equal X means standard deviation.

If Z is positive, the value is above the mean.

If negative, it's below.

If it's zero, it's exactly the mean.

Crucially, Z scores have no units, which makes them perfect for comparisons.

We usually round them to two decimal places.

No units.

So I could calculate a Z score for my height and a Z score for my exam result and compare them.

You absolutely could.

That's the power.

It standardizes everything onto the same scale of standard deviations from the mean.

Remember the significant value idea from the range rule of thumb being more than two standard deviations away, that translates directly to Z scores.

The Z score greater than plus two or less than minus two is generally considered significant or unusual.

Okay, so let's take someone famous like Tom Brady's height.

If we found the mean and standard deviation for adult male height, calculated his Z score and it came out as a plus 2 .66.

What does that tell us?

It tells you immediately that Tom Brady is significantly tall compared to the general adult male population.

He's over two and a half standard deviations above the average height.

It quantifies just how tall he is relative to the group.

And you mentioned comparing different things like a coin's weight and body temperature.

Yep.

Let's say a specific quarter's weight has a Z score of plus 2 .26 relative to other quarters and someone's body temperature of 99 degrees half has a Z score of plus 1 .29 relative to typical body temperatures.

Even though 99 degrees half might seem further from the average temperature, 98 .6 degrees in raw terms, than the quarter's weight is from the average quarter weight, the Z score tells us the quarter's weight plus 2 .26 is actually more extreme or unusual relative to its group than the body temperature plus 1 .29 is relative to its group.

Z scores level the playing field.

That's incredibly useful for comparison.

Okay, what else helps us understand relative standing?

Percentiles?

Percentiles, exactly.

There are another way to describe location within a data set.

There are 99 percentiles, P1 through P99.

They divide the sorted data into 100 groups with about 1 % of the data values falling each group.

So the 72nd percentile, P72, means that about 72 % of the data values are below that value.

And P50, that's the 50th percentile.

It's just another name for the median.

50 % of values are below it.

So if a space -mounted wait time of 45 minutes is at the 72nd percentile, that means 72 % of the wait times in that data set were less than 45 minutes.

Precisely.

It tells you where that specific wait time ranks within the distribution of all wait times.

And quartiles are related to this, right?

Very closely related.

Quartiles are just specific percentiles that divide the data into four equal parts, like quarters of a dollar.

Q1, the first quartile, is the same as P25.

It separates the bottom 25 % of the data from the top 75%.

Q2, the second quartile, is P50, the median again.

It splits the data in half.

And Q3, the third quartile, is P75.

It separates the bottom 75 % for the top 25%.

You find them using basically the same process as finding percentiles.

For that space -mounted data, maybe Q1 was 25 minutes.

Okay, minimum, Q1, median, Q2, Q3, maximum.

Does putting those together tell us something?

It tells you a lot.

That specific set of values is called the five -number summary.

Minimum, Q1, median, Q3, maximum.

It gives you a concise summary of the center, median, and the spread using the range and the distance between quartiles.

For space -mounted, the five -number summary might be 10 -minute melon, 25 -minute Q1, 35 -minute median, 50 -minute Q3, and 110 -minute max.

And this five -number summary leaves directly into box plots, right?

Those diagrams.

Exactly.

Box plots, sometimes called box and whisker diagrams, are visual representations of the five -number summary.

They're fantastic.

You draw a scale, then create a box that stretches from Q1 to Q3.

You draw a line inside the box at the median, Q2.

Then you typically draw whiskers lines extending out from the box to the minimum and maximum values.

Their real power comes when comparing multiple data sets.

You can draw several box plots side by side on the same scale.

Imagine box plots for Space Mountain, Tower of Terror, and Avatar Flight of Passage.

Wait times all lined up.

You'd instantly see which ride tends to have longer waits, longer box, higher median, which has more variation, wider box, or longer whiskers, and compare their minimums and maximums.

So they gave a quick visual comparison of center and spread.

Can they show skewness, too, like if the data is bunched up on one side?

They can give you clues, definitely.

If one whisker is much longer than the other, or if the median line isn't centered within the box, it suggests the data might be skewed in that direction.

A longer whisker or larger part of the box on the right suggests right skewness, a tail towards higher values.

Longer on the left suggests left skewness.

But it's a key point.

While box plots show symmetry or skewness, they don't reveal the detailed shape of the distribution like a histogram can.

You can't see if there are peaks or valleys within the box, for example.

Got it.

And one last thing related to box plots.

Outliers, again.

Can box plots help identify them?

Yes, especially modified box plots.

These use a specific rule to formally identify potential outliers.

First, you calculate the inner quartile range IQR, which is simply Q3 minus Q1, that's the width of the box.

Then a value is flagged as an outlier if it falls more than 1 .5 times the IQR below Q1, or more than 1 .5 times the IQR above Q3.

In a modified box plot, instead of extending the whiskers all the way to the minimum and maximum, the whiskers only go out to the lowest and highest data points that are not outliers.

The outliers themselves are then plotted individually, often as asterisks or dots.

So for Space Mountain, maybe Q125 and Q350, the IQR is 50, 25 equals 25.

Then 1 .5 IQR is 1 .525 equals 37 .5.

So anything below Q137 .5, which is 25, 37 .5, which is 50 plus 37 .5 equals 87 .5, would be an outlier.

Those wait times of 105 and 110 minutes we mentioned earlier, they'd definitely be marked as outliers on a modified box plot.

Exactly right.

And identifying them is so important.

Why?

Because outliers can seriously distort the mean and standard deviation.

Ignoring them can lead you to completely wrong conclusions about the data.

So when you see an outlier, you don't just delete it, you investigate.

Is it a typo?

A measurement error?

Or is it a genuinely unusual event that's actually really interesting?

That makes perfect sense.

Wow.

Okay.

So that really covers a lot of ground for describing and comparing data.

We've gone from finding the center with mean, median, mode, mid -range, to understanding the spread with range, standard deviation, variance, and CVA.

And finally, figuring out where individual points stand using z -scores, percentiles, quartiles, and visualizing it all with the five -number summary and box plots.

That's quite the cool kit.

It really is a powerful toolkit.

And remember, the goal isn't just calculating these numbers.

It's about the critical thinking, the interpretation behind the numbers.

The applications are truly everywhere, managing businesses better, understanding health data, making sense of scientific research.

It empowers you to engage with quantitative information more effectively.

These aren't just tools for statisticians.

They're essential tools for anyone navigating our data -filled world.

Absolutely.

So the next time you encounter a statistic in an article, a report, anywhere, you can now step back and ask those key qualms.

Okay, what measure of center are they actually using here?

And why did they choose it?

How much variation or spread is there really?

Are there any outliers that might be influencing the picture?

Keep asking those questions.

Keep digging a little deeper.

That's how you move from just seeing numbers to truly understanding the story they tell.

Thank you so much for joining us on this deep dive.

Keep learning and keep questioning.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Quantifying and visualizing datasets requires a systematic approach to understanding their central tendency, variability, and structure. Measures of center serve as reference points for describing where data clusters, with the mean providing the arithmetic average but remaining vulnerable to extreme values, the median offering stability by representing the middle position regardless of outliers, the mode identifying the most frequently occurring observation, and the midrange serving as a simple measure derived from the extreme values. Choosing the appropriate center measure depends on recognizing whether outliers are present and whether the data contains meaningful repeated values. Beyond center, understanding how tightly or loosely observations cluster around that central point requires measures of variation such as range for a quick span estimate, variance for average squared deviations from the mean, and standard deviation for expressing this dispersion in the original measurement units. When comparing datasets with different scales, the coefficient of variation provides a scale independent alternative by expressing spread as a percentage relative to the mean. Locating individual observations within the broader distribution involves relative standing measures including z-scores that indicate distance from the mean in standardized units, percentiles that show what proportion of data falls below a given value, and quartiles that divide the distribution into four equal segments. Exploratory data analysis techniques like boxplots translate the five number summary into visual form, immediately revealing data shape, symmetry properties, and potential anomalies through graphical representation. Constructing and interpreting these displays requires integration of numerical summaries with visual confirmation, as charts can expose patterns that single numbers might obscure. Practical competence involves recognizing when different measures lead to conflicting conclusions, understanding how data structure affects which summaries are most informative, and developing the judgment to present evidence that truthfully represents the underlying patterns rather than oversimplifying complex characteristics.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 3: Describing, Exploring, and Comparing Data

Related Chapters