Chapter 3: Describing Relationships

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace, the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

Ever wondered if there's a, you know, a predictable link between how many miles are on a used truck and its price?

Yeah, or maybe whether a baseball team's huge payroll actually guarantees more wins.

Exactly.

These aren't just random thoughts.

They're questions about relationships between different factors.

And well, that's exactly what we're tackling today.

We are.

We're kind of taking a shortcut to being well informed about one of the most fundamental ideas and stats, understanding relationships between two quantitative variables.

That's right.

Our focus today is unpacking chapter three of the practice of statistics, which is really a cornerstone for any AP statistics learner.

Yeah, absolutely.

And our mission really is to distill the most important stuff from this chapter, you know, from scatter plots to correlation and regression, giving you a clear, comprehensive guide.

Right.

We'll break down the vocabulary, the formulas, show you how these tools actually apply in real world scenarios, and we'll include those important AP exam tips and common pitfalls to watch out for.

Think of this as your essential toolkit for figuring out how two quantitative variables dance or sometimes how they don't.

Exactly.

So to kick things off,

before we even talk about predicting, say how many candies you can grab,

we need to define the roles our variables play.

Like what's influencing what?

Right.

Before we even think about plotting anything, we need to understand which variable is doing the explaining or predicting and which one is the response.

We call these the explanatory variable, usually X and the Y.

The explanatory variable helps predict or account for changes in the response variable.

The response variable, well, that measures the outcome.

Got it.

Now, a quick but really important AP exam tip here.

You might hear these called independent, independent variables sometimes, but in AP stats, we stick to explanatory and response.

It just avoids confusion because independent and dependent have other very specific meanings later on

That makes sense.

Keep it clear.

So let's ground this.

Remember that activity maybe where you measured hand span and grabbed candy?

Yeah, the candy grab.

In that case, hand span is the explanatory variable, right?

What we think might explain how many candies you grab.

Precisely.

And the number of candies grabbed.

That's the response.

Okay.

Or think about diamond pricing.

The weight of the diamond that's explanatory, it usually helps predict this price, which is the response.

Makes sense.

But what about something like, a student's SAT math score versus their SAT reading and writing score?

It's interesting.

Either could predict the other, couldn't they?

Exactly.

And that highlights a really crucial point.

Just because we label something as explanatory doesn't automatically mean it causes the response.

Right.

It's about association,

not necessarily cause and effect.

Exactly.

It suggests a potential predictive relationship.

You know, the amount of alcohol given to a might actually cause a change in body temperature,

but your SAT math score doesn't cause your SAT rating score.

It's about prediction association.

Okay.

So once we've figured out our explanatory and response variables, how do we actually see their relationship?

Well, that's where the scatter plot comes in.

It's really our best friend for this.

It gives you a visual snapshot of how these two quantitative variables interact.

The visual story.

Totally.

A scatter plot just displays the relationship between those quantitative variables.

One goes on the horizontal axis, the X axis.

Explanatory, one usually.

Usually, yeah.

And the other, the response, goes on the vertical or Y axis.

Each individual in your data set, like each baseball team or each student, becomes a single point on the graph.

So if we were looking at that MLB payroll versus wins data from 2016,

payroll in millions would be on the horizontal axis, number of vertical,

and each dot is a team.

Right.

And if you actually picture that scatter plot, you generally see the points trending upwards as you move from left to right.

Meaning higher payrolls tend to go with higher wins.

Not perfectly, but a general trend.

Exactly.

There's an upward movement, even if it's a bit scattered.

So once we have that plot,

how do we like read it properly?

What are we looking for?

Good question.

We look for four key characteristics, just like we did back in chapter one for describing single variables.

Okay, four things.

What are they?

We assess direction, form, strength, and any unusual features.

Direction, form, strength, unusual features.

So for direction,

is it a positive association?

Meaning do higher values of one variable tend to go with higher values of the other, like that payroll and wins example?

Or is it a negative association where higher values of one tend to go with lower values of the other?

Think maybe more runs allowed by a baseball team probably means fewer wins.

Right.

Or sometimes there's just no association at all.

Knowing one variable tells you basically nothing about the other, like maybe student height and hours of sleep.

Okay.

So direction, positive, negative, or none.

What's next?

Form.

Yes, form.

Does the pattern look basically linear like it follows a rough straight line, or is it non -linear, showing some kind of curve?

Linear or curved?

Simple enough.

Then strength.

Strength.

This is about how closely the points actually follow that form you identified.

Are they tightly packed around the line or curve?

That's a strong association.

Okay.

Or are they really spread out, widely scattered?

That would suggest a weak association or somewhere in between, like moderate.

Makes sense.

Strong, moderate, weak.

And the last one, unusual features.

Yeah, always look for unusual features.

These could be outliers points that clearly fall outside the overall pattern, or sometimes you see distinct clusters where groups of points seem separate from the main cloud of data.

Okay.

And this is where an AP exam tip comes in, right?

Yeah.

You have to mention all four.

Absolutely crucial.

When asked to describe a scatter plot on the AP exam, you must discuss all four characteristics.

Direction, form, strength, and any unusual features.

And this is just as important, you must do it in the context of the problem.

Meaning use the actual variable names.

Don't just say it's positive and strong.

Exactly.

Say there is a moderately strong positive linear association between MLB team payroll and number of wins with no apparent outliers.

Context is everything.

Got it.

Can you give an example, maybe Old Faithful?

Oh, Old Faithful is a great one.

If you plot the duration of an eruption versus the time until the next eruption, you see overall a strong positive linear relationship.

Longer eruptions mean longer waits, but if you look closely, you actually see two distinct clusters.

One cluster for shorter eruptions and shorter waits, another for longer eruptions and longer waits.

Ah, so it's not just one simple line.

Those clusters suggest maybe two different types of eruptions.

Precisely.

It's a deeper insight that just saying linear might miss.

Interesting.

What about a non -linear one?

Think about average income versus fertility rate across many countries.

If you visualize that plot, you'd likely see a moderately strong negative relationship, higher income, lower fertility, but it's clearly non -linear.

How so?

It's curved.

The fertility rate drops more steeply at lower income levels and then kind of flattens out as income gets higher.

Why is it curved?

That's a great question it prompts you to ask.

Okay, so describing the plot is key, but our eyes aren't perfect, especially for judging strength, right?

We need a number.

Exactly.

Our eyes can be fooled.

For linear relationships, we need a numerical summary of strength and direction, and that number is the correlation, usually written as R.

Correlation R.

What does it tell us?

R measures both the direction and the strength of a linear association between two quantitative variables.

Only linear.

That's critical.

Okay, only linear.

And what are its properties?

Well, it's always a number between minical one and one inclusive.

So, minus one is R one.

Right.

The sign tells you the direction.

Positive R means a positive association, negative R means negative.

Makes sense.

An R of exactly one of one or one only happens if the points fall perfectly on a straight line.

No scatter at all.

Okay, perfection.

The closer R gets to one or nycus one, the stronger the linear relationship.

Points are tightly clustered around a line.

The closer R gets to zero, the weaker the linear relationship.

And you emphasized linear.

What if the relationship is curved?

That's the big catch.

Correlation R is only appropriate for describing linear relationships.

You could have a really strong, obvious curved pattern, maybe a perfect U shape, but its R value could be close to zero because it's not linear.

That's why looking at the scatterplot first is non -negotiable.

Always plot the data.

So, back to the MLB example, the correlation R was about .613.

What does that number tell us?

Okay, .613.

It's positive, so it confirms the positive association, more spending, more wins generally.

And .613 is, well, it's moderately far from zero, but also pretty far from one.

So, it indicates a moderately strong linear relationship.

Meaning payroll helps predict wins, but it's definitely not the whole story.

There's still quite a bit of variation.

Exactly.

High payroll helps, but it doesn't guarantee a championship ring.

Now, this next bit feels really important, especially for AP students.

Cautions about correlation.

Hugely important.

First, and maybe the biggest one in all of statistics,

correlation does not imply causation.

Correlation does not imply causation.

Just because two things are strongly related doesn't mean one causes the other.

Like the classic ice cream sales and drowning example.

Perfect example.

They're highly correlated.

But does eating ice cream cause drowning?

Of course not.

There's a lurking variable this summer season, hot weather, that causes both to increase.

So, you always have to ask, could something else be causing both things we're seeing?

Always.

Think about the strong correlation between per capita chocolate consumption and Nobel laureates by country.

That's real.

But does eating chocolate make you win a Nobel?

Probably not, sadly.

Highly unlikely.

It's probably lurking variables like national wealth, education levels, research funding, things that enable both high chocolate consumption and Nobel level research.

Okay.

Correlation isn't causation.

Got it.

What's the next caution?

Correlation, like the mean and standard deviation we learned about earlier, is not resistant to outliers.

Meaning one weird point can mess it up.

Big time.

A single outlier can dramatically change the value of R.

It can make a weak relationship look strong, a strong one look weak, or even flip the sign from positive to negative.

Which is yet another reason, too.

Always plot your data first.

See those outliers.

Understand their potential impact.

Don't just trust the R value blindly.

Okay, makes sense.

Yeah.

How is R actually calculated?

I know we use tech, but conceptually?

Conceptually, R is basically an average of the products of the standardized scores, the z -scores for each variable for each individual.

Z -scores, okay.

So if a point is above average on both x and y, both positive z -scores, it contributes positively to R.

If it's below average on both, both negative z -scores, negative times negative is positive, so it also contributes positively.

Points in the upper right and lower left quadrants of a scatter plot centered at the means tend to make R positive.

And points in the upper left or lower right make it negative.

Exactly.

Quick facts about R.

Sure.

Both variables must be quantitative.

Can't correlate categorical data like favorite color with quantitative data like income.

Right.

It makes no difference which variable you call x and which you call y.

The correlation is the same either way.

Changing the units of measurement like feet to meters or pounds to kilograms doesn't change R's.

Useful.

And finally, R itself has no units.

It's just a pure number between next to one and one.

Which leads to that AP tip.

Don't ever write units on your R value.

Don't say R in .75 inches.

Please don't.

It's just .75.

Okay.

So if we've looked at the scatter plot, decided a linear relationship seems reasonable, and maybe check the correlation R, what's next?

Yeah.

Can we make predictions?

Yes.

If a linear model seems appropriate, we can summarize that relationship with a specific line and use it to make predictions.

This is the least squares regression line, often called the LSRL.

LSRL.

Okay.

How's it written?

The equation looks like this.

E equals B plus backs.

Okay.

That's EI hat.

Exactly.

R hat and the hat is super important.

It signifies that this is the predicted value of y for a given value of x, not the actual observed y value.

Predicted value.

Got it.

So B is the slope and B is the y intercept.

Let's use that.

Use Ford F -150 example again.

Miles driven x first price y.

We said it was a moderately strong negative linear relationship.

Right.

R was about 9 and 0 .815.

The LSRL equation might look something like predicted price equals 38 ,257 .1629 miles driven.

Okay.

So if I have a truck with say 100 ,000 miles, how do I predict its price using this line?

You just plug 100 ,000 in for miles driven equals 38 ,257 .1629 100 ,000.

Okay.

Do the math.

And you get $21 ,967.

So our model predicts that a Ford F -150 with 100 ,000 miles will cost on average about $21 ,967.

On average.

That's key too.

It's a prediction, not exact.

Definitely.

It's the predicted average for trucks with that mileage.

Now what happens if we try to predict way outside our data?

What about a truck with 300 ,000 miles?

Ah, good question.

If you plug 300 ,000 into that same equation.

38 ,257 .1629.

300 ,000.

You actually get a negative number like negative to $10 ,613.

Which makes absolutely no sense for a truck price.

Exactly.

And this highlights the huge danger of extrapolation.

Extrapolation.

That's predicting outside the range of your original X values.

Precisely.

Using the regression line to make predictions for X values far beyond the ones you actually observe.

It's dangerous because you have no guarantee the linear relationship holds true out there.

Maybe the price drop slows down after 200 ,000 miles or maybe the truck just falls apart.

The line doesn't know.

Don't extrapolate.

Okay.

Now you said the line gives predicted values,

but reality is messy.

No line hits every point perfectly.

That's right.

The difference between the actual observed E value and the predicted E value for any given point is really important.

We call that difference a residual.

Residual.

Like leftovers.

Exactly.

Residual actual E predicted E or E.

It's the vertical distance from the data point down or up to the regression line on the scatter plot.

So if a truck with, say, 70 ,583 miles actually sold for $21 ,994, but our line predicted maybe $26 ,759 for that mileage.

Then the residual for that specific truck would be $21 ,994, $26 ,759, which is a natural $4 ,765, meaning that truck is about $4 ,865 cheaper than the model predicted based on its mileage alone.

Right.

Maybe it had some issues.

Maybe it was just a good deal.

The residual captures that prediction error for each point.

Okay.

That makes sense.

Now let's go back to the equation itself.

E equals B Oro plus Bakes.

What do B Oro and M Oro, the intercept and slope actually mean in context?

Great question.

The slope tells you the predicted change in the response variable Y for every one unit increase in the explanatory variable.

Predicted change for one unit increase in X.

For the F -150s, the slope was on a 0 .1629.

That means for each additional mile driven, a one unit increase in X, the predicted price Y goes down by 0 .1629 or about 16 cents.

Okay.

That AP tip again.

Always use the word predicted when interpreting these.

Always, always, always.

It's the predicted change based on the model, not the actual guaranteed change.

Got it.

What about the Y intercept B?

The Y intercept B is the predicted value of the response variable Y when the explanatory variable X is equal to zero.

Predicted Y when X is zero.

Right.

For the F -150s, the intercept was $38 ,257.

So that's the predicted price of a truck with zero miles, basically a brand new one.

In this context, that intercept makes some sense.

It's a plausible price for a new truck, but sometimes the Y intercept doesn't make practical sense.

Remember that candy grab example?

Uh -huh.

Hand span versus candies.

If the Y intercept for that line was, say, negative 29 .8, well, what does an X of zero mean?

A hand span of zero centimeters.

That's impossible.

Right.

So the predicted negative 29 .8 candies for a zero centimeter hand span is mathematically part of the line, but it's meaningless in reality.

It's often an extrapolation.

So the intercept isn't always practically interpretable, even if it's part of the best fit line equation.

Exactly.

You have to consider if X of zero is even meaningful or within the range of your data.

Okay.

We can draw lots of lines through data points.

Why is this specific line called the least squares regression line?

What makes it the best?

Ah, it's called the least squares line because it's the unique line that minimizes the sum of the squared residuals.

Sum of the squared residuals.

Minimize that.

Yeah.

Think about all those vertical distances, residuals, from the points to the line.

Some are positive, some negative.

If you just added them up, they might cancel out.

So we square each residual first, making them all positive, and then add them up.

The LSRL is the line that makes this total sum of squared errors as small as possible.

So it fits the data best in that specific mathematical sense, minimizing the squared prediction errors overall.

That's the idea.

And thankfully we don't have to do that minimization by hand.

Our calculators or statistical software find the slope B and intercept B for the LSRL automatically.

Phew.

Okay.

So tech gives us the best line,

but how do we double check if using a line was even a good idea in the first place?

We looked at the scatter plots form, but is there more?

Yes, there is.

After fitting the line, we should always examine a residual plot.

A plot of the residuals themselves.

Exactly.

It's a scatter plot where you plot the residuals on the vertical axis against the explanatory variable on the horizontal axis.

Sometimes people plot residuals against the predicted values too.

Okay.

What are we looking for in this residual plot?

We are looking for nothing.

Ideally, a residual plot should show just a random scatter of points around the horizontal line at residual equals zero.

No leftover pattern.

Random scatter means good.

Random scatter means a linear model is likely appropriate.

It means our line captured the main linear trend and what's left over the residuals is just random noise.

What if it's not random scatter?

What if there is a pattern?

Ah, if you see a leftover curved pattern in the residual plot, maybe a U shape or an upside down U or any clear non -linear shape that's a red flag.

Red flag means?

It means the linear model is not appropriate.

The curve in the residuals tells you that a linear model is systematically over predicting for some X values and under predicting for others.

There's a non -linear relationship in the original data that your straight line missed.

So if we did the diamond price versus weight regression and the residual plot looked like a smile, the shape,

that tells us.

It tells you that a simple linear model isn't the best way to describe the relationship between diamond weight and price.

There's likely a curve involved and a linear model isn't capturing it well.

You'd want to consider a different type of model.

Okay, residual plots check if the linear model is appropriate.

Now, assuming it is appropriate, random scatter and residuals, how do we measure how well the line actually fits?

How good are the predictions typically?

Two key numbers help us quantify that goodness of fit.

The first is the standard deviation of the residuals, usually just called S.

Standard deviation, but of the residual.

Exactly.

S measures the typical size of a prediction error, a residual.

It tells you roughly how far the actual Y values tend to fall from the predicted A values on the regression line.

So for the F -150s, if we found S equal $500 and $740,

what does that mean?

It means that the actual price of a used Ford F -150 in our data set is typically about five tons of $740 away from the price predicted by the least squares regression line using its mileage.

It gives you a sense of the average prediction error in the units of the response variable, dollars in this case.

Okay, so test tells us the typical prediction error size.

What's the other number?

The other really important number is the coefficient of determination, which is simply R.

R -squared, like literally the correlation R multiplied by itself.

Literally just R -squared, but its interpretation is quite different and very powerful.

Okay, what does R -squared tell us?

R -squared tells us the fraction or percent of the variability in the response variable.

That is accounted for by the least squares regression line using the explanatory variable X.

Whoa, okay.

Percent of variability accounted for.

Say they give for the F -150s.

If R was not at .815, then R would be at .815, so if you do, which is about .664.

Right, so R is about .664 or 66 .4%.

How do we interpret that 66 .4 %?

About 66 .4 % of the variability in the prices of these used Ford F -150s is accounted for by the linear model relating price to miles driven.

Meaning miles driven explains about two -thirds of why some trucks cost more than others in this data set.

Pretty much, yes.

The remaining 33 .6 % of the variation in price must be due to other factors not included in our simple model, things like the truck's age, condition, trim level, color, location, etc.

Got it, so Keels gives typical error size.

Keels gives percent variation explained.

Both useful.

Very useful.

And on the APA exam, you'll often see computer output that gives you the slope, intercept, S, and R.

You need to be able to find and interpret all of them.

Okay.

Anything else deep on regression before we wrap with wisdom?

Just a quick note on calculating the LSRL.

If you don't have the raw data but you do have summary statistics like the mean and standard deviation of X, the mean and standard deviation of Y, and the correlation R, you can actually calculate the slope and intercept using formulas.

Oh!

Yeah, the slope BORRSSX where C and X are the standard deviations and the intercept BOR by Yero where YO and X are the means.

It also shows that the LSRL always passes through the point.

Yeah, the point of averages.

Good to know.

And you mentioned regression earlier, regression to the mean.

Ah, yes, regression to the mean.

It's a fascinating concept.

It basically means that if you select a value of X that is a certain number of standard deviations away from its mean,

the predicted value of Y will be fewer standard deviations away from its mean.

Specifically, it's only R times as many standard deviations away.

So if R is less than 1, the prediction is closer to the average Y value.

Exactly.

That's where the term regression came from.

Sir Francis Galton studied heights of parents and children.

He found that very tall parents tended to have children who were also taller than average, but on average not as tall as the parents themselves.

Their children's heights tended to regress or move back toward the mean height.

Interesting.

It applies beyond height too, right?

Like that sophomore slump in sports.

Perfect example.

An athlete has an amazing rookie year, an extreme performance.

Their second year is often still good, but maybe not as amazing closer to the average performance level.

It's regression to the mean.

Extreme performances are often followed by more typical ones.

Okay, super important concept.

Now, let's really hammer home some final crucial correlation and regression wisdom.

What are the absolute must -knows?

Okay, listen closely everyone.

This is critical.

First,

correlation and regression only describe linear relationships.

We keep saying linear.

Because it's that important.

Remember Anscombe's Quartet.

It's a famous set of four data sets.

If you just calculate the summary statistics mean of X, mean of Y, correlation R, the LSRL equation, they are identical for all four data sets.

Identical numbers.

Identical.

But when you plot them,

they look wildly different.

One is a nice linear scatter.

One is a perfect curve.

One is linear except for one huge outlier.

And one has X values all the same except for one extreme point that basically dictates the entire line.

Wow, so the numbers alone are completely misleading without the picture.

Utterly misleading.

It's the ultimate proof.

You must, must, must always plot your data first.

Don't just trust the numbers R or the regression equation.

Look at the picture.

Plot the data.

Got it.

Second piece of wisdom.

Second,

correlation and LSRLs are not resistant to outliers.

We said R isn't resistant and the line isn't either.

Especially to certain types of outliers.

Certain types.

Yeah.

Points that are extreme in the X direction, far left or far right, can be particularly influential.

An influential point is one that, if removed, would markedly change the result of the calculation, like the slope or Y intercept of the LSRL.

So one point way out on the X axis could potentially pull the whole line towards it.

Exactly.

It can act like a lever and drastically tilt the line.

Other outliers might just be far from the line vertically, large residual, but not extreme horizontally.

They might affect S 'ers or more than the line itself.

So AP tip.

If you see an outlier, think about its potential influence.

Maybe even recalculate things without it to see how much changes.

That's an excellent strategy.

Okay.

And the final absolutely critical piece of wisdom.

The one we can't repeat enough.

Association does not imply causation.

Even a perfect correlation, R1 or R1, any beautiful linear fit, does not prove that changes in X cause changes in Y.

Never ever.

Never ever, based on observational data alone, there could always be lurking variables.

Remember the cars and longevity example.

Right.

Owning more cars is associated with living longer.

But buying a fleet of cars won't magically extend your lifespan.

Yeah.

Cars are likely a proxy for wealth.

And wealth is associated with better health care, nutrition, safer environments, education.

Those are the likely causal factors for longer life, not the cars themselves.

The cars are associated, but not causal.

Always look for lurking variables.

Always be skeptical about cause and effect claims based only on association.

Wow.

Okay.

That was definitely a deep dive into describing relationships.

We covered scatter plots, explanatory response, direction, form, strength, unusual features, correlation R, the LSRL, prediction, extrapolation, residuals, residual plots, R -est, interpretation,

influence, and the huge caution about causation.

We covered a lot of ground.

It's the foundation for understanding how variables interact.

So the final thought for our listeners.

As you go through your day, start noticing quantitative relationships around you.

When you see two things trending together, maybe online searches for flu symptoms and sales of tissues, ask yourself, is one causing the other?

Or is there a lurking variable like flu season itself, driving both?

Be that critical thinker about associations.

Don't just accept them at face value.

Great advice.

Always question the connection.

Thank you so much for joining us on the deep dive.

We really hope this exploration of chapter three helps you confidently analyze and interpret relationships and data you encounter.

Yeah.

Keep exploring.

Keep asking questions.

Until next time, keep learning.

This has been a deep dive.

Brought to you by The Last Minute Lecture Team.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Analyzing relationships between two quantitative variables requires both visual exploration and numerical quantification, beginning with scatterplots that reveal three fundamental characteristics of bivariate data: direction shows whether variables increase together or move in opposite directions, form indicates whether the relationship is linear or curved, and strength describes how closely observations cluster around the underlying trend. Recognizing outliers as anomalous data points and identifying influential points that disproportionately affect statistical estimates are essential skills for accurate interpretation. Correlation provides a standardized measure ranging from negative one to positive one that captures the magnitude and direction of linear association, though a strong correlation never implies causal connection between variables. Least-squares regression extends this analysis by fitting a line through bivariate data to model the relationship and generate predictions, with slope representing the expected change in the response variable for each unit increase in the explanatory variable and intercept indicating the predicted response when the explanatory variable is zero. Residuals, calculated as the differences between observed values and regression predictions, form the basis for model evaluation through residual plots that reveal whether linearity assumptions hold and whether patterns in the data suggest model inadequacy. The coefficient of determination quantifies the proportion of variation in the response variable explained by the explanatory variable through the fitted regression model. Successful application requires understanding fundamental limitations: extrapolating predictions beyond the observed data range produces unreliable estimates, outliers and high leverage points can distort regression lines substantially, and practitioners must verify that linear regression is genuinely appropriate for their data before applying the model to real-world questions or making consequential decisions based on its predictions.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 3: Describing Relationships

Related Chapters