Chapter 10: Correlation and Regression
Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement, not replace, the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
Okay, let's unpack this.
Have you ever noticed how some things just seem to, well, go together?
Like when a Powerball jackpot gets truly astronomical, the lines to buy tickets appear to snake around the block for miles.
Is that just something we think happens, or is there something statistically really going on there?
Today, we're taking a deep dive into this fascinating world of how variables relate to each other.
We're using a great chapter for Mario Triola's elementary statistics, specifically the one focused on correlation and regression.
Our mission, really, is to give you the tools to understand, analyze, and maybe even predict these relationships and data,
and, crucially, help you avoid some surprisingly common pitfalls along the way.
Yeah, what's really cool here is we move beyond just sort of noticing associations.
We're going to actually quantify them, build models.
This chapter helps us tackle core questions.
Is there a real connection, a correlation?
If there is, can we describe it with an equation?
Can we predict?
And maybe the biggest question, when can we truly say one thing causes another, and when are we just seeing a pattern?
Statistical echo, maybe.
Okay, so where do we start unraveling this?
How do we even know if two things are connected in a way that matters statistically?
What does it really mean for two variables to be correlated?
Right, that's step one.
Correlation.
Basically, it means the values of one variable tend to be associated with the values of another.
And when we talk about linear correlation, we're looking for a specific pattern.
If you plotted the data points, they'd look roughly like a straight line.
Now the huge warning, the one we always start with in stats, is this.
Correlation does not imply causality.
That's the classic one, isn't it?
Yeah.
So tempting to think it does.
Like with Powerball, it feels like the giant jackpot makes people buy tickets.
Exactly.
It feels intuitive.
But the statistical correlation alone doesn't prove causation.
It just shows a strong pattern, a link.
Think about that famous example, storks and babies.
Seriously, historical data in some areas showed a linear correlation between the stork population and human births.
Slight chuckle.
Okay, but storks don't actually deliver babies.
Right.
Of course not.
There's what we call a lurking variable.
Maybe growing towns meant more houses with chimneys for storks and more people having babies.
So both things increase together, but one doesn't cause the other.
That's a spurious correlation.
It shows why you can't just look at numbers.
You need common sense, too.
So before we jump into calculations, then, how do we get a first look?
How do we visually explore these possible connections?
You start with a scatter plot.
It's simple, just a graph.
Each dot represents a pair of data, like one jackpot amount, and the tickets sold for it.
And how those dots arrange themselves tells you a lot, visually.
What kind of patterns are we looking for?
Well, if they generally trend upwards, left to right, that suggests a positive linear correlation, like more jackpot, more tickets.
If they slope downwards, that's negative linear correlation.
As one goes up, the other goes down.
Sometimes it's just a random cloud of points, no linear pattern.
Or you might see a clear curve that's a nonlinear pattern.
The relationship isn't a straight line.
And really importantly, a scatter plot makes outliers jump out at you.
Those odd points far from the main group.
OK, that visual check seems crucial.
But how do we put a number on that?
How do we objectively measure the strength of that straight line connection we might be seeing?
That's the job of the linear correlation coefficient, which we call RP.
It's just one number, but it measures both the strength and the direction of the linear link.
It always falls between minus one and one.
Close to one means strong positive linear correlation.
Close to minus one means strong negative.
And close to zero suggests, well, no linear correlation.
So R gives us a standard way to compare correlations across different datasets.
Precisely.
An R of 0 .8 means a strong positive linear link, whether it's house prices and square footage or, I don't know, rainfall and crop yield.
But a key thing about R, it's sensitive to outliers.
One weird data point can really pull it up or down.
That's another reason the scatter plot first is so important.
All right.
So we've got R to R value.
We've looked at the scatter plot.
How do we know if that R value is actually statistically significant?
Like is it strong enough to say there's really a linear relationship or could it just be chance?
Yeah, we need to test it.
Usually we look at the P value.
If the P value is small, typically less than 0 .05, we have enough evidence to conclude there is a linear correlation.
It wasn't just random luck.
For that Powerball data, R was really high, 0 .947, and the P value was tiny, like 0 .0001.
So it's much smaller than 0 .05.
Way smaller.
It basically screams significant linear correlation.
The link between jackpot size and ticket sales is statistically solid, not just chance.
But even with a strong R and a tiny P value,
we still have to be careful about assuming causation, right?
Like that other example you mentioned.
Margarine and divorce rates.
Oh yeah, that's a classic textbook example.
There's data showing an incredibly strong correlation, like R equals 0 .993 between margarine consumption in Maine and the divorce rate there.
The P value is basically zero.
But come on, eating less margarine isn't saving marriages.
Exactly.
It's statistically significant but makes absolutely no real -world sense.
It's another spurious correlation, probably driven by some other broad societal trends happening over time.
It hammers home that stats needs critical thinking.
You can't just blindly trust the numbers.
Right, that common -sense filter is vital.
So okay, let's say we do have a genuine, statistically significant linear correlation, like with Powerball.
What does that number actually explain about the relationship?
That leads us to R squared, the coefficient of determination.
This is super useful.
It tells you the proportion of the variation in your Y variable, the one you're trying to predict or explain, that is explained by its linear relationship with the X variable.
Okay, so for Powerball,
R was 0 .947, squaring that gives?
Gives you aught percent, or you'll 0 .897.
So what that means is about 89 .7%, let's call it roughly 90%, of the up and down variation we see in ticket sales can be explained by the changes in the jackpot amount.
So jackpot size accounts for almost 90 % of why ticket sales fluctuate.
According to this linear model, yes, the other 10 % or so, that's unexplained variation.
It's due to other factors we didn't measure, or maybe just random chance, random noise in the system.
Okay, establishing that link is powerful.
But I think for many people, the real payoff is prediction.
If we know jackpot size and ticket sales are linked, can we use the jackpot size to actually forecast how many tickets might be sold?
Absolutely.
And that's where regression comes in.
Correlation tells you if there's a linear link.
Regression helps you describe what that link is with an equation.
Regression analysis finds the equation of the straight line that best fits your data points on the scatter plot.
We call this the regression line, or sometimes the line of best fit, or the least squares line.
The equation itself is the regression equation, usually written as UUAE plus euros.
Okay, let's break that down.
U, that's the predicted value.
Yep, U euro is your predicted value of y, the response variable.
X is your predictor variable.
And the key parts are BO, that's the slope.
It tells you how much U is predicted to change for every one unit increase in x.
And B is the y -intercept.
It's where the line would cross the y -axis, theoretically, when x is zero.
And finding B and BO.
I assume we don't usually do that by hand.
Oh, definitely not these days.
The formulas exist, but standard practice is to use software calculators to statistical packages.
For our Powerball data using software, the regression equation came out as y is a list
Remember,
x is jackpot in millions.
Y is predicted, tickets sold in millions.
Now, interpreting that slope, the Bayer value of 0 .174, that's where you get practical insight.
It means for every extra $1 million added to the jackpot, a one -unit increase in x, we predict ticket sales will increase by 0 .174 million.
Or 174 ,000 tickets.
Exactly.
So you can see how lottery officials could use that, right?
They can estimate the impact of boosting the jackpot on ticket revenue.
It helps inform decisions.
That seems incredibly useful.
But here's a really important question, I think.
When should you actually use this equation to make predictions?
And when shouldn't you?
Are there rules for that?
Absolutely critical question.
You use the regression equation when you have a good model.
That means a few things check out.
One, the scatter plot shows a decent linear pattern visually.
Two, your correlation r is statistically significant.
And three,
this is crucial.
The x value you're plugging in to make the prediction is within the scope of your original data.
Don't go too far outside the range of x values you use to build the model.
Like with the Powerball example.
Right.
Using that equation to predict tickets for a $625 million jackpot, which was within the range of jackpots in the data used, gave a prediction of 97 .9 million tickets.
The actual sales were around 90 million.
Pretty close.
So the model worked reasonably well there.
But if you have a bad model, maybe there was no significant linear correlation in the first place like trying to predict IQ from height.
Right, you said error would be near zero there.
Exactly.
In that case, the regression equation is useless.
Your best prediction for any person's IQ, regardless of their height, is simply the average IQ of the group, which is typically around 100.
You just predict the mean of y.
And finally, the big warning,
avoid extrapolation.
That means predicting for x values far beyond what you observed.
Why is that so risky?
Because the relationship might change outside your observed range, that linear trend might curve off, or something else might happen.
The classic example is cricket chirps and temperature.
The linear model works well within a certain temperature range, but extrapolate to freezing or boiling temperatures and the predictions become absurd.
That makes sense.
Sticking within the scope of your data is key.
This also makes me think about those outliers we mentioned seeing on the scatterplot.
How do they affect the regression line itself?
Is there a difference between just an outlier and something called an influential point?
Good question.
Yes, there's a distinction.
An outlier is just any point that lies far from the general pattern of the other points.
An influential point, though, is an outlier that, if you were to remove it, would significantly change the regression line itself, its slope or intercept or both.
So it has a big impact on the equation.
A really big impact.
Imagine in our Powerball data, we had one weird point, maybe a huge $980 million jackpot, where for some reason, maybe a hurricane hit, only 12 million tickets were sold.
That single point is way off the trend.
If you included it, it could dramatically pull the regression line down, changing the slope completely, removing it would make the line snap back.
That's an influential point.
It shows how just one or two unusual data points can sometimes really distort your model if you're not careful.
So how does the math actually find this line of best fit when points are scattered around?
How does it decide which line is best?
It uses something called the least squares property.
It's actually pretty clever.
First, think about residuals.
For each data point, the residual is the vertical distance between the actual observed I value and the I value the regression line predicts for that same x.
It's the error or the miss for that point.
So the distance from the dot to the line vertically.
Exactly.
Now, the least squares property says that the regression line, the line of best fit, is the unique line that makes the sum of the squares of all those residuals as small as possible.
Why square them?
Squaring them does two things.
It makes all the errors positive, so misses above and below the line don't cancel out, and it heavily penalizes larger errors.
So the line is pulled towards minimizing those big misses.
That's how it finds line that overall fits closest to all the points according to the specific criterion.
And can we use those residuals, those errors, to check how well our model is actually working?
I think you mentioned residual plots.
Yes, absolutely.
A residual plot is another diagnostic tool.
You just plot your original x values against their corresponding residuals, the e values.
What you want to see in a residual plot for a good linear model is basically nothing.
Just a random scatter of points above and below the zero line.
No pattern.
No obvious pattern.
If you see a curve, or a funnel shape where the points get more spread out, that suggests your initial linear model might not be the best fit, or that the variability isn't constant, a random scatter supports the idea that the linear model is appropriate.
Okay, so we have this regression line, Euromaniki 10 .9 plus 0 .174x for Powerball.
It gives us a single best prediction, like 97 .9 million tickets for a $625 million jackpot, but you know, that's just one number.
How much uncertainty is there around that specific prediction?
How accurate is it likely to be?
That's where prediction intervals come in.
They address exactly that uncertainty for a single forecast.
It's really important to distinguish this from a confidence interval.
A confidence interval estimates a population parameter, like the average height of all men.
A prediction interval gives you a range for a single predicted value of y based on a specific value of x.
It's about forecasting an individual outcome.
So its purpose is to give a range, saying we predict 97 .9 million, but we're pretty confident the actual value will fall somewhere between x and y.
Exactly.
It provides a range that is likely to contain the actual future i -value for that specific x.
Using our Powerball example for that 625 million jackpot prediction of 97 .99 million tickets, a 95 % prediction interval might be something like, say, 73 .7 million to 122 million tickets.
Wow, that's quite a wide range.
Why would it be so wide sometimes?
Several things can contribute.
A small sample size used to build the model definitely widens the interval.
Also, the inherent variability in the data, how spread out the points are around the regression line.
And making predictions for x -values further away from the average x -value in your data set also tends to result in wider prediction intervals, even if you're still within the overall scope.
There's just more uncertainty out towards the edges.
And this width, this uncertainty, ties right back to r -squared.
Remember, Arstor told us the proportion of explained variation.
Right.
For Powerball, about 90 % was explained by the jackpot.
So the total variation in ticket sales can be split into explained variation, the part linked to jackpot size, about 90%, and unexplained variation, the other 10%, due to other factors or randomness.
That prediction interval essentially reflects the magnitude of that unexplained variation around your regression line.
The more unexplained variation, the wider the interval has to be to capture the likely actual outcome.
That makes sense.
But what if we suspect that ticket sales aren't just about the jackpot?
What if other things matter, too?
Like maybe whether it's a holiday weekend, or how long it's been since the last big win, or competitor lottery jackpots.
Can we include more factors?
Absolutely.
That's precisely what multiple regression is designed for.
It lets you model a response variable, y, using two or more predictor variables, x errors, etc.
So instead of b or a plus x, your equation looks more like u e o a plus b, it's supposed to be u e o a plus b and so on.
I imagine the math gets complicated quickly there.
Oh, definitely.
For multiple regression, using statistical software isn't just helpful, it's essential.
The calculations become very complex, very fast.
When you evaluate a multiple regression model, you look at a few key things.
One is the adjusted R row.
Adjusted R, how's that different from regular R row?
Regular R will always go up, or stay the same, when you add more predictor variables, even if they're totally useless.
Adjusted R penalizes you for adding variables that don't actually improve the model significantly, considering the sample size.
So it gives a more honest assessment of the model's explanatory power, especially when comparing models with different numbers of predictors.
You want a high adjusted RR.
You also look at the overall p -value for the entire regression model.
A low p -value indicates that the combination of predictor variables taken together does significantly help predict R.
So how do you figure out which predictors to include?
What's the process for finding the best multiple regression equation?
It sounds like it could get messy just throwing variables in.
It definitely requires more than just number crunching.
Finding the best model involves judgment and practical considerations.
Some guidelines.
First, use common sense and subject matter knowledge.
Is it plausible that this variable actually influences the outcome?
Like a doctor's shoe size probably doesn't predict their patient's recovery time.
Second, look for models with low overall p -value, indicating significance, and a high adjusted R4.
Third, aim for parsimony.
Generally, a simpler model with fewer variables that explains things well is better than a super complex one.
Fourth, watch out for multicollinearity.
That's when your predictor variables are highly correlated with each other.
It can make the individual coefficient estimates unstable and hard to interpret.
The textbook had a great example on this, didn't it?
Predicting height from crime scene evidence.
Yeah, that was perfect.
You might have potential predictors like age, foot length, shoe print length, shoe size.
Statistically, a model using all four might give the highest adjusted R4.
But think practically.
Right.
Criminals don't usually leave their driver's license with their age.
Exactly.
And shoe size is basically just a rounded version of foot length, so they're highly correlated
multicollinearity.
And you're far more likely to find a shoe print than a clear bare footprint.
So even if the stats look slightly better with more variables?
The practical realities of collecting the evidence mean a simpler model using just say shoe print length might be the best usable model in that real world context.
It highlights that best depends on usability, not just maximizing a statistic.
And sometimes the variables you want to include aren't numbers, they're categories like male or female or treatment group versus control group.
How do you handle those in regression?
You use dummy variables.
A dummy variable is just a numerical variable you create that uses zero and one to represent the two categories of a qualitative variable.
For instance, if predicting a child's height, you could use father's height, mother's height, X here, and then create a dummy variable for sex, maybe setting it to zero for female and one for male.
And what does the coefficient for that dummy variable tell you?
Its coefficient, in this case, would represent the estimated difference in predicted height between males, coded as one, and females, coded as zero, after accounting for the parent's heights.
So it directly quantifies the impact of that categorical variable.
That's clever.
And just quickly, what if the category itself is what you're trying to predict?
Like predicting if someone will vote yes or no based on their income and age.
Ah, okay.
If your response variable Y is categorical, like yes, no, use one, then standard linear regression isn't the right tool.
You'd typically use something called logistic regression.
It's a different technique designed specifically for predicting categorical outcomes.
We won't dive deep into it now, but just be aware it exists for those situations.
Okay, so far we've mostly talked about straight line relationships.
But life isn't always linear, right?
Sometimes things curve or grow exponentially.
But what do we do then?
Exactly.
Not all relationships fit a straight line.
That's where nonlinear regression comes into play.
We try to fit different types of curves to the data instead of just a line.
The chapter introduces five common mathematical models beyond linear.
Quadratic, which makes a U shape or inverted U, a parabola.
Thing maybe projectile motion or cost minimizing at a certain production level.
Logarithmic, which starts steep, then flattens out.
Maybe learning curves or diminishing returns.
Exponential, which shows rapid growth or decay.
Think population growth, compound interest, or radioactive decay.
And power models, which often describe relationships where one variable is proportional to a power of another, like area and radius.
So you've got these different curve shapes.
How do you choose which one fits your data best?
Is it just guesswork?
Not guesswork, but it involves a similar process to what we've discussed.
First, visual inspection.
Look at the scatter plot.
Does it clearly bend?
Does it look like a U shape or is it leveling off?
That gives you initial candidates.
Second, you can compare R values.
Fit several plausible models, linear, quadratic, exponential, etc., using software.
And generally, the model with the higher R explains more variation and might be preferred.
But small differences in R might not be practically significant.
And third, critically, use your brain.
Think.
Does the model make sense in the context of the problem?
Does it lead to realistic predictions, especially if you were to extrapolate slightly?
That danger of extrapolation seems even bigger with curves, maybe.
It can be, absolutely.
The U .S.
population example in the book is a fantastic cautionary tale.
A quadratic model fit the historical population data from 1800 to 2020 incredibly well.
With an R of like .9995, it predicted about 399 million for 2040, which seems plausible.
That's like a great fit.
For the observed data, yes.
But if you take that same quadratic model and extrapolate way back in time to 1492, it predicts a population of 663 million, which is completely absurd.
Right, because population growth wasn't following that same parabolic curve back then.
Exactly.
It shows that even a model with a near -perfect statistical fit on your data can produce nonsense outside that range.
Common sense has to override the R value sometimes.
The COVID -19 modeling example also touched on this.
Choosing between quadratic and exponential fits required thinking about the underlying process of disease spread, not just which curve had a slightly higher R.
Wow.
So we've covered a lot of ground here, from spotting a simple correlation with powerball to building predictive regression lines, checking residuals, adding multiple factors, and even fitting curves for nonlinear trends like population growth.
We really have.
And stepping back, you can see how these statistical tools, correlation, regression, both simple and complex, are incredibly powerful for making sense of data and informing decisions.
But the big takeaway, I think, is that they are just tools.
Their value depends entirely on the critical thinking you apply alongside them.
The numbers tell part of the story.
But you are the interpreter.
You constantly have to ask, does this relationship make real -world sense?
What are the limitations here?
Could there be lurking variables?
Am I extrapolating dangerously?
It really brings up a final thought, doesn't it?
In this world, absolutely overflowing with data, with claims about this causing that being thrown around constantly, how can using these statistical ideas, understanding correlation isn't causation, checking model fit, being wary of extrapolation, how can that empower you, the listener, to better cut through the noise, to see the genuine connections, maybe question the spurious ones, and really understand the true story the data is trying to tell?
That's the challenge and the opportunity.
Being statistically literate isn't just for researchers anymore, it's becoming a vital skill for navigating the modern world.
Absolutely.
So next time you encounter a claim about two things being related, hopefully you'll feel better equipped to look closer, ask the right questions, and decide for yourself what the data is truly revealing, or perhaps what is hiding.
ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.
Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.
Support LML ♥