Chapter 7: Psychometrics & Individual Differences

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

Our mission here is pretty simple.

We take a dense stack of sources, articles, research, specialized chapters, and we distill the most important nuggets of knowledge into a powerful,

comprehensive conversation.

It really is your shortcut to being genuinely well -informed.

And today, we are undertaking a really fundamental deep dive.

We're looking into the language that underpins all modern psychological research.

Yeah, we're stepping back a bit.

We're not talking about specific findings on behavior or cognition today.

Instead, we're focusing on the statistical and the mathematical basis that lets us study individual differences at all.

Our source material is a concentrated chapter on psychometrics.

Yeah, I know what a lot of you're probably thinking.

Statistics.

Isn't this supposed to be about the mind?

Right.

And honestly, for a lot of people who didn't come from a strong math background, that word statistics, it can feel, I don't know, immediately intimidating, almost irrelevant to understanding human nature.

And that is the central misunderstanding we really need to correct right away.

The whole purpose of this discussion is to show you why this knowledge is absolutely foundational.

If you want to read a sophisticated study on personality or understand an argument about intelligence or even interpret a finding in a medical journal, you have to grasp this logical basis.

You just have to.

So we're not trying to turn you into a statistician overnight.

No, not at all.

But we are trying to give you the crucial logical toolkit.

Psychometrics is, well, it's essentially the statistical backbone of psychology.

It's the application of these methods to measure psychological phenomena.

Think of it like this.

Biometrics apply statistics to biology, right?

That helps us understand genetics, growth, epidemiology.

Well, psychometrics does the same thing for the non -observable internal processes of the mind, intelligence,

personality, aptitude.

It provides the mathematical scaffolding that validates everything else we study.

It helps us move from just, you know, simple observation to actual scientific inference.

So without this statistical lens,

every finding you read, every claim about human behavior,

is what?

It's just an anecdote dressed up in fancy words.

That's a great way to put it.

Yeah.

Understanding psychometrics gives you the power to evaluate if a finding is true, if it's important, and maybe most critically, if it's replicable.

All right.

So let's get into it.

Where do we begin?

Let's start with the most basic function of statistics.

Just simple description.

Imagine you, as a researcher, you've just measured the IQ scores of 10 ,000 children.

Okay.

So I've got this massive, just an unusable list of numbers.

Exactly.

That list doesn't tell you anything immediately useful.

How do we take that chaos, that data, and turn it into a digestible, meaningful picture?

We visualize it.

When we plot most psychological and biological characteristics, things like height, weight, or intelligence, we almost always get a consistent shape, the normal frequency distribution.

Which people probably know by a few other names.

Oh, yeah.

The Gaussian curve, named after Johann Gauss.

Or sometimes it's called the curve of error.

I find, looking at this graph,

it's just an incredible shortcut to understanding a whole population.

So you plot the IQ scores along the bottom, the horizontal axis.

Right, the abscissa.

And then on the vertical axis, the ordinate, you're plotting the number of kids who actually got that score.

And what just leaps out at you is the symmetry.

It's the classic bell curve shape.

It tells us that most people cluster right around the center.

And as you move out towards the extremes, the frequency just drops off dramatically and symmetrically.

And for the standard IQ distribution, that clustering is really, really pronounced, isn't it?

It is.

A full 50 % of the entire population falls in a very narrow range, between 90 and 110 IQ points.

Half the population is right there in that average band.

So what happens as we move outward, away from the middle?

The drop off is sharp.

The next 10 point band on either side, so 80 to 90 and 110 to 120, that only holds about 7 % of the population on each side.

So 14 % total.

Okay.

And then the next step out.

Even smaller.

Between 70 and 80 and 120 and 130, you're only looking at 2 % on each side.

So a total of 4%.

Wow.

So by the time you get to an IQ of 130, you're already in the top.

The top 2%, exactly.

2 % of the population.

And out at the far, far ends like scores below 60 or above 140.

You're talking about tiny fractions, only about 0 .4 % of the population in each of those tails.

It really emphasizes just how rare the extremes actually are.

And that visualization, that's what gives meaning to a single score.

Right.

That's the key.

If you tell me a child has an IQ of 120, that number is only meaningful when you compare it to the group, we can immediately say, okay, that child is brighter than about 90 % of their peers.

And if they score 140.

Then they're in a league where only about one child in 200 is brighter.

The score doesn't exist in a vacuum.

Its meaning is entirely relative to the distribution.

Now here's where the statistical elegance really comes in for me.

We started with 10 ,000 individual scores.

We managed to create a

the curve.

But we can summarize that entire complex picture, all 10 ,000 of those scores, which is two figures.

That's the principle of parsimony in statistics.

It's beautiful.

We need two primary measurements.

First, you need the mean.

The average?

The simple arithmetic average.

In the standardized IQ distribution, the mean is set at 100.

That defines the center of our curve.

And the second number, that's the one that tells us about the shape of the curve, how spread out or tightly clustered the data is.

That's the standard deviation or SD.

It's the measure of dispersion.

Conceptually, the SD represents the average distance of every single score from that mean.

So if the SD is small, the curve is tall and skinny.

Right.

And if it's large, the curve is wide and flat.

I find the visual marker for the SD on the curve itself fascinating.

If you're looking at the bell curve, the SD corresponds to the exact point where the curvature changes.

The inflection point.

Yeah.

Where the curve stops bowing outward or convex and starts bowing inward, concave as you move away from the center.

And in our IQ example, that visual change happens at scores of 85 and 115.

Through mathematical calculation, we confirm the SD is 15.

And there's a rule for that, right?

For what percentage falls within one standard deviation?

There is.

Approximately 68 % of all scores will fall within one standard deviation above and below the mean.

So in this case, between 85 and 115.

So just by knowing the mean is 100 and the SD is 15, we've replaced 10 ,000 data points with just two numbers.

And we can still recreate the entire shape and all the percentages.

That's tremendous efficiency.

It is.

But the second, and I would argue even more profound advantage of the SD, is that it provides a dimension -free comparison.

The SD isn't tied to inches or IQ points or seconds.

It's a universal statistical measure of relative position.

So we're converting a raw score into a kind of universal metric.

Precisely.

We stop saying the child has an IQ of 130, and instead we can say the child has two standard deviations above the mean.

And that statement is generalizable to any variable that distributes normally.

Okay.

So let's use that height example from the text.

So we measure the height of those same 10 ,000 children, and we find the average height, the mean is 60 inches, and the SD is two inches.

Okay.

Now we can compare apples and oranges statistically.

A child who is 64 inches tall is two SDs above the mean in height.

That's 60 plus two times two.

Right.

And the child with an IQ of 130 is two SDs above the mean in intelligence.

100 plus two times 15.

So even though one is measured in inches and the other in IQ points, they are statistically equivalent.

I see the power of that now.

In terms of their relative standing within their own populations, that 64 inch child and 130 IQ child hold the exact same relative position of high achievement.

And this dimension -free comparison, that is the essential foundation that allows us to ask scientific questions about the relationships between different measures.

And that, of course, is the definition of correlation.

Okay.

So once we can compare these different dimensions, we can move on to actually testing relationships.

For instance, is there any truth to that old societal image of the small, non -athletic, but highly intellectual child?

Right.

If that stereotype were accurate, we would expect to see a positive correlation between small stature and high intelligence.

But we need an objective measure to test that.

We can't just eyeball the data.

And critically, like we just said, we can't compare inches and IQ points directly.

But we can compare them as standard deviations from their respective means.

And this is exactly what the product -moment correlation formula does.

It takes those normalized scores and it quantifies the extent of the relationship between variable X, let's say height, and variable Y, let's say IQ.

We don't need to get lost in the math, but it helps to understand the components of that formula conceptually.

It gives us the correlation coefficient, which is symbolized by the letter R.

Yes.

And the core of that formula involves basically summing up the product of the deviations of X and Y from their means.

So if a high score on X like being tall, a positive deviation consistently happens, along with a high score on YI, high IQ, also a positive deviation, then the product of those deviations is going to be a big positive number.

Exactly.

And if a high score on X consistently coincides with a low score on Y, which is a negative deviation, the product will be negative.

The summation of all those products across the entire sample is what determines the sign and the magnitude of R.

So R is a quantification of agreement.

How do we interpret the value of R?

What do the numbers mean?

The correlation coefficient, R, ranges from positive one down through zero to negative one.

A value of plus one signifies a perfect positive agreement.

This is a situation where the variables vary perfectly together.

So the tallest kid is the one all the way down.

Perfect lockstep.

A value of zero on the other hand indicates a complete lack of relationship.

The tall boy is just as likely to be bright or dull or average as the short boy.

No connection.

And negative one.

That would mean perfect inverse correspondence.

This would be the scenario where that SWOT stereotype is perfectly true.

The tallest is the dullest and the shortest is the brightest.

Okay.

So let's get to the real data.

What is the empirical finding for height and IQ?

Does that SWOT stereotype hold up?

Well, the empirical studies, they generally contradict the stereotype.

The correlations between height and IQ are typically found to be small, somewhere around positive point two, maybe a little lower, but they're consistently positive.

So tall children are on average, slightly brighter than small children.

That's what the data suggests.

Yes.

A correlation of point two sounds pretty low.

How should we interpret that value in the context of other psychological correlations?

That's the right question to ask.

It's helpful to place it next to other values we see in the field.

For example, the reliability of a very good measure, like giving an IQ test today and then giving it again next week, might give you an R around 0 .95.

That's nearly perfect.

That's just showing the test itself is consistent.

Right.

Now, when you look at something like IQ correlated with academic achievement, say in English or history, the R often falls between 0 .5 and 0 .8.

Which makes sense.

Intelligence is a strong predictor of school success, but it's not a perfect one.

Other things matter, like motivation, teaching quality, all sorts of things.

Then you look at correlations in areas where the variables are more complex, like physical measures, height and weight.

They correlate reasonably well, maybe 0 .5 or 0 .6.

And for social outcomes, things that are really hard to predict.

The predictability drops way down.

If you look at, say, betting experts ranking football or baseball teams and you correlate those preseason rankings with the final order at the end of the season, that correlation is often only around 0 .3.

That low number reflects all the random unpredictable factors, injuries, weather, team dynamics, everything that can happen over a season.

So that 0 .3 figure for predicting reality is a really good anchor point.

It helps you understand just how low a point to correlation for height and IQ really is.

But that brings us to what might be the most critical warning about error.

Yes.

Correlation coefficients are not percentages.

This is the most common interpretive mistake, right?

Especially in the popular science reporting.

It's everywhere.

People see an R of 0 .50 and they assume that 50 % of the variance in one variable is explained by the other.

And that is fundamentally wrong.

The actual rule is that if we accept that variable X is causing variable Y, or that they share common elements, the percentage of overlapping variance is proportional to the or R -squared.

Let's illustrate just how dramatic that difference is.

If R were plier minus one, then R -squared is one.

100 % commonality.

Perfect overlap.

Now, if you want a scenario where exactly 50 % of the variance is shared, you have to find the square root of 0 .50, which is approximately 0 .71.

So you need an R of 0 .71 to say that there's 50 % common elements.

If we look at that academic range where correlations are often around 0 .7, that's where we see substantial overlap.

Almost half the reason for high grades is shared with high IQ.

But now look at the midpoint.

What if we have an R of 0 .50?

You square that and you get 0 .25.

Wow.

So correlation that seems moderate halfway up the scale is actually only indicating 25 % common elements.

Exactly.

Three quarters of the variance is totally unexplained by the other variable.

It fundamentally changes how you view a correlation of 0 .50.

It sounds like a strong relationship, but it's only explaining a quarter of the story.

And those expert sports rankings with an R of 0 .30.

You square that, you get 0 .09.

That means the expert's rankings shared only 9 % of common elements with the final outcome.

When you're reading psychological studies, if you fail to remember this R -squared rule, you will routinely massively exaggerate the

or even moderate correlations.

Okay.

So these statistical tools are only useful if the data you're feeding into them is actually sound.

But as soon as we start applying statistics, we run head first into some really big practical difficulties, starting with the problem of sampling.

It's a huge problem.

Any generalization we make in psychology is in theory about a massive population, all humans, all teenagers, all people with anxiety, but we never test the whole population.

We only test the small sample.

And if that sample is flawed, the generalization is useless.

Completely useless.

The ideal sample would be perfectly representative, but that's often just not practical.

So let's look at the methods researchers actually use.

The source material highlights three main types.

The first and unfortunately least satisfactory, but probably the most common, is accidental sampling.

This is just using whoever is available.

Pretty much.

And the source makes this a rather pointed generalization that the subjects in leading American psychological journals are, rather depressingly, rats, lunatics, or sophomores.

That's a famous, somewhat cynical summation, but it highlights that the subjects often aren't representative of the average person at all.

Not at all.

And even within those groups, they're constrained.

The rats are often specific, highly uniform lab strains.

The lunatics are typically neurotics who are already in treatment or psychotics who are medicated.

And the sophomores are often psychology undergraduates just trying to get course credit.

Acting under duress, as the text says.

This kind of constrained,

accidental sampling makes extrapolating results to the general human population extremely problematic.

So what does the statistical ideal actually look like?

The ideal is a truly representative sample.

That's where every single member of the population has an equal random chance of being selected.

You'd need a national list of names, and a computer would just randomly pick subjects.

But the logistics of that, tracking down every single person chosen, no matter where they live or if they even want to participate, it's often called an impossible problem.

It's logistically prohibitive for most university research.

It's just too expensive and time consuming.

So we seek a compromise.

And that compromise is the quota sample.

Right.

This is what you see polling organizations use a lot.

Here, the samples make up in terms of social class, sex proportions, and so on, is carefully structured to mirror the overall population.

It's reasonably accurate for descriptive studies.

And then there's the third method, which involves actively manipulating the sample, usually for an experimental design.

Yes.

The experimental design sample, which is often used with a technique called analysis of variance.

This is a deliberately artificial sample, and it's designed to maximize the information you can get about how different variables interact with each other.

Can you walk us through how that would be designed?

Sure.

Suppose a researcher wants to investigate how diet affects IQ, but they suspect the effect might vary by race, sex, and social class.

A natural sample wouldn't give them equal numbers for every single one of those combinations.

They might end up with way more white middle class male subjects than, say, black working class female subjects.

So they create an analysis of variance design.

They make sure there's equal representation in every cell.

If they use two races, two sexes, and six social classes, that gives them 24 cells, and they put an equal number of subjects in each one.

And that's crucial because it lets the researcher move beyond a simple comparison, like men versus women.

They can ask, does this finding hold true for all subgroups, or does diet only affect IQ in, for example, working class blacks?

It allows you to identify these inhomogeneities and complexities that a simply sample would It makes the findings far more nuanced and robust.

Before we leave sampling, we have to address the whole issue of generalizing from animal studies.

If psychologists are using these very specific strains of rats, they often defend it by arguing that there are evolutionary uniformities linking rat and human behavior.

What's the psychometric verdict on that defense?

The defense is partial.

It's not completely wrong.

Evolution dictates that we share fundamental biological systems, a central nervous system, a cortex, the autonomic nervous system that governs emotions.

So results from other mammals can be suggestive for human behavior.

Suggestive is the perfect word, but suggestion is not proof.

Before you can transfer those results and make generalizations about humans, the applicability of that animal model has to be specifically demonstrated in the human context.

So we have to be critical.

You have to be critical.

You can't just accept or reject all animal studies wholesale.

You have to assess each finding on its own merits and its biological proximity to the human system.

Okay, so now we shift from the structure of the sample to the second major statistical challenge.

Dealing with the role of pure chance.

In other words, determining the significance of statistical data.

Right.

Since we only ever deal with samples, our results will inherently contain some amount of error.

Our sample mean of say 102 will always be slightly different from the true population mean of 100.

The mathematical question is how much more accurate is a bigger sample?

Intuition would tell you that if you double the sample size, you should double the accuracy.

And intuition would be wrong.

The mathematics of chance deviation is very counterintuitive and it's financially critical for research.

Accuracy increases not linearly with the number of people, which we call n, but as the square root of the number of people.

That has profound implications.

So if I want my results to be twice as accurate to cut my chance error in half, I don't double my sample size.

I have to multiply it by four.

Exactly.

And if you want 10 times the accuracy, you need 100 times the subjects.

This mathematical reality governs the budget and the scope of psychological research all over the world.

And this cost of accuracy is what necessitates the second key function of statistics.

Quantifying our confidence in the results, the analysis of significance.

We have to know if the results we got from our sample are likely to be real to replicate in the true population or if they just popped up by chance.

And we quantify this confidence using probability or the p -value.

So let's go back to a classic example.

We test 100 men and 100 women.

The men's mean IQ is 102.

The women's is 97.

That's a five -point difference.

Is that difference statistically significant?

We use established customs to determine that threshold.

A difference is usually considered significant if it could have arisen by chance once in 20 trials.

We express that as p is less than 0 .05.

And very significant.

That's when the chances are 100 to 1 against finding that result by chance.

So p is less than 0 .01.

To figure out the p -value, though, we need to understand how sample means behave, not just individual scores.

Yes, that's the key.

If you took 100 different samples of 100 people each and calculated the mean for each sample, those 100 means would themselves form a normal curve.

And it would be centered precisely on the true population mean of 100.

But that curve of means would be much less spread out than the curve of individual scores.

Far less spread out.

So we need to calculate the standard deviation of that curve of means.

It's called the standard deviation of the mean, SDM, or the standard error of the mean.

And its formula is the population standard deviation, sigma, divided by the square root of the sample size n.

So in our IQ example, the individual SD is 15 and the sample size n is 100.

So the SDM is 15 divided by the square root of 100, which is 15 divided by 10.

So it's 1 .5.

And that figure, 1 .5, defines the statistical behavior of our sample mean.

Which means that 68 % of all samples of 100 people will have a mean IQ that falls between 98 .5 and 1 .5.

And only 5 % of samples will deviate by more than two SDMs.

So above 103 or below 97.

And that SDM allows us to test the null hypothesis, which is the hypothesis that there is no true difference between men and women.

The five point difference we observed 102 versus 97 is right at the threshold of that 5 % chance boundary.

So given a sample size of only 100 for each group, that five point difference isn't necessarily something we can be super confident in.

It's barely significant.

Barely significant.

It might be real, but it might just be chance.

This same kind of analysis also applies to correlation coefficients.

Let's go back to our height and IQ correlation of r equals 0 .2 in a sample of 100 boys.

Is that finding statistically robust?

Well, in that case, with an n of 100 and an r of 0 .2, the p value is exactly 0 .05.

So again, it's just significant.

You have a 20 to 1 chance of finding a positive correlation again if you repeat the study, which is good, but it's not overwhelmingly strong.

We can visualize this relationship between sample size and the required confidence level in what the source calls figure 7 .2.

Yeah, that figure plots the r values you need to reach the p equals 0 .05 and p equals 0 .01 levels against the sample size n.

And what you see is that when n is small, the correlation has to be massive to rule out chance.

Right.

For a sample size of 20, you need an r of 0 .43 just to be significant and 0 .55 to be highly significant.

But the graph shows that as n increases, the required r just plummets following that square root rule.

For a sample size of 150, you only need an r of 0 .16 to reach that p equals 0 .05 level.

And that visually confirms why these massive large scale studies can validate very small correlations that smaller studies could never distinguish from random noise.

It's all about separating the signal from the noise, and statistics gives us the tools to do that rigorously.

Okay, so we've established how to find a relationship that's r and how to determine if that relationship is likely real and replicable as the p value.

Now we come to what you've called the most dangerous pitfall in all the psychological statistics.

Without a doubt, correlation does not imply direct causation.

This is the interpretive trap that catches everyone from first year students to seasoned sociologists.

It really does.

Even a very large, very significant correlation between variable x and variable y does not, on its own, mean that x caused y or that y caused x.

The classic illustration of this danger is, well, it's notorious and statistics give us the details on that highly correlated but causally meaningless data.

That would be the famous or infamous study that found a correlation of 0 .93 between the number of registered prostitutes in Yokohama, we'll call that x, and the number of iron ingots exported from Pittsburgh, we'll call that y, over a period of decades.

A correlation of 0 .93 is nearly perfect.

If we just followed the r value, we'd have to conclude that increasing the number of prostitutes in Japan somehow directly drives up U .S.

steel exports, which is logically impossible.

Completely absurd.

The causal link, therefore, must be an invisible third variable, which we can call z.

In this case, z was the massive population explosion in global industrialization that was happening over that time period.

Increased population drove up activity in all large city centers, which simultaneously increased the number of registered prostitutes, x, and also increased the need for raw materials, like iron ingots, y.

So the correlation between x and y is just a statistical shadow.

It's a ghost created by the mutual influence of the true cause, z.

And this phenomenon is absolutely rampant in complex social research.

Let's apply this to a more realistic example.

The common finding of a correlation between delinquency, x, and broken homes, y.

Delinquents are statistically more likely to come from broken homes.

And the mistaken but often assumed explanation hypothesis A is that the broken home directly causes the delinquency.

But the correlation itself offers at least two other equally plausible explanations that mask a third factor.

Consider hypothesis B.

Maybe parents who have, say, bad genes are genetically predisposed to instability.

This leads them to both break up their marriage and to pass on traits that predispose their children to criminality.

So in that case, genetics is the z variable, causing both the broken home and the delinquency.

Exactly.

And we also have to consider the direction of the influence.

That's hypothesis C, reverse causation.

Perhaps the child, who is already exhibiting potentially criminal behavior or is just unusually troublesome,

creates so much strain and conflict that their behavior is the factor that causes the parent's marriage to break up.

I see.

All three of those scenarios and likely combinations of them are perfectly consistent with the observed correlation.

The simple calculation of R only shows us that there is some connection.

It takes much more rigorous, often longitudinal, experimental design to figure out the causal nature and direction of that connection.

So if we want to be critically informed readers of psychological research, this warning about causation has to be the anchor Every single time a study finds a correlation, we have to immediately ask,

what is the unmeasured z variable that could be causing both x and y?

That is the single most important question you can ask.

Okay.

So as psychometrists, the goal is to build tests that can reliably and validly measure these complex traits like intelligence and personality.

And these are the two indispensable statistical requirements for any useful psychological test.

They are.

Let's define them statistically, starting with reliability.

That's just the test's consistency.

It's ability to give you similar results if you apply it repeatedly.

The most direct way to measure that is test -retest reliability.

Right.

You administer the test today, you administer it again next week, and then you correlate the two sets of scores.

If the SAR value is high, say if you're a .90 or better, the test is considered reliable.

But there's a problem there.

If you wait too long between tests, the person might actually change.

But if you wait too short a time, they might just remember the answers and artificially inflate their second score.

That's the challenge.

And it leads to alternatives.

We often use alternate forms reliability, where we create two technically interchangeable versions of the test and give them on different occasions.

That gets around the memory effect.

There's a more efficient method, too, that only needs one administration of the test,

odd -even reliability or the split -half method.

In that method, if you have a hundred -item test, you correlate the score from all the odd -numbered items with the score from all the even -numbered items.

If that correlation is high, it measures the internal consistency of the test.

The assumption being that the odd -items and the even -items are measuring the same thing equally well.

Precisely.

Okay, so that's reliability.

Now we move to validity.

And this is considered the more important and significantly more difficult requirement.

A test is valid to the degree that it measures what it's actually supposed to measure.

And the first major way we try to establish validity is through external validity.

This is where we correlate the test scores with some external criterion that is, in theory, linked to the quality we're trying to measure.

Let's use the classic example of testing aptitude for a job or a skill, like being a pilot.

Okay.

A test battery is given to 100 Air Force candidates.

Their scores on that battery are then correlated with their eventual success on a final objective flying ability test given at the end of their training.

That final flying test is the external criterion.

The resulting high value is the validity coefficient.

But this misset immediately runs into two huge problems.

The first is the lack of criteria.

Absolutely.

What is the true external criterion for intelligence?

Is it success in school?

Is it lifetime earnings?

Social class achieved?

Teacher ratings?

Because intelligence is such a vast complex concept, there is no single simple external measure that can serve as a perfect yardstick.

The criterion itself is often really fuzzy.

And the second problem is maybe even more damning.

Even when a criterion seems appropriate, it might be faulty, meaning it lacks reliability itself.

And the famous World War II U .S.

Air Force experience proves this perfectly.

They developed this extensive flying test battery, and they used expert instructor ratings of maneuvers like loops and rolls as their external criterion.

Initially, the test battery seemed completely invalid because it showed zero correlation with the instructor ratings.

So the test seemed useless, but the problem wasn't the test.

The problem was the criterion.

The researchers discovered that the instructor ratings had zero reliability.

When two independent instructors rated the same maneuver performed by the same student, their scores didn't correlate with each other at all.

So the criterion was worthless.

It was just subjective noise.

Pure noise.

And once they fixed the criterion, made it objective and reliable, the test battery proved to be highly valid.

And this is why psychometrists are so wary of subjective measures like school essay grades or subjective teacher ratings.

They often introduce this criterion problem.

So that leads us to the second type of validity,

internal validity.

Right.

When a single external criterion is impossible to define or is just unreliable, we have to measure validity by how well the test fits into an entire theoretical structure, which is called a nomological network.

So we're not correlating the test against one single thing, but against an entire web of established facts and theories.

That's it.

Exactly.

The test is considered valid to the degree that when you place it within this network of facts, it makes successful predictions possible.

And the statistical tool that allows us to explore this internal consistency and structure is factor analysis.

Okay.

So factor analysis.

This is a powerful, highly specialized statistical technique that develops specifically for this kind of internal validation, particularly in complex fields like intelligence and personality.

It allows us to determine if a hypothetical underlying factor is necessary to explain the relationships we observe in the data.

So to understand the logic, let's start with the foundational theory, the existence of a general intelligence factor, which we'll call G.

We hypothesize that various mental tests measure this G to different degrees.

Okay.

Now, if we had a perfect measure of G, the correlation between any given test and that perfect measure would be its factor saturation.

That's its validity coefficient relative to G.

So let's construct a hypothetical example, which the source calls table 7 .1.

We're going to assume we have six different tests and they have factor saturations of 0 .9, 0 .8, 0 .7, 0 .6, 0 .7, 0 .5, and 0 .4 respectively with this general factor G.

From those hypothesized saturations, we can then mathematically calculate the expected inter -correlations between all six tests.

And the rule is that the correlation between any two tests is simply the product of their factor saturations.

Right.

So test one with its 0 .9 saturation correlated with test two with its 0 .8 saturation would be 0 .9 times 0 .8, which is 0 .72.

Or test four with 0 .6 correlated with test six with 0 .4 is 0 .6 times 0 .4 or 0 .24.

So the table observed correlations is just a matrix of these products.

And we also see values along the diagonal in brackets, which are the reliabilities, the correlation of a test with self, and those are just the factor saturation squared.

Right.

So for test one, it's 0 .9 squared, which is 0 .81.

Now here is the conceptual leap that factor analysis performs.

In real world research, we start with the observed correlations, the products, like 0 .72 and 0 .24.

We do not know the factor saturations.

We don't know the 0 .9, the 0 .8, and so on.

We want to use the observed correlations to deduce the observable saturations and therefore the validity of the tests relative to this hypothetical G.

We're starting with the product and trying to find the factors that made it.

It sounds difficult, but the statistical structure helps us.

How so?

When the resulting matrix of correlations is due to a single underlying factor, it exhibits this amazing regularity.

It's called the super diagonal form.

The correlations get progressively smaller as you move away from the top left corner.

But more importantly, there's a proportionality.

The ratio between the correlations holds constant.

Yes.

For instance, in our hypothetical matrix, if you look at test one and test two, the correlations in column one are proportionally related to the correlations in column two.

The ratio 0 .54 divided by 0 .45 has to equal the ratio 0 .48 divided by 0 .40 and so on.

Mathematically, this regularity means the matrix has a rank of unity.

Because of that consistency and proportionality,

we can algebraically work backward to calculate the missing pieces, those diagonal reliability values, even though we never measured them directly.

Let's take the calculation example from the text.

Since all the values in column one are 9 eighths of those in column two, we can deduce that the reliability value for test one must be equal to 9 eighths of the correlation 0 .72, and that result is 0 .81.

By taking the square root of that derived reliability, 0 .81, we get back to the original factor saturation, so without ever knowing what the perfect criterion for G is, we've used the pattern of inter -correlations between the tests to statistically deduce the validity, the factor saturation of each test, with respect to that hypothetical factor.

The analysis confirms the existence of a single unifying factor underlying all the test performance, and it also tells us that the overall battery's validity, in this case, is impressively high over 0 .95.

That is a remarkable demonstration of statistical deduction.

So the question then becomes,

does real -world data from actual psychological tests arrange itself in this neat mathematically consistent pattern?

And historically, the answer is yes.

Professor C.

Spearman was the first to notice this orderly appearance of inter -correlation matrices among intelligence tests, which is what led him to postulate his famous theory of G in the first place.

But the pure G model turned out to be a little too simple.

Sir Cyril Burt added a really important clarification, noting that in addition to G, there are also special abilities or group factors.

Right, things like verbal ability, memory, numerical ability, which exist alongside G.

These group factors complicate the matrix a little bit, but they don't negate the existence of G.

And we see this complexity beautifully illustrated in the work of Professor L .L.

Thurstone, who used factor analysis to identify what he called primary mental abilities.

And Table 7 .2 in the source summarizes the inter -correlations between six sets of tests designed to measure those distinct primary abilities, reasoning, word fluency, verbal, numerical, memory, and spatial.

And despite being categorized as these different primary abilities, we still see that the tests all correlate with each other.

Reasoning correlates with numerical ability, verbal correlates with spatial ability.

This empirical similarity to that orderly pattern we saw in Table 7 .1 shows that even these distinct primary factors must share some kind of general underlying influence.

And factor analysis also gives us a clear metric for which of those primary abilities best captures that general underlying influence.

It does.

The factor saturations that were derived from Thurstone's own data show that reasoning ability is a far superior measure of general intelligence, with a saturation of .84.

On the other hand, rote memory ability has a much lower saturation, only .47.

And that aligns with our general understanding of what intelligence structure looks like.

Now, factor analysis isn't limited to intelligence.

It's also fundamental to personality research.

But here, the technique is often used to see if factors are entirely independent or orthogonal.

A great example is looking at two prominent personality dimensions,

neuroticism, or N, and extraversion, or E.

The hypothesis is that these two dimensions are independent.

They don't correlate with each other.

So researchers compile test items.

Six of them are key to measure N, like do sometimes feel happy, sometimes depressed.

And six are key to measure E, like do you prefer action to planning for action?

And when you administer those, the correlation matrix, Table 7 .4, shows a very specific predicted pattern.

You get consistently high correlations within the six N items and high correlations within the six E items.

Which confirms that those items are measuring something shared.

Right, but critically, when you correlate an N item with an E item, the correlations are near zero.

And that near zero correlation is the statistical confirmation that N and E are independent dimensions.

The factor analysis confirms that the N items have saturations only with the N factor, and the E items have saturations only with the E factor.

And we can visualize this beautifully, as shown in Figure 7 .3, using a geometric representation.

Since correlation can be represented as the cosine of an angle, a zero correlation corresponds to the cosine of 90 degrees.

So independent factors are represented by axes that are perpendicular, or orthogonal.

You plot the N axis and the E axis at a right angle, and the 12 items form two distinct independent clusters.

One group near the N axis and one near the E axis, showing that statistical separation of the two dimensions.

This objective result leads us to a really fascinating philosophical discussion.

What are these factors we've deduced?

Are they actual causal agencies inside the brain?

Are they just principles of organization?

Or are they merely statistical artifacts?

And the composite answer from the source material suggests they could be all three, depending entirely on the context.

It depends on the original theories, the specific tests you used, and the population you studied.

So it's perhaps best to view them as scientific concepts or constructs, kind of like gravitation or intelligence itself.

We can't physically see or touch extraversion, but it is an artifact that's immensely useful for prediction and description.

Asking if extraversion really exists is as pointless as asking if gravity really exists.

Exactly.

Factor analysis doesn't generate truth out of thin air.

It is a powerful tool.

A good help made in, but not a good mistress.

If you feed it bad data that was collected without any theoretical guidance, you will get meaningless, uninterpretable factors.

But when it's guided by theory, factor analysis offers tremendous descriptive economy.

It lets a researcher take a massive matrix of correlations, say 150 different tests, all correlated with each other, which is 22 ,500 individual correlations, and summarize all that complexity into just a few dozen meaningful factor saturations.

We do have to acknowledge one complication in factor analysis, though.

While the pattern of the items in that dimensional space is invariant, the precise placement of the axes we use to describe that pattern can be a bit arbitrary.

So if we rotated the n and e axes slightly in that figure 7 .3, the items' relationship to each other wouldn't change, but their factor saturations relative to the new axes would.

But that's the problem.

If the placement of the axes is arbitrary, then our resulting factor saturations, our interpretation of validity, could also be arbitrary.

So how do psychometrists ensure objectivity in the final description?

They apply an objective rule for the rotation.

Thurstone developed the rule of simple structure.

This objective position is defined as the axis placement that results in the largest possible number of zero -factor saturations.

A factor saturation below 0 .10 is generally treated as zero, meaning that test doesn't load on that specific factor.

And in the neuroticism and extraversion example, those original orthogonal right angle axes already achieved maximal simple structure.

All the n items loaded on n and had zero saturation on e and vice versa.

But the rule of simple structure often forces the factors to be oblique, meaning correlated.

It challenges the assumption that factors have to be independent and orthogonal.

We see this tension in the analysis of mental abilities, as illustrated in figure 7 .4, which plots spatial and numerical ability tests.

Yeah, that figure shows three possible interpretations of the same data pattern.

Burt, who was a champion of G, might have preferred the original orthogonal axes, G and SN.

This suggests one big general factor, G, and a secondary factor that contrasts spatial and numerical abilities.

Thurstone, in his early work, might have preferred the other orthogonal axes, S1 and N1, arguing that general intelligence didn't exist, only two independent primary factors, spatial and numerical.

But if we apply that objective rule of simple structure, we're led to the oblique axes, S2 and N2, which are represented by broken lines.

These axes are rotated to best fit the item clusters, maximizing the number of zero factor saturations.

And the critical result is that these axes are not at a right angle.

They're about 75 degrees apart.

And since the correlation between the axes is the cosine of the angle between them, the factors themselves are correlated.

The cosine of 75 degrees is 0 .26.

So forcing the statistical description to be objective by maximizing simple structure reveals that the primary factors of spatial and numerical ability are, in fact, correlated at r equals 0 .26.

Which perfectly aligns with Thurstone's own later empirical finding that his primary mental abilities were intercorrelated, just like we saw back in table 7 .2.

And the existence of these intercorrelated or oblique factors necessitates one final statistical step.

Right.

You have to postulate higher order factors to account for these correlations.

If factor A correlates with factor B, there must be some common factor at a higher level that's causing that correlation.

And this is how Thurstone reconciled his findings with Spearman's original work.

He suggested that Spearman's G factor emerges as a second order factor from the intercorrelations among his primary factors.

And this reconciliation leads us to the comprehensive, current understanding of intelligence.

The hierarchical model of abilities, which is visualized in figure 7 .5.

That structure really defines the relationships clearly.

At the bottom level one, you have the individual tests.

Those tests load onto the primary factors at level two things like verbal ability, numerical ability, reasoning, and so on.

And finally, the intercorrelations among those primary factors define the highest level,

the ubiquitous general factor G.

So this hierarchical structure provides the objectivity and the clarity that psychometrics strives for.

It takes the incredible complexities of human cognition and organizes them through rigorous objective statistical logic.

It's a really elegant synthesis of decades of research.

This has been a truly comprehensive deep dive into psychometrics, the statistical bedrock that's so necessary for understanding individual differences in psychology.

We started by establishing the descriptive power of statistics, showing how the normal curve, the mean, and the standard deviation let us summarize these vast datasets into just two meaningful numbers, providing that crucial dimension -free comparison.

Then second, we analyzed the correlation coefficient r and we really hammered home that indispensable lesson that r is not a percentage and that the variance explained is proportional to r squared.

And most importantly, we stressed that correlation only signals a connection.

It never ever implies direct causation, often masking a hidden third factor.

And finally, we explored the complex statistical logic required for mental testing.

We defined reliability, the challenges of external validity and the criterion problem, and the mechanics of factor analysis.

And we saw how this objective method lets researchers deduce the internal validity of theoretical constructs, culminating in that robust hierarchical model of abilities.

So here is a final provocative thought for you to consider as you encounter psychological claims in the future.

We spent a good amount of time detailing the advanced concepts of factor analysis, like rotating axes to find simple structure for neuroticism and extroversion.

But every single one of those higher order constructs, every factor, every dimension,

is fundamentally built upon the most basic descriptive statistic we discussed,

the standard deviation.

When psychometrists successfully label extroversion as an independent dimension of personality,

they are relying on the assumption that the underlying individual scores distribute normally, and that the spread of those scores, the SD, is consistent across samples.

Think about how the reliability of those complex factor structures rests entirely on the consistency of the simplest statistical measures.

And if you're looking to explore the mathematical logic further, the source material suggests consulting elementary statistics textbooks or diving into Siebert's classic work, The Factors of the Mind, for the deep logic that underpins factor analysis.

Thank you for joining us as we unpack this critical foundation of psychological science.

We really hope you feel far more equipped now to evaluate the numbers behind human behavior.

We'll see you next time on The Deep Dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Psychometrics represents the systematic application of statistical methodology to quantify and analyze psychological constructs and the natural variation that exists across individuals. The discipline begins with foundational statistical concepts, particularly the normal distribution, which provides a mathematical framework for understanding how psychological traits and abilities spread across populations. By calculating the mean and standard deviation, researchers compress extensive datasets into interpretable parameters that enable meaningful comparison across different measurement domains, whether examining intelligence scores or physical characteristics. The correlation coefficient functions as a primary analytical tool, measuring both the strength and direction of relationships between variables on a scale ranging from negative one to positive one. A critical conceptual distinction emphasized throughout is the fundamental difference between correlation and causation; two variables may move together systematically without one causing the other, a confusion that frequently misleads interpretation of research findings involving social, behavioral, or demographic variables. Sampling methodology directly affects the validity of statistical findings, requiring researchers to employ random or quota-based techniques rather than accidental convenience sampling to ensure representative data collection. The determination of statistical significance relies on probability theory and p-values, which establish whether observed patterns are likely reproducible findings or merely random fluctuations within the dataset. Two essential quality criteria define measurement instruments: reliability refers to the consistency of results across repeated administrations or internal items, while validity addresses whether an instrument actually measures the psychological construct it purports to assess. Factor analysis provides a sophisticated mathematical approach to identifying latent structures underlying observed correlations among variables. Through this technique, researchers discover both general factors, such as the g factor representing overall cognitive ability, and specific group factors corresponding to narrower skill domains like verbal or mathematical reasoning. The chapter concludes by presenting hierarchical conceptualizations of human cognitive abilities and dimensional frameworks for personality organization, demonstrating how statistical techniques transform raw numerical relationships into coherent psychological theories that capture the multifaceted nature of individual differences in behavior and cognition.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 7: Psychometrics & Individual Differences

Related Chapters