Chapter 3: Defining and Measuring Variables

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement, not replace the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

You know, we measure stuff all the time in daily life, right?

Steps, screen time, coffee intake,

maybe the tip we leave at a restaurant, usually pretty simple.

Yeah, straightforward stuff.

But what about when you want to measure something abstract?

Like, how do you measure someone's anxiety

or their motivation or even say pride or shame?

That's where it gets, well, way more complex, especially for researchers in behavioral science.

Absolutely.

And that's really the core challenge we're diving into today.

How do researchers take these intangible ideas, the feelings, thoughts, internal states that theories talk about and turn them into something they can actually, you know, measure?

This whole deep dive is focused on chapter three of research methods for the behavioral sciences, the sixth edition.

It's all about nailing down that fundamental process.

So our mission today is basically to unpack how researchers define these tricky variables, how they grapple with measuring things you can't just see or touch, and crucially introduce the yardsticks they use, validity and reliability, to figure out if their measurements are even, well, any good.

Yeah, you need to know if your tool works.

Exactly.

Think of this as your guide to a foundational piece of the research puzzle.

If you can't measure your variables well, your whole study kind of falls apart.

For sure.

We'll kick off with those abstract ideas,

the constructs and the clever trick.

Researchers use operational definitions to try and capture them.

Then, yeah, a big chunk on the essentials,

validity and reliability.

What are they, how do you check them, why they matter so much together?

Okay.

Then we'll look at the different ways we classify measurements, the actual methods for getting data,

and definitely need to cover the pitfalls.

Things like bias and reactivity that can totally skew your results.

All right, let's jump in.

So variables,

the basic building blocks, they're characteristics or conditions that change or have different values for different people.

Right, like in that waitress example from the book, shirt color, tip amount, easy peasy, you can see it, you can count it.

But then you have things like resilience or maybe creativity in kids.

You can't just point a measurement device at that.

And that's the heart of it for behavioral science.

Theories pop up trying to explain why people do things and they propose these internal mechanisms or attributes exist.

The book calls them constructs or hypothetical constructs.

Like intelligence, motivation, anxiety, stuff you can't directly observe.

Exactly, you can't see motivation itself.

But a theory might say, okay, an external thing like getting a reward influences this internal motivation construct.

And that then influences behavior, like how well someone performs a task.

So you see the reward, you see the performance, but that middle part, the motivation is invisible.

How do you measure that?

That's where the operational definition comes in.

It's the researcher's practical solution.

Since you can't measure the construct directly, you define a specific procedure for measuring an external observable behavior that you assume reflects the construct.

And that procedure itself becomes your working definition and measure of the construct for that study.

You're operationalizing it.

Making it workable, measurable.

The classic example is intelligence, right?

Intelligence is the construct.

An IQ test is the operational definition.

It measures how you answer questions, observable behavior, and that score becomes the standard measure for intelligence.

Perfectly put, and it applies to variables you manipulate too.

If your study is about hunger, you might operationalize it as, say, 12 hours since last food intake.

That's concrete, that's measurable.

But, and this seems like a really important but, the operational definition isn't the same as the construct itself.

It's just a proxy, an indirect link.

Precisely, and that indirectness is where the limitations creep in.

Problem one, the operational definition might leave out key parts of the construct.

Like measuring depression only by how much time someone spends in bed.

It misses the thoughts, the feelings.

Exactly, it's incomplete.

The book does mention using multiple operational definitions can help capture more angles, but that gets complicated.

And the second problem.

It might include extra stuff that isn't part of the construct.

So maybe your anxiety questionnaire score is influenced not just by anxiety, but also by how well someone reads or how willing they are to admit they feel anxious.

That's noise.

Okay, so there are necessary approximations.

If you're a researcher, how do you pick one?

Just invent it.

Huh, probably not the best idea.

The standard practice is to look at previous research.

See how others studying the same thing defined and measured it.

You usually find the details in the methods section of their papers.

Using established methods makes your study comparable to others, right?

Exactly, builds on existing work.

But, and the book encourages this, you should still think critically, is that established way really the best?

Maybe a different approach would be better, more valid.

Questioning those things can spark new research directions.

So we have our variables, especially these constructs, and we've operationalized them.

We have a measurement procedure.

How do we know if it's any good?

Which brings us, I guess, to the big two,

validity and reliability.

The cornerstones.

Validity asks, is this procedure actually measuring what I claim it's measuring?

And reliability asks, is it consistent?

If I measure the same thing again, will I get a similar result?

And you often check these by looking at how consistent measurements are with each other.

Yeah, often we look at the relationship between two sets of measurements.

The book uses scatter plots to show this visually.

Right, the dots on the graph.

If high scores on one measure generally pair with high scores on another, that's a consistent positive relationship.

Like height and weight, generally.

And if high on one pairs with low on the other, that's a consistent negative relationship.

Like maybe number of errors and test score.

Good example.

And if the dots are just all over the place, no pattern, that's no consistent relationship.

Shoe size and IQ?

A, B.

Hopefully.

And we use correlation, that number from plus one to negative one to put a value on that consistency.

A strong correlation, positive or negative, between two measures that are supposed to be measuring the same thing is often good evidence for both validity and reliability.

Okay, let's really zero in on validity first.

This feels like the trickier one, especially for constructs.

Is my IQ test really measuring intelligence?

Is my anxiety measure really anxiety?

It's the million dollar question.

Because like you said, there's no physical intelligence thing to compare it to.

People used to try measuring brain bumps or skull size for intelligence totally invalid.

Or think about that absent -minded professor idea.

Super high IQ score, maybe, but can't find their keys.

Does that IQ score capture everything we mean by intelligence?

That's questioning its validity.

The book lays out six different types, or maybe facets of validity.

Ways to gather evidence.

Yeah, think of them as different lines of evidence.

First is face validity.

Does it just look right superficially?

Like IQ test questions seem like they're about thinking and problem solving.

Exactly, it's the weakest type, very subjective.

And sometimes if it's too obvious, participants might guess what you're after and change their answers.

Okay, what's next?

Concurrent validity.

Does your new measure give scores that relate strongly to scores from an established, already validated measure of the same thing?

So if your quick new depression scale correlates well with the scores from a long trusted depression inventory, that's concurrent validity.

Shows consistency between the new and the old method.

Right, then there's predictive validity.

Does your measure accurately predict some future behavior or outcome that, according to theory, it should predict?

Like does an aptitude test score predict job performance later on?

Perfect example.

Or does a measure of school readiness predict first grade reading scores?

Got it, then the big one, construct validity.

Yeah, this is often seen as the ultimate goal, especially for constructs, but it's also the hardest to nail down.

It means your measurement procedure behaves exactly the way the theory says the construct itself behaves.

So if theory says anxiety increases under pressure,

a measure with construct validity should show scores going up when you put people under pressure.

Precisely, and it's not just one finding.

Construct validity is built up gradually, study by study, showing your measure consistently fits within the network of theoretical relationships surrounding the construct.

It's an ongoing process.

Okay, two more, convergent validity.

This is about showing that different ways of measuring the same construct give you similar results.

So if you measure happiness using a self -report questionnaire, and by observing smiling behavior and maybe even some physiological measure, and the scores from these different methods are strongly related, they're converging on the construct of happiness.

Different paths leading to the same place.

You got it.

And the flip side is divergent validity.

This means showing your measure has a weak relationship or no relationship with measures of different unrelated constructs.

So your happiness measure should correlate with other happiness measures.

Convergent.

But it shouldn't correlate strongly with, say, intelligence or maybe political affiliation.

Exactly, divergent.

Showing both convergent and divergent validity together provides really strong evidence that your measure is hitting the specific target you want and not just measuring some general positive feeling or something else entirely.

Makes sense.

Okay, so that's validity measuring the right thing.

Reliability is about consistency, getting the same score repeatedly.

Right.

Stability.

If you measure something stable, like maybe adult height, you expect to get pretty much the same number each time, assuming you're measuring carefully.

The idea here is that any score you get, the measured score, is a combination of the person's true score plus some random error.

Error -like.

Background noise.

Random fluctuations.

Yeah, exactly.

Things like the person's mood changing slightly, a distraction in the room, maybe guessing on one or two answers on a test, the experimenter may be reading a measurement slightly differently.

It's random variation.

So lots of error means low reliability.

Scores jump around.

Yep.

And very little error means high reliability scores are stable and consistent.

Think about measuring reaction time.

One trial can be way off just by chance.

Lots of error.

That's why they do lots of trials and average them.

Exactly.

Averaging smooths out the random error, making the average score much more reliable than any single trial score.

What kind of things cause this error?

The book mentions a few key sources.

Observer error, like two judges scoring a gymnastics routine slightly differently.

Environmental changes, maybe the room gets warmer or noisier during testing.

And participant changes, the person gets tired, bored, more focused, less focused.

And just like validity, there are different ways to check reliability.

Yep, depending on what kind of consistency you're interested in.

Okay.

You've got test, retest reliability,

give the same measure to the same people at two different times, then correlate the scores, check stability over time.

Makes sense.

Parallel forms reliability is similar, but you use two different versions of the test at the two times to avoid people just remembering their answers from the first time.

Ah, clever.

When you have observers rating behavior, you use inter -rater reliability.

You check how consistent the ratings are between the different observers, usually use correlation or percent agreement for that.

And for things like questionnaires with lots of items, there's split half reliability.

You basically split the test in half, say odd versus even items score each half separately and see how well those two scores correlate.

It tells you about the internal consistency or all the items basically measuring the same underlying thing.

Got it.

So, validity and reliability.

How do they connect?

Is one more important or?

There's a really crucial relationship.

Reliability is absolutely necessary for validity.

Think about it.

If your measure gives wildly different results every time you use it, it can't possibly be accurately measuring anything specific.

Like that IQ example, 75 one day, one in 60 the next.

That's not measuring intelligence consistently, so it can't be valid.

Exactly.

It's too unstable to be meaningful.

So reliability sets a limit on validity.

But being reliable isn't enough on its own, right?

Nope.

Reliability does not guarantee validity.

This is key.

You can have a measure that's perfectly consistent, perfectly reliable, but totally wrong for what you wanna measure.

Like your example of measuring head circumference for intelligence.

Perfect.

You'll get a very reliable measurement of head size every time.

Super consistent, but it's completely invalid as a measure of intelligence.

It's measuring the wrong thing reliably.

Consistent doesn't mean correct.

Got it.

The book also mentioned accuracy briefly.

How's that different?

Accuracy is about matching an established standard.

Like does your bathroom scale match the official weight standard?

Does your ruler match the standard inch?

In behavioral science, we often lack those agreed upon standard units for things like depression or creativity.

So while accuracy is important in physics, validity and reliability are usually the main concerns for us.

Okay, that clarifies things.

So we've got our operational definition.

We've thought about its validity and reliability.

Now we actually use it to collect data.

And how we classify that data matters, right?

This brings us to scales of measurement.

Right, measuring involves assigning individuals to categories or giving them scores.

The kind of information those categories or scores convey determines the scale.

And this is super important because it dictates what statistical analysis you can actually do later.

There are four main scales.

Starting with the simplest.

Nominal scale.

Yep, categories are just names, labels,

qualitative differences, like classifying people by their college major art, chem, psych,

or by their gender identity or preferred type of music.

So you can only say if people are the same or different on that very day.

That's it, no order, no quantity.

Art isn't more or less than chem on the nominal scale of major, just different categories.

Then ordinal scale.

Here the categories have names and they have a logical order or rank, like finishing places in a race, first, second, third, or t -shirt sizes,

small, medium, large, or maybe socioeconomic class,

low, middle, high.

So now you know the direction of difference.

First is faster than second, large is bigger than small.

Exactly, you know the direction, but, and this is key, you don't know the magnitude of the difference between categories.

The gap between first and second place might be tiny, while the gap between second and third might be huge.

The intervals aren't necessarily equal.

Like on a satisfaction survey, the difference between satisfied and very satisfied might not be the same size as the difference between neutral and satisfied.

Precisely, ordinal tells you order, but not by how much.

All right, then the scales that often get grouped together because they allow more math, interval and ratio.

Right, both of these have ordered categories and the intervals between the categories are equal in size.

Think of a standard thermometer measuring temperature in Fahrenheit or Celsius.

The difference between 10 degrees and 20 degrees is the same amount of heat change as the difference between 70 degrees and 80 degrees.

So you know the direction and the magnitude of the difference.

Yes.

The big difference between interval and ratio scales lies in the zero point.

The arbitrary versus absolute zero thing.

You got it.

An interval scale has an arbitrary zero.

Zero doesn't mean the complete absence of the thing being measured.

Zero degrees Celsius isn't the absence of all heat, it's just the freezing point of water.

Zero altitude is sea level, but you can go below sea level.

Okay, and ratio scale.

A ratio scale has a true or absolute zero point.

Zero really means none of the variable exists.

Height, weight, reaction time, number of errors on a test.

Zero inches means no height.

Zero errors means no mistakes were made.

And because of that true zero, you can make ratio comparisons.

Like eight seconds is twice as long as four seconds.

Exactly, you can talk about ratios.

20 pounds is twice as heavy as 10 pounds.

You can't say 20 degrees C is twice as hot as 10 degrees C because the zero point is arbitrary.

That makes sense.

But the book seemed to suggest that for a lot of statistical analyses, the difference between interval and ratio isn't always critical.

Often, yes.

For many common statistical tests, like t -tests, ANOVA, calculating means, and standard deviations, the crucial feature is having those equal intervals, which both interval and ratio scales provide.

So while the true zero is conceptually important, the practical choice of statistical tool often doesn't change between interval and ratio.

But the jump from nominal or ordinal up to interval or ratio is a big deal for stats.

Huge deal.

You can do much more sophisticated quantitative analysis with interval and ratio data, calculating means, variances, running powerful tests, stuff you generally can't do meaningfully with just nominal categories or ordinal ranks.

That's why researchers often prefer to get interval or ratio data if they can.

What about those common measures like IQ scores or those Likert scales on surveys, like one to five rating scales?

Are they truly interval?

The gap between strongly agree and agree might feel different to someone than the gap between neutral and agree.

That's a great point.

And it's what the book calls an equivocal measurement situation.

Technically, you can argue they're ordinal because we can't prove the psychological distance between, say, a four and a five is exactly the same as between a one and a two.

But researchers seem to treat them as interval all the time.

They do.

By longstanding convention and for practical reasons, measures like IQ scores and some scores from rating scales are typically treated as if they are interval scales.

It allows for more powerful statistical analysis.

It's an assumption that the intervals are close enough to equal and it's been debated, but it's standard practice.

So choosing your measurement procedure also involves thinking about the scale you'll get and whether that scale lets you answer your research question.

Absolutely.

What kind of comparison do you need to make?

Just difference, direction,

magnitude, ratio.

The scale you choose limits the information you get and the conclusions you can draw.

Okay, deep breath.

We've covered defining variables,

operationalizing constructs, checking validity and reliability, and classifying the data using scales.

Now,

how do researchers actually collect this data?

What are the different modalities of measurement?

Right.

These are the different ways or categories of methods researchers use to get those external observable measures that stand in for the construct.

The book uses fear of flying as an example to show the three main types.

First up, self -report measures.

The most direct way in a sense.

Just ask a person, questionnaires, surveys, interviews.

On a scale of one to 10, how anxious does flying make you feel?

Strength seems obvious.

You're asking the person who knows their internal state best, high face validity too.

True, but the major weakness is distortion.

People might not tell the truth, maybe social desirability bias kicks in, wanting to look good, they might misunderstand.

Their answers could be swayed by how the question is worded or even who's asking.

Validity can be shaky.

Okay,

second modality, physiological measures.

This involves measuring biological responses assumed to link to the construct.

For fear of flying, maybe tracking heart rate, galvanic skin response, GSR sweatiness,

maybe even brain activity using FMRI while they imagine flying.

Strength here seems to be objectivity.

The machine gives you a number.

Exactly, it's less subjective, potentially more precise.

Weaknesses though, often needs expensive, specialized equipment.

The measurement setting itself can be artificial and stressful, potentially affecting the readings.

And there's still a validity question.

Is that higher heart rate definitely fear?

Or could it be excitement or just physical exhaustion from being wired up?

The link isn't always one to one.

Makes sense.

And the third one,

behavioral measures.

This is about observing and measuring overt actions.

Could be anything from counting how many times a child shares a toy for altruism to reaction time on a computer task for alertness to how many problems someone solves for ability.

Or for fear of flying, maybe how close someone is willing to walk towards an airplane.

Good one.

Or whether they avoid watching movies about planes.

The strength is the sheer variety you can often find or design a behavior directly relevant to your construct.

Sometimes the behavior is the thing you ultimately care about.

Weaknesses.

Behavior can be context dependent.

A single behavior might not capture a complex construct fully.

Again, people might act differently if they know they're being watched.

Which leads us nicely into some of those challenges and potential pitfalls, even if you have a decent measure.

The book talks about artifacts.

Yeah, artifacts are like unwanted contaminants in your measurement process.

External factors that distort the scores.

They mess with validity because you're measuring the artifacts influenced to and potentially reliability if the artifact isn't constant.

And two really big ones in behavioral science are.

Experimental bias and participant reactivity.

Huge potential problems.

Let's make experimental bias first.

The researcher's own expectations messing things up.

It's a classic finding.

The researcher knows the hypothesis or expects certain groups to perform differently.

And that expectation consciously or unconsciously influences how they interact with participants or even how they record data.

Like being slightly more encouraging to the group they expect to do well.

Exactly.

Subtle cues, tone of voice, body language, or even just interpreting ambiguous responses in line with the hypothesis.

The famous Rosenthal and Fode study with the maze bright and maze dull rats showed this powerfully.

The rats were assigned randomly, but the experimenter's belief about them actually affected the rat's maze performance.

Wow, that's sobering.

How do you fight that?

Standardization helps using scripts, maybe automated instructions, but the gold standard is blinding.

In a single blind study, the experimenter interacting with participants doesn't know which condition the participant is in.

Example, real drug versus placebo.

In a double blind study, neither the experimenter nor the participant knows the group assignment or expected outcome.

This is common and really important in clinical trials.

And just to clarify, this is about the researcher's knowledge being blinded, not just the participant being unaware of the specific hypothesis.

Got it, so that's the researcher side.

What about the participant side?

We're not measuring inert objects, people react.

Exactly, and that leads to demand characteristics and participant reactivity.

Humans are smart, active participants.

Demand characteristics are clues in the study.

Yeah, cues in the setting, the instructions, the questions, anything that might hint to the participant what the study is really about or what the researcher expects to find.

And participant reactivity is how they respond to those clues or just to the fact they're being watched.

Precisely, they modify their natural behavior because they're in a study.

Or in these crazy study where people kept doing tedious addition for hours, they reacted figuring the real test was about endurance, not math, and played along with what they thought was expected.

And they might adopt different subject roles based on how they react.

Right, the book lists a few.

The good subject.

Tries to help the researcher confirm the hypothesis.

Problematic because results might not generalize.

The negativistic subject.

Tries to figure out the hypothesis and actively mess it up.

Also problematic.

The apprehensive subject.

Worries about being judged, tries to look good.

Social desirability.

Not giving truthful responses.

And the ideal one, the faithful subject.

Follows instructions carefully, acts naturally, doesn't try to guess or play along.

That's what you hope for.

So how do researchers try to get more faithful subjects and minimize reactivity?

Doing research in a natural setting of field study helps as people might not even know they're being observed.

Disguising the true purpose, maybe embedding key measures within others, using less obvious low face validity measures can work, though ethical considerations are paramount with deception.

Also just reassuring participants about confidentiality and anonymity and stressing the importance of honest natural responses can encourage that faithful role.

Man, it's like navigating a minefield sometimes.

So many things to consider just in the measurement step, pulling it all together.

How does a researcher actually select a measurement procedure?

It's definitely not trivial.

The absolute best place to start is reviewing previous research.

How have others measured this variable?

Are there established measures with known validity and reliability?

Using a well -vetted established procedure is often the safest bet, and it makes your results comparable to the existing literature.

But don't just take it off the shelf without thinking.

Right, you still need to ask,

is this specific measure right for my research question, for my population?

Is it sensitive enough to pick up the changes I expect to see?

Does it give me the right scale of measurement for the stats I need to run and the conclusions I want to draw?

And questioning the status quo can be good too.

Absolutely, maybe the standard way isn't great.

Maybe you have an idea for a better way to measure it.

That can lead to important new research.

Just be aware that developing a new measure and doing all the work to demonstrate its validity and reliability from scratch, that's a major project in itself.

This has been incredibly clarifying.

It really hits home how defining and measuring variables properly isn't just some preliminary step, it's like the absolute bedrock of the entire research process.

Totally.

We went from construct these abstract ideas to operational definitions as the way to measure them, then dove deep into why validity, measuring the right thing, and reliability, measuring it consistently, are non -negotiable.

Yeah, and how they relate reliability needed for validity but doesn't guarantee it.

Then a scale's nominal or null interval ratio, determining what kind of information we get and the different ways of getting it, self -report, physiological, behavioral.

And finally, watching out for those artifacts like experimenter bias and participant reactivity that can really distort things.

It's a lot,

but so crucial.

So as you listener encounter research, maybe you start asking these questions.

How exactly did they define that key variable?

What was their operational definition?

Does it seem like a valid way to measure it?

Was it likely reliable?

What scale did they end up with?

And what does that allow them to say?

Yeah, it gives you a much more critical eye.

Definitely.

So maybe a final thought for everyone listening.

Pick something you're curious about regarding human behavior.

Anything, kindness, burnout, political tolerance, whatever.

Now really think,

how would you even start to define that operationally?

What specific measurable thing would you track?

What would be the biggest hurdles in making sure your measurement was actually capturing that specific thing, doing it consistently and wasn't being thrown off by all these potential artifacts we talked about?

It really makes you appreciate the challenge, doesn't it?

It's often much harder than it looks on the surface.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Converting abstract theoretical constructs into measurable, observable variables represents a foundational challenge in psychological research. Operational definitions serve as the critical bridge between conceptual ideas—such as intelligence, anxiety, or self-esteem—and the concrete procedures that generate quantifiable data amenable to systematic analysis. Establishing measurement quality depends on two interconnected properties: validity ensures that an instrument genuinely captures what it purports to measure, encompassing multiple forms including surface-level appearance, correlation with concurrent external criteria, prediction of future outcomes, and alignment with underlying theoretical constructs verified through patterns of correlation and differentiation with related and unrelated variables; reliability guarantees that repeated applications of the same instrument under comparable conditions yield consistent results, demonstrated through stability across time intervals, consistency among independent raters or judges, and internal coherence across items measuring the same construct. The measurement landscape includes four distinct scales—nominal for categorical assignments, ordinal for ranked orderings, interval for equally-spaced numerical values, and ratio for data with an absolute zero point—each permitting different statistical operations and levels of mathematical sophistication. Researchers choose among three major measurement approaches: direct questioning of participants regarding their thoughts and feelings, recording of biological or physiological indicators tied to psychological processes, and systematic documentation of actual behaviors in experimental or naturalistic contexts, each presenting unique tradeoffs between scientific rigor, practical implementation, cost, and acceptability to research participants. Numerous technical obstacles complicate measurement, including ceiling effects and floor effects that compress variability at scale extremes, constrained score ranges that reduce sensitivity to detecting meaningful differences, and contaminating influences such as researcher expectations affecting participant responses, implicit cues about research objectives shaping behavior, and the tendency for measurement processes themselves to alter the phenomenon under investigation. Effective measurement selection requires navigating tensions among competing objectives: achieving strong validity and reliability while preserving feasibility and participant comfort, adhering to ethical standards, and retaining adequate sensitivity to identify genuine effects in the construct of interest.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 3: Defining and Measuring Variables

Related Chapters