Chapter 9: Experimental Designs: Within-Subjects Design

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Okay, let's unpack this.

Welcome back to the deep dive.

Today we're tackling a really core idea in how researchers figure things out, especially in behavioral sciences.

We're diving headfirst into the within subjects experimental design.

Think of it as the strategy where you measure the same exact person under different conditions.

Our map for this is Chapter 9 from Research Methods for the Behavioral Sciences, the sixth edition.

Yep.

And understanding this design is really key because you see it everywhere.

In studies looking at how people change over time maybe, or how different experiences impact the same folks.

Or comparing treatments where you really need that consistency.

Exactly.

So, our mission today is to pull out the essential pieces of this chapter, what makes within subjects designs unique,

the sneaky problems they can run into, how researchers try to solve those problems, and where you typically see this design in action.

Right.

So you should walk away knowing exactly what researchers gain and maybe what they risk when they go this route.

That's the plan.

Okay, so the absolute core idea,

we're using the same people across different conditions.

How is that fundamentally different from say, a between subjects design, what's the defining thing here?

The defining characteristic is exactly that.

Every single person in your study gets all of the different treatment conditions you're comparing.

Ah, okay.

So instead of having group A get treatment one, and a totally separate group B get treatment two, you just have one group, and they go through everything.

So if I was testing, I don't know, two different meditation apps, app A and app B on stress levels, in a between subjects design, I'd have one group use only A, another only B, but within subjects.

You'd have one group of people, and they would all use app A, and they would all use app B.

Maybe app A one week, app B the next, measuring stress after each.

Gotcha.

And because you're getting multiple measurements from the same individuals, you sometimes call it something else too.

Yep.

It's often called a repeated measures design.

Makes sense, right?

You're repeating the measurement on the same person under different circumstances.

Repeated measures.

Yeah, like physics.

And the book mentioned this can happen in a couple of ways, structure wise.

It's not always just one thing right after another.

Exactly.

It can happen sequentially over time, like that swearing in ice water study example.

Same people put their hand in ice water while swearing, then later, maybe after a break, they did it again while saying a neutral word.

Two conditions spread out.

Or it can happen altogether in one session, maybe a memory experiment where you see a list with, say, some funny words and some neutral words all mixed together, and then you recall them.

The conditions remembering funny versus neutral are kind of experienced within that single session.

Whether it's spread out or all at once, the big advantage seems obvious straight off.

You've got equivalent groups because it's the same group.

That's a massive point.

You don't have to worry that the group who got app A was just naturally less stressed to begin with than the group who got app B.

You completely eliminate those confounding that come from differences between the groups because, well, there are no differences between the groups.

They're the same people.

It seems like a huge win.

It is.

And what's really interesting and maybe less obvious is that this gives the within -sub -subjects design a pretty big statistical advantage.

It often makes it more powerful.

More powerful, meaning more likely to actually find a real effect if there is one.

Precisely.

Because you're looking at the change within each person, the stats can actually sort of isolate and mathematically remove the noise caused by those stable individual differences.

The fact that some people just generally score higher or lower than others.

It subtracts out the baseline noise for each person.

Kind of, yeah.

It lets the effect of the treatment stand out more clearly against that background variation between people.

That's a key takeaway right there.

Within -subjects designs can be statistically more powerful because they control for individual differences so effectively.

Each person serves as their own baseline.

Okay, that makes a ton of sense.

Getting rid of that between group noise, boosting power, sounds great.

But measuring the same person multiple times, especially if it's spread out over time,

it feels like that could open up a whole different set of problems.

And you'd be right.

This is where we hit the unique threats to internal validity for within -subjects designs.

When you repeat measurements, things other than your treatment can change and affect the scores.

Like what kind of things?

Still worried about room temperature and stuff?

Well, yeah.

You still have those general environmental variables, sure.

You control them like you would in any study randomization, holding things constant if you can.

But the big new worries here are time -related factors.

Time -related.

Just stuff that happens because time is passing during the experiment.

Exactly.

If your treatments happen at different times, the simple passage of time can introduce confounds.

The textbook details several key ones.

Okay, like history.

What's that?

Sounds dramatic.

Huh, yeah.

History just means any environmental event outside the study that happens between your measurements and affects participants' scores.

Think back to our meditation app study, app A one week, app B the next.

What if, during the week they used app B, there was a massive storm and a power outage that stressed everyone out?

Oh, right.

So their stress levels might go up, but it had nothing to do with app B.

It was the storm.

Exactly.

That external event messes with your results.

You might wrongly blame app B.

Okay, external stuff.

What about maturation?

Is that like people getting older?

It can be, especially in long studies.

Maturation refers to systematic changes within the participants themselves just due to time passing, not outside events.

Think about a study teaching kids a new skill over several months.

They're going to get better just because they're developing naturally.

Yeah, makes sense.

So an improvement you see later might just be that natural maturation, not necessarily the effect of the specific teaching method used in the later phase.

Or think about elderly participants maybe experiencing some cognitive decline over a long study.

Got it.

Internal changes over time.

Then there's instrumentation.

Is that like the scale breaking?

It could be equipment calibration drifting, yes, but it also covers changes in the observer or the scoring criteria.

If you have people rating, say, participant anxiety levels by observing them, maybe the raters get more experienced or their standards shift slightly between the first observation session and the second.

Ah, so the measuring tool itself, even if it's a human changes.

Right.

So a difference in scores might be due to the changing instrument, not the treatment.

Okay.

And statistical regression or regression toward the mean.

This one always feels a bit weird.

It is a bit counterintuitive.

Statistical regression is this tendency for really extreme scores,

super high or super low on one measurement to be less extreme, closer to the average on a second measurement.

Why does that happen?

Because any single score is usually a mix of stable things like underlying ability or stable trait and unstable things like luck, mood that specific day, random error.

If someone scores way off the charts high the first time, it might be high ability plus unusually good luck.

The next time, their ability is probably still high, but their luck is likely to be more average.

So their score naturally drifts back towards their own average or the group's average.

Exactly.

It regresses toward the mean.

Is it mainly a threat if you select participants for your study because they had extreme scores initially?

If you pick only the most anxious people for your meditation app study based on a pretest.

Then test them again after they use the app.

Some of the decrease in their anxiety scores might just be them regressing back to their more typical, slightly less extreme anxiety level, not purely the effect of the app.

Okay.

That's subtle, but important if you're targeting extremes.

So those are all related to time passing, but then there are order effects.

That sounds different.

More about the sequence.

That's the key difference.

Order effects happen specifically because the experience of participating in one treatment condition influences how you score in a later treatment condition.

It's not just about time passing between sessions, but about what you did in the earlier session.

Like getting better at the task just from doing it.

Yep.

Practice effects are super common.

People get better at the task, figure out the procedure, get more comfortable.

Their scores might improve just from repetition.

Or the opposite.

Right?

Fatigue effects.

People might get tired, bored, lose focus over multiple conditions.

Their performance might decline just because it's the third or fourth thing they've had to do.

The source calls these general trends progressive error.

Progressive error.

Are there more specific kinds of order effects?

Yes.

Carryover effects are a big one.

This is when the effect of a specific treatment lingers and influences performance in the next condition.

Maybe app A teaches a specific relaxation technique.

And then even when using app B, they might still unconsciously use that technique they learned from A.

Exactly.

The effect of app A carries over and contaminates the measurement for app B.

Or think about drug studies.

The effects of one drug might not have fully worn off before the next one is administered.

Ok, carryover.

What about contrast effects?

Contrast effects are more subjective.

Your perception or experience of a later treatment is influenced by how it compares to the one right before it.

If treatment A was incredibly difficult,

maybe treatment B feels surprisingly easy by comparison even if it's objectively moderately hard.

Like a room feeling really dark right after you were outside in bright sunlight.

Perfect analogy.

The contrast makes you judge the second condition differently than you would have in isolation.

Ok, so just to recap the difference.

Time related factors are about things changing during the gap between treatments, external events, internal maturation, the measurement tool itself.

Order effects are about the experience of the first treatment directly impacting performance on the second.

That's a great summary.

And crucially,

these are mostly threats when your treatments are separated by a decent chunk of time.

If everything happens super quickly in one go, history and maturation are less of a worry.

But order effects like practice, fatigue, carryover, contrast, can still absolutely be relevant.

I'm thinking about that hypothetical data the book showed, table 9 .1, where just a simple five point practice effect, if everyone did treatment one, then treatment two made it Exactly.

That systematic influence tied to the order totally confounds your results.

You can't separate the real effect of treatment two from the leftover effect of having done treatment one first.

It can completely mislead you.

So these time issues, and especially these order effects, they sound like they could really mess up your findings pretty easily.

If they're such a big risk, how do researchers actually use within -subjects designs without just getting garbage results?

How do they handle this stuff?

Right, it's a major consideration.

They have a few main strategies.

Sometimes they try to minimize the effects.

Sometimes they have to change the design entirely.

What's the first line of defense?

Can they just adjust the timing?

They can try.

Controlling the time between treatments is one approach.

But it's tricky.

Make the time gap too short, and temporary order effects like fatigue or maybe a mood induced by the first treatment might not have faded away yet.

But make the gap too long, and you increase the risk of history effects, something happening in the outside world,

or maturation becoming a bigger factor.

It's a balancing act.

And some things just don't fade, right?

Like if Appy A taught me a permanent new skill.

Exactly.

Increasing the time gap does nothing for permanent carryover effects.

Once you've learned that skill, you've learned it.

So what if you expect really strong permanent order effects?

Like comparing two fundamentally different ways of teaching reading, once you learn one way, it might mess up learning the second way.

In situations like that, where you anticipate really substantial, irreversible order effects that you just can't easily manage, the best call might be to say, you know what, a within subjects design just isn't right for this question.

Ah, so you switch.

Yeah, you switch to a between subjects design.

Each person only gets one condition that completely eliminates order effects.

Of course, you lose that built in control over individual differences we liked so much.

Trade -offs.

Okay, switch if you have to.

But if you really want those within subjects benefits, maybe you need the statistical power or participants are scarce.

Is there a way to keep the design but manage the order effects?

Yes, absolutely.

And this is the main technique researchers rely on.

Counterbalancing.

That sounds like balancing things out, mixing up the order.

Precisely.

The whole idea of counterbalancing is to vary the order in which participants experience the conditions.

You systematically change the sequence from one participant to the next.

Why?

What does changing the order achieve?

The goal is to break the link between any specific treatment condition and any specific time point or position in the sequence.

You want to ensure that across all your participants, the treatments are matched with respect to time and order.

Okay, matched with respect to time.

How does that work with our app A and app B example?

It means, for instance, that half your participants would use app A first, then app B.

The other half would use app B first, then app A.

Oh, okay.

Now look what happens.

App A appears in the first position for half the people and in the second position for the other half.

Same for app B.

So neither app is consistently stuck being the first one or the second one.

I think I see it now with that other data table example, table 9 .2, even if there is still that five -point practice effect happening for everyone when they do their second app.

Right, that effect is still there.

People doing their second task, whichever it is, might score five points higher just due to practice.

Because half the people did A second and half did B second, that five -point practice boost gets added to both treatments' overall scores kind of equally.

Exactly.

The average score for treatment A now includes scores from people who did it first, no practice boost, and people who did it second with the practice boost.

Same for treatment B.

The practice effect is distributed evenly across both conditions.

So, wait, the effect is still in the scores?

The average scores for A and B might both be inflated a bit by the practice?

Yes, that's possible.

Counterbalancing doesn't eliminate the order effect from individual scores, or even necessarily from the overall treatment means.

And what does it do?

It prevents the order effect from becoming a confounding variable.

Because the effect is balanced across conditions, it doesn't systematically bias the difference between the treatment means.

You can still make a fair comparison between the average score for A and the average score for B, even with the order effect present in the background noise.

It isolates the treatment difference.

That's pretty clever.

Distribute the problem evenly so it cancels out in the comparison.

But it sounds maybe not perfect?

Are there catches?

Oh, definitely limitations.

First, as we just said, the order effects are still hanging around in the data.

They don't vanish.

What this does is add variability within each treatment group.

Yeah, so.

Because within, say, the treatment A scores, you have scores from people who did it first, no order effect bump, and scores from people who did it second with the order effect bump.

That makes the scores within treatment A more spread out, more variable.

And more variability.

Didn't we say that can make it harder to see a real difference between the groups?

It hides the signal and the noise?

Exactly.

So counterbalancing can increase the within treatment variance, which can slightly reduce the statistical power of the design.

It slightly undermines that power advantage we talked about earlier.

It's a trade -off.

Okay, adds noise.

What else?

Counterbalancing works best if the order effects are symmetrical, meaning the effect of going from A to B is roughly the same size and type as going from B to A.

But what if they're not?

What if app A makes you way more tired than app B does?

Right.

Or what if the carryover from A to B is much stronger than any carryover from B to A?

If the effects are asymmetrical, then simply swapping the order, A, B, B, S, B, A, won't perfectly balance them out.

One sequence might still leave a bigger residual impact than the other.

Hmm, okay.

And I bet this gets crazy complicated if you have more than two treatments.

That's the biggest practical hurdle.

Complete counterbalancing, where you use every single possible order of the treatments, becomes mathematically explosive very quickly.

The number of possible sequences is N factorial written as N.

N factorial, right.

So if we had, say, four different app features to test, A, B, C, D, that's four times three times two times one, 24 different orders.

Correct.

And for complete counterbalancing, you technically need to use all 24 of those unique sequences.

You'd need at least 24 participants, or 24 groups of participants, one for each sequence.

Okay, 24 might be doable sometimes, but what if you had six features?

Six factorial.

Six times five times four times three times two times one.

That's 720 different sequences.

Whoa,

720?

You'd need at least 720 participants just to cover all the orders.

That sounds almost impossible for most studies.

It's usually totally impractical.

So researchers rarely use complete counterbalancing when they have more than, say, three or maybe four conditions.

So what do they do instead?

Just give up on balancing?

No.

They use partial counterbalancing.

Instead of using all possible sequences, they select a smaller, carefully chosen subset of sequences.

How do they choose which ones?

Is it just random?

Sometimes random selection is involved, but often it's more systematic.

The goal is to pick sequences such that across the chosen set, each treatment condition still appears in each possible ordinal position, first, second, third, et cetera, roughly equally often.

Is there a standard way to do that?

A very common technique is using a Latin square.

It's basically a grid setup.

For our four treatments, A, B, C, D, a simple Latin square might give you these four sequences to use.

A, B, C, D, B, C, D, A, C, D, A, B, D, A, B, C.

Ah, I see.

Each letter appears once in each position, first, second, third, fourth, across those four sequences.

And each letter appears only once in each sequence.

Exactly.

So now, instead of needing 24 sequences for complete counterbalancing, you only need these four.

Much more manageable, you'd assign different participants to each of these four orders.

That makes way more sense.

Is the Latin square perfect, though, or does partial counterbalancing have its own issues?

It's not perfect.

Yeah.

It ensures each condition is in each position, but it doesn't guarantee balancing of all potential sequence effects.

For example, in that simple square, the sequence A, B happens, but B, A never happens.

Oh, right.

So if the carryover from A to B is different from B to A, that specific comparison isn't balanced.

Potentially, yes.

There are more complex ways to construct Latin squares or randomization procedures to try and mitigate this, but it's a reminder that partial counterbalancing is an approximation.

It's usually good enough, but it doesn't capture every possible order interaction like complete counterbalancing theoretically does.

Wow, okay.

Counterbalancing is quite the rabbit hole.

So we've seen the design, the potential pitfalls, like time and order effects, and the ways researchers try to control them.

Let's pull back a bit.

How does this within -subjects approach really stack up against the between -subjects one?

When do you actually pick one over the other?

Yeah, it really comes down to weighing those pros and cons, which often seem like mirror images of each other.

There are about three main things to consider.

Okay, factor one.

Individual differences.

We touched on this.

This is a huge plus for within -subjects and a potential headache for between -subjects.

Right, because in between -subjects, maybe your groups just happen to be different from the start, which confuses things.

Those pre -existing differences between people in different groups can mask your treatment effect or even create a fake one.

They add a lot of noise or variance.

But within -subjects.

Since it's the same people, that between -group difference is gone.

And crucially, as we discussed, the statistical analysis can mathematically account for and remove the variance caused by the stable differences between individuals.

That power boost again.

Yes.

The source examples,

like table 9 .3, figure 9 .2, box 9 .1, really illustrate this.

How removing that individual difference noise makes the actual treatment effect pop out much more clearly.

It's often the biggest reason to choose within -subjects, especially if you expect people will vary a lot on whatever you're measuring.

Okay, so point one.

Individual differences definitely favors within -subjects for control and power.

What's point two?

Must be the flip side.

You got it.

Time -related factors and order effects.

These are the Achilles heel of within -subjects designs, but they are completely avoided in between -subjects designs.

Because in between -subjects, each person is only measured once, so there's no later condition to be affected by an earlier one or by time passing.

Precisely.

No history effects between measurements, no maturation, no practice or fatigue effects from repeating the task, no carryover.

So if you strongly suspect that order effects are going to be massive, really persistent or maybe asymmetrical in a way that counterbalancing can't handle.

Then maybe between -subjects is the safer way to go, even if you lose some power.

That's often the calculation, yes.

And the third factor.

Just practicality.

Number of participants.

Within -subjects designs are generally much more efficient.

Fewer people needed.

Way fewer, typically.

Since each person gives you data for all your conditions, you just don't need as large a total sample size compared to a between -subjects design where you need a whole separate group for every single condition.

So if participants are really hard to come by, like you mentioned Olympic athletes or maybe people with a very specific rare condition,

then using a within -subjects design might be the only feasible way to conduct the research.

A between -subjects design might require more participants than you could ever realistically recruit.

Okay, let me try to sum that trade -off up.

You lean towards within -subjects if participants are scarce or you expect huge individual differences and really need that statistical power to see an effect.

And you lean towards between -subjects if you anticipate really nasty uncontrollable order effects that would mess up a within -subjects design.

That's the core logic, well put.

Is there?

Is there anything that tries to split the difference, get some of that individual difference control without the order effect problem?

Ah, yes.

There is a design that attempts that.

The matched -subjects design.

Sometimes just called matched pairs or matched groups.

Matched -subjects, how does that work?

It's kind of a hybrid.

Like a between -subjects design, you have separate groups of participants for each treatment condition.

So nobody experiences more than one condition that avoids the time and order effects.

Okay, separate groups.

But how does it handle individual differences then?

Before the study starts, the researcher matches participants across the groups on one or more variables that are thought to be relevant to the outcome variable.

So, back to our app study.

If we thought maybe baseline stress level was important, we'd find pairs of people with similar baseline stress.

And then randomly assign one person from that pair to use app A and the other person to use app B.

You do this for all your participants, creating groups that are intentionally equivalent or matched on that baseline stress variable.

I see.

So you're trying to build in that group equivalence that within -subjects gets automatically.

Exactly.

The goal is to get the advantages of controlling for individual differences, reducing that noise, potentially increasing power.

But by using separate groups, you avoid the disadvantages of time -related and order effects.

The statistics used for matched -subjects designs actually account for that matching.

Similar to how our repeated measure stats account for the same person being in all conditions.

But it's not quite the same as using the exact same person, is it?

No, it's definitely an approximation.

You can only match on variables you think are important and that you can accurately measure.

You can't possibly match people on every single subtle difference between them.

So while it's usually better than a standard unmatched -between -subjects design at controlling individual differences, it's generally not considered as effective or as powerful as a true within -subjects design where the person is their own perfect match.

So it's a good option maybe when within -subjects is too risky due to order effects, but you're still really worried about individual differences messing up a standard between -subjects design.

That's exactly its niche, a useful compromise in certain situations.

Okay, great.

We've got the concepts, the threats, the controls, the comparisons.

Let's talk about how these designs actually get used.

What do they look like in practice, and how do researchers analyze the data they get from them?

Right.

So as we said, researchers often pick within -subjects designs when participants are hard to find, or when they expect those individual differences to be really large and want the extra statistical power.

What's the absolute simplest version you see?

That would be the two -treatment design, just comparing two conditions, say condition A versus condition B.

Like our app A versus app B example, what are the pros of keeping it that simple?

Well, it's obviously easy to conduct and for people to understand.

Counterbalancing is dead simple, just A then B for half, B then A for the other half.

You can often choose two conditions that are quite distinct, which maximizes your chance of actually seeing a difference between them if one exists.

Makes sense.

What's the downside of only having two conditions?

The main limitation is that you only get two data points for your independent variable.

You can find out if there's a difference between A and B, but you don't get a sense of the functional relationship.

Functional relationship, meaning?

Meaning, how does the outcome change as you vary the independent variable across a wider range?

If A and B are, say, two different dosages of a caffeine pill, you know the effect of dose A and dose B, but you don't know what happens at doses in between or higher or lower.

Is the effect linear?

Does it level off?

Does it curve?

You can't see the shape of the relationship with just two points.

Ah, okay.

You can't really draw the curve between just two dots.

How do they analyze the data from these two treatment within subject studies?

Depends on the type of data.

If you have interval or ratio scale data,

like scores on a test, reaction times, things like that, the standard test is a repeated measures t -test.

Or you could use a single factor in OVA, repeated measures, which essentially gives the same result for two groups.

And what do those tests tell you?

They compare the average score in condition A to the average score in condition B, taking into account that the scores came from the same people.

They tell you if the observed difference between the means is statistically significant, basically, is it bigger than what you'd likely see just due to random chance or measurement error.

Okay.

What if the data isn't like scores on a scale?

What if it's just ranks?

Like which app did people prefer?

If you have ordinal data, ranks, you typically use a nonparametric test like the Wilcoxon signed ranks test.

And if your data is even simpler, just directional, like did each person improve or decline from A to B, you could use the signed test.

Got it.

So specific tests for different data types.

Now what if you do want to see that functional relationship, see the curve?

Then you need to use a multiple treatment design with three or more conditions.

So instead of just app A and app B, maybe app A, app B, and app C, or maybe five different dosages of that caffeine pill.

Exactly.

Having multiple levels of your independent variable gives you more data points.

This makes it much more likely that you can actually see the shape of the relationship, whether it's linear, curved, levels off, et cetera.

It can also provide a more convincing demonstration of cause and effect.

If you see performance consistently go up as you increase the caffeine dose across five levels, that's stronger evidence than just seeing a difference between two doses.

Makes sense.

More points make a clearer picture.

What are the downsides of adding more treatments in a within subjects design?

Well, a few things.

First, the differences between adjacent conditions might get smaller.

If you test 10 slightly different caffeine doses, the difference between dose three and dose four might be tiny and hard to detect statistically.

Second,

more treatments mean each participant has to spend longer in the study doing more tasks.

This increases the risk of participant attrition people getting tired, bored, or just dropping out before they finish all the conditions.

And people drop out non -randomly, that can seriously bias your results.

And I bet counterbalancing gets worse.

Oh yeah, as we discussed that N problem explodes.

Complete counterbalancing becomes impossible, so you have to rely on partial counterbalancing like Latin squares with all their potential limitations.

It just gets logistically much more complex.

So more complex to run, more risk of people leaving, harder to counterbalance, but you get a better picture of the relationship.

How do you analyze the data when you have three or more within subjects conditions?

For interval or ratio data, the standard tool is the repeated measures ANOVO, analysis of variance.

What does that tell you?

The overall ANOVO test tells you if there is a statistically significant difference somewhere among the means of your multiple conditions.

It doesn't tell you which specific means are different from each other, just that at least one is different from another.

So if the ANOVO says yes, there's a difference, then what?

Then you typically follow up with post -hoc tests, which are like specific comparisons between pairs of conditions or combinations of conditions, to pinpoint exactly where the significant differences lie.

Is A different from B?

Is B different from C?

Is the average of A and B different from C?

And so on.

Okay, so ANOVO for the overall picture, then follow -up tests for the details.

That's the usual approach, and of course the analysis can get even more complex if you have multiple independent variables in your within -subjects design, but that goes beyond this chapter.

You'd need more advanced text for those factorial repeated measures designs.

Wow.

Okay, that really was a deep dive.

We went from just the basic idea testing the same people multiple times all the way through Time threats, order effects, counterbalancing wizardry, statistical power, design choices.

Quite the journey.

We definitely covered the ground.

We saw the core characteristic is using that single group for all conditions.

We identified those key validity threats unique to measuring over time.

History, maturation, instrumentation, regression, and especially those pervasive order effects practice.

Fatigue, carryover, contrast.

And we spent a lot of time on how researchers fight back.

Controlling the time gap, sometimes just switching to between subjects, but mostly using counterbalancing, understanding it balances effects, doesn't eliminate them, and has its own complexities like symmetry and the N -issue, leading to partial counterbalancing and Latin squares.

Then we stacked it up against between -subject designs, highlighting that core trade -off.

The within -subjects advantage in handling individual differences and needing fewer participants versus its vulnerability to time and order effects.

And we slotted in the match -subjects design as that hybrid approach.

And finally, we looked at applying it the simpler two -treatment design versus multiple treatments for seeing functional relationships, noting the pros and cons of each, like complexity and attrition, and the stats involved t -tests, ANOVAs, like the repeated measures, ANOVA, Wilcoxon, sign test, really understanding why those within -subjects analyses are powerful because they can factor out that individual difference variance.

So, by hitting all those sections, the concepts, methods, design choices, data collection issues like time and dropout, the analysis tools, examples,

implicitly touching on ethical things like ensuring valid conclusions by handling bias.

We've pretty thoroughly worked through this chapter's take on within -subjects design.

I think so.

You should definitely have a solid grasp now of what this design entails, why it's used, its strengths,

and importantly, its weaknesses, and how researchers try to manage them.

So, tying this all together for you, the listener, what does this mean when you encounter a research?

Well, next time you read a study where the same people were tested under different conditions or at different times, you're now equipped to think critically.

Yeah, you can ask, how did they handle potential order effects?

Was it counterbalanced?

If so, how complete partial Latin Square?

Was the time between sessions sensible?

Could history or maturation have played a role, given the study's length?

Knowing the design's potential issues helps you judge the quality of the study and how much confidence you should really have in its conclusions.

And it leaves us with a really interesting question to mull over, I think.

Given how tricky order effects can be, especially carryover or asymmetrical ones, how much does even careful counterbalancing truly let us see the pure effect of a single treatment in isolation?

Or is the effect we measure in these designs always fundamentally tangled up with the sequence and the interaction of the experiences themselves?

Is the effect context dependent on the order?

That's definitely something to think about next time you see a repeated measure study.

And that is a wrap on this deep dive.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Designs employing within-subjects methodology require all research participants to engage with every experimental condition, enabling researchers to track individual performance across varying levels of the independent variable. This approach produces powerful statistical benefits by eliminating variability stemming from differences between participants, thereby lowering error variance and increasing sensitivity to detect authentic treatment effects. Because the same individuals serve across all conditions, researchers require substantially fewer total participants compared to between-subjects approaches while gaining richer comparative data from each person. The critical trade-off involves managing threats to internal validity that emerge uniquely from this repeated-exposure structure. Sequential administration of conditions introduces time-based confounds including history effects where external events influence outcomes, maturation where natural changes occur over time, instrumentation effects when measurement procedures shift, and regression toward the mean following initial extreme scores. More problematic are order effects that systematically alter performance based on condition sequence: practice effects enhance responding through repeated exposure, fatigue effects diminish performance as fatigue accumulates, and carryover effects allow previous conditions to contaminate performance in subsequent conditions. Counterbalancing strategies mitigate these concerns by systematically manipulating the sequence order across participants. Complete counterbalancing presents every possible arrangement, partial counterbalancing selects a subset of arrangements strategically, and Latin square designs offer efficient balanced orderings that control for position effects while minimizing the number of distinct sequences needed. An intermediate approach, matched-subjects design, groups individuals according to relevant characteristics prior to condition assignment, thereby blending advantages of within-subjects and between-subjects frameworks. Analyzing repeated-measures data demands appropriate statistical procedures: repeated-measures analysis of variance handles multiple conditions simultaneously, repeated-measures t-tests compare pairs of conditions, and non-parametric tests including sign tests and Wilcoxon signed-ranks tests serve when assumptions of normality are violated. Successful implementation requires careful consideration of whether within-subjects methodology suits the research question and rigorous application of control procedures addressing the specific threats inherent to repeated measurement.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 9: Experimental Designs: Within-Subjects Design

Related Chapters