Chapter 15: Evaluation Studies: Controlled & Natural Settings

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

Today, we're taking a shortcut through what I think is a really crucial part of interaction design,

evaluation studies.

Absolutely.

This is really where it all comes together, right?

How do you actually prove that the product you design works for the people who are supposed to use it?

Exactly.

And you know, technology is everywhere now.

It's not just on a desk anymore.

It's in our homes, on us, in public spaces.

And that complexity means you can't just test everything in a quiet room.

You need methods that cover the whole spectrum from a super controlled lab to what we call in the wild.

Okay.

So to frame this, let's use a fun example from our sources.

Imagine you've built an app for school kids and their parents to take care of the class hamster over the holidays.

Oh, that's a great one.

Yeah.

It schedules feeding, tracks its health, lets everyone communicate.

So the question is, how do you evaluate that?

You need something that works for a kid, a busy parent, and a teacher.

And that's the perfect lead -in to the three big evaluation types we're going to unpack.

They really exist on a continuum of control.

Continuum, okay.

Yeah.

You have usability testing in lab -like settings.

Then you have experiments in research labs for, you know, real statistical rigor.

And finally, field studies, which happen right in the natural setting, the home, the school.

That's our roadmap.

All right.

Let's start with usability testing.

This is probably the one most people have heard of.

It's usually done in a controlled setting.

And the goal is pretty straightforward.

Right.

You're measuring how effective the product is.

Can users actually get the tasks done?

And, you know, are they satisfied using it?

The key here is controlling the environment, isn't it?

It is.

The whole point is to remove all those outside variables.

You don't want someone's phone ringing or a colleague interrupting them.

You want them focused purely on the interface.

And the data collection for this is it's pretty organized.

Oh, yeah.

It's a combination of methods.

You've got video recording to see their expressions.

You're logging every single keystroke, every tap, every swipe.

And then there's the think aloud technique, which I find fascinating.

It's critical.

You're literally having the user describe what they're thinking as they do the task.

It gives you this amazing window into why they clicked on the wrong thing.

So you're getting inside their head.

You are.

And you pair all that qualitative stuff with hard numbers, with quantitative measures.

Okay.

So what are we tracking there?

Things like time to complete a task, the number, and even the type of errors they make.

How many people actually finish the task successfully?

You even track if they go to the help manual.

Oh, absolutely.

That's a huge red flag for a frustration point.

So the big question everyone always has, how many people do you actually need for this?

Well, the classic answer was always, you know, five to 12 users will find most of the big problems, but it's more flexible now.

Oh, so if you're just doing a quick check on a small change, maybe two or three users is enough to get fast feedback.

But for our hamster app, for something complex, you definitely want a broader group.

And this used to happen in those custom -built labs with the one -way mirrors, right?

Yeah, it did.

But those are incredibly expensive and honestly, often not necessary anymore.

Which brings us to mobile and remote testing.

Exactly.

The lab in a box or lab in a suitcase setup, you can have portable eye tracking glasses.

Our sources mentioned the Tobii glasses or facial recognition systems, all just running off a laptop.

So you can take the lab to the user, essentially.

You can.

And that lets you run studies in more ecologically valid places, places that are closer to real life.

Like that study in the London shopping mall.

That's a perfect example.

They had people wear eye tracking glasses while they were shopping just to see if they noticed digital ads.

You're getting rigorous data, but outside the lab.

And for big global products, I imagine remote testing is everything.

It's essential.

Users can do the tasks in their own homes on their own time.

It's a game changer for testing across different markets or with users who have specific accessibility needs.

You know, the early iPad testing from 2010 is a great case study for this.

It really is.

Apple was under massive time pressure.

Everything was secret.

So they used mobile usability testing with just seven experienced iPhone users.

And they were testing really practical things like checking a recipe or finding something on And the think -aloud method immediately found huge problems.

Links were too small to tap.

Fonts were hard to read.

And people felt lost without a back button.

Yes.

But the really big lesson came from a different study, one in Namibia.

Researchers there learned that you can't just rely on questionnaires.

Why not?

Because you can get glowing, positive feedback on a questionnaire.

But when you actually watch people or interview them, you realize they're struggling.

Culture plays a huge role.

So the takeaway is you have to use multiple methods.

You have to.

Observation, interviews, performance data.

You need the whole picture to get to the truth.

Okay.

So that idea that usability testing finds out what is broken, that leads us perfectly to the second type of evaluation,

controlled experiments.

That's right.

So if usability testing shows the links are too small, an experiment proves, with stats to back it up, why one size is fundamentally better than another.

It's about rigor.

Exactly.

An experiment is all about testing a specific hypothesis and the relationship between variables.

So you'd start with a hypothesis like,

a context menu will be faster than a cascading menu.

Perfect example.

And in the hypothesis, you have your variables.

The independent variable is the thing you, the researcher, change.

So in your example, that's the type of menu.

And the dependent variable is what you measure.

It's what you measure, yes.

The thing you expect to change because of your manipulation.

So the time it takes to select something or maybe the number of errors.

And you set up two competing hypotheses.

Correct.

The null hypothesis is the default assumption.

It says there's no difference.

So menu type has no effect on time.

The whole point of the experiment is to try and reject that.

And the alternative hypothesis is that there is a difference.

There is a difference, yes.

And your goal is to set everything up so perfectly, keeping the screen resolution, the lighting, everything else constant, that you can say with confidence that the change you saw was only because of the variable you manipulated.

This sounds tricky, especially assigning the participants.

It is.

And there are three main ways to do it, each with a big trade -off.

Okay, let's break them down.

First up is the different participant design.

Or between subjects.

This is where you have different people in different conditions.

So for our hamster app, one group of parents only ever sees the blue design, and a totally separate group only sees the red one.

The good thing there is you don't have what you call order effects, right?

No order effects, but the downside is massive.

Individual differences,

one person is just faster with tech than another, can skew the results.

You need a lot of people to average that out.

So if you need that many people, why not just have everyone do everything?

The same participant design.

The within -subjects design.

It seems more efficient, and it is because individual differences are gone and you need half the participants.

There's a catch.

There's a huge catch.

It's called a training effect.

If someone uses the blue hamster app first, they learn how it works.

So when they switch to the red one, they're faster, not because red is better, but because they're already an expert.

Oh, so you have to control for that.

You do with something called counterbalancing.

You have to make sure half the people do blue then red, and the other half do red then blue to cancel out that learning effect.

That makes a ton of sense.

And the last one is matched participant design.

Right.

Here, you try to get the best of both worlds.

You match people in pairs based on key things, like their level of tech expertise, then randomly split the pair into the two conditions.

So you reduce those individual differences without the order effects.

In theory, yes, but the con is you can never be 100 % sure you've matched every single thing that matters.

What about motivation or just, you know, how tired they are?

Okay, so once you have all this data,

you run statistical tests.

The blue test is the big one.

It's the most common in this field.

Yeah.

It compares the averages and the variability across your conditions to see if the difference is actually meaningful.

And this is where the famous p -value comes in.

It is.

And the p -value is actually pretty simple.

It's just the probability that the result you saw happened purely by chance.

And you're aiming for a p -value of less than 0 .05.

That's the gold standard, typically.

Which means?

Well, exactly.

It means there's only a 5 % chance that the difference you found between your blue menu and your red menu was just a random fluke.

So if you get that result, you can confidently reject the null hypothesis and say, yes, this difference is real.

Design B is actually faster.

Okay, so now we got to make that final leap to the complete opposite end of the spectrum, field studies.

To the messy real world.

Exactly.

As tech gets more, you know, ambient and mobile, we have to see how it fits into people's actual lives.

You have to see how the hamster app works when a parent is trying to make dinner, not sitting in a silent lab.

And field studies are, by definition, messy.

There are interruptions everywhere.

Phone calls, kids, the doorbell.

You can't really isolate causation the way you can in an experiment.

But you gain something else.

You gain something way more valuable, in my opinion.

Ecological validity.

A true sense of how the product works in the real world.

The focus shifts from pure precision to getting rich, qualitative data stories, patterns of behavior.

And how do you collect that data in the wild?

A lot of observation, interviews, taking very detailed notes.

And for longer studies, you use something called the Experience Sampling Method, or ESM.

Which is like an electronic diary.

Basically, yeah.

The device itself, like their phone, will ping the person at different times and ask them to record something specific.

Maybe an interruption they just had,

or a problem they ran into.

I can imagine the logistics are a nightmare.

Especially with ethics and privacy.

It's a huge challenge.

If you're studying something in a public space, how do you get informed consent from everyone?

And if you put a prototype in someone's home and it breaks, you need a plan.

Which leads to in -the -wild studies.

This is where you're not just observing, you're actively seeing how a new technology changes people's lives.

That's the key.

You're looking at sustained use over weeks or months.

And what you find is that how people use tech in the wild is almost always totally different from how they behave in a lab.

The PainPad case study is a fantastic example of a field study.

It's a perfect one.

So the context is that it's really hard to get good data on patient pain levels after surgery.

The PainPad was a physical device meant to help with that.

And they did usability testing first, just to make sure it worked.

Of course.

Then for the field study, they gave it to 54 patients who were recovering from joint replacement surgery.

The pad would prompt them to log their pain score every two hours.

While the nurses were still collecting the scores the old way, verbally.

Right.

And the results were pretty amazing.

Patients loved the PainPad.

They found it easy to use.

But the compliance data was the real shocker.

What did it show?

Patients recorded way more pain scores using the device than the nurses collected.

We're talking 824 scores from the device versus 645 from the nurses.

Wow.

And the patients were much better at sticking to the two -hour schedule when the tech prompted them.

And the qualitative feedback was useful, too.

Incredibly.

They found that some older patients had trouble with the keypad.

And the audio alerts were polarizing.

Some found them annoying.

Others barely noticed them.

But overall, the study proved the device worked in a really complex, high -stakes environment.

And just to wrap up the participant numbers question.

Right.

The advice is still pretty solid.

For usability, five users can find most of the big problems.

For experiments, you really need at least 15.

But honestly, you have to talk to a statistician.

And for field studies.

It varies wildly.

It could be one family or a whole community.

The focus is on the depth of the insights, not necessarily the size of the sample.

So we've really traced this whole evaluation continuum.

We've gone from the super high -control precision of experiments.

Or you're testing one specific thing.

All the way to the ecological validity of field studies.

Where you see how products survive in the messy real world.

And I think the biggest takeaway here is that the best evaluations don't just pick one method.

Right.

They make some.

They absolutely do.

You use usability testing to find the obvious bugs.

Then maybe run an experiment to validate a big design choice.

And then you deploy it in the wild to see if people will actually adopt it long term.

You know, if we go back to that pain pad study for a second.

There's a really provocative idea there.

The fact that patients gave more frequent and more reliable data to a machine than they did to a human nurse.

What does that say?

It's a huge question, isn't it?

This shift where we might trust or at least adhere to a technology more than a human system.

What does that imply for the future of health care?

Or any of these really sensitive high -stakes environments?

That's a powerful thought to end on.

It really speaks to the changing dynamics between us and our technology.

Thank you for joining us for this deep dive into evaluation studies.

We hope you feel thoroughly informed.

And we look forward to next time.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Evaluation studies in interaction design operate across a spectrum ranging from highly controlled laboratory conditions to naturalistic field environments, each approach offering distinct advantages for understanding user behavior and system performance. Usability testing represents the most accessible evaluation method, typically conducted in specialized facilities or temporary controlled spaces where researchers systematically measure how effectively users can accomplish specific tasks, track the occurrence and nature of errors, and assess overall satisfaction with a product. Data collection during usability sessions employs multiple recording and observation techniques, including video capture, interaction logging, concurrent verbal reporting where users articulate their thoughts and decisions in real time, and standardized satisfaction instruments. Equipment supporting these evaluations ranges from dedicated observation rooms with one-way glass to portable systems incorporating eye-tracking sensors and affective computing tools, sometimes integrated into portable evaluation kits. The Apple iPad early-stage evaluation demonstrated how real-world constraints like compressed timelines necessitate pragmatic adaptations to evaluation protocols, uncovering usability concerns such as inadequate touch target sizing, navigation complexity, and orientation-related complications. Formal experiments advance beyond usability assessment by imposing stricter methodological controls to test targeted theoretical predictions about user behavior. Experimental methodology requires researchers to deliberately manipulate independent variables representing specific design features and measure corresponding changes in dependent variables reflecting user performance outcomes. Hypothesis formulation involves specifying both a null hypothesis asserting no relationship and an alternative hypothesis proposing a measurable effect, which may be directional or nondirectional. Experimental rigor depends heavily on systematic management of confounding factors and strategic assignment of participants to comparison conditions using between-subjects allocation where groups experience different conditions, within-subjects allocation where individual participants experience all conditions with systematic sequencing procedures preventing bias, or matched-pairs approaches equating groups on relevant characteristics before condition assignment. Statistical analysis subsequently determines whether observed performance differences reach conventional significance thresholds, customarily interpreted at the 0.05 probability level. Field studies embrace the complexity of authentic usage contexts including residential settings, clinical environments, and other real-world spaces, making them particularly valuable for evaluating mobile devices, pervasive computing systems, and connected technologies. This category of evaluation yields rich qualitative insights into how users actually appropriate, modify, and incorporate technologies within the complicated patterns of daily life. In-the-wild evaluation specifically denotes the extended deployment of working prototypes within natural environments over meaningful time periods, permitting observation of adaptation and integration processes. The Painpad investigation exemplified field evaluation methodology by examining adherence patterns and situational influences on a postoperative pain assessment tool deployed within a hospital context.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 15: Evaluation Studies: Controlled & Natural Settings

Related Chapters