Chapter 14: Introducing Evaluation in Interaction Design

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome back to the Deep Dive, where we tear through dense foundational texts to give you the actionable knowledge you need.

Today we are focusing on something absolutely critical.

The process that determines if a great idea actually becomes a great product.

Evaluation.

Exactly.

We're basing this whole conversation on Chapter 14, introducing evaluation from a major interaction design textbook.

To set the scene, think about this, you've built a new app for teenagers.

It looks amazing.

But how do you actually know if they'll use it or, you know, enjoy it after the first week?

That is the billion dollar question.

And evaluation is how you start to answer.

It's not just some final exam for your product.

No, it's a core process that runs through the entire design lifecycle.

It's all about collecting and analyzing data on how users experience.

Well, anything.

It could be a sketch on a napkin, a working prototype, or the finished app.

And the goal is always the same, right?

Always.

Design improvement.

And we look at that through two main lenses,

usability, is it easy to use,

and user experience, which is more about, you know, is it satisfying, is it enjoyable?

So it's the essential reality check.

It makes sure designers don't just build things that only they understand how to use.

Precisely.

This chapter really lays out the whole toolkit for doing that.

So let's start with the fundamentals.

The why, what, where, and when of it all.

Okay, so why evaluate is in the first place?

Isn't a good design just obvious?

Well, not really.

Because user expectations have completely shifted.

It used to be enough for something to just be functional.

Now people want a pleasing, engaging experience.

And from a business perspective.

It's absolutely crucial.

Well -designed products sell.

Evaluation focuses your team on real user problems, and most importantly, it lets you fix those problems before you ship,

which saves a ton of money.

So what kinds of things do we evaluate?

Is it just about a whole app?

Well, it could be anything.

The scope is huge.

You could be testing something super specific, like the search speed on a web browser.

Or something really broad, like the overall aesthetic.

Exactly.

Or even safety features, like the controls for a city's traffic system or a kid's toy.

And that means you have to be careful about who you're testing with.

Oh, absolutely.

You can't just grab anyone.

The book has this great example about Facebook usage.

Right.

The one comparing adults and teenagers.

Teenagers are all about social identity, posting, sharing.

Adults might be more focused on family news or politics.

If you only evaluate with one of those groups, your design will almost certainly fail for the other.

That makes perfect sense.

Okay, so that's the what.

Let's talk about the where.

This feels like a huge, practical question.

It is.

And it's basically a tug of war between control and context.

On one side, you have very controlled settings.

The classic usability lab.

The classic lab, yeah.

It's perfect when you need to systematically measure something, like say how accessible a website is or the layout of keys on a tiny device.

You can isolate the variables.

But what if you're trying to measure something less concrete?

Like how long a kid will actually play with a new toy before getting bored?

Then you have to get out of the lab.

You need to be in a natural setting.

We call them in the wild studies.

Because the lab just isn't real life.

Not at all.

In a lab, people know they're being watched.

They're given specific tasks in the wild.

You see their true spontaneous behavior.

You see how interruptions and boredom actually happen.

The chapter also mentions a compromise here.

The living lab.

Right.

So a living lab is something like a home or a gym that's been kitted out with monitoring tech.

It tries to get the best of both worlds.

But I wonder, doesn't that just make the natural setting feel kind of artificial?

That's the big dilemma.

And it's a great question.

You get long -term data, but you do risk influencing the very behavior you're trying to observe.

It's a trade -off.

A balancing act, as always.

So last of the big four.

When do we evaluate?

All the time.

Evaluation that happens during the design process on sketches, on early prototypes.

That's called formative evaluation.

It's forming the product.

And then the big test at the end.

That's summative evaluation.

That's when you assess the success of the finished product.

Yeah.

But, you know, in modern agile design, evaluation can be almost continuous.

Okay.

That gives us the map.

Let's get into the actual methods.

The book groups them into three main categories, right?

Yep.

Three big buckets based on that control versus context idea we were just talking about.

Okay.

Lay them out for us.

First, you have controlled settings with users.

That's your labs.

Great for finding specific usability problems, but you lose the real -world context.

Second.

Natural settings with users.

So field studies.

Here you get all that rich context, but it can be messy and take a lot of time.

You have very little control.

And the last one is a bit different.

It is.

It's any setting, but without directly involving users.

These are methods where experts or consultants critique a design, or you use modern analytics.

It's fast, but you might miss some really unpredictable user behavior.

This is where it gets really interesting.

Let's break down those methods, starting with that high -control group.

What's usability testing, really?

So it's that structured process.

You bring users into a lab, you give them tasks, and you watch what happens.

You're observing, interviewing, maybe logging their clicks.

And the goal is just to see if they can do the task.

That's the core of it, yeah.

Can a typical user do a typical task?

But critically, the findings often go into what's called a usability specification.

Which sounds very formal.

It is, and it's important.

It's not just, make the button bigger, it's a user must be able to complete this task in under 15 seconds, 95 % of the time.

It sets a clear, measurable benchmark.

And we're seeing this kind of rigor pop up everywhere, like in healthcare, with things like Fitbit or other monitoring devices.

Oh, for sure.

Because even if a device looks beautiful, if the basic usability is bad, it could have serious consequences.

Aesthetics can't compromise function in those cases.

Now, if you need even more control, you go beyond testing.

You run an experiment.

Exactly.

An experiment is the gold standard for proving cause and effect.

You strip away every other variable you can.

Like that example with the tablet keyboards.

Right.

Comparing a virtual keyboard to a physical one to a swiping one, you measure speed and errors.

In a hospital, finding the keyboard that's even 5 % faster could genuinely save time in an emergency.

That's why that level of control matters.

Okay, let's flip to the complete opposite end of the spectrum.

Natural settings, field studies.

Field studies are all about discovery.

You go out into the world to see what people's needs are to find opportunities for new tech.

And a specific type of that is the in -the -wild study.

Yeah, that's where you actually deploy a new prototype, maybe even a disruptive technology, into people's everyday lives.

The researchers have to give up control.

They rely on things like diaries or check -in interviews.

Because you can't be there 24 -7.

You can't.

The most interesting things might happen when you're not around.

And we can do this online too, right, in games or communities?

For sure.

That's a virtual field study.

The researcher might become a participant themselves, just to observe how people interact naturally in that digital space.

Alright, let's hit that third category.

The methods that don't directly involve users.

These are your inspection methods.

They rely on expert knowledge.

A common one is heuristic evaluation.

So you have an expert look at your design and apply a set of rules of thumb, or heuristics.

Exactly.

It's fast, it can be cheap, but there's a big catch.

Bias.

A huge potential for bias.

If your evaluators are inexperienced, or just have their own pet peeves, they can send your team down a completely wrong path.

And there's another one in this category, the cognitive walkthrough.

Right.

This one's a bit different.

You're simulating a user's thought process, step by step, as they try to figure something out.

It's really focused on evaluating how easy a system is to learn.

This is also where big data comes in.

Analytics.

Yes.

Web analytics, learning analytics.

You're logging and analyzing data from thousands, maybe millions of users remotely.

Stuff like Google Analytics.

How long people stay on a page where they click.

All that stuff.

It gives you a huge amount of quantitative data.

And finally, in this group, we have models.

Models are predictive.

The classic example is Fitts' Law.

It's a mathematical formula that can predict how long it will take you to move your mouse and click on a target of a certain size at a certain distance.

So designers can use that to optimize a layout before anyone even touches it.

That's the idea.

It proves, mathematically,

why making a button too small or too far away is a bad idea.

It builds friction right into the system.

So it's clear that you rarely just use one of these methods in isolation.

Almost never.

You combine them.

You might do a lab study and follow it up with some observation in the field.

And you mentioned opportunistic evaluations, too.

Right.

Which are just super informal checks, taking a sketch to a coffee shop,

getting five minutes of feedback.

It's a quick way to see if an idea even has legs before you invest in building a prototype.

To really make this concrete, let's look at those two case studies.

First, the experiment on game engagement.

That was a super controlled lab study.

It was.

Mandrake and Inkpen in 2004, they had people play an ice hockey game either against the computer or against a friend sitting right next to them.

And they wanted to measure something deeper than just performance.

They wanted to measure fun.

Pretty much.

And they did it in two ways.

First, they used questionnaires.

Unsurprisingly, people reported having way more fun playing against their friends.

Well, the really cool part was the other data they collected.

The physiological data.

They hooked people up to sensors that measured things like sweat and heart rate,

classic indicators of excitement and arousal.

And what did that show?

It showed that, objectively, people were more physiologically aroused, more excited when playing against another person.

It was concrete proof that backed up the subjective feelings.

It was a huge step in showing we can actually measure these deeper user experience goals.

OK, now for the second case study, which is the complete opposite.

The Ethnobot field study.

Right.

This was a true in the wild study.

It took place at the Royal Highland Show, which is a massive outdoor agricultural event.

A place that would be impossible for a researcher to cover on their own.

Totally.

So they built a chatbot app called Ethnobot.

Participants had it on their phones and the bot would prompt them for feedback as they walked around the event.

So the bot was the data collector.

It was.

It would ask them to, say, press a button if they learned something new.

And it collected lots of that kind of quantitative data.

But it also let them leave open -ended comments.

And they followed up with real interviews, too.

They did.

And that was the key finding.

The bot was great for collecting data at scale in a really difficult environment.

But the human interviews.

That's where they got their really rich, detailed, qualitative stories.

So the lesson is that combining the two approaches gives you the best results.

Exactly.

An automated bot for scale and a human for depth.

It's a great model for these kinds of mobile, hard -to -access situations.

These studies really show the incredible range of evaluation.

Before we wrap, though, we have to touch on the really critical, non -methodological stuff.

Ethics.

Absolutely.

This is non -negotiable.

You have a duty to protect your participants from any kind of harm.

And there are formal structures for this, right?

Like IRBs.

Yes.

Institutional review boards in the U .S.

and similar bodies elsewhere.

They have to approve your study plan.

And a core part of that is the informed consent form.

And that form has to spell everything out.

Everything.

What they'll be doing, how their data will be used, and crucially,

that they have the right to withdraw at any time for any reason with no penalty.

And in a corporate setting, there's often another document involved.

The NDA, or non -disclosure agreement.

If you're testing a secret new product,

the company needs to protect that information.

It just shows how serious data security is in commercial evaluation.

Beyond ethics, there's quality control.

Let's quickly clarify reliability and validity.

They sound similar, but they're very different.

They are.

Think of it like darts.

Reliability is consistency.

It's hitting the same spot over and over, even if that spot is way off from the bullseye.

So a controlled experiment would be highly reliable.

Very.

Whereas validity is about accuracy.

It's asking, are you actually measuring what you think you're measuring?

It's hitting the bullseye itself.

So a lab experiment to understand how a family uses a new smart thermostat would have low validity.

Terribly low validity.

You'd need to be in their home for that.

That brings up ecological validity, which is just about how the environment itself affects the results.

A lab has low ecological validity.

And that's tied to the Hawthorne effect, right?

Exactly.

The fact that people change their behavior just because they know they're being studied.

We also have to watch out for bias, like an expert evaluator who is too focused on one type of flaw.

And circling back just for a second to a practical method crowdsourcing.

Right.

Tools like Mechanical Turk.

You can get data from thousands of people incredibly fast and cheap.

One study found it was a sixth of the cost of a lab study.

But there are ethical questions there, too.

There are.

Around fair pay and proper acknowledgement for the people doing that work, it's a huge efficiency gain.

But you have to balance that with treating people fairly.

This has been a really thorough tour of the whole evaluation world.

So to boil it all down, remember those three categories.

Controlled settings, natural settings, and no user settings.

And the difference between formative evaluation, which guides design, and summative, which judges it.

And no matter what method you choose, informed consent and a sharp eye for validity, reliability, and bias are just, they're mandatory.

Absolutely.

So what this all really means is that evaluation isn't just one test you run.

It's a whole toolkit, a continuous process where you have to be really thoughtful about matching the right tool to the right problem at the right time.

It's all about balancing that lab precision with real world messiness.

And that leaves us with a final provocative thought for you to mull over.

That Mandrake and InkPin study was so powerful because it used physiological data, heart rate, sweat, to get at a deeper human experience like fun.

Right.

It went beyond just time and errors.

So as we move into more immersive technologies like VR and AR, where engagement is so much deeper and more complex, what new physiological measures or data sources, beyond the ones we already use, might we need to truly evaluate the depth of those kinds of experiences?

Something for you to chew on as you move forward.

Thank you for joining us for this deep dive.

We hope this has fully equipped you with the framework needed to critique and execute effective design evaluation.

We'll catch you next time.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Evaluation serves as an integrated and essential component of interaction design, functioning as a systematic process for gathering and interpreting data about how users interact with design artifacts ranging from preliminary sketches and wireframes to fully realized interactive systems. The fundamental purpose of evaluation is to measure both usability—the ease with which users can learn and execute tasks—and the broader user experience, which encompasses emotional responses, engagement levels, and subjective satisfaction with the artifact. Conducting effective evaluation requires deliberate choices across four dimensions: understanding the rationale behind evaluation (identifying target user populations, discovering design flaws before deployment), determining what aspects merit examination (visual design, task workflows, accessibility compliance, security measures), selecting appropriate evaluation environments (laboratory settings offering controlled conditions, real-world contexts providing authentic usage patterns, or hybrid approaches like living labs that blend both), and timing evaluations within the design cycle (ongoing formative evaluation during iterative refinement or terminal summative evaluation after development concludes). The methodological approaches available fall into distinct categories based on participant involvement and context. Evaluations conducted in controlled settings with active user participation include usability testing, which combines systematic observation with interviews and experimental protocols, alongside hypothesis-driven experiments designed to minimize confounding variables. Field-based evaluations also involving users encompass ethnographic observation and in-the-wild studies that prioritize ecological realism over experimental control to understand authentic user behaviors and establish design requirements. Non-participant evaluation methods include inspection techniques such as heuristic evaluation, which applies established design principles, and cognitive walkthroughs, which trace hypothetical user problem-solving sequences to predict learning difficulty. Additional non-user approaches employ computational modeling of user performance using established laws like Fitts' law and leverage analytics platforms that track system usage patterns to inform optimization. Evaluation typically integrates multiple methods to achieve comprehensive insight, with rapid informal assessments providing timely feedback on nascent concepts. Ethical and methodological rigor demands informed consent procedures, institutional oversight through review boards, and careful attention to data quality factors including reliability, validity, ecological validity as the realism of testing conditions, generalizability, and awareness of systematic biases such as the Hawthorne effect. Scalable evaluation becomes feasible through crowdsourcing platforms that mobilize large participant pools for distributed evaluation tasks, extending the reach and speed of user research.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 14: Introducing Evaluation in Interaction Design

Related Chapters