Welcome to Last Minute Lecture.
This free chapter overview is designed to help students review and understand key concepts.
These summaries supplement, not replace, the original textbook and may not be redistributed or resold.
For complete coverage, always consult the official text.
How do you start your day?
Maybe you ask a smart speaker for the weather forecast or tap your card at a transit turnstile or use a security pass to get into your office.
Every single one of those actions creates data.
And most of the time you're not even thinking about it.
Exactly.
The amount of data we're all generating is just staggering.
We're talking about data at a massive scale, what everyone now calls big data.
And it's not just numbers anymore, is it?
Not at all.
It's images, video,
endless text from social media, even data from environmental sensors.
And it's all being collected at an exponential rate.
Like the stat about YouTube videos.
Right, something like 400 new videos are uploaded every single minute.
When you try to wrap your head around that, you start to get a sense of the sheer volume we're dealing with.
Okay, so let's unpack this.
Our mission for this deep dive is to really get into what this data flood means for us socially.
We need to understand the methods.
How is it collected and analyzed?
How do you turn all that raw information into something meaningful?
And maybe the most important part,
how do we design these systems ethically?
Because there's this huge core tension here, isn't there?
On one hand,
the potential benefits are just transformative.
Improving health care, city planning.
Right, reducing traffic, making streets safer,
all amazing things.
But on the other hand, people are worried.
Very worried about privacy and about fairness.
I mean, what if an algorithm denies you alone for some reason you can't even understand?
That lack of transparency is a huge concern.
The power is immense, and the risk that comes with it is equally large.
So let's jump right in.
Let's talk about the approaches for collecting and analyzing all this data.
Because really, collecting it is becoming the easy part.
The real challenge is, what do you do with it?
How do you analyze it, collate it, and then act on it in a way that's socially acceptable?
You see this in big public spaces, like an airport.
Absolutely.
They might have a public display showing passenger flow.
Maybe the north security line is way busier than the south.
That's helpful.
For travelers and for the airport staff.
Sure.
But that public data is just the tip of the iceberg.
The real question is, what else are they collecting about everyone walking through there?
And how clear are they about that process?
That leads us right into one of the main methods, scraping, and what's called second source data.
Right.
So web scraping is just pulling data directly from websites.
But the second source idea is where researchers mine huge publicly available data sets.
Like social media posts and, crucially, search terms.
Exactly.
This is where a researcher like Seth Skiwens -Dvidovitz found some incredible insights.
He realized that what people type into a search box reveals truths they would never admit in a formal survey.
Like comparing search trends for I hate my boss versus I love my boss.
Precisely.
People treat that search box like a private diary.
They confess things.
So the insight is that with big data, you have to ask the right questions to find those hidden patterns.
So from the massive crowd down to the individual, let's talk about the quantified self -movement.
That really kicked off around 2008.
It did.
And now it's everywhere.
All the wearables, all the apps on our phones, tracking everything.
Our mood, our sleep, energy levels, screen time.
And we're usually tracking ourselves against some kind of target or threshold we've set.
And for a lot of people, this is medically vital.
Oh, absolutely.
Yeah.
Monitoring blood glucose if you have diabetes or maybe figuring out your migraine triggers.
It helps people see patterns in their own lives and make really important changes.
But there's a design challenge there, isn't there?
A huge one.
If you just stream raw physiological data at someone like a constant real -time feed of their heart rate or brain waves, it can create a lot of anxiety.
Yeah, I can see that.
So how you present that data is just as important as the data itself.
Now speaking of getting data from lots of people, let's talk about crowdsourcing.
This is more about active participation.
Right.
This is where citizens and researchers collaborate.
You see it in what's called crowd research, where hundreds of scientists might work together on a problem like climate change.
And in citizen science projects.
The perfect example is eBird .org.
It's a massive global database of bird sightings, all contributed by regular people, by citizen converters.
By mid 2018, it had over half a billion observations.
That sounds amazing for conservation, but it also raises a big red flag for me.
Privacy and location.
It's a critical issue.
How do you share the data without putting an endangered species or even the person who reported it at risk?
So how do they handle that?
A platform like iNaturalist .org has a really smart solution.
They use geo privacy settings.
You can set your observation as open, private or obscure.
Obscure.
What does that do?
It hides the exact location for vulnerable species.
It shares enough data for research, but doesn't give poachers, for instance, a precise map.
It's a necessary balancing act.
That is smart.
Okay, let's shift gears a bit.
How do we make sense data more meaningful?
I was fascinated by the Physikit project.
Yes.
This was a reaction against traditional dashboards, you know, the smart citizen dashboard used what are called canonical visualizations, which are the standard often very complex time series graphs and charts that, frankly, a lot of people found confusing.
So Physikit's goal was different.
They called it human data design.
Exactly.
They wanted to turn data into a physical ambient presence in the home.
And that's where the PhysiCubes come in.
These are little cubes that can light up or rotate parts based on sensor readings.
And the best example was a household that connected a cube to a basil plant.
The cubes rotation was tied to the humidity level in the kitchen.
So what happened?
If the humidity was too high, the cube would stop rotating.
The basil plant would then naturally lean towards the nearest window for light and its shape would become a living, growing physical visualization of the room's humidity over time.
That's incredible.
It connects the data back to the environment in such a tangible way.
It's almost artistic.
Moving back to large scale analysis, let's talk about sentiment analysis.
This is a way to infer what a crowd is feeling.
You can do it by scoring phrases, say from negative 10 to positive 10, or by classifying specific emotions like joy or anger.
And it's used by marketing teams, researchers.
All the time.
Studying public opinion on anything from a new product to public transport.
But computers often get this wrong, don't they?
They do.
It's not objective.
It's a heuristic.
And it misses nuance all the time.
Slang, sarcasm, irony.
The classic example is a teenager texting, I am weak.
A sentiment analysis tool might flag that as negative, thinking they feel ill.
When it actually means, that's hilarious.
Exactly.
It's slang for that made me laugh so hard.
You have to train the system on culture, not just words.
Okay, another big one for analyzing groups, social network analysis or SNA.
SNA comes from social network theory, and it's all about visualizing relationships and social ties.
So you have two main things, nodes and edges.
Right.
The nodes are the people or topics or organizations, and the edges are the links between them.
Then you can use metrics to see who's most central or influential in that network.
And there is no more striking example of this than the visualizations of U .S.
Senate voting patterns.
It's an incredible visualization.
If you look at the graph from 1989,
the Republican nodes in red and the Democrat nodes in blue are pretty intermingled.
Showing they voted together on a lot of issues.
A lot.
But by 2013, the graph is completely different.
The two colors have pulled apart into two distinct separate clusters with almost no links between them.
It visualizes political polarization more powerfully than any table of statistics ever could.
It really does.
Now, before we move on, we have to talk about the quantified toilets project.
Ah, yes.
A very provocative trope.
It was a way to test people's reactions to being tracked without them really knowing it was a study.
So researchers set up this fake service at a public convention.
They claimed it could analyze your urine for all kinds of health, data, blood, alcohol, drugs, even pregnancy.
And they had fake real -time data feeds online.
It was all completely fake, but the reactions from the public were very, very real.
And what were they?
All over the place.
Disapproval, of course.
But also resignation, some approval, and even voyeurism.
The project went viral almost instantly.
And it started a huge public debate.
A debate that a simple survey never could have sparked.
It really exposed that gap between the potential personal benefits and the, you know, the creepy feeling of being under constant surveillance.
That's a perfect lead -in to the idea of combining data sources.
Researchers know they need both the automatic sensor data and what people report themselves.
The student life study from Dartmouth College is really the gold standard for this.
This is where they tracked 48 students for 10 weeks.
Using their smartphones.
They logged everything.
When they woke up, how much they walked, how often they were in conversations, where they went on campus.
All passively.
And then they compared that to surveys and their actual grades.
And the findings were so clear.
Behavioral factors.
Things like physical activity, conversation frequency, and just showing up to class all correlated directly with their grades.
And the timeline visualization is just classic college student behavior.
It's predictable, but seeing it in the data is amazing.
High activity and very little sleep at the start of the term.
Partying.
Followed by a huge drop in class attendance and sleep right before final exams.
It's measurable proof of that whole experience.
Okay, so that's how we collect data.
But making it useful means we have to be able to see it to understand it.
Let's get into visualizing and exploring data.
And to start, you need something called visual literacy.
Which is just the skill set to look at a map or a graph and actually understand what it's telling you.
Exactly.
There's a clear path to how data becomes meaningful.
It starts with data analysis, which leads to a presentation.
The user then perceives and interprets that presentation.
Cognition basically.
Right.
And that leads to understanding and communication.
It's a whole process.
And the ultimate goal of any visualization, as researcher Stu Card said, is to amplify human cognition.
Yes.
To help us see patterns and trends and anomalies that would just be lost in a giant spreadsheet.
There's that great mantra from Ben Schneiderman for how to interact with these displays.
Overview first, zoom and filter, and then details on demand.
It's the perfect summary.
Now, we have standard bar charts and scatter plots, but visualizations have evolved quite a bit.
Let's talk about tree maps.
So tree maps were originally invented in computer science.
The goal was to visualize a computer's file system to see which folders were taking up the most disk space.
But they didn't stay there.
No, they jumped fields completely.
Now, financial reporters use them as market maps to show the stock market.
Size and color instantly tell you how a stock is performing.
It's a fantastic adaptation.
Another really interesting one is the spectrogram.
This is for visualizing sound, like bird calls.
They can compress the visual representation so much that one single pixel can represent a whole minute of sound.
So a birder could look at a single screen and see the sonic patterns of an entire day.
That's it.
It lets them see patterns and can even evoke the memory of being out in the field and hearing those sounds.
It's a very powerful tool.
Of course, when we're dealing with huge, constantly updating data sets, the interface we see most is the dashboard.
A dashboard is basically an interactive control panel.
You've got your widget sliders, check boxes, and multiple displays showing things like heat maps or word clouds.
And the key is that everything is linked.
Everything.
It all draws from the same underlying data set.
Tools like Tableau or Power BI are all about this.
And then for more specialized custom stuff, there's D3.
D3 or data -driven documents.
It's a very powerful JavaScript library.
Journalists especially use it to create those amazing web -based interactive graphics you see that tell really complex data stories.
Which brings us to our final section, and honestly the most important part of this whole discussion, ethical design concerns.
Right.
And when we talk about ethics, we mean standards of conduct that separate right from wrong.
Professional bodies like the ACM and IE are very clear on this, stressing human rights, fairness, and transparency.
A core philosophy here is privacy by design.
Yes.
The idea is to build privacy protections in from the very start, not try to bolt them on later.
So you avoid collecting excessive sensitive data in the first place.
Or you do things like analyzing data on the device itself on your phone instead of uploading it all to some central cloud server.
It's also about establishing trustworthiness and social acceptability.
Do people actually agree to how their data is being used?
But what are the clear boundaries?
Especially for sensitive data like your health history or criminal record.
The DeepCam system is a perfect, really thorny ethical dilemma here.
It is.
So this is a system that passively monitors people in stores using facial recognition to try and identify potential shoplifters.
Now the designers made a choice to anonymize the data so the faces aren't linked to names or addresses.
But the question remains,
is that practice socially acceptable?
Does reducing crime outweigh that creepy feeling of being constantly monitored by an AI?
There's no easy answer.
That tension leads us directly to the four core principles for ethical design.
People often refer to them by the acronym FATE.
FATE.
Fairness, accountability, transparency, and explainability.
These are at the heart of modern AI discussions and regulations like GDPR.
Okay, so first, fairness.
Fairness is about ensuring impartial treatment.
It means you have to actively find and fix biases in your data sets so that your system doesn't make unfair decisions.
Like a lone algorithm that discriminates against a certain demographic.
Next accountability.
This means that an intelligent system has to be able to explain its decisions and someone has to be responsible for them.
Is it the person who wrote the code?
The organization using the system?
We have to know who is accountable.
Then transparency.
You have to make the system's decision making process visible.
We really need to critique this trend of black box AI where a system gives you a recommendation or a diagnosis but no reason why.
And finally explainability.
The system has to provide explanations that a normal person, a lay person, can actually understand.
And research shows that when a system can explain why it did something, it builds a huge amount of trust with the user.
And the need for these principles is made so incredibly clear by real world bias.
The facial recognition algorithm example is just chilling.
A 35 % error rate for darker skinned women compared to less than 1 % for lighter skinned men.
And that's because the training data was overwhelmingly white and male.
Exactly.
If you don't enforce fairness, the system will just bake in and even amplify existing societal biases.
To end on a more positive design note, some projects are building ethics in from the ground up.
The living room of the future project is a great example.
To ensure that any personalization is transparent and trustworthy, all the personal data is stored on a local home fence data server.
A data box.
A data box that never leaves the living room.
The data never goes to the cloud.
The individual is in complete control, full stop.
What an incredible journey.
We started with just asking for the weather and ended up at global ethical frameworks and the need for data autonomy.
I think the main takeaway is this.
Data at scale gives us revolutionary power.
And that power comes from combining all these diverse sources of information.
But that power is only beneficial if it's balanced by strong ethical principles.
We have to demand fairness, accountability, transparency, and explainability.
Every single time.
That really is the ultimate challenge.
So for you, our listener, here's a final thought to chew on.
If an automated system were making a critical decision that would severely impact your life, maybe denying you a visa or a job or a mortgage, what specific feature based on those fatigue principles would you demand?
Would you want the ability to interrogate the system's logic, to negotiate with the parameters it's using?
Think about that tension between automation and your own human autonomy the next time you go about your day, creating your next data point.