Chapter 5: Discrete Probability Distributions

Search this chapter

Audio Overview

0:00 / 0:00

Autoplay next chapter

Welcome to Last Minute Lecture.

This free chapter overview is designed to help students review and understand key concepts.

These summaries supplement not replaced the original textbook and may not be redistributed or resold.

For complete coverage, always consult the official text.

Welcome to the Deep Dive.

We're here to impact complex topics, pull out the key info, and get you smart fast.

That's the plan.

Today we're tackling something that often feels like random chance, but sometimes they're hidden patterns.

Think about the NFL's overtime coin toss.

Yes, a classic.

So between 1974 and 2011, there were 460 games decided in overtime.

And the team that won the coin toss, they won the game 252 times.

That's 54 .8%.

A bit over half.

Exactly.

Then they changed the rules in 2012.

Since then, out of 121 games, the coin toss winner won 67 times.

Still up there at 55 .4%.

So 55%, is that like genuinely an advantage or could it just be, you know, random fluctuation?

My gut kind of leans towards randomness.

Well, that's precisely the intuition we want to explore today.

Our mission really is to give you the statistical tools to tell the difference.

We're going to dive deep into discrete probability distributions.

It's about how you quantify chance, figure out what's truly random versus statistically significant.

And apply it to real stuff like that football example, or even I hear lotteries.

Absolutely.

We'll even touch on how a couple, Jerry and Marge, legally beat the lottery using these ideas.

The goal is to help you cut through all the data noise and quickly grasp what's important when chance is involved.

All right.

So to really get into this, we need to build on the idea of random variables.

These are numerical outcomes determined by chance, right?

Exactly right.

And there are two main types, continuous ones, like say body temperature, which can take infinite values on a scale.

Right.

But today, we're zeroing in on discrete random variables.

These are the ones you can actually count, like the number of heads you get flipping a coin twice, or the number of females in two births.

Finite or accountable outcome.

Precisely.

And that's where probability distributions really come into play.

They show us the patterns.

So a probability distribution is basically the blueprint for a random variable.

You could put it that way.

It's a description could be a table, a formula, sometimes a graph that gives you the probability for each possible value of that discrete random variable.

It's like a theoretical model of what we expect.

Okay.

But not just any table or list qualifies, right?

There are rules.

Three essential requirements for it to be a valid probability distribution.

First, the random variable, let's call it X, must be numerical, and each X value has to have a probability linked to it.

So no categories like country names.

Exactly.

Numerical outcomes.

Second, if you sum up all the individual probabilities for every possible outcome, it has to equal one.

Representing 100 % of the possibilities.

Precisely.

We might allow for tiny rounding errors like .999 or 1 .001, but it should be essentially one.

And the third rule?

Each individual probability, PX, must be between zero and one inclusive.

Probability can't be

more than 100%.

Make sense.

Those are the guardrails.

Absolutely.

Let's use that simple example.

Number of females in two births.

If X in the number of females, X could be zero, one, or two.

Right.

And if we, just for simplicity, assume boys and girls are equally likely.

Which we'll maybe question later.

Yes.

But for now, assume P boys 0 .5, P girl 0 .5, then the probability of zero females, two boys, is 0 .25.

Probability of one female, a boy and a girl in either order, is 0 .50.

And probability of two females is 0 .25.

So let's check our rules.

Okay.

X is numerical, zero, one, two, and has probabilities.

Check.

Sum of probabilities, 0 .25 plus 0 .50 plus 0 .25 equals 1 .0.

And each probability, 0 .25, 0 .50, 0 .25 is between zero and one.

Check.

So that's a valid probability distribution.

Perfect.

And it's just as vital to spot when something isn't valid.

Our source material mentions a software piracy example.

Where they listed countries and percentages of unlicensed software.

Right.

That failed on two counts.

The values were country names, not numerical.

Not a random variable.

And the proportions, when added up, came to something like 2 .09, way over one.

Clearly not a probability distribution.

So always check the rules.

Definitely.

And sometimes it helps to visualize these.

We use something called a probability histogram.

Looks like a regular histogram.

Pretty much.

But the vertical axis shows the probability, px, instead of frequency.

The bars are centered over the x values.

And the area of each bar actually equals the probability for that outcome.

Gives you a nice visual sense of the distribution.

Okay.

So we know what they are.

How to check if they're valid.

How to visualize them.

Now, how do we, you know, extract meaning?

How do we quantify the patterns?

That's where we get into the parameters of the distribution.

Since a probability distribution describes the entire theoretical population of outcomes, we calculate parameters, not sample statistics.

We're talking about the mean, variance,

and standard deviation.

Got it.

Population parameters.

How do we find the mean?

The mean is calculated by summing up the product of each value x and its probability px.

So e equals xpx.

Okay.

Multiply each outcome by chance, then add them all up.

Exactly.

And this mean has another really important name, the expected value e.

They're the same thing.

e, expected value.

Expected value.

We'll come back to that.

What about variance and standard deviation?

Measuring the spread.

Right.

The variance measures the average squared deviation from the mean.

The formula is px.

Looks a bit involved.

It can be.

There's actually a shortcut formula that's often easier for manual calculation.

You square each x, multiply by its probability, sum those up, and then subtract the square of the mean you already calculated.

That sounds a bit more manageable.

And the standard deviation is just the square root of the variance.

It brings the measure of spread back into the original units of x.

And rounding.

We should probably mention that.

Good point.

Generally carry one more decimal place than the x values have.

If x values are integers, round to one decimal place.

But use common sense.

The source mentions an example about jet engines where the mean was like 3 .999714.

Yeah, rounding that to 4 .0 would lose important precision.

Exactly.

Sometimes you need those extra digits.

Okay, let's circle back to expected value e.

You said it's the same as the mean.

What does it really represent?

It represents the theoretical average outcome.

If you were to repeat the procedure like infinitely many times, it's the long run average you'd expect to get.

You might never actually get that exact value in a single trial, but it's the central tendency over the long haul.

And this has real world uses.

You mentioned casinos.

Oh, definitely.

It's crucial for understanding games of chance and making informed decisions, even if it's just deciding which bet is slightly less bad for you.

Uh -huh, right.

Less bad.

This source compares a $5 bet on a single number like seven in roulette versus a $5 bet on the pass line in craps.

Okay, roulette, you bet $5 on number seven.

What are the chances?

On a standard American wheel, there are 30 8 slots, 136, 000.

So there's a 1 out of 38 chance you win.

If you win, you get your $5 back plus $175 profit.

And a 37 out of 38 chance you lose your $5.

Correct.

So the expected value is least amount probability of losing plus win amount probability of winning.

That's a narrow amount probability of winning.

That's $5 .3, 738 cents plus $175, 138 cents.

Switching the numbers.

It comes out to approximately negative dollars and 26 cents.

Minus 26 cents.

So on average, every time you make that $5 bet, you expect to lose 26 cents in the long run.

Exactly.

Now let's look at the $5 pass line bet in craps.

The probabilities are more complex, but they work out to roughly a 285, 495 chance of losing $5 and a 294, 4495 chance of winning $5.

Okay, so the expected value there.

It's $5, 25, 4, 495 plus $5, 2, 4, 4, 495, which comes out to about negative seven cents.

Wow, only negative seven cents.

Right.

So even though the roulette payout looks much bigger, $175 versus $5, the craps bet is actually better in the long run because you lose less per dollar bet on average.

Precisely.

Expected value cuts to the allure of the big win and shows you the underlying mathematical reality.

That difference, seven cents versus 26 cents, is how casinos structure their games to ensure profitability.

It really reframes risk, but you mentioned someone actually beat the system using this, Jerry and Marge.

Yes, Jerry and Marge Selby.

It's a fantastic story.

They didn't cheat, they used math.

Jerry noticed a specific state lottery game, Winfall, had a feature where if the jackpot wasn't won and reached a certain limit, the prize money would roll down and significantly increase the lower tier prizes.

Ah, so the expected value changed.

Dramatically.

During these roll down weeks, Jerry calculated that the expected value of buying a $1 ticket actually became positive greater than a $1.

So he and Marge, and eventually a group they formed, started buying massive numbers of tickets during roll downs.

They're basically treated it like an investment.

Exactly.

They spent millions, had huge losses sometimes, but because the expected value was positive over time, they came out way ahead.

They ended up similar structures, all completely legally.

It's maybe the best real world example of the power of expected value.

That's incredible.

Okay, so we have the mean, the standard deviation, the expected value.

How do we use these to formally decide if a result is significant or just a random blip?

There are a couple of common approaches.

The first is a quick guideline called the range rule of thumb.

Range rule of thumb.

How does that work?

It's pretty simple.

It basically says that most values should fall within two standard deviations of the mean.

Okay.

So we find significantly low values as being met two or lower, and significantly high values are met plus two or higher.

Anything in between those limits is considered not significant or usual.

Let's apply that to the two births example again.

Man -mat was 1 .0 female, standard deviation was 0 .7 females.

Is getting two females in two births significantly high using this rule?

Okay, we calculate the upper limit.

Met plus two equals 1 .0 plus 2 .7 plus 1 .4 plus 1 .4 equals 2 .4 females.

And our outcome was two females.

Since two is not greater than or equal to 2 .4, the range rule of thumb tells us that two females in two births is not significantly high.

It's within the usual range.

Simple enough.

What's the other approach?

The second approach uses probabilities directly.

It's often considered more precise and ties into the rare event rule for inferential statistics.

The rare event rule.

How does that work?

It defines significance based on probability thresholds.

An outcome X is considered significantly high if the probability of getting X or more successes is less than or equal to a small value, usually 0 .05.

PX or more.

X, 0 .05.

Right.

And X is significantly low if the probability of getting X or fewer successes is X or fewer five.

PX or fewer, X, 0 .05.

And that 0 .05 is standard.

It's a very common convention in many fields, but it's not absolutely rigid.

Sometimes 0 .01 or 0 .10 might be used depending on the context, but 0 .05 is the typical default.

So what's the underlying logic?

The rare event rule part.

The core idea is this.

If you make an assumption, like say a coin is fair, P is 0 .5, and under that assumption the probability of observing what you actually observed is really small, like 8 goes 0 .5, then you start to doubt your initial assumption.

Because if the outcome occurs anyway, it suggests the assumption is probably wrong.

This is fundamental to hypothesis testing and statistics.

Let's bring back the NFL coin toss.

252 wins in 460 tosses.

The assumption is random chance P, 0 .5.

Is 252 significantly high using this probability rule?

Okay, so we need to calculate the probability of getting 252 or more wins in 460 trials if P, 0 .5.

Using statistical methods like the binomial distribution we'll discuss shortly, that probability turns out to be 0 .0224.

0 .0224.

And how does that compare to our threshold?

Well, 0 .0224 is definitely less than or equal to 0 .05.

So according to the rare event rule, 252 wins is indeed a significantly high number of wins.

The probability of getting that many wins or more just by chance is really low.

Which casts serious doubt on the assumption that the coin toss outcome didn't matter.

Precisely.

This statistical evidence strongly suggested that winning the coin toss did provide an advantage, justifying the NFL's rule change in 2012.

It shows how statistics can directly lead to policy changes.

That's a powerful connection.

Okay, let's begin to some of those specific types of distributions you mentioned.

First up, binomial probability distributions.

Right.

These are super important and come up a lot.

The binomial distribution applies to situations where you have a procedure that meets four specific conditions.

Four conditions.

Okay, what's the first one?

One, there must be a fixed number of trials.

Let's call it N.

You know exactly how many times you're repeating the procedure, like flipping a coin 10 times, N times.

Got it.

Fixed trials.

Second.

Two, the trials must be independent.

The outcome of one trial cannot influence the outcome of any other trial.

Flipping a coin, each flip is independent.

Okay.

Third.

Three, each trial must have outcomes that fall into exactly two categories.

We usually label success, S, and failure.

Success and failure.

But success doesn't always mean something good, right?

Absolutely not.

It's just a label for the outcome we're interested in counting.

If you're studying defects, a success might be finding a defective item.

It's just convention.

Okay.

Context matters.

And the fourth condition.

Four.

The probability of success, P, must remain the constant for every single trial.

The probability of failure, Q, is then just one P.

So fixed trials, independent trials, two outcomes, constant probability.

If all four hold, you've got a binomial setting.

You got it.

We also have standard notation, N for trials, X for the specific number of successes we're interested in, P's for probability of success, Q for probability of failure, and PX for the probability of getting exactly successes in end trials.

Now, that independence condition,

what if you're sampling without replacement, like drawing names from a hat?

Technically, the probabilities change slightly each time.

That's a great point.

Technically, sampling without replacement violates independence.

However, there's a practical guideline.

The five percent rule for treating selections as independent.

Okay.

What's the rule?

If your sample size N is no more than five percent of the total population size N, so NAHO .05N, you can generally treat the selections as independent, even though they technically aren't.

The change in probability is so small, it usually doesn't significantly affect the results.

It makes calculations much easier.

That's very practical.

How do we find the actual binomial probabilities, PX?

Well, there is a formula, the binomial probability formula, formula five in this source.

It involves factorials and powers of P and Q, PX, N, N at X, dot X, PX, Q, N, X.

It's complicated to do by hand.

It can be, especially for larger N.

While understanding the formula is good, the reality is that nowadays, we almost always use technology.

Statistical software like StatCrunch, Minitab, SPSS, or even graphing calculators like the TI -8384 have built -in functions to compute binomial probabilities instantly.

So rely on the tech.

Pretty much.

Using tables like table A1 mentioned in the source is becoming quite obsolete.

Tech is faster and more accurate.

Can we walk through a quick example?

Let's say the probability an adult smartphone owner is cashless is .05.

What's the probability that if we randomly select 10 adults, exactly two of them are cashless?

Okay, here, N10 trials selecting 10 adults, X2 successes being cashless, PAC equals to 25, probability of success, and Q1, 0 .05 equals 0 .95.

We plug this into our calculator or software.

And it spits out.

P2, a minimum .0746.

So there's about a 7 .5 % chance that exactly two out of 10 randomly selected smartphone owners would be cashless given that initial 5 % probability.

See, much easier with tech.

Definitely.

Now, another neat thing about binomial distributions is that their parameters mean variance standard deviation have much simpler formulas than the general ones we discussed earlier.

Oh, simpler how?

For a binomial distribution, the mean is just N times P.

A PP.

The variance is N times P times Q.

A 10PQ, even better.

And the standard deviation is simply the square root of NPQ.

Those are really handy shortcuts.

They really are.

Let's use them to revisit the NFL overtime coin toss one more time.

Remember, 460 games,

N460, and we assumed no advantage.

So P, 0 .5 and Q, 0 .5.

Okay, using the simple formulas, the mean number of wins we'd expect for the coin toss winner is NP equals 462 or 0 .5 equals 230 .0 gain.

Exactly.

And the standard deviation is 7 to our 16, 0 .5, 0 .5 at 10 .7 games.

Now, let's apply the range rule of thumb with these values.

Significantly high wins would be plus 2 equals 230 .0, plus 2, I am point 2 .0, plus 21 .4, plus 21 .4, wheels 251 .4 games or more.

And the actual observed result was 252 wins.

Which is greater than 251 .4.

So using these binomial parameters and the range rule, we again reach the conclusion that 252 wins was significantly high.

It reinforces our earlier finding using the probability method.

The evidence points towards a real advantage.

Consistent results using different methods.

That builds confidence.

Okay, let's move to the next special distribution.

Poisson probability distributions.

What's the deal with Poisson?

The Poisson distribution is perfect for modeling the number of times an event occurs over a specified interval.

This interval could be time, like number calls to a help center per hour, distance, number of potholes per mile of road, area, number of flaws per square meter of fabric, or even volume.

So it's about occurrences within a continuous interval.

What are the requirements?

Key requirements are the occurrences must be random.

They must be independent of each other.

One occurrence doesn't make another more or less likely.

And they must be uniformly distributed over the interval.

The event is equally likely to occur at any point within the interval.

Random, independent, uniform.

Got it.

The formula to calculate the probability of exactly x occurrences in an interval is px, ex.

Here, mu is the mean number of occurrences in that specific interval.

And e, what's that?

e is Euler's number, a fundamental mathematical constant, approximately 2 .71828.

It pops up naturally in growth and decay processes.

Your calculator will have an e button.

And x is x factorial.

Okay.

And the parameters for Poisson?

Nicely simple again.

The mean is just rope.

And the standard deviation is the square root of the mean, moon.

Mean coral standard deviation.

Let's see an example.

The source mentions Atlantic hurricanes.

Right.

Between 1900 and 2017, a period of 118 years, there were 652 Atlantic hurricanes recorded.

So the mean number per year would be 652 divided by 118.

Which is about 5 .5 hurricanes per year.

So methane is 5 .5 for an interval of one year.

Okay.

Let's say we want to find the probability of having exactly six hurricanes in a randomly selected year.

We use the Poisson formula with x6 and mu 5 .5.

P6, 5 .56 E5 .56, plugging that into a calculator.

Gives about 0 .157.

So roughly a 15 .7 % chance of exactly six hurricanes in a given year based on that historical average.

And we can check how well the model fits reality.

In that 118 year period, there were actually 16 years that had exactly six hurricanes.

How many would the model predict?

We'd expect about 118 P6, 118 .157 at 18 .5 years.

18 .5 expected versus 16 observed.

That's pretty close.

Suggest the Poisson distribution is a reasonable model for hurricane occurrences per year.

It does seem to fit well.

Now, there's one more interesting thing about Poisson.

It can sometimes be used to approximate the binomial distribution.

Approximate the nominal.

When would you do that?

It's useful when you have a binomial situation where n, the number of trials, is very large, and p, the probability of success, is very small.

Large and small p.

Are there specific rules for that?

Yes.

The common guidelines are n to 100 and p rate him.

If both those conditions are met, the Poisson distribution provides a good approximation to binomial.

And how do you set it up?

You simply set the Poisson mean equal to the binomial mean n, e, n, p.

Then you use the Poisson formula with that in.

Oh, why bother approximating?

Don't we have tech for binomial?

We do, but sometimes, especially with extremely large in, calculations can still be cumbersome or slow even for computers, or the Poisson might just offer a simpler perspective.

Plus, it's a neat theoretical connection.

Okay, example time.

The source uses the main pick for lottery.

You play once a day for a year,

355 days.

Right.

The probability of winning pick 4 on any given day is 1 in 10 ,000.

So p equal .00001, and n equals 365 days.

Let's check the conditions for approximation.

Nn 365, which is 0100.

Check.

And np equals 300001 equals 0 .0001.

That's definitely n10.

Check.

So we can use the Poisson approximation.

What's it on?

Nnp equals 0 .0365.

That's the average number of wins you'd expect in a year.

Very small.

Let's find the probability of winning at least once in 365 days.

That's 1 minus the probability of winning 0 times, right?

Exactly.

Easier to calculate p0 wins using the Poisson formula with mid 0 .0365 and x0.

p0 .03650, e0 .0365.

These are .0365.

Anything to the power 0 is 1 and 0 is 1.

So it's just e0 .0365.

Which calculates to about 1 .9642.

That's the probability of going a whole year playing daily and never winning.

So the probability of winning at least once is 1 .9642 equals 0 .0358.

About 3 .6%.

So even playing every single day for a year, your chance of winning even one time is less than 4%.

It certainly is.

And it shows the utility of these different distributions.

Okay, let's try and wrap this up.

We've covered a lot of ground today.

We really have.

We started with the basics, what random variables are, focusing on discrete ones.

Then we defined probability distributions, those crucial requirements for them to be valid.

Right.

Numerical variable, probability sum to 1, each probability between 0 and 1.

Then we moved into quantifying them, calculating the mean, variance, standard deviation, and crucially understanding the mean as the expected value e.

That expected value concept seemed really powerful, especially with the casino and lottery examples.

It's about the long run average.

Absolutely.

And then we looked at how to tell if an outcome is statistically significant, unusual enough to make us question our assumptions,

using the range rule of thumb or the rare event rule P .05.

The NFL coin toss example really drove that home, showing how statistics detected an advantage and influenced policy.

And finally, we explored two key specific discrete distributions, binomial, for fixed trials with two outcomes.

Like coin flips or cashless shoppers.

And Poisson for counting occurrences over an interval.

Like hurricanes or maybe even daily lottery wins.

And we saw how Poisson can sometimes approximate binomial when n is large and p is small.

So the big takeaway here is that behind what looks like randomness, there are often quantifiable patterns.

These tools, probability distributions, expected value, significance tests, give you a lens to see those patterns.

Exactly.

It's not just abstract math.

It's about understanding the world better, making more informed decisions, whether you're analyzing sports data, playing a game of chance, or evaluating any situation involving probability.

You really do gain a powerful new perspective.

Now, for a final thought to leave you with, we briefly touched on the assumption of p -boy, p -girl, .5 earlier.

Right.

But the source material actually mentions that the current biological understanding is slightly different, maybe closer to 105 boys born for every 100 girls.

That would make the probability of a boy, p -boy,

it's .512, not .5.

Slightly skewed.

So the provocative thought is,

how might applying what we've learned today about expected values, and especially about identifying significantly high or significantly low results using probability,

allow you to test other commonly held beliefs?

Are there other things we assume are 50 -50, or purely random, that might show statistically significant deviations if we looked closely?

Hmm.

That's interesting.

Makes you want to go out and test some assumptions.

A great place to leave it.

Thanks for joining us on this deep dive.

My pleasure.

ⓘ This audio and summary are simplified educational interpretations and are not a substitute for the original text.

Chapter SummaryWhat this audio overview covers

Discrete probability distributions model scenarios with countable, well-defined outcomes where each result carries a measurable likelihood. Any valid distribution requires probabilities between zero and one inclusive, with all individual probabilities summing to exactly one—foundational constraints that ensure mathematical consistency. Building distributions from empirical data involves organizing observed frequencies and converting them into probability statements, then using these models to answer practical questions about what outcomes are likely. The expected value represents the weighted average outcome of a distribution, calculated by multiplying each outcome by its probability and summing the results, providing a single-number summary of the distribution's center. Variance and standard deviation quantify how spread out the outcomes are around this center, with variance measuring average squared deviation and standard deviation returning to the original measurement units for easier interpretation. The binomial distribution emerges as a powerful model for experiments satisfying specific structural requirements: a fixed number of independent trials, two mutually exclusive outcomes on each trial, constant probability of success across all trials, and interest in counting the total number of successes. Using binomial notation where n denotes the number of trials and p represents the probability of success on any single trial, the binomial probability formula calculates the probability of observing exactly a given number of successes. The complement rule offers computational efficiency by calculating the probability of the desired outcome indirectly when direct calculation would be tedious. Binomial distributions possess derived formulas for mean, variance, and standard deviation that depend only on n and p, avoiding the need for lengthy calculations from first principles. Identifying unusual or suspicious outcomes relies on the range rule of thumb, which uses the mean and standard deviation to establish boundaries for typical values. When sample sizes grow large, the binomial distribution's shape increasingly resembles a normal distribution, allowing analysts to use normal probability methods for approximation in such cases. Applications permeate quality assurance contexts where items are classified as acceptable or defective, medical testing situations where results are positive or negative, and countless other domains where outcomes reduce to success or failure categories. Technology and statistical software streamline the computational burden of repetitive binomial probability calculations, particularly for larger datasets.

Using this chapter to study? Last Minute Lecture is free and student-run. If it helped, consider supporting the project.

Support LML ♥

Chapter 5: Discrete Probability Distributions

Related Chapters