1. Notations

  • P(X) = Probality of event X.
  • P(X|Y) = Probability of X given that Y has occured.

2. Pre-requisites

  • You need to have knowledge of basic probability theory. If you are comfortable in calculating probabilities of discrete events and comfortable with the sum rule and product rule then you're good to go (If you're not, don't worry, I've tried to give a terse explaination using an example below). Try out this question.

Que - Bag A has 5 red and 3 blue balls, Bag B has 6 red and 4 blue balls. The probability that a person chooses Bag A is 0.3 and he'll choose Bag B with probability 0.7. What is the probability of a person selecting a blue ball from bag B? What is the total probability of him coming out with a red ball?

Ans -

  • Probability of choosing blue ball from B = (He chooses B) and (and then he selected blue ball from B) i.e ${P(B)\times P(blue\mid B)}$. If your answer is 0.28, then you're comfortable with the product rule.

  • Probability of him coming out with a red ball = (He chooses red ball from A) or (he chooses red ball from B) i.e ${P(A)\times P(red\mid A)+ P(B)\times P(red\mid B)}$. If your answer comes out to be 0.6075, then you're comfortable with the sum rule.

Intuitively, sum rule comes in play when there is a choice between mutually exclusive events (these events are generally seperated by a or between them) for eg. The event of (either choosing A and then a red ball) or (B and then a red ball) (as shown in second bullet point above). Product rule comes into play when two events occur simultaneously for eg. choosing bag B and then selecting a blue ball from it (as shown in first bullet point above).

3. Formula and basic jargon

Let me state the formula for Bayes' theorem quickly. Subsequently, I'll explain every term of the formula in detail using an example and introduce some basic jargon along the way.

${\displaystyle \large P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}}$ (3.0)

Let's try to understand this formula by investigating a theft incidence.

There are two secret vaults A and B. Vault A contains 6 gold bars and 2 diamonds. Vault B contains 2 gold bars and 8 diamonds. Vault A is easier to break open and therefore if a thief encounters both vaults he's more likely to break into A than in B. Let's say a thief will break in A with probability 0.7 and in B with probability 0.3. Unfortunately, despite all the security, a theft occurs. Due to security alarms, thief had to rush and could only take whichever piece of ornament he could get his hands on (In other words, he didn't have the time to think that diamond is more precious so I should pick diamond instead of gold). After the heroic efforts of the police, the thief is caught and he's found having a diamond. You are a detective but like Sherlock Holmes, unauthorised. You have a personal history with the thief and want to get some more lead into this case by knowing from which vault the theft occured. Sadly, being unauthorised, you're not allowed to enter the crime scene. Thus, you would have to guess the vault from which the thief has stolen the diamond? Let's say you hypothesise that he must have stolen from A because it was easier to break into. Subsequently, you go home and then sit down with a pen and paper to test the validity of your hypothesis using the Bayes' theorem

Let us define some symbols first,

  • ${P(A)}$ = probability of breaking into A which is given as 0.7.
  • ${P(B)}$ = probability of breaking into B which is given as 0.3.
  • ${P(G)}$ = probability of stealing the gold bar.
  • ${P(D)}$ = probability of stealing the diamond.

You have to compute probability of thief having broken into A given that diamond was found in his hand i.e ${P(A\mid D)}$. Expanding this term using the formula 3.0 gives us: ${\displaystyle P(A\mid D)={\frac {P(D\mid A)P(A)}{P(D)}}}$ (3.1)

The sentence "The thief was found having a diamond" is called Evidence. Probability of finding the Evidence is written mathematically as ${P(D)}$ which is the denominator of our Eq.3.1. To solve the question, we begin by listing all the ways through which Evidence could have occured. There are two cases in which diamond could've been stolen:-

  1. He broke into A and then picked up a diamond i.e ${P(A)\times P(D\mid A)}$ or 2. He broke into B and then picked up a diamond i.e ${P(B)\times P(D\mid B)}$.

All these listed ways make up your ${P(D)}$ in Eq.3.1. Thus ${P(D)}$ written mathematically will be equal to -> ${P(A)\times P(D\mid A) + P(B)\times P(D\mid B)}$ (3.2)

Now, You will write down your Prior belief in the correctness of your hypothesis if you had not seen the evidence. What are the odds that thief broke into A? The odds are simply ${P(A)}$ or 0.7. Thus, ${P(A)}$ is called Prior, because it reflects your prior belief about the correctness of your hypothesis before seeing the Evidence. Now, what are the chances that he stole the diamond given that he infiltrated A? In other words,what are the chances of seeing the Evidence given that your hypothesis is true? Mathematically speaking, ${P(D\mid A)}$. This entity is called Likelihood. Likelihood can be described as answer to the question "How likely is that evidence occurs if I assume my hypothesis is true?"

Likelihood and Prior make up the numerator in (3.1). Thus your numerator is ${P(A)\times P(D\mid A)}$. (3.3)

Why is the entity (3.3) our numerator and why not anything else? If you notice carefully (3.3) is probability of our case no 1 (He broke into A and then picked up a diamond). Intuitively, we can think of it as following:- Out of all the cases that make up your evidence ${P(D)}$ you are only interested in the ones in which your hypothesis holds true. Thus only the case no 1 from above list of cases interests you and you put that in your numerator. And if you remember, that is basic probability; we calculate probability using the formula -> cases that interest us/total no of cases, for eg. What are the odds of selecting a red card from a deck of cards -> 26/52 or 0.5.

Thus,finally after putting all the pieces together we can calculate

${\displaystyle P(A\mid D) = {\frac {P(A)\times P(D\mid A)}{P(D)}} = {\displaystyle \frac {Equation 3.3}{Equation 3.2}} = {\frac {P(A)\times P(D\mid A)}{P(A)\times P(D\mid A) + P(B)\times P(D\mid B)}} = 0.42}$

The chances that your hypothesis is true is 42%. Put in other words, the chances of your hypothesis being wrong are 58%. Hence,it is more likely that he broke into B and not A! Now, reflecting on the details of the theft, you suddenly realise that "Ah! That makes sense. There are more diamonds in B than in A and the thief hurriedly picked up whatever ornament he could get his hands on and got diamond. Thus,if he broke into B he had more chances of picking a diamond blindly than doing so after breaking into A. So he must have stolen from B" The calculated probability ${P(A\mid D)}$ is called the Posterior. This probability is an updated version of our Prior based on new evidence. Now that we found the evidence that thief had a diamond, we believe that he broke into A with probability 0.42 (posterior) instead of 0.7 (prior). In other words your belief in your hypothesis went down in the light of new evidence. This is the main motive of Bayes' theorem, It helps us update our Prior beliefs continuously by collecting new Evidence.

Quick summary and technique to solve problems involving Bayes' Theorem :

  • Find out what is given in the problem. The given part serves as Evidence which aids us in assessing our Hypothesis.

  • List out all the ways in which Evidence could have occured, calculate the probabilities of those ways using product rule and sum rule and write them as denominator.

  • Pick out the way amongst the list of ways in which your Hypothesis holds true and put the probability of that in the numerator.

4. Interesting Study Demonstrating The Counter-Intuitiveness Of The Bayes' Theorem

(This part of blog is inspired from a great video by 3 Blue 1 Brown).

Let me ask you an interesting question. Steve is very shy and withdrawn, invariably helpful but with little interest in people or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail." Having read this sentence what do you think is the profession of Steve, a librarian or a farmer ?

(This quesion was asked by Nobel Laureate Daniel Kahneman and Amos Tversky in the studies which they conducted that showed that humans are intuitively bad staticians (even those who had PhDs in the field of statistics) and sometimes overestimate the correctness of their prior beliefs. Daniel Kahneman has written about these studies in his book "Thinking ,fast and slow".)

Most people would guess that Steve is a librarian because he fits in the stereotypical image of one. Let's look at this problem with a Bayesian perspective. Let's say that the sentence written in bold above is our evidence. Now we hypothesise that Steve is a librarian. Let's calculate the validity of our hypothesis.

Steve is a random person taken from a representative sample.

  • Let's say the probability of observing the above traits in a random person are ${P(E)}$.
  • Let the probability of a random person being a farmer be ${P(F)}$.
  • Let the probability of a random person being a librarian be ${P(L)}$.

We would have to consider following questions to calculate the probability of our hypothesis given the evidence :

  1. Out of 100 librarians how many do you think fit the description given above in bold typeface? We are allowed to incorporate our stereotypes in estimating the answer to this question. Let's say 85 out of 100 librarians fit the evidence. Mathematically speaking, given that a person is a librarian, the probability of him fiiting the above evidence (he is shy and a "meek and tidy soul") is ${P(E\mid L)}$ = 0.85

  2. Out of 100 farmers how many do you think fit the description given above in bold typeface? Let's say 30 farmers fit the evidence (beacuse we all stereotypically think that farmers are less likely to be shy or a "meek and tidy soul"). Mathematically speaking, given that a person is a farmer, the probability of him fiting the above evidence is ${P(E\mid F)}$ = 0.3

We also need to take into account some statistical facts to decide our prior beliefs. At the time of conduction of this study, there were 20 farmers for every 1 librarian in america. Thus, out of 210, 10 people were librarian and 200 are farmers.Therefore, probability of a random person being a farmer i.e ${P(F)}$ = 0.95 and probability of a random person being a librarian i.e ${P(L)}$ is 0.05 (assuming our representative sample has only farmers and librarians).

Listing all the ways in which the evidence can occur:

  1. The person selected at random is a librarian and he is a "meek and tidy soul" (${P(L)\times P(E\mid L)}$) or 2. The person selected at random is a farmer and he is a "meek and tidy soul" (${P(F)\times P(E\mid F)}$).

Writing this mathematically -> ${P(E) = P(L)\times P(E\mid L) + P(F)\times P(E\mid F)}$

The case which interests us is case 1. Thus, ${\displaystyle P(L\mid E) = \frac{P(L)\times P(E\mid L)}{P(L)\times P(E\mid L) + P(F)\times P(E\mid F)}}$

After doing the above calculation we find out that probability of Steve being a librarian is a mere 13 %. In other words, if you assemble 100 meek and tidy souls like Steve only 13 of them would turn out to be librarians. This seems surprising and counter-intuitive because we incorporated our stereotypes in our calculations (by saying that 85 out of 100 librarians fit the evidence), yet the final calculations conclude that our hypothesis (which complied with our stereotypes) was wrong.

An intutive way of thinking about this is as following:

There are way more farmers in general population than librarians, therefore there are way more "meek and tidy souls" ploughing the fields (77 out of 100 as per our calculations) than those who are meticulously keeping the books in the library. Take a sample of 210 people for example out of which 10 are librarians and 200 are farmers. According to our stereotypical estimates 85% of 10 librarians or ~9 librarians are shy, while only 30% of 200 lirarians or ~60 farmers are shy.Hence,out of 210 people 69 people are shy and tidy souls, majority of which are farmers. Thus, if we randomly picked a guy named Steve and he comes out as shy, he probably belongs to the group of 60 farmers.

5. Bayes' Theorem As A Way Of Updating Our Priors And Belief Systems.

(This part of blog is inspired from this great video by Veritasium)

Suppose, you go to a doctor and he tells you that results of your test for a disease are unfortunately positive. It is known that 0.1% of the population might have the disease. You know that the tests you took give correct results 99% of the time. Thus, you may be disheartened because such an accurate test has declared you of being sick from a rare disease. Intuitively, you would think that there is a 99% chance of you having this disease. But, let's look at this from a bayesian perspective.

  • Evidence -> The test shows positive. ${P(E)}$ = 0.99
  • Hypothesis -> You have the disease given the evidence. ${P(D\mid E)}$
  • Prior belief before seeing the evidence -> Probaility of you having the disease before you went for tests. ${P(D)}$ = 0.001 (because 0.1% of the population has it and you're part of the population)

Ways In Which Evidence Can Be Observed (Test result Can Come Out As Positive):

  1. You have the disease and test comes as positive (${P(D)\times P(E\mid D)}$). or 2. You don't have the disease and test shows positive (incorrectly) (${P(\neg D).P(E\mid \neg D)}$).

Mathematically -> ${P(D) = P(D)\times P(E\mid D) + P(\neg D).P(E\mid \neg D)}$

We are interested in case 1.

Thus, probability of you having the disease given positive test results ${P(D\mid E)}$ = ${\displaystyle \frac {P(D)\times P(E\mid D)}{P(D)\times P(E\mid D) + P(\neg D)\times P(E\mid \neg D)}}$.

After calculations, the probability of you having the disease comes out to be a mere 9%, which again seems counter-intuitive. Even after being declared positive by a pretty accurate test you are probably healthy and test is False!

This counter-intuitive result stems from the fact that probability of our hypothesis given the evidence is directly proportional to our prior i.e probability of our hypothesis being correct without the evidence (${P(D)}$ in above calculation). In this particular example, the probability of us having the disease without having the test results in our hand was so low (0.001) that even the new strong evidence couldn't vote in favour of our hypothesis that we have the disease.

Think of just 1000 people which also includes you. According to given data, 1 out of these 1000 is sick from the disease. Let's say that he goes for the test and is correctly identified as positive. The other 999 also go for tests. The test will falsely identify 1% of 999 healthy people, i.e 10 healthy people are shown positive. So now, there are 11 people in entire population with positive test results and you are one of them. Out of these 11 positive test results only 1 is correct. That's why having a positive result in first trial is not as bad as you might think!

But What If You Took A Second Test And It Comes As Positive

Suppose just to be sure, you go through tests from a different lab and the result again comes out as positive (assuming that this lab also gives correct results 99% of the times). Now, what are the chances that you have the disease. Let's agai hypothesise that you have the disease and test the validity of out hypothesis. Everything remains the same in terms of data except the prior. The basic definition of the prior is "Probability that your hypothesis is true before collecting the evidence". Thus, in this case the prior is probability of you having the disease without having seen the results from second test. Therefore, prior should be 9% or 0.09 for the second case (Posterior from the first test). Even though the earlier test was likely to be false, it served us by updating our prior from 0.001 to 0.09 by providing us with a strong evidence.

The probability of having the disease given that second test result is also positive = ${\displaystyle \frac {0.99\times 0.09}{0.99\times 0.09 + 0.01\times 0.91}}$ = ${91 \%}$.

Thus, now you have 91% chances of being sick and intuitively this makes sense because the chances of two such accurate tests showing false results are pretty low.

You had a hypothesis that you are sick with 0.1 % odds. Then, you collected a evidence by going through a test and that evidence updated your belief in your hypothesis to 9%. Subsequently, you went out to collect another evidence by going through another test. That test further updated your belief in hypothesis to 91%.

This case shows that Bayes' theorem serves us by updating our priors with help of new evidences. The posteriors serve as priors for the next time any evidence is collected. This process iteratively helps in scientifically solidifying or falsifying our hypotheses by regularly collecting new evidences and updating our Priors subsequently.

If you notice a mistake in this blog post please mention them in the comment section or email them to me at iamabhimanyu08@gmail.com, I'll make sure to correct them right away.