Cloud. Big Data. Analytics... and so on: All About That Bayes!!!

If you were to search on Bayes, you will see tons of hits on the subject. It often comes up in discussions of data science and machine learning. The iconic post can be found in Yudkowsky article (the article is long, but worth reading if you need to learn very fine details about Bayes)

So why another article on Bayes? Well, as they say that you truly learn when you teach, so instead of glancing over pages explaining Bayes, I decided to put together what I know of Bayes to get it all organized here and in my head!

I believe one of the best examples to explain Bayes is breast cancer testing test. It is simple to follow when explained and it is easy to get wrong in the first glance.

Let's say that over the years, doctors were able to observe following data:
- 1% of women have breast cancer.
- 95% of mammograms are able to detect cancer when it is indeed there.
- 20% of mammograms detect "false" cancer, i.e. it is NOT there!

At this point, you can put this information into probability table to make it visually clear to you which value represents what.

How to read this table? If we go column by column, we have:
- First column lists the result of the test, it either showed that it was positive for cancer or not.
- Second and third columns shows condition of the patient. 2nd - for patients that unfortunately have cancer and 3rd - column is for patients who are cancer free.
- The second column shows that test will correctly return positive result for 95% of patients who truly have cancer and 5% of the time it would show that test is negative even though patient does have cancer.
- The third column shows that test will incorrectly return positive result for 20% of the patients who actually cancer free, while giving negative result for 80% of cancer free patients.

Now the question is: Given above information, what are actual chances of someone who just received unfortunate news of positive breast cancer to actually have breast cancer? Would you jump to conclusion that it is very likely considering the fact that the test is able to provide 95% of test positive accuracy when cancer is present? 99.99% since it is just your luck? Or 0% since bad stuff can't happen to you? Well, read on for more educated guess.

Let's look at the table to see how bad the news is...
Step 1: Result is Breast Cancer Positive - refer to Breast Cancer Positive row to find out percentages for patients with cancer and without.
Step 2: Calculate TRUE POSITIVE: Chance of patient having cancer X Chance of positive test = 1% X 95% = 0.0095
Step 3: Calculate FALSE POSITIVE: Chance of patient not having cancer X Chance of positive test = 99% X 20% = 0.198

or if you want to populate confusion matrix entirely:

So when one is given the unfortunate news of breast cancer test being positive, to calculate the probability of patient actually having cancer, one needs to divide probability of true positive (i.e. patient does have cancer) by total probability of patient getting positive breast cancer test back, which is sum of True Positive and False Positive. Let's do some arithmetic!

True Positive = 0.0095
True Positive + False Positive = 0.0095 + 0.198 = 0.2075

Hence actual chance of someone having breast cancer given positive breast cancer test is
0.0095 / 0.2075 = 0.04578 or 4.578%

4.578% - Feeling better about that outcome? But how can it be so low? Doesn't test predict with 95% accuracy? Yes, but it also incorrectly predicts healthy people having cancer 20% or 1 in 5. Given that this cancer only occurs in 1% of the people, it is for more likely that 99% of population who is healthy tested positive. Let's explain with actual numbers to make it clear. In a room of 100 people, only 1 person will have breast cancer. If all 100 were given breast cancer test, 1 person with cancer would receive very accurate test (95%) indicating that they better get some treatment quickly. On the other hand, out of 99 remaining healthy people, 99 X 20% = 19.8 (or almost 20) would receive inaccurate news of them having cancer. So only 1 out of 21 (20+1) people tested positive will actually have cancer!

To make this point hit home for 'visual' learners, let's draw some circles!

Area A - 1% of entire population has cancer

So if you were to ask what's the probability of randomly picking someone with breast cancer, you would come out with

Formula 1:
P(A) = | A | / | Entire Population |

Area B - People who receive positive breast cancer test results (both true positive and false positive)

Similarly, probability that someone would have positive breast cancer results are

Formula 2:
P(B) = | B | / | Entire Population |

Now let's merge these two figures into one:

The intersection of A and B, noted as AB, means that a person who has cancer was diagnosed to have cancer, true positive in our earlier example.

The probability of this is:

Formula 3:
P(AB) = | AB | / | Entire Population |

Now let's answer the same question as we had before: Given the news of positive breast cancer exam, what are the chances of cancer actually be present?

Since we are using regions to help us, we can paraphrase it as, what are the changes for someone who is in region B (i.e. positive breast cancer test) to be in region AB (i.e. positive breast cancer test for a person who is in region A and is having a cancer.) or simply the probability of A given B:

Formula 4:
P(A | B) = | AB | / | B |

if we recall formulas 2 and 3, we can rewrite formula 4 as

Formula 5:
P(A | B) = P(AB) / P (B)

While we know P (B) - number of all positive tests, we don't know P (AB)... So let's ask another question. Given that randomly selected person who has breast cancer, what are the chances that test is positive? (We actually know that number! 95%!!!). This can be written as

Formula 6:
P (B | A) = P(AB) / P(A)

we can express the unknown P(AB) as P(AB) = P(B | A) x P (A) and substitute it in formula 5

Formula 7:
P (A | B) = P(B | A) x P (A) / P (B)

In this case, we know

P (B | A) = 0.95
P (A) = 0.01
P (B) = 0.0095 + 0.198 = 0.2075

P (A | B) = 0.95 * 0.01 / 0.2075 = 0.04578 or 4.578%

and... what we derived in Formula 7 is Bayes' theorem!

I always found that if you can quickly derive the formula, you would always be able to have it at hand versus trying to memorize it and guess A's and B's location. So next time, when some one ask you about Bayes, draw two circles and take it from there!

Cloud. Big Data. Analytics... and so on

Saturday, March 19, 2016

All About That Bayes!!!

No comments:

Post a Comment