Cloud. Big Data. Analytics... and so on: March 2016

Thursday, March 31, 2016

History of Apache Storm and Lessons Learned Notes:

Article: http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html

Notes:
Any successful project requires two things:

1. It solves a useful problem
2. You are able to convince a significant number of people that your project is the best solution to their problem

Make sure that your software is using common language like Java to ensure a very large amount of potential users.

To ensure other non-JVM languages can use it, make sure to define topologies as Thift data structures, and topologies are submitted using a Thrift API. Hence by increasing the use of your software by other languages increases your exposure to larger audience.

There are two ways you can go about releasing open source software. The first is to "go big", build a lot of hype for the project, and then get as much exposure as possible on release. This approach can be risky though, since if the quality isn't there or you mess up the messaging, you will alienate a huge number of people to the project on day one. That could kill any chance the project had to be successful. The second approach is to quietly release the code and let the software slowly gain adoption. This avoids the risks of the first approach, but it has its own risk of people viewing the project as insignificant and ignoring it.

Advantages of releasing your application at a conference:

1. The conference would help with marketing and promotion.
2. You would be presenting to a concentrated group of potential early adopters, who would then blog/tweet/email about it all at once, massively increasing exposure.
3. You could hype my conference session, building anticipation for the project and ensuring that on the day of release, there would be a lot of eyes on the project.

People need to monitor their applications. Hence make sure to build monitoring API or at least log metrics for retrieval via Splunk or ELK.

ALWAYS supply as much of documentation as possible with your software/application. If there is no documentation, then the users don’t know how to use your application or software and hence it would discourage them from using it. Hence always make sure that documentation is extensive and up to date!

Sunday, March 27, 2016

Saturday, March 19, 2016

All About That Bayes!!!

If you were to search on Bayes, you will see tons of hits on the subject. It often comes up in discussions of data science and machine learning. The iconic post can be found in Yudkowsky article (the article is long, but worth reading if you need to learn very fine details about Bayes)

So why another article on Bayes? Well, as they say that you truly learn when you teach, so instead of glancing over pages explaining Bayes, I decided to put together what I know of Bayes to get it all organized here and in my head!

I believe one of the best examples to explain Bayes is breast cancer testing test. It is simple to follow when explained and it is easy to get wrong in the first glance.

Let's say that over the years, doctors were able to observe following data:
- 1% of women have breast cancer.
- 95% of mammograms are able to detect cancer when it is indeed there.
- 20% of mammograms detect "false" cancer, i.e. it is NOT there!

At this point, you can put this information into probability table to make it visually clear to you which value represents what.

How to read this table? If we go column by column, we have:
- First column lists the result of the test, it either showed that it was positive for cancer or not.
- Second and third columns shows condition of the patient. 2nd - for patients that unfortunately have cancer and 3rd - column is for patients who are cancer free.
- The second column shows that test will correctly return positive result for 95% of patients who truly have cancer and 5% of the time it would show that test is negative even though patient does have cancer.
- The third column shows that test will incorrectly return positive result for 20% of the patients who actually cancer free, while giving negative result for 80% of cancer free patients.

Now the question is: Given above information, what are actual chances of someone who just received unfortunate news of positive breast cancer to actually have breast cancer? Would you jump to conclusion that it is very likely considering the fact that the test is able to provide 95% of test positive accuracy when cancer is present? 99.99% since it is just your luck? Or 0% since bad stuff can't happen to you? Well, read on for more educated guess.

Let's look at the table to see how bad the news is...
Step 1: Result is Breast Cancer Positive - refer to Breast Cancer Positive row to find out percentages for patients with cancer and without.
Step 2: Calculate TRUE POSITIVE: Chance of patient having cancer X Chance of positive test = 1% X 95% = 0.0095
Step 3: Calculate FALSE POSITIVE: Chance of patient not having cancer X Chance of positive test = 99% X 20% = 0.198

or if you want to populate confusion matrix entirely:

So when one is given the unfortunate news of breast cancer test being positive, to calculate the probability of patient actually having cancer, one needs to divide probability of true positive (i.e. patient does have cancer) by total probability of patient getting positive breast cancer test back, which is sum of True Positive and False Positive. Let's do some arithmetic!

True Positive = 0.0095
True Positive + False Positive = 0.0095 + 0.198 = 0.2075

Hence actual chance of someone having breast cancer given positive breast cancer test is
0.0095 / 0.2075 = 0.04578 or 4.578%

4.578% - Feeling better about that outcome? But how can it be so low? Doesn't test predict with 95% accuracy? Yes, but it also incorrectly predicts healthy people having cancer 20% or 1 in 5. Given that this cancer only occurs in 1% of the people, it is for more likely that 99% of population who is healthy tested positive. Let's explain with actual numbers to make it clear. In a room of 100 people, only 1 person will have breast cancer. If all 100 were given breast cancer test, 1 person with cancer would receive very accurate test (95%) indicating that they better get some treatment quickly. On the other hand, out of 99 remaining healthy people, 99 X 20% = 19.8 (or almost 20) would receive inaccurate news of them having cancer. So only 1 out of 21 (20+1) people tested positive will actually have cancer!

To make this point hit home for 'visual' learners, let's draw some circles!

Area A - 1% of entire population has cancer

So if you were to ask what's the probability of randomly picking someone with breast cancer, you would come out with

Formula 1:
P(A) = | A | / | Entire Population |

Area B - People who receive positive breast cancer test results (both true positive and false positive)

Similarly, probability that someone would have positive breast cancer results are

Formula 2:
P(B) = | B | / | Entire Population |

Now let's merge these two figures into one:

The intersection of A and B, noted as AB, means that a person who has cancer was diagnosed to have cancer, true positive in our earlier example.

The probability of this is:

Formula 3:
P(AB) = | AB | / | Entire Population |

Now let's answer the same question as we had before: Given the news of positive breast cancer exam, what are the chances of cancer actually be present?

Since we are using regions to help us, we can paraphrase it as, what are the changes for someone who is in region B (i.e. positive breast cancer test) to be in region AB (i.e. positive breast cancer test for a person who is in region A and is having a cancer.) or simply the probability of A given B:

Formula 4:
P(A | B) = | AB | / | B |

if we recall formulas 2 and 3, we can rewrite formula 4 as

Formula 5:
P(A | B) = P(AB) / P (B)

While we know P (B) - number of all positive tests, we don't know P (AB)... So let's ask another question. Given that randomly selected person who has breast cancer, what are the chances that test is positive? (We actually know that number! 95%!!!). This can be written as

Formula 6:
P (B | A) = P(AB) / P(A)

we can express the unknown P(AB) as P(AB) = P(B | A) x P (A) and substitute it in formula 5

Formula 7:
P (A | B) = P(B | A) x P (A) / P (B)

In this case, we know

P (B | A) = 0.95
P (A) = 0.01
P (B) = 0.0095 + 0.198 = 0.2075

P (A | B) = 0.95 * 0.01 / 0.2075 = 0.04578 or 4.578%

and... what we derived in Formula 7 is Bayes' theorem!

I always found that if you can quickly derive the formula, you would always be able to have it at hand versus trying to memorize it and guess A's and B's location. So next time, when some one ask you about Bayes, draw two circles and take it from there!

Monday, March 7, 2016

Making git respect .gitignore after the fact!

Imagine the situation where you wrote your code and then decided to add it to your git repo. Pretty easy right?

git init
git add .

Before you commit, you want to see what's going to be committed. So you do

git status

Now you see whole bunch of config and target files that have no business being in the repo. Not a problem, you can use .gitignore right? First remove what you added, create .gitignore file and you can re add again only source files.

git rm -r .

create .gitignore with
/target/**
.settings/**
.classpath
.project

and re-add

git add .

Check what's about to be committed... and what?!?!? old files? How can this be? Did I messed up my regex? spelled gitignore wrong or forgot the leading period? Nope, everything seems correct...

After reading gitignore help guide... you need to clear your cache!!! Here is what you do
Instead of running

git rm -r .

Run this
git rm -r --cached .

cached flag is the key difference.

After this command, re-add, verify and finally commit:

git add .
git commit -m "source files only!!!"