In order to predict weather for this Friday, I will employ simple Machine Learning technique!
It is very simple...
First you would need to collect the data. For this purpose I will return to WU site to collect year worth of data. Like I said, the site is awesome. Not only they are pretty accurate in their predictions, at least for my area, but they also have some cool gadgets and they share their data for free!
Now... The data won't be clean and what we will get from WU won't be immediately usable. From what I mentioned in one of my previous posts: one of the skills with working with data is to be able to clean it. There are numerous examples out there where dirty aka noisy data led to errors, either unintentional or not. So, second step would involve cleaning of the data.
Now that we have data, we will be able to use decision tree classifier to come up with the model. When we are satisfied with the model, we will be able to use it to come up with our own prediction for this weekend and compare the results with WU and then wait and see who was right :)
The Plan
Step 1
Data that will be given to us by WU will consist of several different features and the outcome for that day. For example, we will have the information of what happened that day. Did it snow? Was it nice and sunny? Or was it raining cats and dogs? I.e. we will have the outcome! As far as features go: we will have information about temperature, humidity, wind, etc. All of these together would help us to predict the weather: if it is 80 degrees and visibility is sunny - it is unlikely that it would snow, but most likely that it will be a nice weather to enjoy the outside on a beach! Hence our goal here is to gather features or attributes of the day that would help us determine the outcome of that day: snow, rain, etc.Step 2
Clean it!Step 3
Use Decision Tree to come up with model. For example, let's says that we have temperature of more than 80 degrees and we see historically that it never rained or snow when it was above 80 degrees, then one branch of the decision tree would be condition: if temp > 80 then "No Snow", what happens if it is less than 80? Well, we can look at other attributes that we will gather, let's say that if it was less than 80 degrees but more than 30 and outlook was sunny, it never snowed. Here you have yet another condition! If 30 < temp < 80 ad outlook is sunny then "No Snow". Hopefully, you can see the idea and where we are going with this at this point!For step 3 to make things easier, I am going to use Weka. It is a machine learning software written in Java.
The Action
Now let's get some data!
Go to http://www.wunderground.com/On the menu locate 'More' and select 'Historical Weather' submenu.
Enter zip code. In my case, I entered 23220 since I am trying to predict weather in Richmond this Friday! And hit 'Enter'
Locate Custom tab (Daily, Weekly, Monthly, Custom) and select year worth of data.
Select Get History. Then go all the way to the bottom to be able to download comma delimited file
I downloaded my file as richmond_data.csv and saved it on my laptop...
Ever since I migrated to MacOS, I started to favor Google Docs. For one thing, I don't want to learn Mac Office and then have to deal with converting my docs and excel spreadsheets for PC users... Finally, with Google Docs my documents are already in the cloud, hence less stuff to back up and worry about it.
Go to https://drive.google.com/ and create a new google sheets document.
File->Import and import your file (richmond_data.csv).
For import action, pick 'Import New Sheet(s)', for separator character, go ahead and pick comma. Hit Import and you shall see something like:
Now it is time to clean!!!
So we have a bunch of attributes and accompanied event for that day. Take a look at the last column: 'Event'. It contains what actually happened that day and as you can see some of the values are empty. Let's correct that and put 'N/A'. Instead of filling it out by hand, let's employ some of the excel wizardly. Create a new column right after 'Event' column and insert following formula (In my case, my 'Event' column is in V column):
=if(V2="","N/A",V2)
Basically, it does simple check to see if value is empty, if so, insert "N/A", if not put the original value in. Now drag it to the bottom and you shall have no empty values left at this point.
Now, we see that we can further improve the 'Event' column. As you can see we are given, multiple events in one column and ('Rain-Snow') and others are little value to us ('Fog' - I don't care if it will be foggy, I have anti fog lights on!!! :) ). So let's modify our formula to give us Snow or N/A.
=if(isnumber(find("Snow", V2)),"Snow","N/A")
Now let's swap 'Events' column with our clean column. You copy W column and paste is as values only and delete V column. Simple!
More cleaning: Precipitation column contains numeric value and T. Let's change it to 0 not to throw our model into a panic mode. Follow the same steps as above but with a slightly different formula, and after you are come copy/paste the values and delete old column.
=if(T2="T",0,V2)
Now, we will be predicting the weather! so we should not put the answer on the same line as list of attributes, i.e. we can not post the weather for that day along with the same day events. We will be predicting it, so we need to put tomorrow's weather in the same row as today's attributes. Select V3 cell and copy all the way to the bottom thus selecting all values for the weather. Now go to V2 and paste it over. So here, you have list of attributes and weather event that happened the next day.
(Just glancing over the data it snowed like 6 times entire year! felt more than that probably since snow was never cleaned even that it only snowed so few times... VDOT... what can I say...)
At this point we have a relatively clean data, so let's download it locally so we can feed it to Weka and do some magic with it. File-> Download As->Comma separated values. (Can't go wrong with this format)
Let's model it!!!
Start Weka and pick 'Explorer' option.
Select 'Open file...' . Pick CSV as your type and navigate to the location of your downloaded file.
If you look at the EST column, you will see that it has 366 unique values. Makes sense since we see each day of the year here. This will not be useful to us so let's remove this column. (Select the checkbox next to it and click Remove button).
Select 'Classify' tab.
Select 'Choose' and pick J48 under trees submenu
Under 'More Options' select Events since that's what we are going to try to predict.
Hit Start
After few seconds, you shall have your decision tree!!!
Now all that hard work of trying to come up with attributes/features that would help you determine what will happen tomorrow was already done for you! Here you can create either your own weather app or just look at the temp and say that if it was less than 32 degrees and cloudcover was more than zero it would snow.
As you can tell, the tree is not very big... but we were expecting it since it only snowed some few times in the last year.
Scroll in the window to see more interesting stuff like confusion matrix and so on. If you take stats, you can use Weka to double check your results :)
As far as Richmond trip goes, if temperature will be less than 32 (it will be...) and we will have clouds the night before, it will snow. We shall see!!!
No comments:
Post a Comment