Tuesday, January 31, 2017

Creating dummy variables in pandas

During my work when I need to leverage some ML techniques and I have input and output data with the output column containing categories, I often time need to split the output column into several columns with yes or no values.
Let me explain... So for instance you are involved in logistical regression calculation and your output is set of categories like 1, 2, 3, 4... if you use logistical regression to predict the value, it can come up with 1.4... which is really not a valid category.
In this case, you need to break down output column into several with yes/no values, this way, logistical regression would work by predicting either yes or no.

So how do you do it?
Pandas!

First import pandas
>>> import pandas as pd

Then let's say you have following data.
>>> my_data = {'name': ['John Doe', 'Jane Doe', 'Mike Roth', 'Mark Wagner', 'David Scott'],
...         'title': ['developer', 'manager', 'developer', 'manager', 'developer']}

View it again...
>>> my_data
{'name': ['John Doe', 'Jane Doe', 'Mike Roth', 'Mark Wagner', 'David Scott'], 'title': ['developer', 'manager', 'developer', 'manager', 'developer']}

Convert it to dataframe object
>>> df = pd.DataFrame(my_data, columns=['name','title'])

>>> df.head()
          name      title
0     John Doe  developer
1     Jane Doe    manager
2    Mike Roth  developer
3  Mark Wagner    manager
4  David Scott  developer

use get_dummies panda method to break it out accordingly
>>> df_title = pd.get_dummies(df['title'])

merge it to the original dataframe
>>> df = pd.concat([df,df_title],axis=1)

Tada... that's our result
>>> df.head()
          name      title  developer  manager
0     John Doe  developer        1.0      0.0
1     Jane Doe    manager        0.0      1.0
2    Mike Roth  developer        1.0      0.0
3  Mark Wagner    manager        0.0      1.0
4  David Scott  developer        1.0      0.0

No comments:

Post a Comment