Wednesday, February 7, 2018

Fun with apply function in pandas

While doing some data munging, I came across one issue that had me running in circles and made me recheck my logic over and over...

I was getting duplicates in my output set while applying 'apply' function to a dataframe. More specifically, I used 'apply' function on a groupedby dataframe to process each batch of records and apply set of business rules to an external file.

Like I said, I had duplicates!

So.... the reason to this as it came apparent to me was the fact that 'apply' function needs to figure out if you were to be mutating the passed data in order to take a fast or slow path to execute the code. Hence I would have duplicate in my external file, which you won't notice if you were to apply function to the dataframe itself, in which case it would just write to it twice replacing the old same value and it would be transparent to you.

References:
https://github.com/pandas-dev/pandas/issues/7739
https://github.com/pandas-dev/pandas/issues/6753

No comments:

Post a Comment