Friday, March 9, 2018

Commit your changes or stash them before you can merge??!?!


When trying to update your local copy from remote master copy, you will see following error

$ git pull origin master
error: Your local changes to the following files would be overwritten by merge:
<list of files>
Please commit your changes or stash them before you merge.
Aborting


You have several options here

1. Commit the change
$ git add .
$ git commit -m "committing before the update"


2. Stash them
$ git stash
$ git stash pop

3. Overwrite local changes
git reset --hard

Thursday, March 1, 2018

Add Twistd to your Flask app

I'm using twisted web as front end for flask wsgi app. you probably already know of it. if not, it's essential. non-blocking IO. fully scales up.


1000's of connections

essentially, twisted combines the non-blocking IO of tornado and front end container of gunicorn in one package

$ pip install twisted

$ vi run.sh
export PYTHONPATH=.:$PYTHONPATH
twistd -n web --port tcp:8006 --wsgi yourfile.app

<where .app is the flask app inside yourfile.py>

$ nohup ./run.sh &



Tuesday, February 13, 2018

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

When I was trying to visualize a simple plot, this error: "ImportError: libGL.so.1: cannot open shared object file: No such file or directory" popped out of the blue when I was trying to debug my code.

Quick google search later, it was suggested that I use
import matplotlib
matplotlib.use("Agg")

Fine... and it did work! no more error message.... but also, i stopped seeing my plot! :)
No good.

After digging deeper, I discovered that Red Hat does not ship native OpenGL libraries, but does ship the Mesa libraries: an MIT licensed implementation of the OpenGL specification.
You can add the Mesa OpenGL runtime libraries on your system by installing the following package:

# sudo yum install mesa-libGL

Problem was solved!

Happy coding

Wednesday, February 7, 2018

Add github master to your project

If you have active project on your laptop and some central git already in place, follow these simple steps to add github as extra back up repo to your project.


git remote add github https://github.com/your_name/repository_name.git
# push master to github
$ git push github master

Fun with apply function in pandas

While doing some data munging, I came across one issue that had me running in circles and made me recheck my logic over and over...

I was getting duplicates in my output set while applying 'apply' function to a dataframe. More specifically, I used 'apply' function on a groupedby dataframe to process each batch of records and apply set of business rules to an external file.

Like I said, I had duplicates!

So.... the reason to this as it came apparent to me was the fact that 'apply' function needs to figure out if you were to be mutating the passed data in order to take a fast or slow path to execute the code. Hence I would have duplicate in my external file, which you won't notice if you were to apply function to the dataframe itself, in which case it would just write to it twice replacing the old same value and it would be transparent to you.

References:
https://github.com/pandas-dev/pandas/issues/7739
https://github.com/pandas-dev/pandas/issues/6753

Thursday, August 3, 2017

Optimizing Pandas

We’ll review the efficiency of several methodologies for applying a function to a Pandas DataFrame, from slowest to fastest:

1. Crude looping over DataFrame rows using indices
2. Looping with iterrows()
3. Looping with apply()
4. Vectorization with Pandas series
5. Vectorization with NumPy arrays

For our example function, we’ll use the Haversine (or Great Circle) distance formula. Our function takes the latitude and longitude of two points, adjusts for Earth’s curvature, and calculates the straight-line distance between them. The function looks something like this:
import numpy as np
 
# Define a basic Haversine distance formula
def haversine(lat1, lon1, lat2, lon2):
    MILES = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    total_miles = MILES * c
    return total_miles

To test our function on real data, we’ll use a dataset containing the coordinates of all hotels in New York state, sourced from Expedia’s developer site. We’ll calculate the distance between each hotel and a sample set of coordinates (which happen to belong to a fantastic little shop called the Brooklyn Superhero Supply Store in NYC).

Crude looping in Pandas, or That Thing You Should Never Ever Do

Just about every Pandas beginner I’ve ever worked with (including yours truly) has, at some point, attempted to apply a custom function by looping over DataFrame rows one at a time. The advantage of this approach is that it is consistent with the way one would interact with other iterable Python objects; for example, the way one might loop through a list or a tuple. Conversely, the downside is that a crude loop, in Pandas, is the slowest way to get anything done. Unlike the approaches we will discuss below, crude looping in Pandas does not take advantage of any built-in optimizations, making it extremely inefficient (and often much less readable) by comparison.

For example, one might write something like this:
# Define a function to manually loop over all rows and return a series of distances
def haversine_looping(df):
    distance_list = []
    for i in range(0, len(df)):
        d = haversine(40.671, -73.985, df.iloc[i]['latitude'], df.iloc[i]['longitude'])
        distance_list.append(d)
    return distance_list

To get a sense of the time required to execute the function above, we’ll use the %timeit command. 

%%timeit
 
# Run the haversine looping function
df['distance'] = haversine_looping(df)

This returns the following result:
645 ms ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Our crude looping function took about 645 ms to run, with a standard deviation of 31 ms. This may seem fast, but it’s actually quite slow, considering the function only needed to process some 1,600 rows. Let’s look at how we can improve this unfortunate state of affairs.

Looping with iterrows()

A better way to loop through rows, if loop you must, is with the iterrows()method. iterrows()is a generator that iterates over the rows of the dataframe and returns the index of each row, in addition to an object containing the row itself. iterrows() is optimized to work with Pandas dataframes, and, although it’s the least efficient way to run most standard functions (more on that later), it’s a significant improvement over crude looping. In our case, iterrows() solves the same problem almost four times faster than manually looping over rows.

%%timeit
 
# Haversine applied on rows via iteration
haversine_series = []
for index, row in df.iterrows():
    haversine_series.append(haversine(40.671, -73.985, row['latitude'], row['longitude']))
df['distance'] = haversine_series
 
166 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

Better looping using the apply method

An even better option than iterrows() is to use the apply() method, which applies a function along a specific axis (meaning, either rows or columns) of a DataFrame.

Although apply() also inherently loops through rows, it does so much more efficiently than iterrows() by taking advantage of a number of internal optimizations, such as using iterators in Cython.

We use an anonymous lambda function to apply our Haversine function on each row, which allows us to point to specific cells within each row as inputs to the function. The lambda function includes the axis parameter at the end, in order to specify whether

Pandas should apply the function to rows (axis = 1) or columns (axis = 0).

%%timeit
 
# Timing apply on the Haversine function
df['distance'] = df.apply(lambda row: haversine(40.671, -73.985, row['latitude'], row['longitude']), axis=1)
90.6 ms ± 7.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Swapping apply() for iterrows() has roughly halved the runtime of the function

Vectorization over Pandas series

To understand how we can reduce the amount of iteration performed by the function, recall that the fundamental units of Pandas, DataFrames and series, are both based on arrays. The inherent structure of the fundamental units translates to built-in Pandas functions being designed to operate on entire arrays, instead of sequentially on individual values (referred to as scalars). Vectorization is the process of executing operations on entire arrays.
Pandas includes a generous collection of vectorized functions for everything from mathematical operations to aggregations and string functions (for an extensive list of available functions, check out the Pandas docs). The built-in functions are optimized to operate specifically on Pandas series and DataFrames. As a result, using vectorized Pandas functions is almost always preferable to accomplishing similar ends with custom-written looping.
So far, we’ve only been passing scalars to our Haversine function. All of the functions being used within the Haversine function, however, are also able to operate on arrays. This makes the process of vectorizing our distance function quite simple: instead of passing individual scalar values for latitude and longitude to it, we’re going to pass it the entire series (columns). This will allow Pandas to benefit from the full set of optimizations available for vectorized functions, including, notably, performing all the calculations on the entire array simultaneously.

%%timeit 
 
# Vectorized implementation of Haversine applied on Pandas series
df['distance'] = haversine(40.671, -73.985,df['latitude'], df['longitude'])
 
1.62 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

We’ve achieved more than a 50-fold improvement over the apply() method, and more than a 100-fold improvement over iterrows() by vectorizing the function — and we didn’t need to do anything but change the input type!

Vectorization with NumPy arrays

At this point, we could choose to call it a day; vectorizing over Pandas series achieves the overwhelming majority of optimization needs for everyday calculations. However, if speed is of highest priority, we can call in reinforcements in the form of the NumPy Python library.
The NumPy library, which describes itself as a “fundamental package for scientific computing in Python”, performs operations under the hood in optimized, pre-compiled C code. Like Pandas, NumPy operates on array objects (referred to as ndarrays); however, it leaves out a lot of overhead incurred by operations on Pandas series, such as indexing, data type checking, etc. As a result, operations on NumPy arrays can be significantly faster than operations on Pandas series.
NumPy arrays can be used in place of Pandas series when the additional functionality offered by Pandas series isn’t critical. For example, the vectorized implementation of our Haversine function doesn’t actually use indexes on the latitude or longitude series, and so not having those indexes available will not cause the function to break. By comparison, had we been doing operations like DataFrame joins, which require referring to values by index, we might want to stick to using Pandas objects.
We convert our latitude and longitude arrays from Pandas series to NumPy arrays simply by using the values method of the series. As with vectorization on the series, passing the NumPy array directly into the function will lead Pandas to apply the function to the entire vector.

%%timeit
 
# Vectorized implementation of Haversine applied on NumPy arrays
df['distance'] = haversine(40.671, -73.985, df['latitude'].values, df['longitude'].values)

370 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Running the operation on NumPy array has achieved another four-fold improvement. All in all, we’ve refined the runtime from over half a second, via looping, to a third of a millisecond, via vectorization with NumPy!

Summary

The scoreboard below summarizes the results. Although vectorization with NumPy arrays resulted in the fastest runtime, it was a fairly marginal improvement over the effect of vectorization with Pandas series, which resulted in a whopping 56x improvement over the fastest version of looping.



This brings us to a few basic conclusions on optimizing Pandas code:
1. Avoid loops; they’re slow and, in most common use cases, unnecessary.
2. If you must loop, use apply(), not iteration functions.
3. Vectorization is usually better than scalar operations. Most common operations in Pandas can be vectorized.
4. Vector operations on NumPy arrays are more efficient than on native Pandas series.
The above does not, of course, make up a comprehensive list of all possible optimizations for Pandas. More adventurous users might consider, for example, further rewriting the function in Cython, or attempting to optimize the individual components of the function. However, these topics are beyond the scope of this post.





Wednesday, August 2, 2017

PCA

Each principal component is a linear combination of the original variables:
pca-coef
where X_is are the original variables, and Beta_is are the corresponding weights or so called coefficients.
This information is included in the pca attribute: components_. As described in the documentationpca.components_ outputs an array of [n_components, n_features], so to get how components are linearly related with the different features you have to:
Note: each coefficient represents the correlation between a particular pair of component and feature
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

# Dump components relations with features:
print pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2'])

      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
PC-1           0.522372         -0.263355           0.581254          0.565611
PC-2          -0.372318         -0.925556          -0.021095         -0.065416