Thursday, August 3, 2017

Optimizing Pandas

We’ll review the efficiency of several methodologies for applying a function to a Pandas DataFrame, from slowest to fastest:

1. Crude looping over DataFrame rows using indices
2. Looping with iterrows()
3. Looping with apply()
4. Vectorization with Pandas series
5. Vectorization with NumPy arrays

For our example function, we’ll use the Haversine (or Great Circle) distance formula. Our function takes the latitude and longitude of two points, adjusts for Earth’s curvature, and calculates the straight-line distance between them. The function looks something like this:
import numpy as np
 
# Define a basic Haversine distance formula
def haversine(lat1, lon1, lat2, lon2):
    MILES = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    total_miles = MILES * c
    return total_miles

To test our function on real data, we’ll use a dataset containing the coordinates of all hotels in New York state, sourced from Expedia’s developer site. We’ll calculate the distance between each hotel and a sample set of coordinates (which happen to belong to a fantastic little shop called the Brooklyn Superhero Supply Store in NYC).

Crude looping in Pandas, or That Thing You Should Never Ever Do

Just about every Pandas beginner I’ve ever worked with (including yours truly) has, at some point, attempted to apply a custom function by looping over DataFrame rows one at a time. The advantage of this approach is that it is consistent with the way one would interact with other iterable Python objects; for example, the way one might loop through a list or a tuple. Conversely, the downside is that a crude loop, in Pandas, is the slowest way to get anything done. Unlike the approaches we will discuss below, crude looping in Pandas does not take advantage of any built-in optimizations, making it extremely inefficient (and often much less readable) by comparison.

For example, one might write something like this:
# Define a function to manually loop over all rows and return a series of distances
def haversine_looping(df):
    distance_list = []
    for i in range(0, len(df)):
        d = haversine(40.671, -73.985, df.iloc[i]['latitude'], df.iloc[i]['longitude'])
        distance_list.append(d)
    return distance_list

To get a sense of the time required to execute the function above, we’ll use the %timeit command. 

%%timeit
 
# Run the haversine looping function
df['distance'] = haversine_looping(df)

This returns the following result:
645 ms ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Our crude looping function took about 645 ms to run, with a standard deviation of 31 ms. This may seem fast, but it’s actually quite slow, considering the function only needed to process some 1,600 rows. Let’s look at how we can improve this unfortunate state of affairs.

Looping with iterrows()

A better way to loop through rows, if loop you must, is with the iterrows()method. iterrows()is a generator that iterates over the rows of the dataframe and returns the index of each row, in addition to an object containing the row itself. iterrows() is optimized to work with Pandas dataframes, and, although it’s the least efficient way to run most standard functions (more on that later), it’s a significant improvement over crude looping. In our case, iterrows() solves the same problem almost four times faster than manually looping over rows.

%%timeit
 
# Haversine applied on rows via iteration
haversine_series = []
for index, row in df.iterrows():
    haversine_series.append(haversine(40.671, -73.985, row['latitude'], row['longitude']))
df['distance'] = haversine_series
 
166 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

Better looping using the apply method

An even better option than iterrows() is to use the apply() method, which applies a function along a specific axis (meaning, either rows or columns) of a DataFrame.

Although apply() also inherently loops through rows, it does so much more efficiently than iterrows() by taking advantage of a number of internal optimizations, such as using iterators in Cython.

We use an anonymous lambda function to apply our Haversine function on each row, which allows us to point to specific cells within each row as inputs to the function. The lambda function includes the axis parameter at the end, in order to specify whether

Pandas should apply the function to rows (axis = 1) or columns (axis = 0).

%%timeit
 
# Timing apply on the Haversine function
df['distance'] = df.apply(lambda row: haversine(40.671, -73.985, row['latitude'], row['longitude']), axis=1)
90.6 ms ± 7.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Swapping apply() for iterrows() has roughly halved the runtime of the function

Vectorization over Pandas series

To understand how we can reduce the amount of iteration performed by the function, recall that the fundamental units of Pandas, DataFrames and series, are both based on arrays. The inherent structure of the fundamental units translates to built-in Pandas functions being designed to operate on entire arrays, instead of sequentially on individual values (referred to as scalars). Vectorization is the process of executing operations on entire arrays.
Pandas includes a generous collection of vectorized functions for everything from mathematical operations to aggregations and string functions (for an extensive list of available functions, check out the Pandas docs). The built-in functions are optimized to operate specifically on Pandas series and DataFrames. As a result, using vectorized Pandas functions is almost always preferable to accomplishing similar ends with custom-written looping.
So far, we’ve only been passing scalars to our Haversine function. All of the functions being used within the Haversine function, however, are also able to operate on arrays. This makes the process of vectorizing our distance function quite simple: instead of passing individual scalar values for latitude and longitude to it, we’re going to pass it the entire series (columns). This will allow Pandas to benefit from the full set of optimizations available for vectorized functions, including, notably, performing all the calculations on the entire array simultaneously.

%%timeit 
 
# Vectorized implementation of Haversine applied on Pandas series
df['distance'] = haversine(40.671, -73.985,df['latitude'], df['longitude'])
 
1.62 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

We’ve achieved more than a 50-fold improvement over the apply() method, and more than a 100-fold improvement over iterrows() by vectorizing the function — and we didn’t need to do anything but change the input type!

Vectorization with NumPy arrays

At this point, we could choose to call it a day; vectorizing over Pandas series achieves the overwhelming majority of optimization needs for everyday calculations. However, if speed is of highest priority, we can call in reinforcements in the form of the NumPy Python library.
The NumPy library, which describes itself as a “fundamental package for scientific computing in Python”, performs operations under the hood in optimized, pre-compiled C code. Like Pandas, NumPy operates on array objects (referred to as ndarrays); however, it leaves out a lot of overhead incurred by operations on Pandas series, such as indexing, data type checking, etc. As a result, operations on NumPy arrays can be significantly faster than operations on Pandas series.
NumPy arrays can be used in place of Pandas series when the additional functionality offered by Pandas series isn’t critical. For example, the vectorized implementation of our Haversine function doesn’t actually use indexes on the latitude or longitude series, and so not having those indexes available will not cause the function to break. By comparison, had we been doing operations like DataFrame joins, which require referring to values by index, we might want to stick to using Pandas objects.
We convert our latitude and longitude arrays from Pandas series to NumPy arrays simply by using the values method of the series. As with vectorization on the series, passing the NumPy array directly into the function will lead Pandas to apply the function to the entire vector.

%%timeit
 
# Vectorized implementation of Haversine applied on NumPy arrays
df['distance'] = haversine(40.671, -73.985, df['latitude'].values, df['longitude'].values)

370 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Running the operation on NumPy array has achieved another four-fold improvement. All in all, we’ve refined the runtime from over half a second, via looping, to a third of a millisecond, via vectorization with NumPy!

Summary

The scoreboard below summarizes the results. Although vectorization with NumPy arrays resulted in the fastest runtime, it was a fairly marginal improvement over the effect of vectorization with Pandas series, which resulted in a whopping 56x improvement over the fastest version of looping.



This brings us to a few basic conclusions on optimizing Pandas code:
1. Avoid loops; they’re slow and, in most common use cases, unnecessary.
2. If you must loop, use apply(), not iteration functions.
3. Vectorization is usually better than scalar operations. Most common operations in Pandas can be vectorized.
4. Vector operations on NumPy arrays are more efficient than on native Pandas series.
The above does not, of course, make up a comprehensive list of all possible optimizations for Pandas. More adventurous users might consider, for example, further rewriting the function in Cython, or attempting to optimize the individual components of the function. However, these topics are beyond the scope of this post.





Wednesday, August 2, 2017

PCA

Each principal component is a linear combination of the original variables:
pca-coef
where X_is are the original variables, and Beta_is are the corresponding weights or so called coefficients.
This information is included in the pca attribute: components_. As described in the documentationpca.components_ outputs an array of [n_components, n_features], so to get how components are linearly related with the different features you have to:
Note: each coefficient represents the correlation between a particular pair of component and feature
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

# Dump components relations with features:
print pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2'])

      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
PC-1           0.522372         -0.263355           0.581254          0.565611
PC-2          -0.372318         -0.925556          -0.021095         -0.065416