Cloud. Big Data. Analytics... and so on
Friday, March 9, 2018
Commit your changes or stash them before you can merge??!?!
When trying to update your local copy from remote master copy, you will see following error
$ git pull origin master
error: Your local changes to the following files would be overwritten by merge:
<list of files>
Please commit your changes or stash them before you merge.
Aborting
You have several options here
1. Commit the change
$ git add .
$ git commit -m "committing before the update"
2. Stash them
$ git stash
$ git stash pop
3. Overwrite local changes
git reset --hard
Thursday, March 1, 2018
Add Twistd to your Flask app
I'm using twisted web as front end for flask wsgi app. you probably already know of it. if not, it's essential. non-blocking IO. fully scales up.
1000's of connections
essentially, twisted combines the non-blocking
IO of tornado and front end container of gunicorn in one package
$ pip install twisted
$ vi run.sh
export PYTHONPATH=.:$PYTHONPATH
twistd -n web --port tcp:8006 --wsgi yourfile.app
<where .app is the flask app inside yourfile.py>
$ nohup ./run.sh &
Tuesday, February 13, 2018
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
When I was trying to visualize a simple plot, this error: "ImportError: libGL.so.1: cannot open shared object file: No such file or directory" popped out of the blue when I was trying to debug my code.
Quick google search later, it was suggested that I use
import matplotlib
matplotlib.use("Agg")
Fine... and it did work! no more error message.... but also, i stopped seeing my plot! :)
No good.
After digging deeper, I discovered that Red Hat does not ship native OpenGL libraries, but does ship the Mesa libraries: an MIT licensed implementation of the OpenGL specification.
You can add the Mesa OpenGL runtime libraries on your system by installing the following package:
# sudo yum install mesa-libGL
Problem was solved!
Happy coding
Quick google search later, it was suggested that I use
import matplotlib
matplotlib.use("Agg")
Fine... and it did work! no more error message.... but also, i stopped seeing my plot! :)
No good.
After digging deeper, I discovered that Red Hat does not ship native OpenGL libraries, but does ship the Mesa libraries: an MIT licensed implementation of the OpenGL specification.
You can add the Mesa OpenGL runtime libraries on your system by installing the following package:
# sudo yum install mesa-libGL
Problem was solved!
Happy coding
Wednesday, February 7, 2018
Add github master to your project
If you have active project on your laptop and some central git already in place, follow these simple steps to add github as extra back up repo to your project.
git remote add github https://github.com/your_name/repository_name.git
# push master to github
$ git push github master
Fun with apply function in pandas
While doing some data munging, I came across one issue that had me running in circles and made me recheck my logic over and over...
I was getting duplicates in my output set while applying 'apply' function to a dataframe. More specifically, I used 'apply' function on a groupedby dataframe to process each batch of records and apply set of business rules to an external file.
Like I said, I had duplicates!
So.... the reason to this as it came apparent to me was the fact that 'apply' function needs to figure out if you were to be mutating the passed data in order to take a fast or slow path to execute the code. Hence I would have duplicate in my external file, which you won't notice if you were to apply function to the dataframe itself, in which case it would just write to it twice replacing the old same value and it would be transparent to you.
References:
https://github.com/pandas-dev/pandas/issues/7739
https://github.com/pandas-dev/pandas/issues/6753
I was getting duplicates in my output set while applying 'apply' function to a dataframe. More specifically, I used 'apply' function on a groupedby dataframe to process each batch of records and apply set of business rules to an external file.
Like I said, I had duplicates!
So.... the reason to this as it came apparent to me was the fact that 'apply' function needs to figure out if you were to be mutating the passed data in order to take a fast or slow path to execute the code. Hence I would have duplicate in my external file, which you won't notice if you were to apply function to the dataframe itself, in which case it would just write to it twice replacing the old same value and it would be transparent to you.
References:
https://github.com/pandas-dev/pandas/issues/7739
https://github.com/pandas-dev/pandas/issues/6753
Thursday, August 3, 2017
Optimizing Pandas
We’ll review the
efficiency of several methodologies for applying a function to a Pandas
DataFrame, from slowest to fastest:
1. Crude looping over DataFrame rows using indices
2. Looping with
iterrows()
3. Looping with
apply()
4. Vectorization with Pandas series
5. Vectorization with NumPy arrays
For our example
function, we’ll use the Haversine (or Great
Circle) distance formula. Our function takes the latitude and longitude of two
points, adjusts for Earth’s curvature, and calculates the straight-line
distance between them. The function looks something like this:
import numpy as np
# Define a basic Haversine distance formula
def haversine(lat1, lon1, lat2, lon2):
MILES = 3959
lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
total_miles = MILES * c
return total_miles
To test our function
on real data, we’ll use a dataset containing the coordinates of all hotels in
New York state, sourced from Expedia’s developer site. We’ll calculate
the distance between each hotel and a sample set of coordinates (which happen
to belong to a fantastic little shop called the Brooklyn Superhero Supply Store in NYC).
Crude looping in Pandas, or That Thing You Should Never Ever Do
Just about every
Pandas beginner I’ve ever worked with (including yours truly) has, at some
point, attempted to apply a custom function by looping over DataFrame rows one
at a time. The advantage of this approach is that it is consistent with the way
one would interact with other iterable Python objects; for example, the way one
might loop through a list or a tuple. Conversely, the downside is that a crude
loop, in Pandas, is the slowest way to get anything done. Unlike the approaches
we will discuss below, crude looping in Pandas does not take advantage of any
built-in optimizations, making it extremely inefficient (and often much less
readable) by comparison.
For example, one
might write something like this:
# Define a function to manually loop over all rows and return a series of distances
def haversine_looping(df):
distance_list = []
for i in range(0, len(df)):
d = haversine(40.671, -73.985, df.iloc[i]['latitude'], df.iloc[i]['longitude'])
distance_list.append(d)
return distance_list
To get a sense of
the time required to execute the function above, we’ll use the
%timeit
command. %%timeit
# Run the haversine looping function
df['distance'] = haversine_looping(df)
This returns the
following result:
645 ms ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Our crude looping
function took about 645 ms to run, with a standard deviation of 31 ms. This may
seem fast, but it’s actually quite slow, considering the function only needed
to process some 1,600 rows. Let’s look at how we can improve this unfortunate
state of affairs.
Looping with iterrows()
A better way to
loop through rows, if loop you must, is with the
iterrows()
method. iterrows()
is
a generator that iterates over the rows of the dataframe and returns the index
of each row, in addition to an object containing the row itself. iterrows()
is
optimized to work with Pandas dataframes, and, although it’s the least
efficient way to run most standard functions (more on that later), it’s a
significant improvement over crude looping. In our case, iterrows()
solves
the same problem almost four times faster than manually looping over rows.%%timeit
# Haversine applied on rows via iteration
haversine_series = []
for index, row in df.iterrows():
haversine_series.append(haversine(40.671, -73.985, row['latitude'], row['longitude']))
df['distance'] = haversine_series
166 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Better looping using the apply method
An even better
option than
iterrows()
is
to use the apply()
method,
which applies a function along a specific axis (meaning, either rows or
columns) of a DataFrame.
Although
apply()
also
inherently loops through rows, it does so much more efficiently than iterrows()
by
taking advantage of a number of internal optimizations, such as using iterators
in Cython.
We use an
anonymous lambda function to
apply
our
Haversine function on each row, which allows us to point to specific cells
within each row as inputs to the function. The lambda function includes the axis
parameter
at the end, in order to specify whether
Pandas should
apply the function to rows (
axis
= 1
) or columns (axis
= 0
).%%timeit
# Timing apply on the Haversine function
df['distance'] = df.apply(lambda row: haversine(40.671, -73.985, row['latitude'], row['longitude']), axis=1)
90.6 ms ± 7.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Swapping
apply()
for iterrows()
has roughly halved the runtime of the function
Vectorization over Pandas series
To understand how we can reduce the
amount of iteration performed by the function, recall that the fundamental
units of Pandas, DataFrames and series, are both based on arrays. The inherent
structure of the fundamental units translates to built-in Pandas functions
being designed to operate on entire arrays, instead of sequentially on
individual values (referred to as scalars). Vectorization is
the process of executing operations on entire arrays.
Pandas includes a generous collection
of vectorized functions for everything from mathematical operations to
aggregations and string functions (for an extensive list of available
functions, check out the Pandas docs). The built-in functions are optimized to operate
specifically on Pandas series and DataFrames. As a result, using vectorized
Pandas functions is almost always preferable to accomplishing similar ends with
custom-written looping.
So far, we’ve only been passing scalars to our Haversine function. All
of the functions being used within the Haversine function, however, are also
able to operate on arrays. This makes the process of vectorizing our distance
function quite simple: instead of passing individual scalar values for latitude
and longitude to it, we’re going to pass it the entire series (columns). This
will allow Pandas to benefit from the full set of optimizations available for
vectorized functions, including, notably, performing all the calculations on
the entire array simultaneously.
%%timeit
# Vectorized implementation of Haversine applied on Pandas series
df['distance'] = haversine(40.671, -73.985,df['latitude'], df['longitude'])
1.62 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
We’ve achieved more
than a 50-fold improvement over the
apply()
method, and more than a 100-fold improvement over iterrows()
by vectorizing the function — and we didn’t need to do
anything but change the input type!
Vectorization with NumPy arrays
At this point, we could choose to
call it a day; vectorizing over Pandas series achieves the overwhelming
majority of optimization needs for everyday calculations. However, if speed is
of highest priority, we can call in reinforcements in the form of the NumPy
Python library.
The NumPy library,
which describes itself as a “fundamental package for scientific computing in
Python”, performs operations under the hood in optimized, pre-compiled C code.
Like Pandas, NumPy operates on array objects (referred to as ndarrays);
however, it leaves out a lot of overhead incurred by operations on Pandas
series, such as indexing, data type checking, etc. As a result, operations on
NumPy arrays can be significantly faster than operations on Pandas series.
NumPy arrays can be used in place of
Pandas series when the additional functionality offered by Pandas series isn’t
critical. For example, the vectorized implementation of our Haversine function doesn’t
actually use indexes on the latitude or longitude series, and so not having
those indexes available will not cause the function to break. By comparison,
had we been doing operations like DataFrame joins, which require referring to
values by index, we might want to stick to using Pandas objects.
We convert our latitude and longitude arrays from Pandas series to
NumPy arrays simply by using the
values
method
of the series. As with vectorization on the series, passing the NumPy array
directly into the function will lead Pandas to apply the function to the entire
vector.%%timeit
# Vectorized implementation of Haversine applied on NumPy arrays
df['distance'] = haversine(40.671, -73.985, df['latitude'].values, df['longitude'].values)
370 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Running the
operation on NumPy array has achieved another four-fold improvement. All in
all, we’ve refined the runtime from over half a second, via looping, to a third
of a millisecond, via vectorization with NumPy!
Summary
The scoreboard below summarizes the
results. Although vectorization with NumPy arrays resulted in the fastest
runtime, it was a fairly marginal improvement over the effect of vectorization
with Pandas series, which resulted in a whopping 56x improvement over the
fastest version of looping.
This brings us to
a few basic conclusions on optimizing Pandas code:
1. Avoid loops; they’re slow and, in most common use cases, unnecessary.
2. If you must loop, use
3. Vectorization is usually better than scalar operations. Most common operations in Pandas can be vectorized.
4. Vector operations on NumPy arrays are more efficient than on native Pandas series.
1. Avoid loops; they’re slow and, in most common use cases, unnecessary.
2. If you must loop, use
apply()
,
not iteration functions.3. Vectorization is usually better than scalar operations. Most common operations in Pandas can be vectorized.
4. Vector operations on NumPy arrays are more efficient than on native Pandas series.
The above does not, of course, make up a comprehensive list of all
possible optimizations for Pandas. More adventurous users might consider, for
example, further rewriting the function in Cython,
or attempting to optimize the individual components of the function. However,
these topics are beyond the scope of this post.
Wednesday, August 2, 2017
PCA
Each principal component is a linear combination of the original variables:
where
X_i
s are the original variables, and Beta_i
s are the corresponding weights or so called coefficients.
This information is included in the
pca
attribute: components_
. As described in the documentation, pca.components_
outputs an array of [n_components, n_features], so to get how components are linearly related with the different features you have to:
Note: each coefficient represents the correlation between a particular pair of component and feature
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA
# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns)
# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)
# Dump components relations with features:
print pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2'])
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
PC-1 0.522372 -0.263355 0.581254 0.565611
PC-2 -0.372318 -0.925556 -0.021095 -0.065416
Subscribe to:
Posts (Atom)