Cloud. Big Data. Analytics... and so on: 2017

Thursday, August 3, 2017

Optimizing Pandas

We’ll review the efficiency of several methodologies for applying a function to a Pandas DataFrame, from slowest to fastest:

1. Crude looping over DataFrame rows using indices
2. Looping with iterrows()
3. Looping with apply()
4. Vectorization with Pandas series
5. Vectorization with NumPy arrays

For our example function, we’ll use the Haversine (or Great Circle) distance formula. Our function takes the latitude and longitude of two points, adjusts for Earth’s curvature, and calculates the straight-line distance between them. The function looks something like this:

import numpy as np

# Define a basic Haversine distance formula

def haversine(lat1, lon1, lat2, lon2):

    MILES = 3959

    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])

    dlat = lat2 - lat1

    dlon = lon2 - lon1

    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2

    c = 2 * np.arcsin(np.sqrt(a))

    total_miles = MILES * c

    return total_miles

To test our function on real data, we’ll use a dataset containing the coordinates of all hotels in New York state, sourced from Expedia’s developer site. We’ll calculate the distance between each hotel and a sample set of coordinates (which happen to belong to a fantastic little shop called the Brooklyn Superhero Supply Store in NYC).

Crude looping in Pandas, or That Thing You Should Never Ever Do

Just about every Pandas beginner I’ve ever worked with (including yours truly) has, at some point, attempted to apply a custom function by looping over DataFrame rows one at a time. The advantage of this approach is that it is consistent with the way one would interact with other iterable Python objects; for example, the way one might loop through a list or a tuple. Conversely, the downside is that a crude loop, in Pandas, is the slowest way to get anything done. Unlike the approaches we will discuss below, crude looping in Pandas does not take advantage of any built-in optimizations, making it extremely inefficient (and often much less readable) by comparison.

For example, one might write something like this:

# Define a function to manually loop over all rows and return a series of distances

def haversine_looping(df):

    distance_list = []

    for i in range(0, len(df)):

        d = haversine(40.671, -73.985, df.iloc[i]['latitude'], df.iloc[i]['longitude'])

        distance_list.append(d)

    return distance_list

To get a sense of the time required to execute the function above, we’ll use the %timeit command.

%%timeit

# Run the haversine looping function

df['distance'] = haversine_looping(df)

This returns the following result:

645 ms ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Our crude looping function took about 645 ms to run, with a standard deviation of 31 ms. This may seem fast, but it’s actually quite slow, considering the function only needed to process some 1,600 rows. Let’s look at how we can improve this unfortunate state of affairs.

Looping with iterrows()

A better way to loop through rows, if loop you must, is with the iterrows()method. iterrows()is a generator that iterates over the rows of the dataframe and returns the index of each row, in addition to an object containing the row itself. iterrows() is optimized to work with Pandas dataframes, and, although it’s the least efficient way to run most standard functions (more on that later), it’s a significant improvement over crude looping. In our case, iterrows() solves the same problem almost four times faster than manually looping over rows.

%%timeit

# Haversine applied on rows via iteration

haversine_series = []

for index, row in df.iterrows():

    haversine_series.append(haversine(40.671, -73.985, row['latitude'], row['longitude']))

df['distance'] = haversine_series

166 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Better looping using the apply method

An even better option than iterrows() is to use the apply() method, which applies a function along a specific axis (meaning, either rows or columns) of a DataFrame.

Although apply() also inherently loops through rows, it does so much more efficiently than iterrows() by taking advantage of a number of internal optimizations, such as using iterators in Cython.

We use an anonymous lambda function to apply our Haversine function on each row, which allows us to point to specific cells within each row as inputs to the function. The lambda function includes the axis parameter at the end, in order to specify whether

Pandas should apply the function to rows (

axis
= 1

) or columns (

axis
= 0

%%timeit

# Timing apply on the Haversine function

df['distance'] = df.apply(lambda row: haversine(40.671, -73.985, row['latitude'], row['longitude']), axis=1)

90.6 ms ± 7.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Swapping apply() for iterrows() has roughly halved the runtime of the function

Vectorization over Pandas series

To understand how we can reduce the amount of iteration performed by the function, recall that the fundamental units of Pandas, DataFrames and series, are both based on arrays. The inherent structure of the fundamental units translates to built-in Pandas functions being designed to operate on entire arrays, instead of sequentially on individual values (referred to as scalars). Vectorization is the process of executing operations on entire arrays.

Pandas includes a generous collection of vectorized functions for everything from mathematical operations to aggregations and string functions (for an extensive list of available functions, check out the Pandas docs). The built-in functions are optimized to operate specifically on Pandas series and DataFrames. As a result, using vectorized Pandas functions is almost always preferable to accomplishing similar ends with custom-written looping.

So far, we’ve only been passing scalars to our Haversine function. All of the functions being used within the Haversine function, however, are also able to operate on arrays. This makes the process of vectorizing our distance function quite simple: instead of passing individual scalar values for latitude and longitude to it, we’re going to pass it the entire series (columns). This will allow Pandas to benefit from the full set of optimizations available for vectorized functions, including, notably, performing all the calculations on the entire array simultaneously.

%%timeit

# Vectorized implementation of Haversine applied on Pandas series

df['distance'] = haversine(40.671, -73.985,df['latitude'], df['longitude'])

1.62 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

We’ve achieved more than a 50-fold improvement over the apply() method, and more than a 100-fold improvement over iterrows() by vectorizing the function — and we didn’t need to do anything but change the input type!

Vectorization with NumPy arrays

At this point, we could choose to call it a day; vectorizing over Pandas series achieves the overwhelming majority of optimization needs for everyday calculations. However, if speed is of highest priority, we can call in reinforcements in the form of the NumPy Python library.

The NumPy library, which describes itself as a “fundamental package for scientific computing in Python”, performs operations under the hood in optimized, pre-compiled C code. Like Pandas, NumPy operates on array objects (referred to as ndarrays); however, it leaves out a lot of overhead incurred by operations on Pandas series, such as indexing, data type checking, etc. As a result, operations on NumPy arrays can be significantly faster than operations on Pandas series.

NumPy arrays can be used in place of Pandas series when the additional functionality offered by Pandas series isn’t critical. For example, the vectorized implementation of our Haversine function doesn’t actually use indexes on the latitude or longitude series, and so not having those indexes available will not cause the function to break. By comparison, had we been doing operations like DataFrame joins, which require referring to values by index, we might want to stick to using Pandas objects.

We convert our latitude and longitude arrays from Pandas series to NumPy arrays simply by using the values method of the series. As with vectorization on the series, passing the NumPy array directly into the function will lead Pandas to apply the function to the entire vector.

%%timeit

# Vectorized implementation of Haversine applied on NumPy arrays

df['distance'] = haversine(40.671, -73.985, df['latitude'].values, df['longitude'].values)

370 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Running the operation on NumPy array has achieved another four-fold improvement. All in all, we’ve refined the runtime from over half a second, via looping, to a third of a millisecond, via vectorization with NumPy!

Summary

The scoreboard below summarizes the results. Although vectorization with NumPy arrays resulted in the fastest runtime, it was a fairly marginal improvement over the effect of vectorization with Pandas series, which resulted in a whopping 56x improvement over the fastest version of looping.

This brings us to a few basic conclusions on optimizing Pandas code:
1. Avoid loops; they’re slow and, in most common use cases, unnecessary.
2. If you must loop, use apply(), not iteration functions.
3. Vectorization is usually better than scalar operations. Most common operations in Pandas can be vectorized.
4. Vector operations on NumPy arrays are more efficient than on native Pandas series.

The above does not, of course, make up a comprehensive list of all possible optimizations for Pandas. More adventurous users might consider, for example, further rewriting the function in Cython, or attempting to optimize the individual components of the function. However, these topics are beyond the scope of this post.

Wednesday, August 2, 2017

PCA

Each principal component is a linear combination of the original variables:

where X_is are the original variables, and Beta_is are the corresponding weights or so called coefficients.

This information is included in the pca attribute: components_. As described in the documentation, pca.components_ outputs an array of [n_components, n_features], so to get how components are linearly related with the different features you have to:

Note: each coefficient represents the correlation between a particular pair of component and feature

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

# Dump components relations with features:
print pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2'])

      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
PC-1           0.522372         -0.263355           0.581254          0.565611
PC-2          -0.372318         -0.925556          -0.021095         -0.065416

Saturday, July 29, 2017

Adding certs to java keytool

If you have troubles connecting to https site, the issue might be with certs. In order to test that, use this handy SSLPoke (https://gist.github.com/krinkere/8a4b526cf37a66261a7f560d81078cdb)

java SSLPoke server 443

you should get something like when connection is unsucessful

javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

In order to install cert

openssl s_client -connect server:443 < /dev/null | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/server.crt

The cert was saved into /tmp/server.crt. Now let's add it to the keystore of Java

/jre/bin/keytool -import -alias server -keystore /jre/lib/security/cacerts -file server.crt

See the list of certs: /jre/bin/keytool -list -v -keystore /jre/lib/security/cacerts

positive test cert / keytool:

java SSLPoke server 443

you should get this:

Successfully connected

Friday, July 28, 2017

How to use confusion matrix to calculate accuracy, precision, recall, and f1 score

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1 Score = 2 * Precision * Recall / (Precision + Recall)

TP = # True Positives, TN = # True Negatives, FP = # False Positives, FN = # False Negatives

Tuesday, July 25, 2017

An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec

Introduction

Before we start, have a look at the below examples.

You open Google and search for a news article on the ongoing Champions trophy and get hundreds of search results in return about it.
Nate silver analysed millions of tweets and correctly predicted the results of 49 out of 50 states in 2008 U.S Presidential Elections.
You type a sentence in google translate in English and get an Equivalent Chinese conversion.

So what do the above examples have in common?

You possible guessed it right – TEXT processing. All the above three scenarios deal with humongous amount of text to perform different range of tasks like clustering in the google search example, classification in the second and Machine Translation in the third.

Humans can deal with text format quite intuitively but provided we have millions of documents being generated in a single day, we cannot have humans performing the above the three tasks. It is neither scalable nor effective.

So, how do we make computers of today perform clustering, classification etc on a text data since we know that they are generally inefficient at handling and processing strings or texts for any fruitful outputs?

Sure, a computer can match two strings and tell you whether they are same or not. But how do we make computers tell you about football or Ronaldo when you search for Messi? How do you make a computer understand that “Apple” in “Apple is a tasty fruit” is a fruit that can be eaten and not a company?

The answer to the above questions lie in creating a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in.

And all of these are implemented by using Word Embeddings or numerical representations of texts so that computers may handle them.

Below, we will see formally what are Word Embeddings and their different types and how we can actually implement them to perform the tasks like returning efficient Google search results.

What are Word Embeddings?

In very simplistic terms, Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text. But before we dive into the details of Word Embeddings, the following question should be asked – Why do we need Word Embeddings?

As it turns out, many Machine Learning algorithms and almost all Deep Learning Architectures are incapable of processing strings or plain text in their raw form. They require numbers as inputs to perform any sort of job, be it classification, regression etc. in broad terms. And with the huge amount of data that is present in the text format, it is imperative to extract knowledge out of it and build applications. Some real world applications of text applications are – sentiment analysis of reviews by Amazon etc., document or news classification or clustering by Google etc.

Let us now define Word Embeddings formally. A Word Embedding format generally tries to map a word using a dictionary to a vector. Let us break this sentence down into finer details to have a clear view.

Take a look at this example – sentence=” Word Embeddings are Word converted into numbers ”

A word in this sentence may be “Embeddings” or “numbers ” etc.

A dictionary may be the list of all unique words in the sentence. So, a dictionary may look like – [‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]

A vector representation of a word may be a one-hot encoded vector where 1 stands for the position where the word exists and 0 everywhere else. The vector representation of “numbers” in this format according to the above dictionary is [0,0,0,0,0,1] and of converted is[0,0,0,1,0,0].

This is just a very simple method to represent a word in the vector form. Let us look at different types of Word Embeddings or Word Vectors and their advantages and disadvantages over the rest.

Different types of Word Embeddings

The different types of word embeddings can be broadly classified into two categories-

Frequency based Embedding
Prediction based Embedding

Let us try to understand each of these methods in detail.

1 Frequency based Embedding

There are generally three types of vectors that we encounter under this category.

Count Vector
TF-IDF Vector
Co-Occurrence Vector

Let us look into each of these vectorization methods in detail.

1.1 Count Vector

Consider a Corpus C of D documents {d1,d2…..dD} and N unique tokens extracted out of the corpus C. The N tokens will form our dictionary and the size of the Count Vector matrix M will be given by D X N. Each row in the matrix M contains the frequency of tokens in document D(i).

Let us understand this using a simple example.

D1: He is a lazy boy. She is also lazy.

D2: Neeraj is a lazy person.

The dictionary created may be a list of unique tokens(words) in the corpus =[‘He’,’She’,’lazy’,’boy’,’Neeraj’,’person’]

Here, D=2, N=6

The count matrix M of size 2 X 6 will be represented as –

	He	She	lazy	boy	Neeraj	person
D1	1	1	2	1	0	0
D2	0	0	1	0	1	1

Now, a column can also be understood as word vector for the corresponding word in the matrix M. For example, the word vector for ‘lazy’ in the above matrix is [2,1] and so on.Here, the rows correspond to the documents in the corpus and the columns correspond to the tokens in the dictionary. The second row in the above matrix may be read as – D2 contains ‘lazy’: once, ‘Neeraj’: once and ‘person’ once.

Now there may be quite a few variations while preparing the above matrix M. The variations will be generally in-

The way dictionary is prepared.
Why? Because in real world applications we might have a corpus which contains millions of documents. And with millions of document, we can extract hundreds of millions of unique words. So basically, the matrix that will be prepared like above will be a very sparse one and inefficient for any computation. So an alternative to using every unique word as a dictionary element would be to pick say top 10,000 words based on frequency and then prepare a dictionary.
The way count is taken for each word.
We may either take the frequency (number of times a word has appeared in the document) or the presence(has the word appeared in the document?) to be the entry in the count matrix M. But generally, frequency method is preferred over the latter.

Below is a representational image of the matrix M for easy understanding.

1.2 TF-IDF vectorization

This is another method which is based on the frequency method but it is different to the count vectorization in the sense that it takes into account not just the occurrence of a word in a single document but in the entire corpus. So, what is the rationale behind this? Let us try to understand.

Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the words which are important to a document. For example, a document A on Lionel Messi is going to contain more occurences of the word “Messi” in comparison to other documents. But common words like “the” etc. are also going to be present in higher frequency in almost every document.

Ideally, what we would want is to down weight the common words occurring in almost all documents and give more importance to words that appear in a subset of documents.

TF-IDF works by penalising these common words by assigning them lower weights while giving importance to words like Messi in a particular document.

So, how exactly does TF-IDF work?

Consider the below sample table which gives the count of terms(tokens/words) in two documents.

Now, let us define a few terms related to TF-IDF.

TF = (Number of times term t appears in a document)/(Number of terms in the document)

So, TF(This,Document1) = 1/8

TF(This, Document2)=1/5

It denotes the contribution of the word to the document i.e words relevant to the document should be frequent. eg: A document about Messi should contain the word ‘Messi’ in large number.

IDF = log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.

where N is the number of documents and n is the number of documents a term t has appeared in.

So, IDF(This) = log(2/2) = 0.

So, how do we explain the reasoning behind IDF? Ideally, if a word has appeared in all the document, then probably that word is not relevant to a particular document. But if it has appeared in a subset of documents then probably the word is of some relevance to the documents it is present in.

Let us compute IDF for the word ‘Messi’.

IDF(Messi) = log(2/1) = 0.301.

Now, let us compare the TF-IDF for a common word ‘This’ and a word ‘Messi’ which seems to be of relevance to Document 1.

TF-IDF(This,Document1) = (1/8) * (0) = 0

TF-IDF(This, Document2) = (1/5) * (0) = 0

TF-IDF(Messi, Document1) = (4/8)*0.301 = 0.15

As, you can see for Document1 , TF-IDF method heavily penalises the word ‘This’ but assigns greater weight to ‘Messi’. So, this may be understood as ‘Messi’ is an important word for Document1 from the context of the entire corpus.

1.3 Co-Occurrence Matrix with a fixed context window

The big idea – Similar words tend to occur together and will have similar context for example – Apple is a fruit. Mango is a fruit.

Apple and mango tend to have a similar context i.e fruit.

Before I dive into the details of how a co-occurrence matrix is constructed, there are two concepts that need to be clarified – Co-Occurrence and Context Window.

Co-occurrence – For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times they have appeared together in a Context Window.

Context Window – Context window is specified by a number and the direction. So what does a context window of 2 (around) means? Let us see an example below,

Quick

Brown

Fox

Jump

Over

The

Lazy

Dog

The green words are a 2 (around) context window for the word ‘Fox’ and for calculating the co-occurrence only these words will be counted. Let us see context window for the word ‘Over’.

Quick

Brown

Fox

Jump

Over

The

Lazy

Dog

Now, let us take an example corpus to calculate a co-occurrence matrix.

Corpus = He is not lazy. He is intelligent. He is smart.

	He	is	not	lazy	intelligent	smart
He	0	4	2	1	2	1
is	4	0	1	2	2	1
not	2	1	0	1	0	0
lazy	1	2	1	0	0	0
intelligent	2	2	0	0	0	0
smart	1	1	0	0	0	0

Let us understand this co-occurrence matrix by seeing two examples in the table above. Red and the blue box.

Red box- It is the number of times ‘He’ and ‘is’ have appeared in the context window 2 and it can be seen that the count turns out to be 4. The below table will help you visualise the count.

He	is	not	lazy	He	is	intelligent	He	is	smart

He	is	not	lazy	He	is	intelligent	He	is	smart

He	is	not	lazy	He	is	intelligent	He	is	smart

He	is	not	lazy	He	is	intelligent	He	is	smart

while the word ‘lazy’ has never appeared with ‘intelligent’ in the context window and therefore has been assigned 0 in the blue box.

Variations of Co-occurrence Matrix

Let’s say there are V unique words in the corpus. So Vocabulary size = V. The columns of the Co-occurrence matrix form the context words. The different variations of Co-Occurrence Matrix are-

A co-occurrence matrix of size V X V. Now, for even a decent corpus V gets very large and difficult to handle. So generally, this architecture is never preferred in practice.
A co-occurrence matrix of size V X N where N is a subset of V and can be obtained by removing irrelevant words like stopwords etc. for example. This is still very large and presents computational difficulties.

But, remember this co-occurrence matrix is not the word vector representation that is generally used. Instead, this Co-occurrence matrix is decomposed using techniques like PCA, SVD etc. into factors and combination of these factors forms the word vector representation.

Let me illustrate this more clearly. For example, you perform PCA on the above matrix of size VXV. You will obtain V principal components. You can choose k components out of these V components. So, the new matrix will be of the form V X k.

And, a single word, instead of being represented in V dimensions will be represented in k dimensions while still capturing almost the same semantic meaning. k is generally of the order of hundreds.

So, what PCA does at the back is decompose Co-Occurrence matrix into three matrices, U,S and V where U and V are both orthogonal matrices. What is of importance is that dot product of U and S gives the word vector representation and V gives the word context representation.

Advantages of Co-occurrence Matrix

It preserves the semantic relationship between words. i.e man and woman tend to be closer than man and apple.
It uses SVD at its core, which produces more accurate word vector representations than existing methods.
It uses factorization which is a well-defined problem and can be efficiently solved.
It has to be computed once and can be used anytime once computed. In this sense, it is faster in comparison to others.

Disadvantages of Co-Occurrence Matrix

It requires huge memory to store the co-occurrence matrix.
But, this problem can be circumvented by factorizing the matrix out of the system for example in Hadoop clusters etc. and can be saved.

Source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/