Saturday, April 30, 2016

NumPy? SciPy? Pandas? Scikit? What!?!?!

Okay, this post is inspired by what I observed recently in my workplace... When one of the fellow developers were asked what those were, he kind of waived it off as in "well, they do math stuff". This got me thinking if that's the general umbrella of how people classify those packages? A little research on stackoverflow showed that people were indeed confused on what each was or did... So here is my attempt to put it in few words to help people understand what each does or used for.

NumPy is the foundation... It is a low level library for mathematical functions. Since it is written in C it scales very well when working with multidimensional arrays and functions that operate on them.

SciPy - People wonder enough about difference between NumPy and SciPy so it made official FAQ:
What is the difference between NumPy and SciPy?
Basically, SciPy uses NumPy and like the guide said: "In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, et cetera. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors. Thus NumPy contains some linear algebra functions, even though these more properly belong in SciPy"
So I think of it as NumPy on steroids with much more capabilities and functionality.

Pandas - Also based on NumPy, but meant to provide users with number of tools and capability to work with tabular data. Pandas give Python users DataFrame that is found and highly valued in R statistical package. So while you can play around with NumPy to process "excel" style data, Pandas make it super easy.

Scikit - Think of it as a framework that is built on top of:
NumPy
SciPy
Matplotlib
IPython
Sympy
Pandas
to provide range of supervised and unsupervised machine learning algorithms: classification, regression, clustering, etc. 
So think of all libraries as a components that provide very low level processing and scikit would use that in the background to run a regression model for you.

No comments:

Post a Comment