I liked Jeff Cogswell’s article “Learning Enough Python to Land a Job”. He rightly calls out that Python is for more than web development, Django, Twisted, Flask and friends. Spending part of my time as a data science practitioner, mangling tens of gigabytes if not tens of terabytes daily, my conversion to Python after several years of hard-core Matlab use was because of Python’s highly-performant data science stack incorporating Pandas, SciPy, NumPy, h5py and friends.
Given my experience mentoring and collaborating with those having moderate to extensive Python experience, I will cite a few points to consider, targeted toward those with a engineering/analyst/researcher bent.
- You don’t have to port your code right away, even from very high-level languages like Matlab. Tools like f2py for Fortran 77/90+, SWIG (and numerous others) for C, Oct2Py for Matlab, etc. allow you to speedily and often straightforwardly integrate code from other popular languages.
- You really need to get familiar with (in this order): Spyder, Numpy, Matplotlib, h5py, Scipy, and Pandas. If you’re working with most real problems, you should be considering Pandas and h5py to allow you to filter/select data before reading it all form disk. I personally prefer h5py over PyTables. The fastest data to load is data where the superfluous data you didn’t need was never loaded, that would otherwise slow down the loading of the wanted data!
- You can get started in data analysis knowing not much more than how to use dict(), list(), and numpy.array() along with the standard basic functions that one would use in Matlab or R. such as sqrt(), for, if, and the like. Don’t worry about generators, sets, list comprehensions, itertools, etc. or the other stuff they teach in first semester Python for now. You can do a lot of that just as well and perhaps more easily in Numpy/Pandas most often for data analysis.
- When you find you have lots of heterogeneous but associated variables, particularly those associated by time, it’s time to use Pandas. Beyond 2-D DataFrames,
xarrayis the module to use. Pandas is awesome for loading and working with huge, messy datasets. Think of Pandas as SQL for doing computations.
I hope this commentary on Python for data scientists and analysts considering the transition to Python from languages such as R, Matlab, Fortran, etc. has helped you.