Speed of Matlab vs. Python Numpy Numba CUDA vs Julia vs IDL
Related: Anaconda Accelerate: GPU from Python/Numba
The Benchmarks Game uses deep expert optimizations to exploit every advantage of each language. The benchmarks I’ve adapted from the Julia microbenchmarks are done in the way a general scientist or engineer competent in the language, but not an advanced expert in the language would write them. Emphasis is on keeping the benchmarks written with priority on simplicity and length, where programmer time is far more important than CPU time. Jules Kouatchou runs benchmarks on massive clusters comparing Julia, Python, Fortran, etc. A prime purpose of these benchmarks is given ease of programming for a canonical task (say Mandelbrot), which languages are very much better/worse than other languages.
Julia
Julia’s growing advantage is the performance of compiled languages with the relative ease of a scripted language. The majority of analysts in astronomy/geospace/geoscience/aerospace/etc. are working in Python, with Matlab in second place. Yes of course C, C++ and Fortran are right up there too in usage. The stable Julia 1.0 release finally brings the promise of API stability that was an adoption blocker in earlier Julia releases.
Julia allows abstract expression of formulas, ideas, and arrays in ways not feasible in other major analysis applications. This allows advanced analysts unique, performant capabilities with Julia. Since Julia is readily called from Python, Julia work can be exploited from more popular packages.
Python
Python often is “close enough” in performance to compiled languages like Fortran and C, by virtue of numeric libraries Numpy, Numba and the like.
For particular tasks, Tensorflow, OpenCV, and directly loading Fortran libraries with f2py
or ctypes
minimizes Python’s performance penalty.
This was not the case when Julia was conceived in 2009 and first released in 2012.
Thanks to Anaconda, Intel MKL and PyCUDA, momentum and performance are solidly behind Python for scientific and engineering computing for the next several years at least.
Cython
Cython has Pythonlike syntax that is compiled to .c
code that is much larger than the original Python code and isn’t very readable.
However, substantial speed increases can result.
Don’t convert the entire program to Cython!
Just the slow functions.
PyPy
PyPy does sophisticated analysis of Python code and can also offer massive speedups, without changes to existing code.
Key language benchmarking takeaways
Matrix Multiplication:
Fortran is comparable to Python with MKL, Matlab, Julia.
If you can use singleprecision float, Python Cuda can be 1000+ times faster than Python, Matlab, Julia, and Fortran. However, the usual “price” of GPUs is the slow I/O. If huge arrays need to be moved constantly on and off the GPU, special strategies may be necessary to get a speed advantage.
Iteration:
It’s worthwhile to use Numba or Cython with Python, to get Fortranlike speeds from Python, comparable with Matlab at the given test.
Harris IDL
IDL is used mostly by astronomers, and can be replaced by GDL, the free opensource IDLcompatible program. A better choice would be to move from IDL/GDL to Python or Julia in many cases.
Language Benchmarking Prereq
 compilers/libraries
apt install libblasdev gfortran
 install Miniconda Python and Julia
Language Benchmark Systems tested
HPC
A small supercomputing node.
Intel Ivy Bridge desktop PC, Ubuntu 18.04
cat /proc/cpuinfo  grep 'model name'  uniq
Intel(R) Core(TM) i73770 CPU @ 3.40GHz
Matrix Operations Benchmark
This test multiplies two matrices that are too large to fit in CPU cache, so it is a test of system RAM bandwidth as well.
Task: Matrix Multiply a 1000 x 1000 array by another 1000 x 1000 array each comprised of random doubleprecision 64bit float numbers.
Results: in milliseconds, best time to compute the result
HPC
 Gfortran 7.2 (DGEMM): 74.6
 Gfortran 7.2 (SGEMM): 24930.4
 Gfortran 7.2 (64bit): 116.9

Gfortran 7.2 (32bit): 65.3

Python 3.6.3, Numpy 1.14.3: 53.2
 Matlab R2018a: 55.2

Octave 4.0.2: 705.4

Julia 1.0.0: 65.6
 IDL 8.6: 62.1
Ivy Bridge Desktop
 Gfortran 7.2 (DGEMM): 23.2
 Gfortran 7.2 (SGEMM): 15.9
 Gfortran 7.2 (64bit): 115.0

Gfortran 7.2 (32bit): 62.5

Python 3.7.0, Numpy 1.15.1: 22.2
 Matlab R2018a: 21.0

Octave 4.2.2: 22.6

Julia 1.0.0: 22.8

GDL 0.9.7: 39.6
 Python CUDA via Anaconda Accelerate (formerly NumbaPro): Note that Python CUDA scaled O(N^0.5), while Python MKL scaled O(N^2.8) or so.
See the README.rst for procedure
Pi
Task: compute Pi using Machin algorithm with N=1e6
Results: in milliseconds, best time to compute the result
HPC
 GCC 7.2: 47.7

Intel C 18: 44.1
 Gfortran 7.2: 43.6

Intel Fortran 18: 45.8
 Python 3.6.3: 530
 Python 3.6.3 Numba 0.38.0: 50.6

Cython 0.28.2: 26.7
 Matlab R2018a: 31.8

Octave 4.0.2: 3060

Julia 1.0.0: 53.5
 IDL 8.6: 274
Ivy Bridge desktop
 GCC 7.3: 46.9
 Intel C 2019: 43.0

Clang 5.0.0: 45.1
 Gfortran 7.3: 39.3
 Intel Fortran 2019: 40.2

Flang 5.0.0: 3.92 (10x faster)
 Python 3.7.0: 326
 Python 3.7.0, Numba 0.39.0: 37.9

Cython 0.28.5: 21.8
 Matlab R2018a: 24.5

Octave 4.2.2: 2164

Julia 1.0.0: 47.2
 GDL 0.9.7: 831
Tags: conda, cuda, gpu, intelmkl, julia, numpy, pypy
Categories: benchmarking