Speed of Matlab vs. Python Numpy Numba CUDA vs Julia vs IDL

4 minute read

Related: Anaconda Accelerate: GPU from Python/Numba

The Benchmarks Game uses deep expert optimizations to exploit every advantage of each language. The benchmarks I’ve adapted from the Julia micro-benchmarks  are done in the way a general scientist or engineer competent in the language, but not an advanced expert in the language would write them. Emphasis is on keeping the benchmarks written with priority on simplicity and length, where programmer time is far more important than CPU time. Jules Kouatchou runs benchmarks on massive clusters comparing Julia, Python, Fortran, etc. A prime purpose of these benchmarks is given ease of programming for a canonical task (say Mandelbrot), which languages are very much better/worse than other languages.

Julia does not have overwhelmingly compelling advantages vs. Python

What’s striking from my laptop/desktop benchmarks and Jules’ HPCC benchmarks, as well as other benchmarks seen around the web is that for a wide range of problems, Julia does not have a consistent significant advantage over Python. The majority of analysts in astronomy/geospace/geoscience/aerospace/etc. are working in Python, with Matlab in second place. Yes of course C and Fortran are right up there too in usage.

Python often is “close enough” in performance to compiled languages like Fortran and C, by virtue of numeric libraries Numpy, Numba and the like. For particular tasks, Tensorflow, OpenCV, and directly loading Fortran libraries with f2py or ctypes minimizes Python’s performance penalty. This was not the case when Julia was conceived in 2009 and first released in 2012. Thanks to Anaconda, Intel MKL and PyCUDA, momentum and performance are solidly behind Python for scientific and engineering computing for the next several years at least.

Julia’s downsides

Almost immediately, Julia’s unstable Core API (even after a decade of development) is strikingly noticable. This bodes ill will for serious, comprehensive adoption by corporate users, particuarly in the sciences and aerospace who need to keep programs running on multi-decade scales. Rapid deprecation makes Julia and associated packages like a rolling release system almost, which is not suitable for many use cases. Python can freeze-in versions using requirements.txt and similar means, and worst case a Docker VM can be recalled in long future times to recall prior work output. Perhaps such capability arises in Julia as well, but the window of change is so short that it’s hard to keep up. Typically Python’s stable versions are maintained for a few years, not less than a year like Juli

Julia’s strengths

Julia allows abstract expression of formulas, ideas, and arrays in ways not feasible in other major analysis applications. This allows advanced analysts unique, performant capabilities with Julia. Since Julia is readily called from Python, Julia work can be exploited from more popular packags, provided the Julia API is stable enough.

Key language benchmarking takeaways

Matrix Multiplication:

Fortran is comparable to Python with MKL, Matlab, Julia.

If you can use single-precision float, Python Cuda can be 1000+ times faster than Python, Matlab, Julia, and Fortran. However, the usual “price” of GPUs is the slow I/O. If huge arrays need to be moved constantly on and off the GPU, special strategies may be necessary to get a speed advantage.


It’s worthwhile to use Numba or Cython with Python, to get Fortran-like speeds from Python, comparable with Matlab at the given test.

Harris IDL

(used only by astronomers?) is very slow compared to other modern computing languages, including GDL, the free open-source IDL-compatible program.

Language Benchmarking Prereq

  1. compilers/libraries
    apt install libblas-dev gfortran
  2. install Miniconda Python and Julia
  3. install Python packages
    conda install mkl accelerate

Language Benchmark Systems tested


A small supercomputing node.

Intel Ivy Bridge desktop PC, Ubuntu 14.04

cat /proc/cpuinfo | grep 'model name' | uniq

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Lenovo W541 with Quadro K1100M GPU, Ubuntu 14.04

cat /proc/cpuinfo | grep 'model name' | uniq

model name    : Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz

Dell Optiplex Core 2 Duo, Ubuntu 14.04

cat /proc/cpuinfo | grep 'model name' | uniq

Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz

Matrix Operations Benchmark

This test multiplies two matrices that are too large to fit in CPU cache, so it is a test of system RAM bandwidth as well.

Task: Matrix Multiply a 5000 x 5000 array by another 5000 x 5000 array each comprised of random double-precision 64-bit float numbers.

  • Note: the HPC test was with 1000 x 1000 matrices

Results: in milliseconds, best time to compute the result

CPU/GPU Gfortran (matmul) Gfortran (dgemm) Gfortran (sgemm) Ifort 14 (matmul) Ifort 14 (dgemm) Python 3.6 (MKL) Matlab R2018a (MKL) Julia 0.6.2 IDL 8.6 GDL 0.9.6 Python 3.5 (Cuda 7.5)
HPC   723       47.3 54.8 64.4 62.2    
i7-3770  2147 2147   18967  2352 2420  2394  2161     0.261
W541/K1100M    1717 960     1160         0.161
E7500    2147               3368    
  • Python CUDA via Anaconda Accelerate (formerly NumbaPro): Note that Python CUDA scaled O(N^0.5), while Python MKL scaled O(N^2.8) or so.
  • Wow! GDL is so much faster than IDL at matrix multiplication.

See the README.rst for procedure

iterative algorithm

Task: compute PI: N=1e6

Results: in milliseconds, best time to compute the result

CPU GCC 6.2 Gfortran 6.2 Ifort 18 Python 3.6 MKL Py36, Numba 0.36 MKL Cython 0.27 Matlab R2018a MKL Julia 0.6.2 IDL 8.6
HPC 46.9 49.1 46.2 544 46.5 28.2 31.4 58.3 272

Task: ”simple” N=1e6

CPU GCC 6.2 Gfortran 6.2 Ifort 18 Python 3.6 MKL Py36, Numba 0.36 MKL Matlab R2018a MKL Julia 0.6.2
HPC   13.8 8.8 544 46.5 16.5 2.3

Leave a Comment