Speed of Matlab vs. Python Numpy Numba CUDA vs Julia vs IDL

3 minute read

Related: Anaconda Accelerate: GPU from Python/Numba


The Benchmarks Game uses deep expert optimizations to exploit every advantage of each language. The benchmarks I’ve adapted from the Julia micro-benchmarks  are done in the way a general scientist or engineer competent in the language, but not an advanced expert in the language would write them. Emphasis is on keeping the benchmarks written with priority on simplicity and length, where programmer time is far more important than CPU time. Jules Kouatchou runs benchmarks on massive clusters comparing Julia, Python, Fortran, etc. A prime purpose of these benchmarks is given ease of programming for a canonical task (say Mandelbrot), which languages are very much better/worse than other languages.

Julia

Julia’s growing advantage is the performance of compiled languages with the relative ease of a scripted language. The majority of analysts in astronomy/geospace/geoscience/aerospace/etc. are working in Python, with Matlab in second place. Yes of course C, C++ and Fortran are right up there too in usage. The stable Julia 1.0 release finally brings the promise of API stability that was an adoption blocker in earlier Julia releases.

Julia allows abstract expression of formulas, ideas, and arrays in ways not feasible in other major analysis applications. This allows advanced analysts unique, performant capabilities with Julia. Since Julia is readily called from Python, Julia work can be exploited from more popular packages.

Python

Python often is “close enough” in performance to compiled languages like Fortran and C, by virtue of numeric libraries Numpy, Numba and the like. For particular tasks, Tensorflow, OpenCV, and directly loading Fortran libraries with f2py or ctypes minimizes Python’s performance penalty. This was not the case when Julia was conceived in 2009 and first released in 2012. Thanks to Anaconda, Intel MKL and PyCUDA, momentum and performance are solidly behind Python for scientific and engineering computing for the next several years at least.

Cython

Cython has Python-like syntax that is compiled to .c code that is much larger than the original Python code and isn’t very readable. However, substantial speed increases can result. Don’t convert the entire program to Cython! Just the slow functions.

PyPy

PyPy does sophisticated analysis of Python code and can also offer massive speedups, without changes to existing code.

Key language benchmarking takeaways

Matrix Multiplication:

Fortran is comparable to Python with MKL, Matlab, Julia.

If you can use single-precision float, Python Cuda can be 1000+ times faster than Python, Matlab, Julia, and Fortran. However, the usual “price” of GPUs is the slow I/O. If huge arrays need to be moved constantly on and off the GPU, special strategies may be necessary to get a speed advantage.

Iteration:

It’s worthwhile to use Numba or Cython with Python, to get Fortran-like speeds from Python, comparable with Matlab at the given test.

Harris IDL

IDL is used mostly by astronomers, and can be replaced by GDL, the free open-source IDL-compatible program. A better choice would be to move from IDL/GDL to Python or Julia in many cases.

Language Benchmarking Prereq

  1. compilers/libraries
    apt install libblas-dev gfortran
    
  2. install Miniconda Python and Julia

Language Benchmark Systems tested

HPC

A small supercomputing node.

Intel Ivy Bridge desktop PC, Ubuntu 18.04

cat /proc/cpuinfo | grep 'model name' | uniq

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Matrix Operations Benchmark

This test multiplies two matrices that are too large to fit in CPU cache, so it is a test of system RAM bandwidth as well.

Task: Matrix Multiply a 1000 x 1000 array by another 1000 x 1000 array each comprised of random double-precision 64-bit float numbers.

Results: in milliseconds, best time to compute the result

HPC

  • Gfortran 7.2 (DGEMM): 74.6
  • Gfortran 7.2 (SGEMM): 24930.4
  • Gfortran 7.2 (64-bit): 116.9
  • Gfortran 7.2 (32-bit): 65.3

  • Python 3.6.3, Numpy 1.14.3: 53.2

  • Matlab R2018a: 55.2
  • Octave 4.0.2: 705.4

  • Julia 1.0.0: 65.6

  • IDL 8.6: 62.1

Ivy Bridge Desktop

  • Gfortran 7.2 (DGEMM): 23.2
  • Gfortran 7.2 (SGEMM): 15.9
  • Gfortran 7.2 (64-bit): 115.0
  • Gfortran 7.2 (32-bit): 62.5

  • Python 3.7.0, Numpy 1.15.1: 22.2

  • Matlab R2018a: 21.0
  • Octave 4.2.2: 22.6

  • Julia 1.0.0: 22.8

  • GDL 0.9.7: 39.6

  • Python CUDA via Anaconda Accelerate (formerly NumbaPro): Note that Python CUDA scaled O(N^0.5), while Python MKL scaled O(N^2.8) or so.

See the README.rst for procedure

Pi

Task: compute Pi using Machin algorithm with N=1e6

Pi Machin benchmark

Results: in milliseconds, best time to compute the result

HPC

  • GCC 7.2: 47.7
  • Intel C 18: 44.1

  • Gfortran 7.2: 43.6
  • Intel Fortran 18: 45.8

  • Python 3.6.3: 530
  • Python 3.6.3 Numba 0.38.0: 50.6
  • Cython 0.28.2: 26.7

  • Matlab R2018a: 31.8
  • Octave 4.0.2: 3060

  • Julia 1.0.0: 53.5

  • IDL 8.6: 274

Ivy Bridge desktop

  • GCC 7.3: 46.9
  • Intel C 2019: 43.0
  • Clang 5.0.0: 45.1

  • Gfortran 7.3: 39.3
  • Intel Fortran 2019: 40.2
  • Flang 5.0.0: 3.92 (10x faster)

  • Python 3.7.0: 326
  • Python 3.7.0, Numba 0.39.0: 37.9
  • Cython 0.28.5: 21.8

  • Matlab R2018a: 24.5
  • Octave 4.2.2: 2164

  • Julia 1.0.0: 47.2

  • GDL 0.9.7: 831

Leave a comment