Performance Boost in Python: Empowering Your Scripts with Numba’s JIT Compiler

In most scientific python projects, efficiency is key. Today, let’s delve into the transformative realm of parallelization using Numba’s JIT (Just-In-Time) compiler. In this post, I’ll demonstrate how we can use the Numba library to speed up the execution of 10 million montecarlo experiments to approximate the number pi. The potential speedup we can get from numba is twofold:

  • First, we can get a speedup by using the jit (just in time) compiler to pre-compile python code which is executed very frequently.
  • Second, we can get another speedup by trying to parallelize code execution and distributing the work to be done among the available CPU-cores.

An excellent introduction to working with Numba can be found here. The speedup achieved through compilation is a low-hanging fruit, requiring no code adaptation. However, to harness the advantages of parallel execution, code adjustments are often necessary to enable processing in independently running loops. As demonstrated in our example of Monte Carlo pi approximation, you will observe that parallelization is also straightforward to implement

The Basic Idea of Monte Carlo Pi Approximation

The idea of the pi approximation with the Monte Carlo method is as follows: we generate a certain number of randomly selected points in the unit square and count how many points lie in the unit circle.
We can then approximate the circle number pi from the ratio of the points in the circle and the total number of points generated.

The Basic Script

Let’s start with the basic code version of the Monte Carlo pi approximation. The following script performs a previously defined number of Monte Carlo experiments calling the mc_experiment function:

import random 
import math

# perform num_exp Monte Carlo experiments
def mc_experiment(num_exp):
    circle_counter = 0
    for i in range(num_exp):
        # generate random point (x/y)
        x = random.random()
        y = random.random()
        z = math.sqrt(x**2 + y**2)
        # if point in unit circle, increase counter
        if z <= 1:
            circle_counter += 1
    return circle_counter 
    
num_exp = 10000000
circle_counter = mc_experiment(num_exp)
print(f"Pi approx.: {4*circle_counter/num_exp}.")

The script generates 10 mio random selected points within the unit square and returns the number of points which lie within the unit circle.

To better compare the plain and Numba-versions of the scripts, let’s add some additional code to collect important information as the execution time and the number of experiments performed per second:

import time

# mc_experiment function definition
...

num_exp = 10000000
start = time.time()
circle_counter = mc_experiment(num_exp)
end = time.time()
runtime = end - start
pi_approx = 4*circle_counter/num_exp
speed = num_exp / (end - start)
print(f"Pi approx.: {pi_approx}.")
print(f"Rel. error [pc]: {(pi_approx - math.pi)/math.pi*100}.")
print(f"Execution time: {runtime}.")
print(f"Exp/s: {speed}.")

When I execute the script above on my laptop running Windows (R) on an Intel (R) Core i7,8665U CPU @ 1.9 GHz with 8 cores, I get the following output:

Pi approx.: 3.1414808.
Rel. error [pc]: -0.003560410343627188.
Execution time: 8.228349447250366.
Exp/s: 1215310.563085244.

In the basic version of the script, python runs about 1.2 mio experiments per second, which is still quite a considerable result! Now let’s introduce Numba and check, how this affects runtime. The code adaptations are required:

  • Import numba library
  • Add the jit-annotation on our function to be precompiled
    This annotation serves the purpose of compiling the code just-in-time, optimizing it for subsequent parallel executions.
  • Modify the loops to be parallelizable
    Here, this is an easy job, since the random experiments do not need to share information. Depending on the task to perform, this can be considerable more challenging.

Here is an important note on how the JIT-compiler works: a function is only compiled the first time it is called, after which the machine code is cached. We only benefit from the speedup of the JIT-compiler from the second function call onwards. To correctly measure the speedup, we initialy have to invoke the function in order to compile it. The subsequent call will then based on pure machine code execution.

import random
import math
import numba
import time
  
@numba.jit(nopython=True, fastmath=True, parallel=True)
def mc_experiment_jit(num_exp):
    circle_counter = 0
    for i in numba.prange(num_exp):
        x = random.random()
        y = random.random()
        z = math.sqrt(x**2 + y**2)
        if z <= 1:
            circle_counter += 1
    return circle_counter

# first 'dummy' function call to trigger JIT-compilation
mc_experiment_jit(1)

This is the Numba-version of our first version shown above. As you can see, the method gets an annotation specifying some arguments like nopython, fastmath and parallel, indicating, whether nopython (True) or Objectcode should be produced, simplification for math datatypes is allowed (True) or parallelization should be applied (True).

@numba.jit(nopython=True, fastmath=True, parallel=True)

Please refer to the numba-documentation to get more details about the function-annotation here, or to find a guide to parallelization using numba here.

The only change we had to apply in our script is to use numba.prange instead of the plain old python range-function.

Now it’s time to run the experiments based on the compiled and parallelized version of the code and to find out what we gained by transforming our script to numba:

import random
import math
import numba
import time
  
@numba.jit(nopython=True, fastmath=True, parallel=True)
def mc_experiment_jit(num_exp):
    circle_counter = 0
    for i in numba.prange(num_exp):
        x = random.random()
        y = random.random()
        z = math.sqrt(x**2 + y**2)
        if z <= 1:
            circle_counter += 1
    return circle_counter

# JIT compile call
mc_experiment_jit(1)

# now we're ready to use the compiled version
num_exp = 10000000
start = time.time()
circle_counter = mc_experiment_jit(num_exp)
end = time.time()
runtime = end - start
pi_approx = 4*circle_counter/num_exp
speed = num_exp / (end - start)
print(f"Pi approx.: {pi_approx}.")
print(f"Rel. error [pc]: {(pi_approx - math.pi)/math.pi*100}.")
print(f"Execution time: {runtime}.")
print(f"Exp/s: {speed}.")

I execute this 2nd version of the script above on my laptop running Windows (R) on an Intel (R) Core i7,8665U CPU @ 1.9 GHz with 8 cores, I get the following output:

Pi approx.: 3.1418324.
Rel. error [pc]: 0.007631365254598664.
Execution time: 0.04564619064331055.
Exp/s: 219076331.6514672.

If we compare the speeds achieved, we get a speedup factor of 180. Since we can achieve a maximum speedup of 8 with 8 CPU cores through parallelization alone, we find that in this use case we can achieve at least a speedup factor of 22 through precompilation alone.

Of course these are limited performance gains when compared with the latest options from Mojo (Modular), but they are still relatively easy to achieve. I will definitely make a post about mojo in the near future to shed light on its possibilities.