Making the code faster requires us to first understand where are the bottlenecks in our program. By finding out pieces of code which are taking the longest time we can narrow down our focus to those and save ourselves from spending our time and energy on parts of programme which were already optimized for performance.

Profiling to find bottlenecks

Let's say we want to calculate how much time a function foo in our program takes to run. We can do this using time module and calculating the time taken wherever the function is called

import time

start = time.time() ## Noting the start time

foo(*args, **kwargs) ## calling the function

end = time.time() ## noting the end time

print(f'The function {foo.__name__} took {end-start} seconds to run')

This approach however requires us to write the code for calculating the time taken by function everywhere the function is called. If the function is called numerous times, this approach can clutter our program.

A better approach would be to use decorator

Using Decorators

def timer_func(func):
    
    def time_measurer(*args, **kwargs):
        start = time.time()
        
        reult = func(*args, **kwargs)
        
        end = time.time()
        
        print(f'The function {func.__name__} took {end-start} seconds to run')
        
        return result
    
    return time_measurer

Now we only need to "decorate" the function as follows

@timer_func
def foo(*args, **kwargs):
    ...
    
    

The above code snippet is just a fancy way of saying foo = timer_func(foo). With this approach, we only need to write the code for calculating the time taken once and then using a decorator we can convert foo into a function that prints the time taken and returns the result. Moreover, we can time any function using this decorator

But there's one problem with this approach. Whenever the function foo will be called it will print out The function time_measurer took 10 seconds to run. This is because timer_func returns a function named "time_measurer". We can circumvent this issue by a small fix.

from functools import wraps

def timer_func(func):
    
    @wraps(func)
    def time_measurer(*args, **kwargs):
        start = time.time()
        
        reult = func(*args, **kwargs)
        
        end = time.time()
        
        print(f'The function {func.__name__} took {end-start} seconds to run')
        
        return result
    
    return time_measurer

wraps decorator forces the function time_measurer to have the same attributes as that of func.

Using magic commands

In Jupyter notebooks we can use magic %timeit for timing the function. This will return mean and standard deviation of run time of several calls to the function

import julia_set

%timeit julia_set.calc_pure_python(desired_width = 1000, max_iterations = 300)
17 s ± 1.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

timeit can also be run from command line as:

! python -m timeit -n 5 -r 2 -s "import julia_set" "julia_set.calc_pure_python(desired_width = 1000, max_iterations = 300)"
5 loops, best of 2: 16.2 sec per loop

Note: Running timeit using magic command return mean of all the runs while while running it from command line displays the time of the best run

Using cProfile Module

cProfile is the build in profiling tool in the standard library. Using this module gives more detailed information at the cost of greater overhead. It can be used from command line as below

!python -m cProfile -s cumulative julia_set.py
         36221990 function calls in 24.540 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   24.540   24.540 {built-in method builtins.exec}
        1    0.038    0.038   24.540   24.540 julia_set.py:1(<module>)
        1    0.894    0.894   24.503   24.503 julia_set.py:21(calc_pure_python)
        1   15.991   15.991   23.277   23.277 julia_set.py:7(calculate_z_serial_purepython)
 34219980    7.286    0.000    7.286    0.000 {built-in method builtins.abs}
  2002000    0.321    0.000    0.321    0.000 {method 'append' of 'list' objects}
        1    0.010    0.010    0.010    0.010 {built-in method builtins.sum}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


To get more control over the results of cProfile we can write the results into a statistics file as below:

! python -m cProfile -o profile.stats julia_set.py

The above line of code writes the results of cProfile in a file named profile.stats. We can analyze this file in a seperate programme using the pstats module

import pstats

p = pstats.Stats("profile.stats")

p.sort_stats("cumulative")
<pstats.Stats at 0x215b957a160>

The last line of code in the cell above sorted the functions according to the cumulative time taken by them

p.print_stats()
Sat Jul 17 10:32:32 2021    profile.stats

         36221990 function calls in 24.906 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   24.906   24.906 {built-in method builtins.exec}
        1    0.039    0.039   24.906   24.906 julia_set.py:1(<module>)
        1    0.942    0.942   24.867   24.867 julia_set.py:21(calc_pure_python)
        1   16.292   16.292   23.595   23.595 julia_set.py:7(calculate_z_serial_purepython)
 34219980    7.303    0.000    7.303    0.000 {built-in method builtins.abs}
  2002000    0.319    0.000    0.319    0.000 {method 'append' of 'list' objects}
        1    0.011    0.011    0.011    0.011 {built-in method builtins.sum}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


<pstats.Stats at 0x215b957a160>

Similarly, we can sort according to total time taken as follows:

p.sort_stats("tottime")
<pstats.Stats at 0x215b957a160>
p.print_stats()
Sat Jul 17 10:32:32 2021    profile.stats

         36221990 function calls in 24.906 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   16.292   16.292   23.595   23.595 julia_set.py:7(calculate_z_serial_purepython)
 34219980    7.303    0.000    7.303    0.000 {built-in method builtins.abs}
        1    0.942    0.942   24.867   24.867 julia_set.py:21(calc_pure_python)
  2002000    0.319    0.000    0.319    0.000 {method 'append' of 'list' objects}
        1    0.039    0.039   24.906   24.906 julia_set.py:1(<module>)
        1    0.011    0.011    0.011    0.011 {built-in method builtins.sum}
        1    0.000    0.000   24.906   24.906 {built-in method builtins.exec}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


<pstats.Stats at 0x215b957a160>

To get a sense of which function was called by which, or what are the functions being profiled we can use print_callers() method:

p.print_callers()
   Ordered by: internal time

Function                                          was called by...
                                                      ncalls  tottime  cumtime
julia_set.py:7(calculate_z_serial_purepython)     <-       1   16.292   23.595  julia_set.py:21(calc_pure_python)
{built-in method builtins.abs}                    <- 34219980    7.303    7.303  julia_set.py:7(calculate_z_serial_purepython)
julia_set.py:21(calc_pure_python)                 <-       1    0.942   24.867  julia_set.py:1(<module>)
{method 'append' of 'list' objects}               <- 2002000    0.319    0.319  julia_set.py:21(calc_pure_python)
julia_set.py:1(<module>)                          <-       1    0.039   24.906  {built-in method builtins.exec}
{built-in method builtins.sum}                    <-       1    0.011    0.011  julia_set.py:21(calc_pure_python)
{built-in method builtins.exec}                   <- 
{built-in method builtins.len}                    <-       2    0.000    0.000  julia_set.py:7(calculate_z_serial_purepython)
                                                           2    0.000    0.000  julia_set.py:21(calc_pure_python)
{method 'disable' of '_lsprof.Profiler' objects}  <- 


<pstats.Stats at 0x215b957a160>

To print which function called which other functions i.e flipping the information in previous output cell, we can use p.print_callees()

p.print_callees()
   Ordered by: internal time

Function                                          called...
                                                      ncalls  tottime  cumtime
julia_set.py:7(calculate_z_serial_purepython)     -> 34219980    7.303    7.303  {built-in method builtins.abs}
                                                           2    0.000    0.000  {built-in method builtins.len}
{built-in method builtins.abs}                    -> 
julia_set.py:21(calc_pure_python)                 ->       1   16.292   23.595  julia_set.py:7(calculate_z_serial_purepython)
                                                           2    0.000    0.000  {built-in method builtins.len}
                                                           1    0.011    0.011  {built-in method builtins.sum}
                                                     2002000    0.319    0.319  {method 'append' of 'list' objects}
{method 'append' of 'list' objects}               -> 
julia_set.py:1(<module>)                          ->       1    0.942   24.867  julia_set.py:21(calc_pure_python)
{built-in method builtins.sum}                    -> 
{built-in method builtins.exec}                   ->       1    0.039   24.906  julia_set.py:1(<module>)
{built-in method builtins.len}                    -> 
{method 'disable' of '_lsprof.Profiler' objects}  -> 


<pstats.Stats at 0x215b957a160>

Visualising cProfile Output using SnakeViz

We can use snakeviz visulaiser to visualize the outputs of cProfile profiler

!pip install snakeviz
! snakeviz profile.stats

The above line of produces a graphic as shown below:

Using line_profiler for Line-by-Line measurements

Above profiling methods only give information at the function level. That's a good starting point. Once we have figured out the functions which are taking up the most time we can profile them further line by line using the line_profiler tool by Robert Kern

!pip install line_profiler

Before running the relevant line_profiler commands from the command line we have to decorate the functions we are interested in by @profile decorator. The kernprof script is used to execute the code. The argument -l specifies that we want line-by-line profiling instead of function level. The -v argument stand for verbose output. Without -v you receive an .lprof output which can be later analyzed by line-profiler module.

! kernprof -l -v julia_set.py
Wrote profile results to julia_set.py.lprof
Timer unit: 1e-06 s

Total time: 93.4638 s
File: julia_set.py
Function: calculate_z_serial_purepython at line 7

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     7                                           @profile
     8                                           def calculate_z_serial_purepython(maxiter, zs, cs):
     9                                               """Calculate output list using Julia update rule"""
    10         1       5846.4   5846.4      0.0      output = [0] * len(zs)
    11   1000001     656046.7      0.7      0.7      for i in range(len(zs)):
    12   1000000     655067.5      0.7      0.7          n = 0
    13   1000000     750590.3      0.8      0.8          z = zs[i]
    14   1000000     719062.3      0.7      0.8          c = cs[i]
    15  34219980   36962398.8      1.1     39.5          while abs(z) < 2 and n < maxiter:
    16  33219980   27921885.4      0.8     29.9              z = z * z + c
    17  33219980   25029336.6      0.8     26.8              n += 1
    18   1000000     763567.8      0.8      0.8          output[i] = n
    19         1         10.5     10.5      0.0      return output

Total time: 159.399 s
File: julia_set.py
Function: calc_pure_python at line 21

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    21                                           @profile
    22                                           def calc_pure_python(desired_width, max_iterations):
    23                                               """Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set and display"""
    24         1          9.5      9.5      0.0      x_step = (float(x2 - x1) / float(desired_width))
    25         1          2.8      2.8      0.0      y_step = (float(y1 - y2) / float(desired_width))
    26         1          1.3      1.3      0.0      x = []
    27         1          1.1      1.1      0.0      y = []
    28         1          1.2      1.2      0.0      ycoord = y2
    29      1001       1255.8      1.3      0.0      while ycoord > y1:
    30      1000       1535.0      1.5      0.0          y.append(ycoord)
    31      1000       1291.5      1.3      0.0          ycoord += y_step
    32         1          1.2      1.2      0.0      xcoord = x1
    33      1001       1235.3      1.2      0.0      while xcoord < x2:
    34      1000       1406.3      1.4      0.0          x.append(xcoord)
    35      1000       1230.4      1.2      0.0          xcoord += x_step
    36                                               # set width and height to the generated pixel counts, rather than the
    37                                               # pre-rounding desired width and height
    38         1          3.1      3.1      0.0      width = len(x)
    39         1          1.5      1.5      0.0      height = len(y)
    40                                               # build a list of co-ordinates and the initial condition for each cell.
    41                                               # Note that our initial condition is a constant and could easily be removed,
    42                                               # we use it to simulate a real-world scenario with several inputs to our
    43                                               # function
    44         1          1.4      1.4      0.0      zs = []
    45         1          1.4      1.4      0.0      cs = []
    46      1001        958.0      1.0      0.0      for ycoord in y:
    47   1001000     886158.6      0.9      0.6          for xcoord in x:
    48   1000000    1429082.3      1.4      0.9              zs.append(complex(xcoord, ycoord))
    49   1000000    1487039.5      1.5      0.9              cs.append(complex(c_real, c_imag))
    50                                           
    51                                           #     print("Length of x:", len(x))
    52                                           #     print("Total elements:", len(zs))
    53                                           #     start_time = time.time()
    54         1  155577400.0 155577400.0     97.6      output = calculate_z_serial_purepython(max_iterations, zs, cs)
    55                                           #     end_time = time.time()
    56                                           #     secs = end_time - start_time
    57                                           #     print(calculate_z_serial_purepython.__name__ + " took", secs, "seconds")
    58                                           
    59                                               # this sum is expected for 1000^2 grid with 300 iterations
    60         1      10861.3  10861.3      0.0      assert sum(output) == 33219980

By looking at the %Time column we can see that the while loop in function calc_pure_python takes 39.5 % of the total time it takes to run the whole programme. More specifically the line while abs(z) < 2 and n < maxiter takes up 39.5% of the time. We can't figure out which part of either side of and takes the most time yet. Also we can see that an operation as simple as n += 1 takes up 26% of the total time. This is because of the python's dynamic lookup machinery. Every time we encounter that line in the loop python checks whether the n object has an __add__ function. (If it didn't, it'll walk up any inherited classes to see whether any of them has it.), then integer 1 is passed to the __add__ function to it's thing. To reduce the cost of this rather simple operation, type specialization is the way to go (which we'll encounter later).

Now getting back to that time consuming while loop; we want to figure out which one of abs(z) < 2 and n < maxiter took greater time. Whichever takes less time and is False more often should be put on the left side of and because if first argument of an and statement is False it doesn't even need to look at the second argument.

We can breakup the two statements on either send of and into seperate lines and then run a line profiler as below:

statement1 = abs(z) < 2
statement2 = n < maxiter

while statement1 and statement2:
    ...
! kernprof -l -v julia_set.py
Wrote profile results to julia_set.py.lprof
Timer unit: 1e-06 s

Total time: 156.846 s
File: julia_set.py
Function: calculate_z_serial_purepython at line 21

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    21                                           @profile
    22                                           def calculate_z_serial_purepython(maxiter, zs, cs):
    23                                               """Calculate output list using Julia update rule"""
    24         1       6964.2   6964.2      0.0      output = [0] * len(zs)
    25   1000001     727030.2      0.7      0.5      for i in range(len(zs)):
    26   1000000     703580.7      0.7      0.4          n = 0
    27   1000000     822355.9      0.8      0.5          z = zs[i]
    28   1000000     795890.2      0.8      0.5          c = cs[i]
    29                                                   while True:
    30  34219980   39121477.8      1.1     24.9              statement1 = abs(z) < 2
    31  34219980   27895802.9      0.8     17.8              statement2 = n < maxiter
    32  34219980   27064761.2      0.8     17.3              if statement1 and statement2: 
    33  33219980   30971767.7      0.9     19.7                  z = z * z + c
    34  33219980   27879240.0      0.8     17.8                  n += 1
    35                                                       else: 
    36                                                           break
    37   1000000     857110.0      0.9      0.5          output[i] = n
    38         1         10.5     10.5      0.0      return output

Total time: 275.555 s
File: julia_set.py
Function: calc_pure_python at line 41

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    41                                           @profile
    42                                           def calc_pure_python(desired_width, max_iterations):
    43                                               """Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set and display"""
    44         1          8.5      8.5      0.0      x_step = (float(x2 - x1) / float(desired_width))
    45         1          2.6      2.6      0.0      y_step = (float(y1 - y2) / float(desired_width))
    46         1          1.3      1.3      0.0      x = []
    47         1          1.2      1.2      0.0      y = []
    48         1          1.2      1.2      0.0      ycoord = y2
    49      1001       1366.8      1.4      0.0      while ycoord > y1:
    50      1000       1751.9      1.8      0.0          y.append(ycoord)
    51      1000       1422.5      1.4      0.0          ycoord += y_step
    52         1          1.3      1.3      0.0      xcoord = x1
    53      1001       1414.5      1.4      0.0      while xcoord < x2:
    54      1000       1628.6      1.6      0.0          x.append(xcoord)
    55      1000       1438.4      1.4      0.0          xcoord += x_step
    56                                               # set width and height to the generated pixel counts, rather than the
    57                                               # pre-rounding desired width and height
    58         1          4.3      4.3      0.0      width = len(x)
    59         1          1.7      1.7      0.0      height = len(y)
    60                                               # build a list of co-ordinates and the initial condition for each cell.
    61                                               # Note that our initial condition is a constant and could easily be removed,
    62                                               # we use it to simulate a real-world scenario with several inputs to our
    63                                               # function
    64         1          2.1      2.1      0.0      zs = []
    65         1          1.4      1.4      0.0      cs = []
    66      1001       1065.7      1.1      0.0      for ycoord in y:
    67   1001000    1016534.9      1.0      0.4          for xcoord in x:
    68   1000000    1584552.3      1.6      0.6              zs.append(complex(xcoord, ycoord))
    69   1000000    1627592.5      1.6      0.6              cs.append(complex(c_real, c_imag))
    70                                           
    71                                           #     print("Length of x:", len(x))
    72                                           #     print("Total elements:", len(zs))
    73                                           #     start_time = time.time()
    74         1  271301669.6 271301669.6     98.5      output = calculate_z_serial_purepython(max_iterations, zs, cs)
    75                                           #     end_time = time.time()
    76                                           #     secs = end_time - start_time
    77                                           #     print(calculate_z_serial_purepython.__name__ + " took", secs, "seconds")
    78                                           
    79                                               # this sum is expected for 1000^2 grid with 300 iterations
    80         1      14830.6  14830.6      0.0      assert sum(output) == 33219980

It looks as if checking abs(z) < 2 takes more time than checking n < maxiter. Now we need to know which of these two statements is False more often. n < maxiter will be false on 1 of every 301 calls since maxiter = 300 in the programme.