Profiling and optimising


  • Learn how to benchmark and profile Python code

  • Understand how optimisation can be algorithmic or based on CPU or memory usage

  • Learn how to boost performance using Numba and Cython

Instructor note

  • 40 min teaching/type-along

  • 40 min exercises

Once your code is working reliably, you can start thinking of optimising it.


Always measure the code before you start optimization. Don’t base your optimization on theoretical consideration, otherwise you’ll have surprises.



One of the easy way to profile the program is to use the time function:

import time
# start the timer
# here are the code you would like to profile
a = np.arange(1000)
a = a ** 2
# stop the timer
print("Runtime: {:.4f} seconds".format(end_time - start_time))
# Runtime: 0.0001 seconds


If you’re using a Jupyter notebook, the best choice will be to use %timeit to time a small piece of code:

import numpy as np

a = np.arange(1000)

%timeit a ** 2
# 1.4 µs ± 25.1 ns per loop

One can also use the cell magic %%timeit to benchmark a full cell.


For long running calls, using %time instead of %timeit; it is less precise but faster


For more complex code, one can use the built-in python profilers, cProfile or profile.

As a demo, let us consider the following code which simulates a random walk in one dimension (we can save it as or download from here):

import numpy as np

def step():
    import random
    return 1. if random.random() > .5 else -1.

def walk(n):
    x = np.zeros(n)
    dx = 1. / n
    for i in range(n - 1):
        x_new = x[i] + dx * step()
        if x_new > 5e-3:
            x[i + 1] = 0.
            x[i + 1] = x_new
    return x

if __name__ == "__main__":
    n = 100000
    x = walk(n)

We can profile it with cProfile:

$  python -m cProfile -s time

The -s switch sorts the results by time. Other options include e.g. function name, cumulative time, etc. However, this will print a lot of output which is difficult to read.

$ python -m cProfile -o

It’s also possible to write the profile to a file with the -o flag and view it with profile pstats module or profile visualisation tools like Snakeviz or profile-viewer.


Similar functionality is available in interactive IPython or Jupyter sessions with the magic command %%prun.


The cProfile tool tells us which function takes most of the time but it does not give us a line-by-line breakdown of where time is being spent. For this information, we can use the line_profiler tool.

Demo: line profiling

For line-profiling source files from the command line, we can add a decorator @profile to the functions of interests. If we do this for the step() and walk() function in the example above, we can then run the script using the program which comes with line_profiler, making sure to include the switches -l, --line-by-line and -v, --view:

$ kernprof -l -v

line_profiler also works in a Jupyter notebook. First one needs to load the extension:

%load_ext line_profiler

If the walk() and step() functions are defined in code cells, we can get the line-profiling information by:

%lprun -f walk -f step walk(10000)
  • Based on the output, can you spot a mistake which is affecting performance?

Performance optimization

Once we have identified the bottlenecks, we need to make the corresponding code go faster.

Algorithm optimization

The first thing to look into is the underlying algorithm you chose: is it optimal? To answer this question, a good understanding of the maths behind the algorithm helps. For certain algorithms, many of the bottlenecks will be linear algebra computations. In these cases, using the right function to solve the right problem is key. For instance, an eigenvalue problem with a symmetric matrix is much easier to solve than with a general matrix. Moreover, most often, you can avoid inverting a matrix and use a less costly (and more numerically stable) operation. However, it can be as simple as moving computation or memory allocation outside a loop, and this happens very often as well.

Singular Value Decomposition

Singular Value Decomposition (SVD) is quite often used in climate model data analysis. The computational cost of this algorithm is roughly \(n^3\) where \(n\) is the size of the input matrix. However, in most cases, we are not using all the output of the SVD, but only the first few rows of its first returned argument. If we use the svd implementation from SciPy, we can ask for an incomplete version of the SVD. Note that implementations of linear algebra in SciPy are richer then those in NumPy and should be preferred. The following example demonstrates the performance benefit for a “slim” array (i.e. much larger along one axis):

import numpy as np
data = np.random.random((4000,100))

%timeit np.linalg.svd(data)
# 1.09 s ± 19.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

from scipy import linalg

%timeit linalg.svd(data)
# 1.03 s ± 24.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit linalg.svd(data, full_matrices=False)
# 21.2 ms ± 716 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.linalg.svd(data, full_matrices=False)
# 23.8 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CPU usage optimization


Arithmetic is one place where NumPy performance outperforms python list and the reason is that it uses vectorization. A lot of the data analysis involves a simple operation being applied to each element of a large dataset. In such cases, vectorization is key for better performance. In practice, a vectorised operation means reframing the code in a manner that completely avoids a loop and instead uses e.g. slicing to apply the operation on the whole array (slice) at one go. For example, the following code for calculating the difference of neighbouring elements in an array:

Consider the following code:


import numpy as np
a = np.arange(1000)
a_dif = np.zeros(999, np.int64)
for i in range(1, len(a)):
    a_dif[i-1] = a[i] - a[i-1]

# 564 µs ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

How can the for loop be vectorized? We need to use clever indexing to get rid of the loop:


import numpy as np
a = np.arange(1000)
a_dif = a[1:] - a[:-1]

# 2.12 µs ± 25.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

The first brute force approach using a for loop is much slower than the second vectorised form!

So one should consider using vectorized operations whenever possible, not only for performance but also because the vectorized version can be more convenient.

What if we have a function that only take scalar values as input, but we want to apply it element-by-element on an array? We can vectorize the function! Let’s define a simple function f which takes scalars input:

import math
def f(x, y):
    return math.pow(x,3.0) + 4*math.sin(y)

If we pass an array we get an error

x = np.ones(10000, dtype=np.int8)

# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "<stdin>", line 2, in f
# TypeError: only size-1 arrays can be converted to Python scalars

We could loop over the array:

for i in x:

# 49.9 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

However, in order to pass a NumPy array it is better to vectorize the function using np.vectorize() which takes a nested sequence of objects or NumPy arrays as inputs and returns a single NumPy array or a tuple of NumPy arrays:

import numpy as np
import math

def f(x, y):
    return math.pow(x,3.0) + 4*math.sin(y)

f_numpy = np.vectorize(f)

# benchmark
x = np.ones(10000, dtype=np.int8)
%timeit f_numpy(x,x)
# 4.84 ms ± 75.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

For high performance vectorization, another choice is to use Numba. Adding the decorator in a function, Numba will figure out the rest for you:

import numba
import math

def f(x, y):
    return math.pow(x,3.0) + 4*math.sin(y)

f_numba = numba.vectorize(f)

# benchmark
x = np.ones(10000, dtype=np.int8)
%timeit f_numba(x,x)

# 89.2 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Memory usage optimization


Basic operations of NumPy are elementwise, and the shape of the arrays should be compatible. However, in practice under certain conditions, it is possible to do operations on arrays of different shapes. NumPy expands the arrays such that the operation becomes viable.


Broadcasting Rules

  • Dimensions match when they are equal, or when either is 1 or None.

  • In the latter case, the dimension of the output array is expanded to the larger of the two.

  • Broadcasted arrays are never physically constructed, which saves memory.


import numpy as np
a = np.array([1, 2, 3])
b = 4
a + b

Cache effects

Memory access is cheaper when it is grouped: accessing a big array in a continuous way is much faster than random access. This implies amongst other things that smaller strides are faster:

c = np.zeros((10000, 10000), order='C')

%timeit c.sum(axis=0)
# 1 loops, best of 3: 3.89 s per loop

%timeit c.sum(axis=1)
# 1 loops, best of 3: 188 ms per loop

# (80000, 8)

This is the reason why Fortran ordering or C ordering may make a big difference on operations.

Temporary arrays

  • In complex expressions, NumPy stores intermediate values in temporary arrays

  • Memory consumption can be higher than expected

a = np.random.random((1024, 1024, 50))
b = np.random.random((1024, 1024, 50))

# two temporary arrays will be created
c = 2.0 * a - 4.5 * b

# four temporary arrays will be created, and from which two are due to unnecessary parenthesis
c = (2.0 * a - 4.5 * b) + (np.sin(a) + np.cos(b))

# solution
# apply the operation one by one for really large arrays
c = 2.0 * a
c = c - 4.5 * b
c = c + np.sin(a)
c = c + np.cos(b)
  • Broadcasting approaches can lead also to hidden temporary arrays

    • Input data M x 3 array

    • Output data M x M array

    • There is a temporary M x M x 3 array

import numpy as np
X = np.random.random((M, 3))
D = np.sqrt(((X[:, np.newaxis, :] - X) ** 2).sum(axis=-1))


  • Evaluation of complex expressions with one operation at a time can lead also into suboptimal performance

    • Effectively, one carries out multiple for loops in the NumPy C-code

  • Numexpr package provides fast evaluation of array expressions

import numexpr as ne
x = np.random.random((10000000, 1))
y = np.random.random((10000000, 1))
%timeit y = ((.25*x + .75)*x - 1.5)*x - 2
%timeit y = ne.evaluate("((.25*x + .75)*x - 1.5)*x - 2")
  • By default, Numexpr tries to use multiple threads

  • Number of threads can be queried and set with numexpr.set_num_threads(nthreads)

  • Supported operators and functions: +,-,*,/,**, sin, cos, tan, exp, log, sqrt

  • Speedups in comparison to NumPy are typically between 0.95 and 4

  • Works best on arrays that do not fit in CPU cache

Performance boosting

For many user cases, using NumPy or Pandas is sufficient. However, in some computationally heavy applications, it is possible to improve the performance by pre-compiling expensive functions. Cython and Numba are among the popular choices and both of them have good support for NumPy arrays.


Cython is a superset of Python that additionally supports calling C functions and declaring C types on variables and class attributes. Under Cython, source code gets translated into optimized C/C++ code and compiled as Python extension modules.

Developers can run the cython command-line utility to produce a .c file from a .py file which needs to be compiled with a C compiler to an .so library which can then be directly imported in a Python program. There is, however, also an easy way to use Cython directly from Jupyter notebooks through the %%cython magic command. We will restrict the discussion here to the Jupyter-way. For a full overview of the capabilities refer to the documentation.

Demo: Cython

Consider the following pure Python code which integrates a function:

import numpy as np

def f(x):
    return x ** 2 - x

def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

def apply_integrate_f(col_a, col_b, col_N):
    n = len(col_N)
    res = np.empty(n,dtype=np.float64)
    for i in range(n):
        res[i] = integrate_f(col_a[i], col_b[i], col_N[i])
    return res

We generate a dataframe and apply the apply_integrate_f() function on its columns, timing the execution:

import pandas as pd

df = pd.DataFrame({"a": np.random.randn(1000),
                  "b": np.random.randn(1000),
                  "N": np.random.randint(100, 1000, (1000))})

%timeit apply_integrate_f(df['a'], df['b'], df['N'])
# 321 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now import the Cython extension:

%load_ext cython

As a first cythonization step we add the cython magic command with the -a, --annotate flag, %%cython -a, to the top of the Jupyter code cell. The yellow coloring in the output shows us the amount of pure Python:


Our task is to remove as much yellow as possible by explicitly declaring variables and functions. We can start by simply compiling the code using Cython without any changes:


import numpy as np
import pandas as pd

def f_cython(x):
    return x * (x - 1)

def integrate_f_cython(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_cython(a + i * dx)
    return s * dx

def apply_integrate_f_cython(col_a, col_b, col_N):
    n = len(col_N)
    res = np.empty(n,dtype=np.float64)
    for i in range(n):
        res[i] = integrate_f_cython(col_a[i], col_b[i], col_N[i])
    return res
%timeit apply_integrate_f_cython(df['a'], df['b'], df['N'])
# 276 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Simply by using Cython and a copy-and-paste gives us about 10% increase in performance. Now we can start adding data type annotation to the input variables:


import numpy as np
import pandas as pd

def f_cython_dtype0(double x):
    return x ** 2 - x

def integrate_f_cython_dtype0(double a, double b, long N):   
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_cython_dtype0(a + i * dx)
    return s * dx

def apply_integrate_f_cython_dtype0(double[:] col_a, double[:] col_b, long[:] col_N):  
    n = len(col_N)
    res = np.empty(n,dtype=np.float64)
    for i in range(n):
        res[i] = integrate_f_cython_dtype0(col_a[i], col_b[i], col_N[i])
    return res
# this will not work
#%timeit apply_integrate_f_cython_dtype0(df['a'], df['b'], df['N'])
# but rather
%timeit apply_integrate_f_cython_dtype0(df['a'].to_numpy(), df['b'].to_numpy(), df['N'].to_numpy())
# 41.4 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


You can not pass a Series directly since the Cython definition is specific to an array. Instead using the Series.to_numpy() to get the underlying NumPy array which works nicely with Cython.

We use C data types like double, long to define variables.

Next step, we can start adding type annotation to the functions. There are three ways of declaring functions:

  • def - Python style:

Declaring the types of arguments and local types (thus return values) can allow Cython to generate optimised code which speeds up the execution. If the types are declared then a TypeError will be raised if the function is passed the wrong types.

  • cdef - C style:

Cython treats the function as pure C functions. All types must be declared. This will give you the best performance but there are a number of consequences. One should really take care of the cdef declared functions, since you are actually writing in C.

  • cpdef - Python/C mixed:

cpdef functions combine both def and cdef: one can use cdef for C types and def for Python types. In terms of performance, cpdef functions may be as fast as those using cdef and might be as slow as def declared functions.


import numpy as np
import pandas as pd

cdef f_cython_dtype1(double x):
    return x ** 2 - x

cpdef integrate_f_cython_dtype1(double a, double b, long N):   
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_cython_dtype1(a + i * dx)
    return s * dx

cpdef apply_integrate_f_cython_dtype1(double[:] col_a, double[:] col_b, long[:] col_N):
    n = len(col_N)
    res = np.empty(n,dtype=np.float64)
    for i in range(n):
        res[i] = integrate_f_cython_dtype1(col_a[i], col_b[i], col_N[i])
    return res
%timeit apply_integrate_f_cython_dtype1(df['a'].to_numpy(), df['b'].to_numpy(), df['N'].to_numpy())
# 37.2 ms ± 556 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Last step, we can add type annotation to the local variables within the functions and output.


import numpy as np
import pandas as pd

cdef double f_cython_dtype2(double x):
    return x ** 2 - x

cpdef double integrate_f_cython_dtype2(double a, double b, long N):   
    cdef double s, dx
    cdef long i
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_cython_dtype2(a + i * dx)
    return s * dx

cpdef double[:] apply_integrate_f_cython_dtype2(double[:] col_a, double[:] col_b, long[:] col_N):
    cdef long n,i
    cdef double[:] res
    n = len(col_N)
    res = np.empty(n,dtype=np.float64)
    for i in range(n):
        res[i] = integrate_f_cython_dtype2(col_a[i], col_b[i], col_N[i])
    return res
%timeit apply_integrate_f_cython_dtype2(df['a'].to_numpy(), df['b'].to_numpy(), df['N'].to_numpy())
# 696 µs ± 8.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Now it is over 400 times faster than the original Python implementation, and all we have done is to add type declarations! If we add the -a annotation flag we indeed see much less Python interaction in the code.



An alternative to statically compiling Cython code is to use a dynamic just-in-time (JIT) compiler with Numba. Numba allows you to write a pure Python function which can be JIT compiled to native machine instructions, similar in performance to C, C++ and Fortran, by simply adding the decorator @jit in your function. However, the @jit compilation will add overhead to the runtime of the function, i.e. the first time a function is run using Numba engine will be slow as Numba will have the function compiled. Once the function is JIT compiled and cached, subsequent calls will be fast. So the performance benefits may not be realized especially when using small datasets.

Numba supports compilation of Python to run on either CPU or GPU hardware and is designed to integrate with the Python scientific software stack. The optimized machine code is generated by the LLVM compiler infrastructure.

Demo: Numba

Consider the integration example again using Numba this time:

import numpy as np
import numba

def f_numba(x):
    return x ** 2 - x

def integrate_f_numba(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_numba(a + i * dx)
    return s * dx

def apply_integrate_f_numba(col_a, col_b, col_N):
    n = len(col_N)
    res = np.empty(n,dtype=np.float64)
    for i in range(n):
        res[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
    return res
# try passing Pandas Series
%timeit apply_integrate_f_numba(df['a'],df['b'],df['N'])
# 6.02 ms ± 56.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
# try passing NumPy array
%timeit apply_integrate_f_numba(df['a'].to_numpy(),df['b'].to_numpy(),df['N'].to_numpy())
# 625 µs ± 697 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Numba is best at accelerating functions that apply numerical functions to NumPy arrays. When used with Pandas, pass the underlying NumPy array of Series or DataFrame (using to_numpy()) into the function. If you try to @jit a function that contains unsupported Python or NumPy code, compilation will fall back to the object mode which will mostly likely be very slow. If you would prefer that Numba throw an error for such a case, you can do e.g. @numba.jit(nopython=True) or @numba.njit.

We can further add date type, although in this case there is not much performance improvement:

import numpy as np
import numba

def f_numba_dtype(x):
    return x ** 2 - x

def integrate_f_numba_dtype(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_numba_dtype(a + i * dx)
    return s * dx

def apply_integrate_f_numba_dtype(col_a, col_b, col_N):
    n = len(col_N)
    res = np.empty(n,dtype=np.float64)
    for i in range(n):
        res[i] = integrate_f_numba_dtype(col_a[i], col_b[i], col_N[i])
    return res
%timeit apply_integrate_f_numba_dtype(df['a'].to_numpy(),df['b'].to_numpy(),df['N'].to_numpy())
# 625 µs ± 697 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Numba vs Cython

Should you use Numba or Cython? Does it matter?

  • Performance is usually very similar and exact results depend on versions of Python, Cython, Numba and NumPy.

  • Numba is generally easier to use (just add @jit)

  • Cython is more stable and mature, Numba developing faster

  • Numba also works for GPUs

  • Cython can compile arbitrary Python code and directly call C libraries, Numba has restrictions

  • Numba requires LLVM toolchain, Cython only C compiler.


NumPy is really good at what it does. For simple operations or small data, Numba or Cython is not going to outperform it. But when things get more complex these frameworks can save the day!


Profile the word-autocorrelation code

Revisit the word-autocorrelation code. To clone the repository (if you haven’t already):

$ git clone

To run the code, type:

$ python source/ data/pg99.txt processed_data/pg99.dat results/acf_pg99.dat

Add @profile to the word_acf() function, and run (or just kernprof) from the command line. What lines of this function are the most expensive?

Is the word_acf() function efficient?

Have another look at the word_acf() function from the word-count project.

def word_acf(word, text, timesteps):
    Calculate word-autocorrelation function for given word 
    in a text. Each word in the text corresponds to one "timestep".
    acf = np.zeros((timesteps,))
    mask = [w==word for w in text]
    nwords_chosen = np.sum(mask)
    nwords_total = len(text)
    for t in range(timesteps):
        for i in range(1,nwords_total-t):
            acf[t] += mask[i]*mask[i+t]
        acf[t] /= nwords_chosen      
    return acf

Do you think there is any room for improvement? How would you go about optimizing this function?

Try to implement one faster version!

Pairwise distance

Consider the following Python function:

import numpy as np

def dis_python(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, M), dtype=np.float64)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)
    return D

Start by profiling it in Jupyter:

X = np.random.random((1000, 3))
%timeit dis_python(X)

Now try to speed it up with NumPy (i.e. vectorise the function), Numba or Cython (depending on what you find most interesting). Make sure that you’re getting the correct result, and then benchmark it with %timeit.

Bubble sort

To make a long story short, in the worse case the time taken by the Bubblesort algorithm is roughly \(O(n^2)\) where \(n\) is the number of items being sorted.


Here is a function that performs bubble-sort:

def bs_python(a_list):
    N = len(a_list)
    for i in range(N):
        for j in range(1, N):
            if a_list[j] < a_list[j-1]:
                a_list[j-1], a_list[j] = a_list[j], a_list[j-1]
    return a_list

And this is how you can benchmark it:

import random
l = [random.randint(1,1000) for num in range(1, 1000)]
%timeit bs_python(l)

Now try to speed it up with Numba or Cython (depending on what you find most interesting). Make sure that you’re getting the correct result, and then benchmark it with %timeit.


  • Measure and benchmark before you start optimizing

  • Optimization can be to change algorithms, optimise memory usage or add vectorization, or to convert performance-critical functions to Numba or Cython