Performance fundamentals
Objectives
Learn Python specific performance aspects
Instructor note
10 min teaching/type-along
Understanding the Python interpreter
Performance bottlenecks in Python
Have you ever written Python scripts that look something like this?
def read_xyz_from_text_file():
f = open("mydata.dat", "r")
for line in f.readlines():
fields = line.split(",")
x, y, z = fields[0], fields[1], fields[2]
# some analysis with x, y and z
f.close()
Compared to C/C++/Fortran, this for-loop will probably be orders of magnitude slower!
This happens because during the execution step CPython mostly interprets instructions. There is some level of optimization involved though. Here is a simplified schematic of how this is invoked:
Important
While doing so, the interpreter
evaluates a result, expression-by-expression.
every intermediate result is packed and unpacked as an instance of
object
(Python) /PyObject
(CPython API) behind the scenes.
Type-Along
Try out the following code in Python Tutor
import io
def rms_from_text_file(f):
"""Compute root-mean-square of comma-separated values."""
rms = 0
n_samples = 0
for line in f.readlines():
fields = line.split(",")
x, y, z = float(fields[0]), float(fields[1]), float(fields[2])
# compute root-mean-square value
rms += ((x**2 + y**2 + z**2) / 3) ** 0.5
n_samples += 1
return rms / n_samples
fake_file = io.StringIO("""\
0.27194615,0.85939776,0.76905204
0.51586611,0.59174447,0.06501842
0.23109192,0.8260391,0.08045166
""")
avg_rms = rms_from_text_file(fake_file)
Be aware that this is a simplified version of the execution. Since it does not go into the expression level. However you can get an idea of the intermediate objects being returned and the way the interpreter parses the code.
Discussion
In the previous episode, we described I/O, Memory and CPU bound bottlenecks. For the above use case and algorithm, and not necessarily the same code, what kind of performance issue arise,
when the file becomes long, with several millions of lines?
when the file is stored in network filesystem which is slow to respond?
when instead of 3 fields,
x, y, z
, you have to read 10 million fields for every line of the text?
Solution
We can only guess at this point, but we can expect the above code to be
CPU bound: the
for
loop becomes a hotspot and vanilla CPython without JIT does not optimize this.I/O bound: if more time is spent in awaiting output of
f.readlines()
methodMemory bound: if a line of data does not fit in the memory the code needs to handle it in batches. The program will need to be rewritten with nested for-loop which depends on the memory availability.
We have to keep in mind that performance depends a lot on the kind of
input data
algorithm
Using a better container for the input data or a better algorithm with less computation complexity can often outperform technical solutions.
Structured approach towards optimization
The first priority is to look for an more efficient:
Data container, data structure, database etc.
Algorithm
If the above are not an option, then we move on to performance optimization.
First we evaluate the overall performance by benchmarking.
Then we measure the performance of at either function/method-level or line-level by profiling.
Finally we generate optimized code.
Any Python code can be replaced using optimized instructions. This is done by ahead of time (AOT) / just-in-time (JIT) compilation. The question which remains to be answered is at which level? One can optimize:
: whole programs (Nuitka, Shed Skin)
: interpreter compiling slowest loops (PyPy)
: user-defined functions / methods (numba
, transonic
)
: call compiled functions (numpy
/ Python)
We will take a look at some of these approaches in the coming episodes.
Keypoints
Develop a strategy on how to optimize.
Go shopping.
Look for better ways of reading data or better algorithms
Look for tools and libraries to help you alleviate the performance bottlenecks.