Python for High Performance Data Analytics

Prerequisites

  • Basic proficiency in Python (variables, flow control, functions)

  • Basic grasp of descriptive statistics (such as minimum, maximum, median, arithmetic mean…)

  • Basic knowledge of NumPy

  • Basic knowledge of some plotting package (Matplotlib, Seaborn, Holoviz…)

20 min

filename

Optional material

What to expect from this course

Discussion

How large are the datasets you are working with?

Both for classical machine/deep learning and (generative) AI, the amount of data needed to train ever-growing models is becoming bigger and bigger. Moreover, great strides in both hardware and software development for high performance computing (HPC) applications allow for large scale computations that were not possible before. This course focuses on high performance data analytics (HPDA). The data can come from simulations or experiments (or just generally available datasets), and the goal is to pre-process, analyse and visualise it. The lesson introduces some of the modern Python stack for data analytics, dealing with packages such as Pandas, Polars, multithreading and Dask, as well as Streamlit for large-scale data visualisations.

Learning outcomes

This lesson provides a broad overview of methods to work with large datasets using tools and libraries from the Python ecosystem. Since this field is fairly extensive, we will try to expose just enough details on each topic for you to get a good idea of the picture and an understanding of what combination of tools and libraries will work well for your particular use case.

Specifically, this lesson covers:

  • Tools for efficiently storing data and writing/reading it to/from disk

  • Interfacing with databases and object storage solutions

  • Main libraries to work with arrays and tabular data

  • Performance monitoring and benchmarking

  • Workload parallelisation: threads and Dask

See also

Credit

Don’t forget to check out additional course materials from the Data carpentry, such as:

Moreover, the Polars documentation and Awesome data science with Python are valuable resources, as well as PythonSpeed.

License