Motivation

Objectives

  • Become familiar with the term “big data”

  • Know what to expect from this course

Instructor note

  • 10 min teaching/type-along

Big data

How large is your data?

How large is the data you are working with? Are you experiencing performance bottlenecks when you try to analyse it?

“Big data refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. […] Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source.” (from Wikipedia)

“Big data” is a current buzzword used heavily in the tech industry, but many scientific research communities are increasingly adopting high-throughput data production methods which lead to very large datasets. One driving force behind this development is the advent of powerful machine learning methods which enable researchers to derive novel scientific insights from large datasets. Another is the strong development of high performance computing (HPC) hardware and the accompanying development of software libraries and packages which can efficiently take advantage of the hardware.

This course focuses on high-performace data analytics (HPDA), a subset of high-performance computing which focuses on working with large data. The data can come from either computer models and simulations or from experiments and observations, and the goal is to preprocess, analyse and visualise it to generate scientific results.

Python

Performance bottlenecks in Python

Have you ever written Python scripts that look something like this?

f = open("mydata.dat", "r")
for line in f.readlines():
    fields = line.split(",")
    x, y, z = fields[1], fields[2], fields[3]
    # some analysis with x, y and z
f.close()

Compared to C/C++/Fortran, this for-loop will probably be orders of magnitude slower!

Despite early design choices of the Python language which made it significantly slower than conventional HPC languages, a rich and growing ecosystem of open source libraries have established Python as an industry-standard programming language for working with data on all levels of the data analytics pipeline. These range from generic numerical libraries to special-purpose and/or domain-specific packages. This lesson is focused on introducing modern packages from the Python ecosystem to work with large data. Specifically, we will learn to use:

  • Numpy

  • Scipy

  • Pandas

  • Xarray

  • Numba

  • Cython

  • multithreading

  • multiprocessing

  • Dask

What you will learn

This lesson provides a broad overview of methods to work with large datasets using tools and libraries from the Python ecosystem. Since this field is fairly extensive we will not have time to go into much depth. Instead, the objective is to expose just enough details on each topic for you to get a good idea of the big picture and an understanding of what combination of tools and libraries will work well for your particular use case.

Specifically, the lesson covers:

  • Tools for efficiently storing data and writing/reading data to/from disk

  • How to share datasets and mint digital object identifiers (DOI)

  • Main methods of efficiently working with tabular data and multidimensional arrays

  • How to measure performance and boost performance of time consuming Python functions

  • Various methods to parallelise Python code

The lesson does not cover the following:

  • Visualisation techniques

  • Machine learning

  • GPU related

Keypoints

  • Datasets are getting larger across nearly all scientific and engineering domains

  • The Python ecosystem has many libraries and packages for working with big data efficiently