Multithreading

Questions

  • How multithreading works in Julia?

Instructor note

  • 30 min teaching

  • 30 min exercises

Linear algebra threading

OpenBLAS is the default linear algebra backend in Julia. Because it is an external library, it manages threading by itself independent of the Julia threads. We can manage the thread count for OpenBLAS inside Julia session using the BLAS.set_num_threads() function or by setting an environment variable as follows:

export OPENBLAS_NUM_THREADS=4

Consider the multiplication of two large arrays and compare the execution time.

using LinearAlgebra

A = rand(2000, 2000)
B = rand(2000, 2000)

# Precompile the matrix multiplication
A*B

# Single thread
begin
    BLAS.set_num_threads(1)
    @show BLAS.get_num_threads()
    @time A*B
end

# All threads on the machine
begin
    BLAS.set_num_threads(Sys.CPU_THREADS)
    @show BLAS.get_num_threads()
    @time A*B
end

We should see a difference between using single and all threads.

Julia threading

We start by walking through how to use multithreading in Julia. We can set the JULIA_NUM_THREADS environment variable such that Julia will start the set thread count.

export JULIA_NUM_THREADS=4

Alternatively, we can also use the -t/--threads option to supply the thread count. Using an option will override the value in the environment variable.

# Short form
julia -t 4
# Long form
julia --threads=4

In Julia, we can import the Threads module for easier usage.

using Base.Threads

The nthreads function show us how many threads we have available:

nthreads()

There are three main ways of approaching multithreading:

  1. Using @threads to parallelize a for loop to run in multiple threads.

  2. Using @spawn and @sync to spawn tasks in threads and synchronize them at the end of the block.

  3. Using @spawn and fetch to spawn tasks and fetch their return values once they are complete.

Choosing the most suitable method depends on the problem. Julia uses dynamic scheduler for multithreading which allows us to do nested multithreading, that is, to nest @spawn statements.

using Base.Threads

# Using @threads macro with dynamic scheduling
a = zeros(Int, 2*nthreads())
@threads for i in eachindex(a)
    a[i] = threadid()
end
println(a)

Let’s see if we can achieve any speed gain when performing a costly calculation.

function sqrt_array(A)
    B = similar(A)
    for i in eachindex(A)
        @inbounds B[i] = sqrt(A[i])
    end
    B
end

We can now compare the performance:

A = rand(1000, 1000)
@btime sqrt_array(A);
@btime threaded_sqrt_array(A);

# make sure we're getting the correct value
sqrt_array(A)  threaded_sqrt_array(A)

With 4 threads, the speedup could be about a factor of 3.

Threading overhead

Julia threading has an overhead of a few microseconds (equivalent to thousands of computations). Multithreading becomes efficient for tasks that are larger than the overhead.

Pitfalls

Just like with multithreading in other languages, one needs to be aware of possible race conditions, i.e. when the order in which threads read from and write to memory can change the result of a computation.

We can illustrate this with an example where we sum up the square root of elements of an array. The serial version provides the correct value and reference execution time. The “race condition” version illustrates how a naive implementation can lead to problems. The “atomic” version shows how we can ensure a correct results by using atomic operations. The “workaround” version shows how we can refactor the code to get both correct result and speedup.

function sqrt_sum(A)
    s = zero(eltype(A))
    for i in eachindex(A)
        @inbounds s += sqrt(A[i])
    end
    return s
end

We will observe that:

  • The serial version is slow but correct.

  • The race condition version is both slow and wrong.

  • The atomic version is correct but extremely slow.

  • The migration version is incorrect because tasks can migrate between Julia threads during execution, thus threadid() is not constant.

  • The workaround is fast and correct, but required refactoring.

Bonus questions:

  • What does eltype() do?

  • What does eachindex() do?

Threading with Threads.@threads is quite straightforward, but one needs to be careful not to introduce race conditions and sometimes that requires code refactorization. Using atomic operations adds significant overhead and thus only makes sense if each iteration of the loop takes significant time to compute.

Exercises

Multithreading the Laplace function

Consider the double for loop in the lap2d!() function:

function lap2d!(u, unew)
    M, N = size(u)
    for j in 2:N-1
        for i in 2:M-1
            @inbounds unew[i,j] = 0.25 * (u[i+1,j] + u[i-1,j] + u[i,j+1] + u[i,j-1])
        end 
    end
end

Can it safely be threaded, i.e. is there any risk of race conditions?

  • Insert the Threads.@threads macro in the right location - note that @threads currently only works on outermost loops!

  • Measure its effects with @benchmark. Since it’s cumbersome to change the “Julia: Num Threads” option in VSCode and relaunch the Julia REPL over and over, create a script instead which imports BenchmarkTools and prints benchmark results:

    bench_results = @benchmark lap2d!(u, unew)
    println(minimum(bench_results.times))
    
  • Now run with different number of threads from a terminal using julia -t <N> laplace.jl and observe the scaling.

  • Try increasing the problem size (e.g. M=N=8192). Does it scale better?

Multithread the computation of π

../_images/pi_with_darts.png

Consider the following function which estimates π by “throwing darts”, i.e. randomly sampling (x,y) points in the interval [0.0, 1.0] and checking if they fall within the unit circle.

function estimate_pi(num_points)
    hits = 0
    for _ in 1:num_points
        x, y = rand(), rand()
        if x^2 + y^2 < 1.0
            hits += 1
        end
    end
    fraction = hits / num_points
    return 4 * fraction
end
num_points = 100_000_000
estimate_pi(num_points)  # 3.14147572...

Can this function be safely threaded, i.e. is there any risk of race conditions?

  • Define a new function threaded_estimate_pi() where you implement the necessary changes to multithread the loop.

  • Run some benchmarks to explore the parallel efficiency.