Data science and machine learning

Questions

  • Can I use Julia for machine learning?

  • What are the key steps in data preprocessing in Julia?

  • How can you handle missing data in Julia?

  • How can you save your current environment in Julia?

  • What are some popular machine learning algorithms available in Julia?

  • How does Julia handle large datasets in machine learning?

  • How can you implement clustering in Julia?

  • What are some classification techniques available in Julia?

Instructor note

  • 100 min teaching

  • 50 min exercises

Working with data

In the Data Formats and Dataframes lesson, we explored a Julian approach to manipulation and visualisation of data.

Here we will learn and clustering, classification, machine learning and deep learning with some toy examples.

Download a dataset

We start by downloading a dataset containing measurements of characteristic features of different penguin species.

../_images/lter_penguins.png

Artwork by @allison_horst

Exercise

To obtain the data we simply add the PalmerPenguins package.

using Pkg
Pkg.add("PalmerPenguins")
using PalmerPenguins

As it was done in the Data Formats and Dataframes lesson, we can

dropmissing!(df)

The main features we are interested in for each penguin observation are bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g. What the first three features mean is illustrated in the picture below.

../_images/culmen_depth.png

Artwork by @allison_horst

Saving the Current Setup

There are several ways to save the current setup in Julia. This section will cover three parts: saving the environment to have reproducible code and saving data using CSV files or JLD.

1. Saving the Environment

To check the current status of your Julia environment, you can use the status command in the package manager.

using Pkg
Pkg.status()
Status `~/.julia/environments/v1.9/Project.toml`
   [336ed68f] CSV v0.10.11
   [aaaa29a8] Clustering v0.15.4
   [a93c6f00] DataFrames v1.6.1
   [682c06a0] JSON v0.21.4
   [8b842266] PalmerPenguins v0.1.4

This will display the list of packages in the current environment along with their versions.

To save the state of your environment, Julia uses two files: Project.toml and Manifest.toml. The Project.toml file specifies the packages that you explicitly added to your environment, while the Manifest.toml file records the exact versions of these packages and all their dependencies1.

When you add packages using Pkg.add(), Julia automatically updates these files. Therefore, your environment’s state (i.e., the set of loaded packages) is automatically saved. Project.toml and Manifest.toml are located in the directory of your current Julia environment; in our case, ~/.julia/environments/v1.9/.

If you want to replicate this environment on another machine or in another folder, you can do the following:

  1. Copy both Project.toml and Manifest.toml to the new location.

  2. In Julia, navigate to that folder and activate the environment using Pkg.activate(".").

  3. Use Pkg.instantiate() to download all the necessary packages.

More information in section Environments at https://enccs.github.io/julia-intro/development/

2. Saving Data as a CSV File

As shown in the Data Formats and DataFrames lesson, a DataFrame can easily dumped into a CSV file using the CSV.jl package, which also allows for reading tabular data.

Exercise

You can use the CSV.jl package to save a DataFrame as a CSV file, which can be re-read later.

# using Pkg
# Pkg.add("CSV")
using CSV
CSV.write("penguins.csv", df)

And you can load it back with:

df = CSV.read("penguins.csv", DataFrame)

3. Saving Data Using JLD/JLD2

Another option is to use JLD.jl The JLD.jl package provides a way to save and load Julia variables while preserving native types. It is based on HDF5, a cross-platform, multi-language data storage format most frequently used for scientific data. However, it is written in pure Julia and does not require any of the original C HDF5 implementation.

The JLD package can be imported in the usual way:

using Pkg
Pkg.add("JLD")

A DataFrame can be saved to file in the following way:

using JLD
save("penguins.jld", "df", df)

Here we’re saving df as “df” within penguins.jld. You can load this DataFrame back in with:

df = load("penguins.jld", "df")

This will return the DataFrame df from the file and assign it back to df. In the past years, the JLD2.jl package came forward as an alternative to JLD. It is also based on HDF5 and can read h5 files saved by other HDF5 implementations. It exposes an interface similar to JLD with save() and load() functions, but the more user-friendly function jldsave() is also available:

using JLD2
jldsave("penguins.jld2"; df) # This is equivalent to the save command above
df = load("penguins.jld2", "df")

Moreover, a jldopen() function provides a file-like interface. More information can be found here.

Machine learning

Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. It is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines “discover” patterns and algorithms to deal with data. Classical machine learning algorithms include (non-)linear regression, logistic regression,support vector machines (SVM), k-Nearest Neigbours, xgboost and many others, spanning supervised learning (where the algorithm is shown examples of what it should do), unsupservised learning (where it learns autonomously) and reinforcement learning (where a reward policy is used to “teach”). Deep learning, which will be discussed in more depth in a later section, is a type of machine learning algorithm.

References:

Machine learning in Julia

Despite being a relatively new language, Julia already has a strong and rapidly expanding ecosystem of libraries for machine learning and deep learning. An important advantage of Julia for ML is that it is possible to extract very good performance by writing pure Julia code, without resorting to backends written in compiled languages. A distinct feature of the Julian ML ecosystem is a well-developed stack for “scientific machine learning” (SciML) (a.k.a. physics-informed learning), which is a flavour of machine learning that incorporates physics (ODEs, PDEs…) into the learning process instead of relying only on data. SciML relies heavily on automatic differentiation - the ability to automatically compute derivatives of any function and thus incorporate it into predictive models.

Traditional machine learning

Julia has packages for traditional (non-deep) machine learning:

  • ScikitLearn.jl is a port of the popular Python package.

  • MLJ.jl provides a common interface and meta-algorithms for selecting, tuning, evaluating, composing and comparing over 150 machine learning models.

  • Machine Learning · Julia Packages lists various Julia packages related to machine learning, such as MLJ.jl, Knet.jl, TensorFlow.jl, DiffEqFlux.jl, FastAI.jl, ScikitLearn.jl, and many more. You can browse the packages by their popularity, alphabetical order, or update date. Each package has a brief description and a link to its GitHub repository.

  • AI · Julia Packages lists Julia packages related to Artificial Intelligence broadly

We will use a few utility functions from MLJ.jl in our deep learning exercise below, so we will need to add it to our environment:

using Pkg
Pkg.add("MLJ")

Clustering and Classification

In this episode, we will go through a Clustering example. Clustering is an unsupervised learning technique, used to discover “groups” of datapoints (called clusters) such that:

  • Points belonging to the same cluster are similar, in the sense that they share common properties

  • Points in different clusters are not similar

Being an unsupervised method, it doesn’t need labeled examples, it aims to uncover hidden structure in the data. It is commonly used in a number of situations, including customer segmentation, geospatial analysis, biomedical data analysis, even time series analysis in some cases.

In Julia, the Clustering.jl package implements a few common clustering algorithms, including k-means, density based clustering and more, with most algorithms expecting data in a matrix of shape (features x observations).

In this particular example, we will use (a sample of) the published dataset of the New York City yellow taxi rides. This data is published monthly by an agency of the municipality of New York and includes data such as pickup and dropoff locations (lat/lon or an “ID” that can be mapped to a location) and timestamps, trip length, airport fare surcharge, fare mount, tips, and more. This dataset is very big (a few GBs, 12 million rows for January 2015), thus we prepared two subsets for local exploration: one having 100k rows and one having 50k rows. In our tests, the clustering should be able to run in just a few seconds even with 100k rows. Please download the datasets now! The idea is to try to uncover patterns in the data, e.g. correlation between location and tipping/fare, whether clusters correspond to specific location, and so on. To do this, we will use the k-means clustering algorithm. The objective is to partition the dataset into k clusters by minimising the distance between points and their assigned cluster centre. The algorithm is implemented as follows:

  1. Choose the number of clusters k. This can be done either by the user based on preliminary knowledge of the data (e.g. number of cell populations in biomedical data) or with some heuristic (e.g. elbow method)

  2. Initialise k cluster centres randomly

  3. Assign each point to the nearest centre

  4. Update each centre as the mean of the assigned points

  5. Repeat 3 and 4 until convergence

The function to minimise is:

\[\sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2\]

where:

  • \(C_k\) is a cluster

  • \(\mu_k\) is its centroid

There are several other clustering techniques, such as k-medoids, DBSCAN and hierarchical clustering. Now, let us try to apply this to our example dataset!

Clustering example of NYC yellow taxi rides

using DataFrames, Parquet2

df = DataFrame(Parquet2.Dataset("taxi_100k.parquet"); copycols=true)

dropmissing!(df)

# Remove extreme outliers
df = filter(row -> -74.1 < row[:pickup_longitude] < -73.7 && 40.6 < row[:pickup_latitude] < 40.9, df)

We can now select the features we’re interested in, such as geographical information (latitude and longitude), distance and fare:

features = select(df, [
    :pickup_longitude,
    :pickup_latitude,
    :fare_amount,
    :trip_distance,
])

Moreover, we need to scale the data (we’re computing distance between clusters, so we don’t want larger fields to be overrepresented):

using Statistics

X = Matrix(features)

μ = mean(X, dims=1)
σ = std(X, dims=1)

X_scaled = ((X .- μ) ./ σ)'

Now we can go for the actual clustering:

using Clustering

k = 6
result = kmeans(X_scaled, k)

clusters = result.assignments

Visualisation (plot pickup location, colour by cluster):

using Plots

scatter(
    df.pickup_longitude,
    df.pickup_latitude,
    marker_z = clusters,
    ms = 2,
    alpha = 0.5,
    legend = false,
    xlabel = "Longitude",
    ylabel = "Latitude",
    title = "Taxi Clusters (Location + Price + Distance)"
)

You should hopefully get something similar:

../_images/taxi_clusters.png

Exercises

Exercise1: Spatial clustering only

Cluster using only longitude and latitude and compare the results with multi-feature clustering.

Exercise 2: Add time features

Try adding the tip amount to the features and see if people in certain areas tip more!

Exercise 3: Geospatial plotting

Let us now try to get a nicer plot by plotting the clusters on the land surface of the city. To do so, we use the ``GeoMakie.jl <https://geo.makie.org/stable/>`_ package, which is suitable for geospatial data plotting, and NaturalEarth.jl which is a proxy to the Natural Earth dataset, containing the landmass (and other features) of the whole planet.

Furthermore, we also have a couple of notebooks that deal with clustering and classification problems. The first one deals with another example of clustering of housing prices in California and follows a similar structure, although it uses a different plotting package. The second one shows different ways to perform a classification task on the Iris dataset from a previous episode, building on our findings to try to predict the three species using a variety of traditional machine learning techniques such as Lasso, Support Vector Machines (SVM), Ridge regression and more. They can be run in VS Code or through the JupyterLab interface.

Clustering notebook: https://github.com/ENCCS/julia-for-hpda/blob/main/notebooks/Clustering.ipynb

Classification notebook: https://github.com/ENCCS/julia-for-hpda/blob/main/notebooks/Classification.ipynb

Deep learning

Deep learning is a subset of ML. Neural networks are a particular type of ML approach which tries to loosely mimic the functioning of a human brain and consists of connected computational units called neurons; each neuron has one or more inputs and (mostly) performs three fundamental operations: - Take a weighted sum of the inputs with a vector of weights - Add an extra constant weight (bias) - Apply an (usually non-linear) activation function to this weighted sum.

In essence, given a input vector \((x_1,x_2,...,x_N)\), a neuron computes:

\[y = f(\sum_i(x_iw_i) + b)\]

where f is the activation function, \(w_i\) are the weights and b is the bias.

These neural networks attempt to simulate the behavior of the human brain — albeit far from matching its ability — allowing it to “learn” from large amounts of data. Deep learning drives many AI applications and services that improve automation, performing analytical and physical tasks without human intervention. Large language models and other generative AI models are very large neural networks with specific architectures.

For more detailed information, the Intro to deep learning course gives a good overview of the building blocks of a neural network.

At the time of writing (2026), two main frameworks are used when dealing with neural networks in Julia: Flux.jl and Lux.jl.

Flux.jl comes “batteries-included” with many useful tools built in, but also enables the user to write own Julia code for DL components.

  • Flux has relatively few explicit APIs for features like regularisation or embeddings. Core components are available, but there are few rigid, high-level APIs. Things like regularisation and embeddings are written using standard Julia code patterns

  • All of Flux is straightforward Julia code and can be inspected/extended (no DSLs)

  • Flux works well with other Julia libraries, like dataframes, images and differential equation solvers. One can build complex data processing pipelines that integrate Flux models.

Lux.jl, on the other hand, is a newer framework that has a more functional programming style. While the building blocks are similar (layers, activation functions,etc.), the models are pure functions and the state is passed around as an argument. While this may look clunkier at first, it does have some advantages when trying to optimise code, and models can be composed, transformed and differentiated with standard Julia tools without special abstractions. Moreover, a Training API is available to hide some of the boilerplate if needed.

Generally speaking, Lux shines when integration with SciML is needed, or any time neural networks are embedded in larger scientific computing applications, and new developments tend to land in Lux first. Conversely, Flux is the more mature, general purpose deep learning framework. From a Python perspective, Flux feels more like PyTorch/Keras, whereas Lux is more akin to Jax/Flax.

To install Flux:

using Pkg
Pkg.add("Flux")

Whereas for Lux:

using Pkg
Pkg.add("Lux")

Training a deep neural network to classify penguins

To train a model we need four things:

  • A collection of data points that will be provided to the objective function.

  • A objective (cost or loss) function, that evaluates how well a model is doing given some input data.

  • The definition of a model and access to its trainable parameters.

  • An optimiser that will update the model parameters appropriately.

In this case, we will train a simple neural network to be able to classify the four species of penguins from the dataset above based on their anatomical features (bill length, depth, etc.).

First we load the data:

using MLJ: partition, ConfusionMatrix
using DataFrames
using PalmerPenguins
using OneHotArrays

table = PalmerPenguins.load()
df = DataFrame(table)
dropmissing!(df)

We can now preprocess our dataset to make it suitable for training a network:

# select feature and label columns
X = select(df, Not([:species, :sex, :island]))
Y = df[:, :species]

# split into training and testing parts
(xtrain, xtest), (ytrain, ytest) = partition((X, Y), 0.8, shuffle=true, rng=123, multi=true)

# use single precision and transpose arrays
xtrain, xtest = Float32.(Array(xtrain)'), Float32.(Array(xtest)')

# one-hot encoding
ytrain = OneHotArrays.onehotbatch(ytrain, ["Adelie", "Gentoo", "Chinstrap"])
ytest = OneHotArrays.onehotbatch(ytest, ["Adelie", "Gentoo", "Chinstrap"])

# count penguin classes to see if it's balanced
sum(ytrain, dims=2)
sum(ytest, dims=2)

Next up is the loss function which will be minimized during the training. We also define another function which will give us the accuracy of the model:

using Flux
# we use the cross-entropy loss function typically used for classification
loss(model, x, y) = Flux.crossentropy(model(x), y)

# onecold (opposite to onehot) gives back the original representation
function accuracy(x, y)
    return sum(Flux.onecold(model(x)) .== Flux.onecold(y)) / size(y, 2)
end

model will be our neural network, so we go ahead and define it:

n_features, n_classes, n_neurons = 4, 3, 10
model = Chain(
        Dense(n_features, n_neurons, sigmoid),
        Dense(n_neurons, n_classes),
        softmax)

We now set up our optimizer. We have selected the standard optimizer ADAM:

opt_state = Flux.setup(Adam(), model)

Before training the model, let’s have a look at some initial predictions and the accuracy:

# predictions before training
model(xtrain[:,1:5])
ytrain[:,1:5]
# accuracy before training
accuracy(xtrain, ytrain)
accuracy(xtest, ytest)

Finally we are ready to train the model. Let’s run 100 epochs:

# the training data and the labels can be passed as tuples to train!
for i in 1:100
    Flux.train!((m,x,y) -> loss(m, x, y), model, [(xtrain, ytrain)], opt_state)
end

# check final accuracy
accuracy(xtrain, ytrain)
accuracy(xtest, ytest)

The performance of the model is probably somewhat underwhelming, but you will fix that in an exercise below!

We finally create a confusion matrix to quantify the performance of the model:

predicted_species = OneHotArrays.onecold(model(xtest), ["Adelie", "Gentoo", "Chinstrap"])
true_species = OneHotArrays.onecold(ytest, ["Adelie", "Gentoo", "Chinstrap"])
ConfusionMatrix()(predicted_species, true_species)

Exercises

Improve the deep learning model

Improve the performance of the neural network we trained above! The network is not improving much because of the large numerical range of the input features (from around 15 to around 6000) combined with the fact that we use a sigmoid activation function. A standard method in machine learning is to normalize features by “batch normalization”. Replace the network definition with the following and see if the performance improves:

n_features, n_classes, n_neurons = 4, 3, 10
model = Chain(
          Dense(n_features, n_neurons),
          BatchNorm(n_neurons, relu),
          Dense(n_neurons, n_classes),
          softmax)

Performance is usually better also if we, instead of training on the entire dataset at once, divide the training data into “minibatches” and update the network weights on each minibatch separately. First define the following function:

using StatsBase: sample

function create_minibatches(xtrain, ytrain; batch_size=32, n_batch=10)
    minibatches = Tuple[]
    for i in 1:n_batch
        randinds = sample(1:size(xtrain, 2), batch_size)
        push!(minibatches, (xtrain[:, randinds], ytrain[:,randinds]))
    end
    return minibatches
end

and then create the minibatches by calling the function.

You will not need to manually loop over the minibatches, simply pass the minibatches vector of tuples to the Flux.train! function. Does this make a difference?

More improvements

Exercise: Hyperparameter Tuning

Experiment with different hyperparameters of the model and the training process.

# Try different batch sizes in the minibatch creation.
minibatches = create_minibatches(xtrain, ytrain, batch_size=64, n_batch=10)

# Experiment with different learning rates for the ADAM optimizer.
opt_state = Flux.setup(Adam(0.05), model)

# Change the number of neurons in the hidden layer of the model.
model = Chain(
   Dense(n_features, 20, relu),
   Dense(20, n_classes),
   softmax
)

# The solution will depend on the specific hyperparameters chosen.

Exercise: Feature Engineering

Consider doing some feature engineering on your input data.

# Try normalizing or standardizing the input features.
xtrain = (xtrain .- mean(xtrain, dims=2)) ./ std(xtrain, dims=2)
xtest = (xtest .- mean(xtest, dims=2)) ./ std(xtest, dims=2)

Exercise: Different Model Architectures

Experiment with different model architectures.

# Try adding more layers to your model.
model = Chain(
   Dense(n_features, n_neurons, relu),
   Dense(n_neurons, n_neurons, relu),
   Dense(n_neurons, n_classes),
   softmax
)

Remember to experiment and see how these changes affect your model’s performance! 😊

See also

Neuromorphic | Probabilistic learning