Data Formats and Dataframes#

Questions

  • How can I manipulate and wrangle dataframes in Julia?

  • How can I handle missing data in a DataFrame in Julia?

  • How can I merge data in Julia?

  • How can I use the Fourier transform to analyze climate data in Julia?

Instructor-note

  • 35 min teaching

  • 30 min exercises

Working with data#

We will now explore a Julian approach to a use case common to many scientific disciplines: manipulating data and visualization. Julia is a good language to use for data science problems as it will perform well and alleviate the need to translate computationally demanding parts to another language.

Here we will learn how to work with data using the DataFrames package, visualize it with the Plots and StatsPlots.

DataFrame in Julia#

In Julia, a DataFrame is a two-dimensional table-like data structure, similar to a Excel spreadsheet or a SQL table. It is part of the DataFrames.jl package, which provides a powerful and flexible way to manipulate and analyze data in Julia.

A DataFrame consists of columns and rows.

../_images/01_table_dataframe.svg

The rows usually represent independent observations, while the columns represent the features (variables) for each observation. You can perform various operations on a DataFrame, such as filtering, sorting, grouping, joining, and aggregating data.

The DataFrames.jl package is Julia’s version of the pandas library in Python and the data.frame() function in R. DataFrames.jl also provides a rich set of functions for data cleaning, transformation, and visualization, making it a popular choice for data science and machine learning tasks in Julia. Just like in Python and R, the DataFrames.jl package provides functionality for data manipulation and analysis.

Download a dataset#

We start by downloading a dataset containing measurements of characteristic features of different penguin species.

../_images/lter_penguins.png

Artwork by @allison_horst#

To obtain the data we simply add the PalmerPenguins package.

Pkg.add("PalmerPenguins")
using PalmerPenguins

We will use DataFrames here to analyze the penguins dataset, but first we need to install it:

Pkg.add("DataFrames")
using DataFrames

Here’s how you can create a new dataframe:

using DataFrames
names = ["Ali", "Clara", "Jingfei", "Stefan"]
age = ["25", "39", "14", "45"]
df = DataFrame(; name=names, age=age)
4×2 DataFrame
Row │ name        age
    │ String      String
────┼────────────────────
  1 │ Ali         25
  2 │ Clara       39
  3 │ Jingfei     14
  4 │ Stefan      45

Todo

Dataframes

The following code loads the PalmerPenguins dataset into a DataFrame. Then it demonstrates how to write and read the data in CSV, JSON, and Parquet formats using the CSV, JSONTables, and Parquet packages respectively.

More about Types of scientific data one can find at ENCCS High Performance Data Analytics in Python training.

using DataFrames
# Load the PalmerPenguins dataset
table = PalmerPenguins.load()
df = DataFrame(table)

We now create a dataframe containing the PalmerPenguins dataset. Note that the table variable is of type CSV.File; the PalmerPenguins package uses the CSV.jl package for fast loading of data. Note further that DataFrame can accept a CSV.File object and read it into a dataframe!

using PalmerPenguins
table = PalmerPenguins.load()
df = DataFrame(table)

# the raw data can be loaded by
#tableraw = PalmerPenguins.load(; raw = true)

first(df, 5)
344×7 DataFrame
 Row │ species    island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
     │ String     String     Float64?        Float64?       Int64?             Int64?       String?
─────┼──────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Adelie   Torgersen            39.1           18.7                181         3750  male
   2 │ Adelie   Torgersen            39.5           17.4                186         3800  female
   3 │ Adelie   Torgersen            40.3           18.0                195         3250  female
   4 │ Adelie   Torgersen       missing        missing              missing      missing  missing
   5 │ Adelie   Torgersen            36.7           19.3                193         3450  female

Inspect dataset#

Todo

We can inspect the data using a few basic operations:

# slicing
df[1, 1:3]

# slicing and column name (can also use "island")
df[1:20:100, :island]

# dot syntax (editing will change the dataframe)
df.species

# get a copy of a column
df[:, [:sex, :body_mass_g]]

# access column directly without copying (editing will change the dataframe)
df[!, :bill_length_mm]

# get size
size(df), ncol(df), nrow(df)

# find unique species
unique(df.species)

Summary statistics can be displayed with the describe function:

describe(df)
7×7 DataFrame
 Row │ variable           mean     min     median  max        nmissing  eltype
     │ Symbol             Union…   Any     Union…  Any        Int64     Type
─────┼──────────────────────────────────────────────────────────────────────────────────────────
   1 │ species                     Adelie          Gentoo            0  String
   2 │ island                      Biscoe          Torgersen         0  String
   3 │ bill_length_mm     43.9219  32.1    44.45   59.6              2  Union{Missing, Float64}
   4 │ bill_depth_mm      17.1512  13.1    17.3    21.5              2  Union{Missing, Float64}
   5 │ flipper_length_mm  200.915  172     197.0   231               2  Union{Missing, Int64}
   6 │ body_mass_g        4201.75  2700    4050.0  6300              2  Union{Missing, Int64}
   7 │ sex                         female          male             11  Union{Missing, String}

We can see in the output of describe that the element type of all the columns is a union of missing and a numeric type. This implies that our dataset contains missing values. More about Tidy Data concept can be found in the Python for Scientific Computing training by Aalto Scientific Computing: https://aaltoscicomp.github.io/python-for-scicomp/pandas/#tidy-data

We can remove these missing values by the dropmissing or dropmissing! functions (what is the difference between them?):

dropmissing!(df)

Alternatively, we can use:

# Missing data
# Replacing missing values with a specific value
df = coalesce.(df, 0)

The code shows how to handle missing data in the bill_length_mm column by replacing missing values with a specific value using the coalesce function or by interpolating missing values using the Interpolations package.

# Interpolating missing values
using Interpolations
mask = ismissing.(df[:bill_length_mm])
itp = interpolate(df[:bill_length_mm][.!mask], BSpline(Linear()))
df[:bill_length_mm][mask] .= itp.(findall(mask))

It throws the issue because the syntax df[column] is not supported in Julia 1.9.0. Here is the correct code:

# Interpolating missing values
using Interpolations
mask = ismissing.(df.bill_length_mm)
itp = interpolate(df[!, :bill_length_mm][.!mask], BSpline(Linear()))
df[!, :bill_length_mm][mask] .= itp.(findall(mask))

mask = ismissing.(df.bill_length_mm): This line is creating a logical mask that is true wherever there are missing values (NaN) in the bill_length_mm column and false elsewhere.

itp = interpolate(df[!, :bill_length_mm][.!mask], BSpline(Linear())): This line is creating an interpolation object itp. It’s using only the non-missing values of the bill_length_mm column (specified by df[!, :bill_length_mm][.!mask]) and a linear B-spline interpolation method (BSpline(Linear())).

df[!, :bill_length_mm][mask] .= itp.(findall(mask)): This line is replacing the missing values in the bill_length_mm column with the interpolated values. It’s finding the indices of the missing values with findall(mask) and then using the interpolation object itp to estimate values at these indices.

So, in summary, this code is filling in missing values in the bill_length_mm column by estimating their value based on a linear interpolation of the non-missing values. This can be a useful way to handle missing data when you don’t want to or can’t simply ignore those missing values. 😊

(Optional) Long vs Wide Data Format#

The data is in a so-called wide format.

In data analysis, we often encounter two types of data formats: long format and wide format. https://www.statology.org/long-vs-wide-data/

  • Long format: In this format, each row is a single observation, and each column is a variable. This format is also known as “tidy” data.

  • Wide format: In this format, each row is a subject, and each column is an observation. This format is also known as “spread” data.

The DataFrames.jl package provides functions to reshape data between long and wide formats. These functions are stack, unstack, melt, and pivot. Detailed tutorial: https://dataframes.juliadata.org/stable/man/reshaping_and_pivoting/

# To convert from wide to long format
df_long = stack(df, Not(:species))

# To convert from long to wide format
df_wide = unstack(df_long, :species, :variable, :value)

# or
# Custom combine function
function custom_combine(x)
   if eltype(x) <: Number
      return mean(skipmissing(x))
   else
      return first(skipmissing(x))
   end
end

# Unstack DataFrame with custom combine function
df_wide = unstack(df_long, :species, :variable, :value, combine = custom_combine)

(Optional) Reshaping and Pivoting#

The pivot function can be used to reshape data (from long to wide format) and also perform aggregation.

using Statistics

# Pivot data with aggregation
df_grouped = groupby(df, [:species, :island])
df_pivot = combine(df_grouped, :body_mass_g => mean)

In this example, groupby(df, [:species, :island]) groups your DataFrame by the species and island columns. Then, combine(df_grouped, :body_mass_g => mean) calculates the mean of the body_mass_g column for each group. The mean function is used for aggregation.

The result is a new DataFrame where each unique value in the :species column forms a row, each unique value in the :island column forms a column, and the mean body mass for each species-island combination fills the DataFrame.

Note that if you don’t provide an aggregation function and there are multiple values for a given row-column combination, pivot will throw an error. To handle this, you can provide an aggregation function like mean, sum, etc., which will be applied to all values that fall into each cell of the resulting DataFrame.

Creating and merging DataFrames like in SQL#

Creating DataFrames

In Julia, you can create a DataFrame from scratch using the DataFrame constructor from the DataFrames package. This constructor allows you to create a DataFrame by passing column vectors as keyword arguments or pairs. For example, to create a DataFrame with two columns named :A and :B, you can use the following code: DataFrame(A = 1:3, B = [“x”, “y”, “z”]) You can also create a DataFrame from other data structures such as dictionaries, named tuples, vectors of vectors, matrices, and more. You can find more information about creating DataFrames in Julia in the official documentation

Merging DataFrames

Also, you can merge two or more DataFrames using the join function from the DataFrames package. This function allows you to perform various types of joins, such as inner join, left join, right join, outer join, semi join, and anti join. You can specify the columns used to determine which rows should be combined during a join by passing them as the on argument to the join function. For example, to perform an inner join on two DataFrames df1 and df2 using the :ID column as the key, you can use the following code: join(df1, df2, on = :ID, kind = :inner). You can find more information about joining DataFrames in Julia in the official documentation

Plotting#

Let us now look at different ways to visualize this data. Many different plotting libraries exist for Julia and which one to use will depend on the specific use case as well as personal preference.

We will be using Plots.jl and StatsPlots.jl but we encourage to explore these other packages to find the one that best fits your use case.

First we install Plots.jl and StatsPlots backend:

Pkg.add("Plots")
Pkg.add("StatsPlots")

Here’s how a simple line plot works:

using Plots
gr()  # set the backend to GR

x = 1:10; y = rand(10, 2)
plot(x, y, title = "Two Lines", label = ["Line 1" "Line 2"], lw = 3)

In VSCode, the plot should appear in a new plot pane. We can add labels:

xlabel!("x label")
ylabel!("y label")

To add a line to an existing plot, we mutate it with plot!:

z = rand(10)
plot!(x, z)

Finally we can save to the plot to a file:

savefig("myplot.png")
../_images/myplot.png

myplot.png#

Multiple subplots can be created by:

y = rand(10, 4)

p1 = plot(x, y); # Make a line plot
p2 = scatter(x, y); # Make a scatter plot
p3 = plot(x, y, xlabel = "This one is labelled", lw = 3, title = "Subtitle");
p4 = histogram(x, y); # Four histograms each with 10 points? Why not!
plot(p1, p2, p3, p4, layout = (2, 2), legend = false)

Todo

Visualizing the Penguin dataset

First load Plots and set the backend to GR (precompilation of Plots might take some time):

using Plots
gr()

For the Penguin dataset it is more appropriate to use scatter plots, for example:

scatter(df[!, :bill_length_mm], df[!, :bill_depth_mm])

We can adjust the markers by this list of named colors and this list of marker types:

scatter(df[!, :bill_length_mm], df[!, :bill_depth_mm], marker = :hexagon, color = :magenta)

We can also change the plot theme according to this list of themes, for example:

theme(:dark)
# then re-execute the scatter function

We can add a dimension to the plot by grouping by another column. Let’s see if the different penguin species can be distiguished based on their bill length and bill depth. We also set different marker shapes and colors based on the grouping, and adjust the markersize and transparency (alpha):

scatter(df[!, :bill_length_mm],
        df[!, :bill_depth_mm],
        xlabel = "bill length (mm)",
        ylabel = "bill depth (g)",
        group = df[!, :species],
        marker = [:circle :ltriangle :star5],
        color = [:magenta :springgreen :blue],
        markersize = 5,
        alpha = 0.8
        )
../_images/penguin_scatter.png

The scatter function comes from the base Plots package. StatsPlots provides many other types of plot types, for example density. To use dataframes with StatsPlots we need to use the @df macro which allows passing columns as symbols (this can also be used for scatter and other plot functions):

using StatsPlots

@df df density(:flipper_length_mm,
               xlabel = "flipper length (mm)",
               group = :species,
               color = [:magenta :springgreen :blue],
               )
../_images/penguin_density.png

Exercises#

Todo

Create a custom plotting function

Convert the final scatter plot in the type-along section “Visualizing the Penguin dataset” and convert it into a create_scatterplot function:

  • The function should take as arguments a dataframe and two column symbols.

  • Use the minimum() and maximum() functions to automatically set the x-range of the plot using the xlim = (xmin, xmax) argument to scatter().

  • If you have time, try grouping the data by :island or :sex instead of :species (keep in mind that you may need to adjust the number of marker symbols and colors).

  • If you have more time, play around with the plot appearance using theme() and the marker symbols and colors.

Todo

Working with DataFrames in Julia

In this exercise, you will practice reading data from CSV files into DataFrames, manipulating data in DataFrames, and visualizing data using a plotting package.

  1. Install the CSV and DataFrames packages by running the following commands in the Julia REPL:

    using Pkg
    Pkg.add("CSV")
    Pkg.add("DataFrames")
    

2. Set the relative path to the DailyDelhiClimateTest.csv and DailyDelhiClimateTrain.csv files in the path_test and path_train variables. Assume that the path to your files is juliaforhpda/data and you are currently in the juliaforhpda/ directory in the Julia REPL. The data is available here: ENCCS/julia-for-hpda and ENCCS/julia-for-hpda

This climate data set contains daily mean temperature, humidity, wind speed and mean pressure at a location in Dehli India over a period of several years. The data set was downloaded from here.

  1. Read the data from the CSV files into DataFrames named df_test and df_train using the CSV.read function.

  2. Use the functions provided by the DataFrames package to manipulate the data in the DataFrames. For example, you can select columns, filter rows, group data, compute summary statistics, and compute aggregate functions.

  3. Install a plotting package such as Plots or Gadfly by running the following command in the Julia REPL:

    using Pkg
    Pkg.add("Plots")
    
  4. Use the plotting package to create a line plot of the mean of the meantemp column for each group in a grouped DataFrame. Customize the appearance of the plot by changing its properties such as color, line style, marker style, etc.

Todo

Working with the Fourier Transform in Julia

In this exercise, you will practice computing the Fourier transform of climate data using the FFTW package in Julia.

  1. Install the FFTW package by running the following command in the Julia REPL:

    using Pkg
    Pkg.add("FFTW")
    

2. Read the data from the DailyDelhiClimateTest.csv and DailyDelhiClimateTrain.csv files into DataFrames named df_test and df_train using the CSV.read function.

  1. Compute the Fourier transform of the meantemp column in the df_test DataFrame using the fft function from the FFTW package.

  2. Compute the frequencies corresponding to each element of the Fourier transform using the fftfreq function.

  3. Plot the magnitude of the Fourier transform against the frequencies to visualize the frequency spectrum of the signal.

I hope this exercise helps you practice working with the Fourier transform in Julia!

See also#

You can create interactive 3D scatter plots in Julia using the PlotlyJS package.