Data Formats and Dataframes¶
Questions
How can I manipulate and wrangle dataframes in Julia?
How can I handle missing data in a DataFrame in Julia?
How can I merge data in Julia?
How can I use the Fourier transform to analyze climate data in Julia?
Instructor-note
35 min teaching
30 min exercises
Working with data¶
We will now explore a Julian approach to a use case common to many scientific disciplines: manipulating data and visualization. Julia is a good language to use for data science problems as it will perform well and alleviate the need to translate computationally demanding parts to another language.
Here we will learn how to work with data using the DataFrames package, visualize it with the Plots and StatsPlots.
DataFrame in Julia¶
In Julia, a DataFrame is a two-dimensional table-like data structure, similar to a Excel spreadsheet or a SQL table. It is part of the DataFrames.jl package, which provides a powerful and flexible way to manipulate and analyze data in Julia.
A DataFrame consists of columns and rows.
The rows usually represent independent observations, while the columns represent the features (variables) for each observation. You can perform various operations on a DataFrame, such as filtering, sorting, grouping, joining, and aggregating data.
The DataFrames.jl
package offers similar functionality as the pandas library in Python and
the data.frame() function in R.
DataFrames.jl also provides a rich set of functions for data cleaning,
transformation, and visualization, making it a popular choice for
data science and machine learning tasks in Julia. Just like in Python and R,
the DataFrames.jl package provides functionality for data manipulation and analysis.
Download a dataset¶
We start by downloading a dataset containing measurements of characteristic features of different penguin species.
Artwork by @allison_horst¶
The dataset is bundled within the PalmerPenguins package, so we need to add that:
Pkg.add("PalmerPenguins")
using PalmerPenguins
We will use DataFrames here to analyze the penguins dataset, but first we need to install it:
Pkg.add("DataFrames")
using DataFrames
Here’s how you can create a new dataframe:
using DataFrames names = ["Ali", "Clara", "Jingfei", "Stefan"] age = ["25", "39", "14", "45"] df = DataFrame(name=names, age=age)4×2 DataFrame Row │ name age │ String String ────┼──────────────────── 1 │ Ali 25 2 │ Clara 39 3 │ Jingfei 14 4 │ Stefan 45
Inspect dataset¶
So, in summary, this code is filling in missing values in the bill_length_mm column by estimating their value based on a linear interpolation of the non-missing values. This can be a useful way to handle missing data when you don’t want to or can’t simply ignore those missing values. 😊
(Optional) Long vs Wide Data Format¶
The data is in a so-called wide format.
In data analysis, we often encounter two types of data formats: long format and wide format. https://www.statology.org/long-vs-wide-data/
Long format: In this format, each row is a single observation, and each column is a variable. This format is also known as “tidy” data.
Wide format: In this format, each row is a subject, and each column is an observation. This format is also known as “spread” data.
The DataFrames.jl package provides functions to reshape data between long and wide formats. These functions are stack, unstack, melt, and pivot.
Further examples can be found in the official documentation.
# To convert from wide to long format
#First we create an ID column
df.id = 1:size(df,1)
df_long = stack(df, Not(:species, :id))
# To convert from long to wide format
df_wide = unstack(df_long, :variable, :value)
# or
# Custom combine function
function custom_combine(x)
if eltype(x) <: Number
return mean(skipmissing(x))
else
return first(skipmissing(x))
end
end
# Unstack DataFrame with custom combine function
unstack(df_long, :species, :variable, :value, combine = custom_combine)
Split-apply-combine workflows¶
Oftentimes, data analysis workflows include three steps:
Splitting/stratifying a dataset into different groups;
Applying some function/modification to each group;
Combining the results.
This is commonly referred to as “split-apply-combine” workflow, which can be
achieved in Julia with the groupby function to stratify and
the combine function to aggregate with some reduction operator.
An example of this is provided below:
using Statistics
# Split-apply-combine
df_grouped = groupby(df, [:species, :island])
df_combined = combine(df_grouped, :body_mass_g => mean)
In this example, groupby(df, [:species, :island]) groups the DataFrame by the species and island columns.
Then, combine(df_grouped, :body_mass_g => mean) calculates the mean of the :body_mass_g column for each group.
The mean function is used for aggregation.
The result is a new DataFrame where each unique :species-:island combination forms a row,
and the mean body mass for each species-island combination fills the DataFrame.
(Optional) Creating and merging DataFrames¶
Creating DataFrames¶
In Julia, you can create a DataFrame from scratch using the DataFrame constructor from the DataFrames package.
This constructor allows you to create a DataFrame by passing column vectors as keyword arguments or pairs.
For example, to create a DataFrame with two columns named :A and :B, the following works:
DataFrame(A = 1:3, B = ["x", "y", "z"])
A DataFrame can also be created from other data structures such as dictionaries, named tuples, vectors of vectors, matrices, and more. You can find more information about creating DataFrames in Julia in the official documentation
Merging DataFrames¶
Also, you can merge two or more DataFrames using the join function from the DataFrames package.
This function allows you to perform various types of joins, such as inner join, left join, right join, outer join, semi join, and anti join.
You can specify the columns used to determine which rows should be combined during a join by passing them as the on argument to the join function.
For example, to perform an inner join on two DataFrames df1 and df2 using the :ID column as the key, you can use the following code:
join(df1, df2, on = :ID, kind = :inner).
You can find more information about joining DataFrames in Julia in the official documentation.
Plotting¶
Let us now look at different ways to visualize this data. Many different plotting libraries exist for Julia and which one to use will depend on the specific use case as well as personal preference.
We will be using Plots.jl and StatsPlots.jl but we encourage to explore these other packages to find the one that best fits your use case.
First we install Plots.jl and StatsPlots backend:
Pkg.add("Plots")
Pkg.add("StatsPlots")
Here’s how a simple line plot works:
using Plots
gr() # set the backend to GR
x = 1:10; y = rand(10, 2)
plot(x, y, title = "Two Lines", label = ["Line 1" "Line 2"], lw = 3)
In VSCode, the plot should appear in a new plot pane. We can add labels:
xlabel!("x label")
ylabel!("y label")
To add a line to an existing plot, we mutate it with plot!:
z = rand(10)
plot!(x, z)
Finally we can save to the plot to a file:
savefig("myplot.png")
myplot.png¶
Multiple subplots can be created by:
y = rand(10, 4)
p1 = plot(x, y); # Make a line plot
p2 = scatter(x, y); # Make a scatter plot
p3 = plot(x, y, xlabel = "This one is labelled", lw = 3, title = "Subtitle");
p4 = histogram(x, y); # Four histograms each with 10 points? Why not!
plot(p1, p2, p3, p4, layout = (2, 2), legend = false)
Exercises¶
See also¶
- You can create interactive 3D scatter plots in Julia using the PlotlyJS package.