Tabular data (aka Dataframes)

Questions

  • What are series and dataframes?

  • What do we mean by tidy and untidy data?

  • What packages are available in Python to handle dataframes?

Objectives

  • Learn how to manipulate dataframes in Pandas

  • Lazy and eager dataframes in Polars

This episode will give an introduction to the concepts of Series and DataFrame and how they can be manipulated using different Python packages.

Series and dataframes

A collection of observations (e.g. a time series or simply a set of observations of a feature of a phenomenon) can be represented by a homogeneous vector, i.e. an array where all the elements are of the same type. This is known as a Series in many frameworks. Several series (of different types) can be used as columns of a tabular structure called a Dataframe, as depicted in the figure below.

Structure of dataframe

Tidy vs untidy dataframes

Let us look at the following two dataframes:

Runner

400

800

1200

1500

0

Runner 1

64

128

192

240

1

Runner 2

80

160

240

300

2

Runner 3

96

192

288

360

Most tabular data is either in a tidy format or a untidy format (some people refer them as the long format or the wide format). The main differences are summarised below:

  • In untidy (wide) format, each row represents an observation consisting of multiple variables and each variable has its own column. This is intuitive and easy for us to understand and make comparisons across different variables, calculate statistics, etc.

  • In tidy (long) format , i.e. column-oriented format, each row represents only one variable of the observation, and can be considered “computer readable”. When it comes to data analysis using Pandas, the tidy format is recommended:

  • Each column can be stored as a vector and this not only saves memory but also allows for vectorized calculations which are much faster.

  • It’s easier to filter, group, join and aggregate the data.

Imagine, for example, that you would like to compute the speed as speed=distance/time. The untidy format would make this much clunkier, as:

  • The distances are encoded as column names, not as data points (rows)

  • The speed would have to be stored in a new dataframe since it would not fit in that data structure.

In comparison, in a tidy dataframe, this computation would be a simple operation between two columns.

Tip

Recovering a wide dataframe from a tidy one is commonly referred to as pivoting. Most dataframe libraries provide a pivot() or pivot_table function.

Pandas & Polars

Historically, Pandas has been the go-to package to handle dataframes in Python. It is based on NumPy (each column is a Numpy vector) and has been the traditional workhorse for tabular data, with a stable API and a large ecosystem built around it, including the Seaborn statistical plotting framework. More recently, Polars was introduced as a more modern and faster alternative to handle dataframes. It is written in Rust and supports out of the box out of core evaluation (i.e. does not need loading the whole dataset in memory), lazy evaluation of queries and automatically uses multiple threads. Moreover, experimental GPU support is available through cuDF. In the remainder of this episode, the NYC taxi will be used to showcase how datasets can be accessed, summarised and manipulated in both Pandas and Polars. The dataset can be download in Parquet format from the link above (the file for the month of January was used in this case). The dataset contains information about taxi trips performed in New York, such as the ID of the vendor, the total fare, pickup and drop-off time and location (expressed as an ID), type of payment, whether additional fees were charged and more.

Opening a dataset

Assuming the file is called yellow_tripdata_2025-01.parquet, the dataset can be opened as:

import pandas as pd 
df = pd.read_parquet("yellow_tripdata_2025-01.parquet")

Description and summarisation

We can get a first understanding of the contents of a dataframe by printing the first few lines, the “schema” (i.e. the number and type of each column) and summary statistics as follows:

df.head()
df.info()
df.describe()

Indexing

We can index data in the dataframe as follows:

# With this we can select a column
df['VendorID'] # Could also be df.VendorID 
# Get a row 
df.iloc[1000,:]

In both cases, a similar syntax can be used to do in-place modification (e.g. df[row][column]=...). Please note that this kind of replacement carries a big performance penalty, which is designed to do column-wide operations with minimal overhead. This is commonly achieved through the expression API, as detailed in the next section.

Common workflows

It is quite common to stratify (i.e. divide the samples into a number of groups based on categorical variables) to produce descriptive statistics (i.e. statistics that provide a summary of the samples and do not aim to predict anything regarding the population it comes from). This is commonly achieved through a group-by workflow, where the following happens:

  • Splitting: data is partitioned into different groups based on some criterion

  • Applying: applying a function/performing a calculation to each group

  • Combining: assembling a dataframe (of potentially any size) with the results.

This is type of workflow is represented below.

split-apply-combine

Source: Earth and environmental data science

As an example, let us try to to compute the total fare for each hour, split by payment type.

#First let us extract the hour from the tpep_pickup_datetime column 
df["hour"] = df['tpep_pickup_datetime'].dt.hour 

hourly_fare = (
  df.groupby(['hour', 'payment_type'], observed=False)['fare_amount']
  .sum()
  .reset_index()
  .sort_values(['hour', 'payment_type'])
)

The groupby statement is used to stratify the fare_amount column by hour and payment type. Then the amounts per hour and type get summed and sorted according to time and payment type.

Idiomatic Polars

Polars introduces a few variations to dataset operations compared to the traditional Pandas approach. In particular, a domain-specific language (DSL) was developed, where expressions are written to represent dataset operations and contexts provide the environment where they produce a result.

Expressions

Let’s say that we created a trip_duration_sec column in our NYC cab database and, given the trip_distance column, we want to compute the average speed. In Polars, this can be achieved with:

pl.col('trip_distance') / pl.col(`trip_duration_sec`)

This is a lazy representation of an operation we want to perform, which can be further manipulated or just printed. For it to actually produce data, a context is needed.

Contexts

The same Polars expression can produce different results depending on the context where it is used. Four common contexts include:

  • select

  • with_columns

  • filter

  • group_by

Both select and with_columns can produce new columns, which may be aggregations, combinations of other columns, or literals. The difference between the two is that select only includes the columns contained in its input expression, whereas with_columns returns a new dataframe which contains all the columns from the original dataframe and the new ones created by the expression. To exemplify, using our earlier example of computing the average speed during a trip, using select would yield a single column, whereas with_columns would return the original dataframe with an additional column called trip distance:

df.select(pl.col('trip_distance')/pl.col('trip_duration_sec')*3600)
shape: (3_475_226, 1)
┌───────────────┐
 trip_distance 
 ---           
 f64           
╞═══════════════╡
 11.497006     
 11.764706     
 18.461538     
 5.60479       
 11.207547     
              
 13.68899      
 19.42398      
 9.879418      
 9.339901      
 12.781395     
└───────────────┘
df.with_columns((pl.col('trip_distance')/pl.col('trip_duration_sec')*3600).alias("avg_sp
eed_mph"))
shape: (3_475_226, 22)
┌──────────┬──────────┬──────────┬──────────┬───┬──────────┬──────────┬──────────┬──────────┐
 VendorID  tpep_pic  tpep_dro  passenge    Airport_  cbd_cong  trip_dur  avg_spee 
 ---       kup_date  poff_dat  r_count      fee       estion_f  ation_se  d_mph    
 i32       time      etime     ---          ---       ee        c         ---      
           ---       ---       i64          f64       ---       ---       f64      
           datetime  datetime                         f64       i64                
           [μs]      [μs]                                                          
╞══════════╪══════════╪══════════╪══════════╪═══╪══════════╪══════════╪══════════╪══════════╡
 1         2025-01-  2025-01-  1           0.0       0.0       501       11.49700 
           01        01                                                   6        
           00:18:38  00:26:59                                                      
 1         2025-01-  2025-01-  1           0.0       0.0       153       11.76470 
           01        01                                                   6        
           00:32:40  00:35:13                                                      
 1         2025-01-  2025-01-  1           0.0       0.0       117       18.46153 
           01        01                                                   8        
           00:44:04  00:46:01                                                      
 2         2025-01-  2025-01-  3           0.0       0.0       334       5.60479  
           01        01                                                            
           00:14:27  00:20:01                                                      
 2         2025-01-  2025-01-  3           0.0       0.0       212       11.20754 
           01        01                                                   7        
           00:21:34  00:25:06                                                      
                                                                          
 2         2025-01-  2025-01-  null        null      0.75      881       13.68899 
           31        31                                                            
           23:01:48  23:16:29                                                      
 2         2025-01-  2025-02-  null        null      0.75      1618      19.42398 
           31        01                                                            
           23:50:29  00:17:27                                                      
 2         2025-01-  2025-01-  null        null      0.75      962       9.879418 
           31        31                                                            
           23:26:59  23:43:01                                                      
 2         2025-01-  2025-01-  null        null      0.75      1218      9.339901 
           31        31                                                            
           23:14:34  23:34:52                                                      
 2         2025-01-  2025-02-  null        null      0.0       645       12.78139 
           31        01                                                   5        
           23:56:42  00:07:27                                                      
└──────────┴──────────┴──────────┴──────────┴───┴──────────┴──────────┴──────────┴──────────┘

The filter context filters the rows of a dataframe based on one (or more) expressions which evaluate to a Boolean, e.g.

df.filter(pl.col('avg_speed_mph') < 1)
shape: (104_410, 22)
┌──────────┬──────────┬──────────┬──────────┬───┬──────────┬──────────┬──────────┬──────────┐
 VendorID  tpep_pic  tpep_dro  passenge    Airport_  cbd_cong  trip_dur  avg_spee 
 ---       kup_date  poff_dat  r_count      fee       estion_f  ation_se  d_mph    
 i32       time      etime     ---          ---       ee        c         ---      
           ---       ---       i64          f64       ---       ---       f64      
           datetime  datetime                         f64       i64                
           [μs]      [μs]                                                          
╞══════════╪══════════╪══════════╪══════════╪═══╪══════════╪══════════╪══════════╪══════════╡
 2         2025-01-  2025-01-  1           0.0       0.0       10        0.0      
           01        01                                                            
           00:37:43  00:37:53                                                      
 2         2025-01-  2025-01-  3           0.0       0.0       8         0.0      
           01        01                                                            
           00:57:08  00:57:16                                                      
 1         2025-01-  2025-01-  1           0.0       0.0       1910      0.0      
           01        01                                                            
           00:27:40  00:59:30                                                      
 2         2025-01-  2025-01-  4           0.0       0.0       5         0.0      
           01        01                                                            
           00:56:49  00:56:54                                                      
 1         2025-01-  2025-01-  0           0.0       0.0       2         0.0      
           01        01                                                            
           00:42:42  00:42:44                                                      
                                                                          
 1         2025-01-  2025-02-  null        null      0.75      266       0.0      
           31        01                                                            
           23:59:17  00:03:43                                                      
 1         2025-01-  2025-01-  null        null      0.75      1100      0.0      
           31        31                                                            
           23:17:38  23:35:58                                                      
 2         2025-01-  2025-01-  null        null      0.0       161       0.0      
           31        31                                                            
           23:39:25  23:42:06                                                      
 1         2025-01-  2025-01-  null        null      0.75      24        0.0      
           31        31                                                            
           23:30:42  23:31:06                                                      
 1         2025-01-  2025-01-  null        null      0.75      1556      0.0      
           31        31                                                            
           23:10:25  23:36:21                                                      
└──────────┴──────────┴──────────┴──────────┴───┴──────────┴──────────┴──────────┴──────────┘

The group_by context behaves like its Pandas counterpart.

Transformations

A join operation combines columns from one or more dataframes into a new dataframe. There are different joining strategies, which influence how columns are combined and what rows are included in the final set. A common type is the equi join, where rows are matched by a key expression. Let us clarify this with an example. The df dataframe does not include specific coordinates for each pickup and drop-off, rather only a PULocationID and a DOLocationID. There is a taxy_zones_xy.csv file that contains, for each LocationID, the latitude (X) and longitude (Y) of each location, as well as the name of zone and borough:


lookup_df = pl.read_csv('taxy_zones_xy.csv', has_header=True)
lookup_df.head()
┌────────────┬────────────┬───────────┬─────────────────────────┬───────────────┐
 LocationID  X           Y          zone                     borough       
 ---         ---         ---        ---                      ---           
 i64         f64         f64        str                      str           
╞════════════╪════════════╪═══════════╪═════════════════════════╪═══════════════╡
 1           -74.176786  40.689516  Newark Airport           EWR           
 2           -73.826126  40.625724  Jamaica Bay              Queens        
 3           -73.849479  40.865888  Allerton/Pelham Gardens  Bronx         
 4           -73.977023  40.724152  Alphabet City            Manhattan     
 5           -74.18993   40.55034   Arden Heights            Staten Island 
└────────────┴────────────┴───────────┴─────────────────────────┴───────────────┘

This can be used to append these columns to the original df to have some form of geographical data as follows (e.g. for the PULocationID):

df = df.join(lookup_df, left_on='PULocationID', right_on='LocationID', how='left'
, suffix='_pickup')

In the line above, left_on is used to indicate the key in the original dataframe, right_on is used to specify the key in the lookup_df dataframe, how=left means that the columns from the second dataframe will be added to the first (and not the other way around) and suffix is what will be added to the names of the joined columns (i.e., df will contain columns called X_pickup, Y_pickup, zone_pickup and borough_pickup). More information on join operations can be found here.

Exercises

Joining geographical data

We have already seen how to add actual latitude and longitude for the pickups. Now do the same for the drop-offs!

Feature engineering: enriching the dataset

We want to understand a bit more of the traffic in the city by creating new features (i.e. columns), in particular:

  • Split the pickup datetime into hour, minute, day of the week and month to indentify daily, weekly and monthly trends

  • Compute the average speed as an indicator of congestion (low speed -> traffic jam)

  • Stratify the trip distance and fare by zone to identify how expensive different zones are. Below is a skeleton of the code, where some lines have been blanked out for you to fill (marked with TODO:...)

import polars as pl
raw_df = pl.read_parquet('yellow_tripdata_2025-01.parquet')
df = raw_df.with_columns([
    pl.col("tpep_pickup_datetime").dt.hour().alias("pickup_hour"),
    #TODO: do this for the minute
    pl.col("tpep_pickup_datetime").dt.day_of_week().alias("pickup_dow"),   # Mon=0 … Sun=6
    pl.col("tpep_pickup_datetime").dt.month().alias("pickup_month"),
    # Trip duration in seconds
    (pl.col("tpep_dropoff_datetime") - pl.col("tpep_pickup_datetime"))
        .dt.total_seconds()
        .alias("trip_duration_sec"),
])

df = df.with_column(
    #TODO: add expression for average velocity here
    .replace_nan(None)                        # protect against div‑by‑zero
    .alias("avg_speed_mph")
)

# Compute per‑pickup‑zone statistics once
zone_stats = (
    df.groupby("PULocationID")
      .agg([
          pl.mean("fare_amount").alias("zone_avg_fare"),
          #TODO: do the same for the trip distance here
          pl.count().alias("zone_trip_cnt"),
      ])
      .rename({"PULocationID": "pickup_zone_id"})   # avoid name clash later
)

# Join those stats back onto the original rows
df = df.join(zone_stats, left_on="PULocationID", right_on="pickup_zone_id", how="left")

While we haven’t covered the join instruction earlier, its main role is to “spread” the zone_stats over all the rides in the original dataframe (i.e. write the zone_avg_fare on each ride in df). join has its roots in relational databases, where different tables can be merged based on a common column.

More feature engineering!

Similarly to the exercise above, define the following features in the data:

  • pickup_hour extracted from tpep_pickup_time

  • is_weekend, a Boolean value for each trip

  • avg_speed_mph, exactly as before

  • tip_to_fare_ratio, dividing the tip amount by the total fare. Be careful with division by 0

  • fare_per_mile, dividing the total fare by the distance

  • dist_per_passenger, the average distance travelled by each passenger (sum of all trip distances divided by number of trips)

  • speed_per_pickup_area, the average velocity stratified by pickup location

  • dropoff_trip_count, count of trips stratified per dropoff location

Summary

We have seen how to deal with common workflows in both Pandas and Polars, starting from basic tasks like opening a dataset and inspecting it to performing split-apply-combine pipelines. We have seen how to use Polars to manipulate datasets and perform some basic feature engineering.

Keypoints

  • Dataframes are combinations of series

  • Both Pandas and Polars can be used to manipulate them

  • The expression API in Polars allows to perform advanced operations with a simple DSL.

See also

There is a lot more to Polars than what we covered in this short introduction. For example, queries like the ones we introduced can be performed lazily, i.e. just declared and then run all together, giving the backend a chance to optimise them. This can dramatically improve performance in the case of complex queries. For this and a lot more, we refer you to the official documentation.