Optional: more on Pandas

Objectives

  • Learn about other data wrangling operations in pandas

Begin by defining a new dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame(
         {
              "foo": ["one", "one", "one", "two", "two", "two"] ,
              "bar": ["A", "B", "C"] * 2,
              "baz": np.linspace(1,6,6).astype(int),
              "zoo": ["x","y","z","q","w","t"]
         }
         )
df

Suppose we would like to represent the table in such a way that the columns are the unique variables from “bar” and the index from “foo”. To reshape the data into this form, we use the DataFrame.pivot() method (also implemented as a top level function pivot()):

pivoted = df.pivot(index="foo", columns="bar", values="baz")
pivoted
../_images/reshaping_pivot.png

Note

pivot() will error with a ValueError: Index contains duplicate entries, cannot reshape if the index/column pair is not unique. In this case, consider using pivot_table() which is a generalization of pivot that can handle duplicate values for one index/column pair.

Stacking and unstacking

Closely related to the pivot() method are the related stack() and unstack() methods available on Series and DataFrame. These methods are designed to work together with MultiIndex objects.

The stack() function “compresses” a level in the DataFrame columns to produce either:

  • A Series, in the case of a simple column Index.

  • A DataFrame, in the case of a MultiIndex in the columns.

If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new lowest level in a MultiIndex on the columns:

tuples = list(zip(*[
                       ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
                       ["one", "two", "one", "two", "one", "two", "one", "two"],
                        ]))
columns = pd.MultiIndex.from_tuples([
             ("bar", "one"),
             ("bar", "two"),
             ("baz", "one"),
             ("baz", "two"),
             ("foo", "one"),
             ("foo", "two"),
             ("qux", "one"),
             ("qux", "two"),
         ],
         names=["first", "second"])
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

Note: there are other ways to generate MultiIndex, e.g.

index = pd.MultiIndex.from_product(
[("bar", "baz", "foo", "qux"), ("one", "two")], names=["first", "second"]
)

df = pd.DataFrame(np.linspace(1,16,16).astype(int).reshape(8,2), index=index, columns=["A", "B"])
df
df2 = df[:4]
df2
stacked=df2.stack()
../_images/reshaping_stack.png

The unstack() method performs the inverse operation of stack(), and by default unstacks the last level. If the indexes have names, you can use the level names instead of specifying the level numbers.

stacked.unstack()

../_images/reshaping_unstack.png

stacked.unstack(1) or stacked.unstack(“second”)

../_images/reshaping_unstack_1.png ../_images/reshaping_unstack_0.png

Aggregation

Here we will go through the following example

import urllib.request
import pandas as pd

header_url = 'ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/HEADERS.txt'
with urllib.request.urlopen(header_url) as response:
    data = response.read().decode('utf-8')
lines = data.split('\n')
headers = lines[1].split(' ')

ftp_base = 'ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/'
dframes = []
for year in range(2016, 2019):
    data_url = f'{year}/CRND0103-{year}-NY_Millbrook_3_W.txt'
    df = pd.read_csv(ftp_base + data_url, parse_dates=[1],
                     names=headers,header=None, sep='\s+',
                     na_values=[-9999.0, -99.0])
    dframes.append(df)

df = pd.concat(dframes)
df = df.set_index('LST_DATE')
df.head()
df['T_DAILY_MEAN'] # or df.T_DAILY_MEAN
df['T_DAILY_MEAN'].aggregate([np.max,np.min,np.mean])
df.index   # df.index is a pandas DateTimeIndex object.
gbyear=df.groupby(df.index.year)
gbyear.T_DAILY_MEAN.head()
gbyear.T_DAILY_MEAN.max()
gbyear.T_DAILY_MEAN.aggregate(np.max)
gbyear.T_DAILY_MEAN.aggregate([np.min, np.max, np.mean, np.std])

now let us calculate the monthly mean values

gb=df.groupby(df.index.month)
df.groupby('T_DAILY_MEAN')  # or  df.groupby(df.T_DAILY_MEAN)
monthly_climatology = df.groupby(df.index.month).mean()
monthly_climatology

Each row in this new dataframe respresents the average values for the months (1=January, 2=February, etc.)

monthly_T_climatology = df.groupby(df.index.month).aggregate({'T_DAILY_MEAN': 'mean',
                                                      'T_DAILY_MAX': 'max',
                                                      'T_DAILY_MIN': 'min'})
monthly_T_climatology.head()
daily_T_climatology = df.groupby(df.index.dayofyear).aggregate({'T_DAILY_MEAN': 'mean',
                                                    'T_DAILY_MAX': 'max',
                                                    'T_DAILY_MIN': 'min'})
def standardize(x):
    return (x - x.mean())/x.std()
anomaly = df.groupby(df.index.month).transform(standardize)

Transformation

The key difference between aggregation and transformation is that aggregation returns a smaller object than the original, indexed by the group keys, while transformation returns an object with the same index (and same size) as the original object.

In this example, we standardize the temperature so that the distribution has zero mean and unit variance. We do this by first defining a function called standardize and then passing it to the transform method.

transformed = df.groupby(lambda x: x.year).transform(
    lambda x: (x - x.mean()) / x.std()
)
grouped = df.groupby(lambda x: x.year)
grouped_trans = transformed.groupby(lambda x: x.year)