Chicago Fall Colors in Context: Using GHCNd Data!#

Tonight, Ben Blaiszik (@BenBlaiszik) mentioned the following:

Ben fall picture

So, I sat down to see if I could answer this question, using the Global Historical Climatology Network daily (GHCNd) dataset.

The Data#

This dataset is commonly used for calculating climatological normals, with long records and worldwide spatial coverage. It is provided by NOAA, with a version stored on the cloud (Amazon Web Services).

Imports#

We use calendar for months, core components of the scientific python stack, hvplot for visualization, Dask for parallel and distributed computing, and Cartopy for mapping.

import calendar

import numpy as np
import pandas as pd
import holoviews as hv
import hvplot.pandas
from distributed import LocalCluster
import cartopy.crs as ccrs
hv.extension('bokeh')

Spin up a Cluster#

This spins up a local Dask cluster to use for the analysis.

cluster = LocalCluster()
cluster
/Users/mgrover/miniforge3/envs/pyart-docs/lib/python3.10/site-packages/distributed/node.py:183: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 52868 instead
  warnings.warn(

Read in the GHCN Data#

Let’s get to accessing the data.. the main documentation can be found here:

https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily

We are interested in:

  • the daily maximum temperature (TMAX)

  • the daily minimum temperature (TMIN)

  • the daily precipitation accumulation (PRCP)

The Stations Table#

The first step to retrieving this dataset is determing which location to use. There is a table for this, but it requires some custom code specifications to load into our dataframe.

This code was provided by Ned Haughton, through his gitlab.

"""
File: read_ghcn.py
Author: Ned Haughton
Description: Code for reading GHCN dly file data
https://gitlab.com/-/snippets/1838910
"""

import pandas as pd


# Metadata specs #

metadata_col_specs = [
    (0,  12),
    (12, 21),
    (21, 31),
    (31, 38),
    (38, 41),
    (41, 72),
    (72, 76),
    (76, 80),
    (80, 86)
]

metadata_names = [
    "ID",
    "LATITUDE",
    "LONGITUDE",
    "ELEVATION",
    "STATE",
    "NAME",
    "GSN FLAG",
    "HCN/CRN FLAG",
    "WMO ID"]

metadata_dtype = {
    "ID": str,
    "STATE": str,
    "NAME": str,
    "GSN FLAG": str,
    "HCN/CRN FLAG": str,
    "WMO ID": str
    }


# Data specs #

data_header_names = [
    "ID",
    "YEAR",
    "MONTH",
    "ELEMENT"]

data_header_col_specs = [
    (0,  11),
    (11, 15),
    (15, 17),
    (17, 21)]

data_header_dtypes = {
    "ID": str,
    "YEAR": int,
    "MONTH": int,
    "ELEMENT": str}

data_col_names = [[
    "VALUE" + str(i + 1),
    "MFLAG" + str(i + 1),
    "QFLAG" + str(i + 1),
    "SFLAG" + str(i + 1)]
    for i in range(31)]
# Join sub-lists
data_col_names = sum(data_col_names, [])

data_replacement_col_names = [[
    ("VALUE", i + 1),
    ("MFLAG", i + 1),
    ("QFLAG", i + 1),
    ("SFLAG", i + 1)]
    for i in range(31)]
# Join sub-lists
data_replacement_col_names = sum(data_replacement_col_names, [])
data_replacement_col_names = pd.MultiIndex.from_tuples(
    data_replacement_col_names,
    names=['VAR_TYPE', 'DAY'])

data_col_specs = [[
    (21 + i * 8, 26 + i * 8),
    (26 + i * 8, 27 + i * 8),
    (27 + i * 8, 28 + i * 8),
    (28 + i * 8, 29 + i * 8)]
    for i in range(31)]
data_col_specs = sum(data_col_specs, [])

data_col_dtypes = [{
    "VALUE" + str(i + 1): int,
    "MFLAG" + str(i + 1): str,
    "QFLAG" + str(i + 1): str,
    "SFLAG" + str(i + 1): str}
    for i in range(31)]
data_header_dtypes.update({k: v for d in data_col_dtypes for k, v in d.items()})


# Reading functions #

def read_station_metadata(filename):
    """Reads in station metadata

    :filename: ghcnd station metadata file.
    :returns: station metadata as a pandas Dataframe

    """
    df = pd.read_fwf(filename, colspecs=metadata_col_specs, names=metadata_names,
                     index_col='ID', dtype=metadata_dtype)

    return df

Load the station data using the function#

We can remotely read in the dataset using the following - no download required!

station_data = read_station_metadata("https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt")
station_data
LATITUDE LONGITUDE ELEVATION STATE NAME GSN FLAG HCN/CRN FLAG WMO ID
ID
ACW00011604 17.1167 -61.7833 10.1 NaN ST JOHNS COOLIDGE FLD NaN NaN NaN
ACW00011647 17.1333 -61.7833 19.2 NaN ST JOHNS NaN NaN NaN
AE000041196 25.3330 55.5170 34.0 NaN SHARJAH INTER. AIRP GSN NaN 41196
AEM00041194 25.2550 55.3640 10.4 NaN DUBAI INTL NaN NaN 41194
AEM00041217 24.4330 54.6510 26.8 NaN ABU DHABI INTL NaN NaN 41217
... ... ... ... ... ... ... ... ...
ZI000067969 -21.0500 29.3670 861.0 NaN WEST NICHOLSON NaN NaN 67969
ZI000067975 -20.0670 30.8670 1095.0 NaN MASVINGO NaN NaN 67975
ZI000067977 -21.0170 31.5830 430.0 NaN BUFFALO RANGE NaN NaN 67977
ZI000067983 -20.2000 32.6160 1132.0 NaN CHIPINGE GSN NaN 67983
ZI000067991 -22.2170 30.0000 457.0 NaN BEITBRIDGE NaN NaN 67991

123183 rows × 8 columns

Visualize the GHCNd Dataset#

Let’s visualize the spatial extent. We can use a spatial density map, setting rasterize=True, and adding our different features for reference.

station_data.hvplot.points(x='LONGITUDE',
                           y='LATITUDE',
                           rasterize=True,
                           geo=True,
                           features=['ocean', 'land', 'borders', 'lakes'],
                           projection=ccrs.PlateCarree(),
                           title='GHCNd Station Locations')

Zoom in Around Chicago#

Let’s zoom in around Chicago! This allows us to investigate specific stations around this area.

lat_min = 41
lat_max = 43
lon_min = -89
lon_max = -87

stations_around_chicago = station_data[(station_data.LATITUDE > lat_min) & 
                                       (station_data.LATITUDE < lat_max) & 
                                       (station_data.LONGITUDE < lon_max) &
                                       (station_data.LONGITUDE > lon_min)]
stations_around_chicago
LATITUDE LONGITUDE ELEVATION STATE NAME GSN FLAG HCN/CRN FLAG WMO ID
ID
US1ILBN0004 42.3363 -88.8701 270.1 IL POPLAR GROVE 3.2 SW NaN NaN NaN
US1ILBN0005 42.3251 -88.9391 277.1 IL LOVES PARK 3.7 ESE NaN NaN NaN
US1ILBN0008 42.3600 -88.8470 278.9 IL POPLAR GROVE 1.3 WSW NaN NaN NaN
US1ILBN0009 42.2699 -88.8284 239.0 IL BELVIDERE 1.6 NE NaN NaN NaN
US1ILBN0010 42.4356 -88.8027 292.9 IL POPLAR GROVE 4.8 NNE NaN NaN NaN
... ... ... ... ... ... ... ... ...
USW00014892 41.7833 -87.6000 181.1 IL CHICAGO UNIV NaN NaN NaN
USW00054811 41.8431 -88.8514 262.4 IL SHABBONA 5 NNE NaN CRN 72497
USW00094818 42.7611 -87.8136 205.4 WI RACINE BATTEN AP NaN NaN NaN
USW00094846 41.9950 -87.9336 200.6 IL CHICAGO OHARE INTL AP NaN NaN 72530
USW00094892 41.9144 -88.2464 229.8 IL WEST CHICAGO DUPAGE AP NaN NaN NaN

1146 rows × 8 columns

stations_around_chicago.hvplot.points(x='LONGITUDE',
                                      y='LATITUDE',
                                      geo=True,
                                      features=['ocean', 'land', 'borders', 'lakes'],
                                      projection=ccrs.PlateCarree(),
                                      hover_cols=['ID', 'NAME'],
                                      title='GHCNd Stations Around Chicago')

We will use the O’Hare International Airport site here, USW00094846, which is second to last value in the table.


Access Station Data from AWS#

Now that we have our station decided, we can load our data from AWS.

Here is the landing page for their documentation

https://registry.opendata.aws/noaa-ghcn/

Finding our Data#

The directories in the bucket are configured as follows:

s3://noaa-ghcn-pds/parquet/by_station/STATION={INSERT_STATION_ID}/

for example,

s3://noaa-ghcn-pds/parquet/by_station/STATION=USW00094846/

is the path to the file for Chicago O’Hare.

Read our Data using Pandas#

Now that we found our data, we can access it using the following, setting our station ID to USW00094846.

station = 'USW00094846'
ohare = pd.read_parquet(f's3://noaa-ghcn-pds/parquet/by_station/STATION={station}/',
                        storage_options={'anon':True})
ohare
ID DATE DATA_VALUE M_FLAG Q_FLAG S_FLAG OBS_TIME ELEMENT
0 USW00094846 19650101 100 None None X None ACMH
1 USW00094846 19650102 70 None None X None ACMH
2 USW00094846 19650103 60 None None X None ACMH
3 USW00094846 19650104 70 None None X None ACMH
4 USW00094846 19650105 90 None None X None ACMH
... ... ... ... ... ... ... ... ...
367498 USW00094846 20050907 1 None None X None WV03
367499 USW00094846 20050919 1 None None X None WV03
367500 USW00094846 20050922 1 None None X None WV03
367501 USW00094846 20051017 1 None None X None WV03
367502 USW00094846 20010324 1 None None X None WV20

367503 rows × 8 columns

Clean up the Data#

Now, we can clean the data. We

  • Convert the date column to a datetime column

  • Subset for our variables, and assign to their own dataframe

  • Rename the columns

  • Convert from tenths of degrees Celsius to degrees Celsius (divide by 10)

# Clean up the dates
ohare['DATE'] = pd.to_datetime(ohare.DATE)

# Subset for tmax, tmin, and precipitation
tmax = ohare.loc[ohare.ELEMENT == 'TMAX']
tmin = ohare.loc[ohare.ELEMENT == 'TMIN']
precipitation = ohare.loc[ohare.ELEMENT == 'PRCP']

# Rename the columns, set the index to the date
tmax_df = tmax[['DATE', 'DATA_VALUE']].rename(columns={'DATA_VALUE': 'TMAX'}).set_index('DATE')
tmin_df = tmin[['DATE', 'DATA_VALUE']].rename(columns={'DATA_VALUE': 'TMIN'}).set_index('DATE')
precip_df = precipitation[['DATE', 'DATA_VALUE']].rename(columns={'DATA_VALUE': 'PRCP'}).set_index('DATE')

# Convert to degrees Celsius
tmax_df['TMAX'] = tmax_df.TMAX/10
tmin_df['TMIN'] = tmin_df.TMIN/10

Calculate and Visualize the Temperature and Precipitation Fields#

Now that we have a clean dataset, we setup the following function to calculate the normal, using the 1990-2020 time frame as the default reference, and compares a given year to this normal.

This gives us a frame of reference for how “normal” the climatological field is.

def plot_monthly_stats(df, year=2022, ylabel=None, first_year=1990, last_year=2020):
    """
    Helper function to calculate the averages and plot our data
    """
    
    if first_year is None:
        first_year = df.index.year.min()
        
    if last_year is None:
        last_year = df.index.year.max()
    
    # Grab the month names from the calendar module
    months = list(calendar.month_name)[1:]
    
    # Subset for the averaging period
    df_normal = df[(df.index.year >= first_year) & (df.index.year <= last_year)]
    
    # Calculate the average monthly values over the entire dataset
    monthly_value = df_normal.groupby(df_normal.index.month).mean()
    monthly_value.index = months
    
    # Calculate the given year average
    df_year = df[df.index.year == year]
    monthly_average = df_year.groupby(df_year.index.month).mean()
    monthly_average.index = months[:monthly_average.index[-1]]
    
    if 'TMAX' in df.columns:
        color = 'red'
        ylabel = 'Daily Maximum Temperature \n (degC)'

    elif 'TMIN' in df.columns:
        color = 'blue'
        ylabel = 'Daily Minimum Temperature \n (degC)'
    
    elif 'PRCP' in df.columns:
        color = 'green'
        ylabel = 'Daily Accumulated Precipitation \n (mm)'
        
    else:
        color='grey'
    
    return monthly_value.hvplot.line(xlabel='Monthly',
                                     ylabel=ylabel,
                                     label=f'Monthly Average {first_year} - {last_year}',
                                     color='k') * monthly_average.hvplot.line(label=f'Monthly Average {year}',
                                                                              color=color,
                                                                              line_width=3)

Plot the 2022 Comparison#

Let’s start with 2022 - notice how relatively dry the last few months have been, with normal high daily maximum temperatures and above normal daily minimum temperatures.

Warm and dry are two ingredients for nice fall colors in the Midwest!

(plot_monthly_stats(precip_df) + plot_monthly_stats(tmax_df) + plot_monthly_stats(tmin_df)).cols(1)

Plot 2015-2022 for Comparison#

For comparison, here are the last seven years!

for year in np.arange(2015, 2023, 1):
    display((plot_monthly_stats(precip_df, year) + plot_monthly_stats(tmax_df, year) + plot_monthly_stats(tmin_df, year)).cols(1))

Conclusions#

It has been an absolutely beautiful fall in the Chicago area (and the Upper Midwest in general). Here is one of my pictures from mid October, from Matthiessen State Park in Oglesby, Illinois.

fall-colors

The availability of these cloud-hosted datasets is incredible - the analysis-ready nature of these datasets lower the barrier to science. Instead of parsing through thousands of text files, or merging tough-to-align CSVs, people can access these analysis-ready, cloud optimized datasets (stored in Parquet in this case), and jump right into their analysis.

Thank you NOAA and AWS for making these datasets available, Ned Haughton for the station table parsing code, and Ben for the motivation.