Chicago Fall Colors in Context: Using GHCNd Data!#
Tonight, Ben Blaiszik (@BenBlaiszik) mentioned the following:
So, I sat down to see if I could answer this question, using the Global Historical Climatology Network daily (GHCNd) dataset.
The Data#
This dataset is commonly used for calculating climatological normals, with long records and worldwide spatial coverage. It is provided by NOAA, with a version stored on the cloud (Amazon Web Services).
Imports#
We use calendar for months, core components of the scientific python stack, hvplot for visualization, Dask for parallel and distributed computing, and Cartopy for mapping.
import calendar
import numpy as np
import pandas as pd
import holoviews as hv
import hvplot.pandas
from distributed import LocalCluster
import cartopy.crs as ccrs
hv.extension('bokeh')
Spin up a Cluster#
This spins up a local Dask cluster to use for the analysis.
cluster = LocalCluster()
cluster
/Users/mgrover/miniforge3/envs/pyart-docs/lib/python3.10/site-packages/distributed/node.py:183: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 52868 instead
warnings.warn(
LocalCluster
a199549e
Dashboard: http://127.0.0.1:52868/status | Workers: 5 |
Total threads: 10 | Total memory: 32.00 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-b655200e-402d-4c0d-86a3-7c785a979724
Comm: tcp://127.0.0.1:52869 | Workers: 5 |
Dashboard: http://127.0.0.1:52868/status | Total threads: 10 |
Started: Just now | Total memory: 32.00 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:52890 | Total threads: 2 |
Dashboard: http://127.0.0.1:52894/status | Memory: 6.40 GiB |
Nanny: tcp://127.0.0.1:52874 | |
Local directory: /var/folders/bw/c9j8z20x45s2y20vv6528qjc0000gq/T/dask-worker-space/worker-j0vbuxei |
Worker: 1
Comm: tcp://127.0.0.1:52887 | Total threads: 2 |
Dashboard: http://127.0.0.1:52896/status | Memory: 6.40 GiB |
Nanny: tcp://127.0.0.1:52876 | |
Local directory: /var/folders/bw/c9j8z20x45s2y20vv6528qjc0000gq/T/dask-worker-space/worker-9n6vq0mg |
Worker: 2
Comm: tcp://127.0.0.1:52889 | Total threads: 2 |
Dashboard: http://127.0.0.1:52892/status | Memory: 6.40 GiB |
Nanny: tcp://127.0.0.1:52873 | |
Local directory: /var/folders/bw/c9j8z20x45s2y20vv6528qjc0000gq/T/dask-worker-space/worker-_av6jg6g |
Worker: 3
Comm: tcp://127.0.0.1:52888 | Total threads: 2 |
Dashboard: http://127.0.0.1:52895/status | Memory: 6.40 GiB |
Nanny: tcp://127.0.0.1:52875 | |
Local directory: /var/folders/bw/c9j8z20x45s2y20vv6528qjc0000gq/T/dask-worker-space/worker-67h50roe |
Worker: 4
Comm: tcp://127.0.0.1:52891 | Total threads: 2 |
Dashboard: http://127.0.0.1:52893/status | Memory: 6.40 GiB |
Nanny: tcp://127.0.0.1:52872 | |
Local directory: /var/folders/bw/c9j8z20x45s2y20vv6528qjc0000gq/T/dask-worker-space/worker-2yckn7hp |
Read in the GHCN Data#
Let’s get to accessing the data.. the main documentation can be found here:
https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily
We are interested in:
the daily maximum temperature (TMAX)
the daily minimum temperature (TMIN)
the daily precipitation accumulation (PRCP)
The Stations Table#
The first step to retrieving this dataset is determing which location to use. There is a table for this, but it requires some custom code specifications to load into our dataframe.
This code was provided by Ned Haughton, through his gitlab.
"""
File: read_ghcn.py
Author: Ned Haughton
Description: Code for reading GHCN dly file data
https://gitlab.com/-/snippets/1838910
"""
import pandas as pd
# Metadata specs #
metadata_col_specs = [
(0, 12),
(12, 21),
(21, 31),
(31, 38),
(38, 41),
(41, 72),
(72, 76),
(76, 80),
(80, 86)
]
metadata_names = [
"ID",
"LATITUDE",
"LONGITUDE",
"ELEVATION",
"STATE",
"NAME",
"GSN FLAG",
"HCN/CRN FLAG",
"WMO ID"]
metadata_dtype = {
"ID": str,
"STATE": str,
"NAME": str,
"GSN FLAG": str,
"HCN/CRN FLAG": str,
"WMO ID": str
}
# Data specs #
data_header_names = [
"ID",
"YEAR",
"MONTH",
"ELEMENT"]
data_header_col_specs = [
(0, 11),
(11, 15),
(15, 17),
(17, 21)]
data_header_dtypes = {
"ID": str,
"YEAR": int,
"MONTH": int,
"ELEMENT": str}
data_col_names = [[
"VALUE" + str(i + 1),
"MFLAG" + str(i + 1),
"QFLAG" + str(i + 1),
"SFLAG" + str(i + 1)]
for i in range(31)]
# Join sub-lists
data_col_names = sum(data_col_names, [])
data_replacement_col_names = [[
("VALUE", i + 1),
("MFLAG", i + 1),
("QFLAG", i + 1),
("SFLAG", i + 1)]
for i in range(31)]
# Join sub-lists
data_replacement_col_names = sum(data_replacement_col_names, [])
data_replacement_col_names = pd.MultiIndex.from_tuples(
data_replacement_col_names,
names=['VAR_TYPE', 'DAY'])
data_col_specs = [[
(21 + i * 8, 26 + i * 8),
(26 + i * 8, 27 + i * 8),
(27 + i * 8, 28 + i * 8),
(28 + i * 8, 29 + i * 8)]
for i in range(31)]
data_col_specs = sum(data_col_specs, [])
data_col_dtypes = [{
"VALUE" + str(i + 1): int,
"MFLAG" + str(i + 1): str,
"QFLAG" + str(i + 1): str,
"SFLAG" + str(i + 1): str}
for i in range(31)]
data_header_dtypes.update({k: v for d in data_col_dtypes for k, v in d.items()})
# Reading functions #
def read_station_metadata(filename):
"""Reads in station metadata
:filename: ghcnd station metadata file.
:returns: station metadata as a pandas Dataframe
"""
df = pd.read_fwf(filename, colspecs=metadata_col_specs, names=metadata_names,
index_col='ID', dtype=metadata_dtype)
return df
Load the station data using the function#
We can remotely read in the dataset using the following - no download required!
station_data = read_station_metadata("https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt")
station_data
LATITUDE | LONGITUDE | ELEVATION | STATE | NAME | GSN FLAG | HCN/CRN FLAG | WMO ID | |
---|---|---|---|---|---|---|---|---|
ID | ||||||||
ACW00011604 | 17.1167 | -61.7833 | 10.1 | NaN | ST JOHNS COOLIDGE FLD | NaN | NaN | NaN |
ACW00011647 | 17.1333 | -61.7833 | 19.2 | NaN | ST JOHNS | NaN | NaN | NaN |
AE000041196 | 25.3330 | 55.5170 | 34.0 | NaN | SHARJAH INTER. AIRP | GSN | NaN | 41196 |
AEM00041194 | 25.2550 | 55.3640 | 10.4 | NaN | DUBAI INTL | NaN | NaN | 41194 |
AEM00041217 | 24.4330 | 54.6510 | 26.8 | NaN | ABU DHABI INTL | NaN | NaN | 41217 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
ZI000067969 | -21.0500 | 29.3670 | 861.0 | NaN | WEST NICHOLSON | NaN | NaN | 67969 |
ZI000067975 | -20.0670 | 30.8670 | 1095.0 | NaN | MASVINGO | NaN | NaN | 67975 |
ZI000067977 | -21.0170 | 31.5830 | 430.0 | NaN | BUFFALO RANGE | NaN | NaN | 67977 |
ZI000067983 | -20.2000 | 32.6160 | 1132.0 | NaN | CHIPINGE | GSN | NaN | 67983 |
ZI000067991 | -22.2170 | 30.0000 | 457.0 | NaN | BEITBRIDGE | NaN | NaN | 67991 |
123183 rows × 8 columns
Visualize the GHCNd Dataset#
Let’s visualize the spatial extent. We can use a spatial density map, setting rasterize=True
, and adding our different features for reference.
station_data.hvplot.points(x='LONGITUDE',
y='LATITUDE',
rasterize=True,
geo=True,
features=['ocean', 'land', 'borders', 'lakes'],
projection=ccrs.PlateCarree(),
title='GHCNd Station Locations')
Zoom in Around Chicago#
Let’s zoom in around Chicago! This allows us to investigate specific stations around this area.
lat_min = 41
lat_max = 43
lon_min = -89
lon_max = -87
stations_around_chicago = station_data[(station_data.LATITUDE > lat_min) &
(station_data.LATITUDE < lat_max) &
(station_data.LONGITUDE < lon_max) &
(station_data.LONGITUDE > lon_min)]
stations_around_chicago
LATITUDE | LONGITUDE | ELEVATION | STATE | NAME | GSN FLAG | HCN/CRN FLAG | WMO ID | |
---|---|---|---|---|---|---|---|---|
ID | ||||||||
US1ILBN0004 | 42.3363 | -88.8701 | 270.1 | IL | POPLAR GROVE 3.2 SW | NaN | NaN | NaN |
US1ILBN0005 | 42.3251 | -88.9391 | 277.1 | IL | LOVES PARK 3.7 ESE | NaN | NaN | NaN |
US1ILBN0008 | 42.3600 | -88.8470 | 278.9 | IL | POPLAR GROVE 1.3 WSW | NaN | NaN | NaN |
US1ILBN0009 | 42.2699 | -88.8284 | 239.0 | IL | BELVIDERE 1.6 NE | NaN | NaN | NaN |
US1ILBN0010 | 42.4356 | -88.8027 | 292.9 | IL | POPLAR GROVE 4.8 NNE | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
USW00014892 | 41.7833 | -87.6000 | 181.1 | IL | CHICAGO UNIV | NaN | NaN | NaN |
USW00054811 | 41.8431 | -88.8514 | 262.4 | IL | SHABBONA 5 NNE | NaN | CRN | 72497 |
USW00094818 | 42.7611 | -87.8136 | 205.4 | WI | RACINE BATTEN AP | NaN | NaN | NaN |
USW00094846 | 41.9950 | -87.9336 | 200.6 | IL | CHICAGO OHARE INTL AP | NaN | NaN | 72530 |
USW00094892 | 41.9144 | -88.2464 | 229.8 | IL | WEST CHICAGO DUPAGE AP | NaN | NaN | NaN |
1146 rows × 8 columns
stations_around_chicago.hvplot.points(x='LONGITUDE',
y='LATITUDE',
geo=True,
features=['ocean', 'land', 'borders', 'lakes'],
projection=ccrs.PlateCarree(),
hover_cols=['ID', 'NAME'],
title='GHCNd Stations Around Chicago')
We will use the O’Hare International Airport site here, USW00094846, which is second to last value in the table.
Access Station Data from AWS#
Now that we have our station decided, we can load our data from AWS.
Here is the landing page for their documentation
https://registry.opendata.aws/noaa-ghcn/
Finding our Data#
The directories in the bucket are configured as follows:
s3://noaa-ghcn-pds/parquet/by_station/STATION={INSERT_STATION_ID}/
for example,
s3://noaa-ghcn-pds/parquet/by_station/STATION=USW00094846/
is the path to the file for Chicago O’Hare.
Read our Data using Pandas#
Now that we found our data, we can access it using the following, setting our station ID to USW00094846.
station = 'USW00094846'
ohare = pd.read_parquet(f's3://noaa-ghcn-pds/parquet/by_station/STATION={station}/',
storage_options={'anon':True})
ohare
ID | DATE | DATA_VALUE | M_FLAG | Q_FLAG | S_FLAG | OBS_TIME | ELEMENT | |
---|---|---|---|---|---|---|---|---|
0 | USW00094846 | 19650101 | 100 | None | None | X | None | ACMH |
1 | USW00094846 | 19650102 | 70 | None | None | X | None | ACMH |
2 | USW00094846 | 19650103 | 60 | None | None | X | None | ACMH |
3 | USW00094846 | 19650104 | 70 | None | None | X | None | ACMH |
4 | USW00094846 | 19650105 | 90 | None | None | X | None | ACMH |
... | ... | ... | ... | ... | ... | ... | ... | ... |
367498 | USW00094846 | 20050907 | 1 | None | None | X | None | WV03 |
367499 | USW00094846 | 20050919 | 1 | None | None | X | None | WV03 |
367500 | USW00094846 | 20050922 | 1 | None | None | X | None | WV03 |
367501 | USW00094846 | 20051017 | 1 | None | None | X | None | WV03 |
367502 | USW00094846 | 20010324 | 1 | None | None | X | None | WV20 |
367503 rows × 8 columns
Clean up the Data#
Now, we can clean the data. We
Convert the date column to a datetime column
Subset for our variables, and assign to their own dataframe
Rename the columns
Convert from tenths of degrees Celsius to degrees Celsius (divide by 10)
# Clean up the dates
ohare['DATE'] = pd.to_datetime(ohare.DATE)
# Subset for tmax, tmin, and precipitation
tmax = ohare.loc[ohare.ELEMENT == 'TMAX']
tmin = ohare.loc[ohare.ELEMENT == 'TMIN']
precipitation = ohare.loc[ohare.ELEMENT == 'PRCP']
# Rename the columns, set the index to the date
tmax_df = tmax[['DATE', 'DATA_VALUE']].rename(columns={'DATA_VALUE': 'TMAX'}).set_index('DATE')
tmin_df = tmin[['DATE', 'DATA_VALUE']].rename(columns={'DATA_VALUE': 'TMIN'}).set_index('DATE')
precip_df = precipitation[['DATE', 'DATA_VALUE']].rename(columns={'DATA_VALUE': 'PRCP'}).set_index('DATE')
# Convert to degrees Celsius
tmax_df['TMAX'] = tmax_df.TMAX/10
tmin_df['TMIN'] = tmin_df.TMIN/10
Calculate and Visualize the Temperature and Precipitation Fields#
Now that we have a clean dataset, we setup the following function to calculate the normal, using the 1990-2020 time frame as the default reference, and compares a given year to this normal.
This gives us a frame of reference for how “normal” the climatological field is.
def plot_monthly_stats(df, year=2022, ylabel=None, first_year=1990, last_year=2020):
"""
Helper function to calculate the averages and plot our data
"""
if first_year is None:
first_year = df.index.year.min()
if last_year is None:
last_year = df.index.year.max()
# Grab the month names from the calendar module
months = list(calendar.month_name)[1:]
# Subset for the averaging period
df_normal = df[(df.index.year >= first_year) & (df.index.year <= last_year)]
# Calculate the average monthly values over the entire dataset
monthly_value = df_normal.groupby(df_normal.index.month).mean()
monthly_value.index = months
# Calculate the given year average
df_year = df[df.index.year == year]
monthly_average = df_year.groupby(df_year.index.month).mean()
monthly_average.index = months[:monthly_average.index[-1]]
if 'TMAX' in df.columns:
color = 'red'
ylabel = 'Daily Maximum Temperature \n (degC)'
elif 'TMIN' in df.columns:
color = 'blue'
ylabel = 'Daily Minimum Temperature \n (degC)'
elif 'PRCP' in df.columns:
color = 'green'
ylabel = 'Daily Accumulated Precipitation \n (mm)'
else:
color='grey'
return monthly_value.hvplot.line(xlabel='Monthly',
ylabel=ylabel,
label=f'Monthly Average {first_year} - {last_year}',
color='k') * monthly_average.hvplot.line(label=f'Monthly Average {year}',
color=color,
line_width=3)
Plot the 2022 Comparison#
Let’s start with 2022 - notice how relatively dry the last few months have been, with normal high daily maximum temperatures and above normal daily minimum temperatures.
Warm and dry are two ingredients for nice fall colors in the Midwest!
(plot_monthly_stats(precip_df) + plot_monthly_stats(tmax_df) + plot_monthly_stats(tmin_df)).cols(1)
Plot 2015-2022 for Comparison#
For comparison, here are the last seven years!
for year in np.arange(2015, 2023, 1):
display((plot_monthly_stats(precip_df, year) + plot_monthly_stats(tmax_df, year) + plot_monthly_stats(tmin_df, year)).cols(1))
Conclusions#
It has been an absolutely beautiful fall in the Chicago area (and the Upper Midwest in general). Here is one of my pictures from mid October, from Matthiessen State Park in Oglesby, Illinois.
The availability of these cloud-hosted datasets is incredible - the analysis-ready nature of these datasets lower the barrier to science. Instead of parsing through thousands of text files, or merging tough-to-align CSVs, people can access these analysis-ready, cloud optimized datasets (stored in Parquet in this case), and jump right into their analysis.
Thank you NOAA and AWS for making these datasets available, Ned Haughton for the station table parsing code, and Ben for the motivation.