Data Preparation
The scripts in data_prep/ regenerate EWB's source datasets: case bounding
boxes, observation archives, and model stores. Most users never need to run
them. Run them when you need to extend the case set, update an observation
archive, or rebuild a data store from scratch.
All scripts must be run from the repository root.
Plot Temperature Events
File: data_prep/plot_temperature_events.py
Plots the maximum number of consecutive heat wave or cold snap days for a
single case from events.yaml. Auto-detects event type from the case record.
Also exports max_consecutive_days and plot_consecutive_map, which
heat_cold_bounds_global.py and heat_cold_bounds_case.py import directly.
Usage
python data_prep/plot_temperature_events.py \
--case-id-number 2 \
--output case_2_consecutive_heatwave_days.png
Output
One PNG at the path given by --output. Default filename when --output is
omitted: case_N_consecutive_{heatwave|cold_snap}_days.png in the current
directory.
Heat / Cold Bounds — Global Detection
File: data_prep/heat_cold_bounds_global.py
Scans ERA5 2 m temperature over a date range and detects heat wave and cold snap events globally over land. A heat wave requires daily max > 85th percentile for 3+ consecutive days; a cold snap requires daily min < 15th percentile. Spatiotemporal blobs are tracked and terminated when their area drops below 50 % of peak. Produces a CSV of bounding boxes and two PNG maps.
Usage
python data_prep/heat_cold_bounds_global.py \
--start-date 2023-06-01 \
--end-date 2023-09-01 \
--output heat_cold_global.csv \
--n-workers 4
Output
CSV at --output with columns label, event_type, start_date,
end_date, latitude_min/max, longitude_min/max. Two PNG maps saved
alongside it: <stem>_heatwave.png and <stem>_cold_snap.png.
Heat / Cold Bounds — Case Validation
File: data_prep/heat_cold_bounds_case.py
For each heat wave or cold snap case in events.yaml, iteratively expands the
existing bounding box by 2° per side until fewer than 50 % of edge grid points
exceed the climatological threshold, or 10 iterations are reached. Processes
all cases in parallel via joblib.
Usage
python data_prep/heat_cold_bounds_case.py \
--output heat_cold_yaml.csv \
--n-workers 4
Output
CSV at --output with final bounding boxes. One PNG per case saved to the
same directory as --output, named
case_<id>_consecutive_{heatwave|cold_snap}_days.png.
Generate GHCNh
File: data_prep/generate_ghcnh.py
Downloads GHCNh station data for 2020–2024 from NCEI, aggregates to hourly
resolution, applies QC filtering, and appends to a single parquet file.
Already-processed station-year combinations are skipped on re-run. Up to 1000
concurrent downloads via asyncio.
Dependencies
pip install aiohttp nest_asyncio
Usage
python data_prep/generate_ghcnh.py
Output
ghcnh_all_2020_2024.parq in the current directory.
AR Bounds
File: data_prep/ar_bounds.py
Calculates bounding boxes for atmospheric river cases from events.yaml.
Runs IVT-based AR detection on ERA5, identifies the largest AR object per
case using connected-component labelling, and adds a spatial buffer. Processes
cases in parallel (8 workers by default). Requires a running Dask cluster
(local cluster started automatically).
Usage
python data_prep/ar_bounds.py
Output
ar_bounds_results_enhanced.pkl in the current directory. Load with
pickle.load — each element is a dict with keys case_id, title,
ar_largest_object_bounds, buffered_bounds, and diagnostics.
IBTrACS Bounds
File: data_prep/ibtracs_bounds.py
Downloads IBTrACS CSV from NCEI, computes a track-based bounding box for each
tropical cyclone case in events.yaml, and writes the updated bounds back to
the installed package's events.yaml in place. Logs which cases were changed.
Usage
python data_prep/ibtracs_bounds.py
Output
Modifies src/extremeweatherbench/data/events.yaml in place. No separate
output file is created.
Severe Convection Bounds
File: data_prep/severe_convection_bounds.py
Creates bounding boxes around PPH non-zero regions for severe convection cases
from events.yaml. Applies a 250 km buffer by default. Requires a precomputed
PPH DataArray as input (produced by
practically_perfect_hindcast_from_lsr.py).
Usage
from data_prep.severe_convection_bounds import main
bounding_boxes, df = main(
pph_data="practically_perfect_hindcast_20200104_20250927.zarr",
events_yaml_path="src/extremeweatherbench/data/events.yaml",
output_path="data_prep/pph_severe_convection_bounding_boxes",
buffer_km=250,
)
Or pass a PPH path directly from the command line:
python data_prep/severe_convection_bounds.py \
practically_perfect_hindcast_20200104_20250927.zarr
Output
<output_path>.csv and <output_path>.yaml with bounding box records.
Practically Perfect Hindcast from LSR
File: data_prep/practically_perfect_hindcast_from_lsr.py
Computes the Practically Perfect Hindcast (PPH) from Local Storm Report (LSR)
data using a Gaussian smoothing method (Hitchens et al. 2013). Reads from the
EWB public LSR target at gs://extremeweatherbench. Runs all valid times in
parallel via joblib. Stores results as a dense zarr archive.
Usage
python data_prep/practically_perfect_hindcast_from_lsr.py
Output
practically_perfect_hindcast_20200104_20250927.zarr in the current directory.
Combined LSR Processing
File: data_prep/combined_lsr_processing.py
Downloads US Local Storm Report data from SPC NOAA for 2020–2025, adds Canadian and Australian storm reports, and writes the combined dataset to parquet. Runs verification checks on the output row count and date coverage.
Dependencies
pip install aiohttp
Usage
python data_prep/combined_lsr_processing.py
Output
combined_canada_australia_us_lsr_01012020_09272025.parq in the current
directory.
CIRA Icechunk Generation
File: data_prep/cira_icechunk_generation.py
Builds an icechunk store for CIRA MLWP model data. Reads model files from
s3://noaa-oar-mlwp-data via VirtualiZarr and writes the resulting
DataTree to the EWB GCS bucket. Requires GCS write credentials for
gs://extremeweatherbench.
Usage
# Set application credentials in the script before running:
# storage = icechunk.gcs_storage(..., application_credentials="/path/to/creds.json")
python data_prep/cira_icechunk_generation.py
Output
Writes to the cira-icechunk prefix in the extremeweatherbench GCS bucket.
Convert to Kerchunk
File: data_prep/convert_to_kerchunk.py
Provides two functions for converting CIRA MLWP NetCDF files from S3 to
kerchunk virtual references. generate_json_from_nc scans a single file and
writes a JSON; xarray_dataset_from_json_list combines a list of JSONs into
a single virtual zarr Dataset. No command-line entry point.
Usage
import fsspec
from data_prep.convert_to_kerchunk import generate_json_from_nc, xarray_dataset_from_json_list
fs_read = fsspec.filesystem("s3", anon=True)
fs_out = fsspec.filesystem("file")
so = {"anon": True}
json_list = generate_json_from_nc(
file_url="s3://noaa-oar-mlwp-data/FourCastNetv2/...",
fs_read=fs_read,
fs_out=fs_out,
so=so,
json_dir="/tmp/cira_jsons/",
)
ds = xarray_dataset_from_json_list(
json_list=json_list,
combined_json_directory="/tmp/cira_jsons/",
fs_out=fs_out,
)
Output
Per-file JSON references in json_dir and a combined.json in
combined_json_directory. xarray_dataset_from_json_list returns an
xr.Dataset backed by the combined reference.
Generate CAPE Reference Data
File: data_prep/generate_cape_reference_data.py
Fetches ERA5 atmospheric profiles from ARCO-ERA5, computes CAPE and CIN with
MetPy, and saves representative profiles as .npz files for unit testing.
Also generates synthetic pathological profiles covering edge cases. Requires
GCP application-default credentials and uv.
Dependencies
pip install metpy
Dependencies are also declared inline for uv:
gcloud auth application-default login
uv run data_prep/generate_cape_reference_data.py
Output
tests/data/era5_reference.npz and tests/data/pathological_profiles.npz.