Some tips with xarray and pandas

Some tips with xarray and pandas#

We have massively different levels here
Try to make some aims for technical skills you can learn!
If you are beginning with python → learn the basics
If you are good at basic python → learn new packages and tricks
If you know all the packages → improve your skills with producing your own software, organising your code etc.
If you don’t know git and github → get better at this!
Learn from each other!

Please feel free to come with suggestions and extra input as we go! Questions:

How many have used jupyter lab?
How many have used pandas?
How many have used xarray?

What are pandas and xarray?#

Pandas → like a spreadsheet 2D data with columns and rows
xarray → like pandas, but in N dimensions
Use the functionality these packages gives you! Will help you avoid mistakes. Try to get as good as possible :)

Pandas#

(Source: https://www.geeksforgeeks.org/python-pandas-dataframe/)

Xarray#

(Source: https://docs.xarray.dev/)

1. Read in CMIP6 data: We will skip this next part, but you can check it later to read data:#

Import python packages#

import xarray as xr
xr.set_options(display_style='html')
import intake
import cftime
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import numpy as np
import pandas as pd
import datetime
import seaborn as sns

#cat_url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
#cat_url = "/mnt/craas1-ns9989k-geo4992/data/cmip6.json"
cat_url = '/mnt/craas1-ns9989k-geo4992/data/catalogs/cmip6.json'
col = intake.open_esm_datastore(cat_url)
col

cmip6 catalog with 155 dataset(s) from 536945 asset(s):

	unique
variable_id	583
table_id	24
source_id	75
experiment_id	94
member_id	190
grid_label	11
time_range	9100
activity_id	18
institution_id	35
version	577
path	536945
dcpp_init_year	0
derived_variable_id	0

Search corresponding data#

Please check here for info about CMIP and variables :)

Particularly useful is maybe the variable search which you find here: https://clipc-services.ceda.ac.uk/dreq/mipVars.html

cat = col.search(source_id = ['CESM2'], 
                 experiment_id=['historical'], 
                 table_id=['Amon','fx','AERmon'], 
                 variable_id=['tas','hurs', 'areacella','mmrso4' ], 
                 member_id=['r1i1p1f1'],
                )
cat.df

	variable_id	table_id	source_id	experiment_id	member_id	grid_label	time_range	activity_id	institution_id	version	path	dcpp_init_year
0	mmrso4	AERmon	CESM2	historical	r1i1p1f1	gn	185001-201412	CMIP	NCAR	v20190308	/mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/CMIP/NC...	NaN
1	hurs	Amon	CESM2	historical	r1i1p1f1	gn	185001-201412	CMIP	NCAR	v20190308	/mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/CMIP/NC...	NaN
2	tas	Amon	CESM2	historical	r1i1p1f1	gn	185001-201412	CMIP	NCAR	v20190308	/mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/CMIP/NC...	NaN
3	areacella	fx	CESM2	historical	r1i1p1f1	gn	NaN	CMIP	NCAR	v20190308	/mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/CMIP/NC...	NaN

cat.esmcat.aggregation_control.groupby_attrs = ['activity_id','experiment_id', 'source_id','table_id','grid_label']
cat.esmcat.aggregation_control.groupby_attrs

['activity_id', 'experiment_id', 'source_id', 'table_id', 'grid_label']

Create dictionary from the list of datasets we found#

This step may take several minutes so be patient!

dset_dict = cat.to_dataset_dict(zarr_kwargs={'use_cftime':True})

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.experiment_id.source_id.table_id.grid_label'

100.00% [3/3 00:05<00:00]

/opt/conda/envs/pangeo-notebook/lib/python3.11/site-packages/xarray/conventions.py:286: SerializationWarning: variable 'areacella' has multiple fill values {1e+20, 1e+20} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
/opt/conda/envs/pangeo-notebook/lib/python3.11/site-packages/xarray/conventions.py:286: SerializationWarning: variable 'hurs' has multiple fill values {1e+20, 1e+20} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
/opt/conda/envs/pangeo-notebook/lib/python3.11/site-packages/xarray/conventions.py:286: SerializationWarning: variable 'mmrso4' has multiple fill values {1e+20, 1e+20} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
/opt/conda/envs/pangeo-notebook/lib/python3.11/site-packages/xarray/conventions.py:286: SerializationWarning: variable 'tas' has multiple fill values {1e+20, 1e+20} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)

list(dset_dict.keys())

['CMIP.historical.CESM2.fx.gn',
 'CMIP.historical.CESM2.AERmon.gn',
 'CMIP.historical.CESM2.Amon.gn']

Since I have already checked that these datasets are on the same grid, we can merge them:

ds_list =[]
for k in dset_dict.keys():
    ds = dset_dict[k]
    for v in ['lon_bnds', 'lat_bnds', 'time_bnds']:
        if v in ds:
            ds= ds.drop(v)
    ds_list.append(ds)     
ds = xr.merge(ds_list,compat='override')

/tmp/ipykernel_14081/179335234.py:6: DeprecationWarning: dropping variables using `drop` is deprecated; use drop_vars.
  ds= ds.drop(v)

1.1 Reading in the data from file:#

Open dataset#

Use xarray python package to analyze netCDF dataset
open_dataset allows to get all the metadata without loading data into memory.
with xarray, we only load into memory what is needed.

path='filename.nc'
ds = xr.open_dataset(path)

Opening multiple files:#

list_of_files = [
    'file1.nc',
    'file2.nc'
]
xr.open_mfdataset(list_of_files, concat_dim='time',combine='by_coords')

2. Check how your dataset looks#

NetCDF + xarray = <3

NetCDF (Network Common Data Form) is a machine-independent data format (and software) that support the creation, access, and sharing of array-oriented scientific data. It was originally developed by UCAR (University Corporation for Atmospheric Research), and it’s widely used in the atmospheric and oceanographic sciences, as well as in other fields such as Earth sciences, geophysics, and climatology.

What is really great is that it keeps a lot of metadata (see below)

Xarray is a Python library designed to work with multi-dimensional arrays and datasets, particularly those used in earth sciences, climate science, and atmospheric science. It builds upon and extends the functionality of NumPy, Pandas, and NetCDF, providing a high-level interface for working with labeled, multi-dimensional data.

Different types of information/data:#

Coordinates
Data variables
Global attributes
Variable attributes
Other?

ds

<xarray.Dataset> Size: 15GB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 1980, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) object 16kB 1850-01-15 12:00:00 ... 2014-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 14GB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 438MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 438MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 438MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

3. Sometimes we want to do some nice tweaks before we start:#

Selecting data and super quick plotting:#

xarray loads data only when it needs to (it’s lazy, someone else can explain), and you might want to early on define the subset of data you want to look at so that you don’t end up loading a lot of extra data.

See here for nice overview#

In order to reduce unecessary calculations and loading of data, think about which part of the data you want, and slice early on.

Slice in time: the sel method#

dss = ds.sel(time = slice('1990-01-01','2010-01-01'))

dss

<xarray.Dataset> Size: 2GB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 240, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 2GB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 53MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

NB: Always check that the function you are using is doing what you expect!

You might even select only the arctic e.g.:

dss_arctic = dss.sel(lat = slice(60,None))
dss_arctic

<xarray.Dataset> Size: 310MB
Dimensions:    (member_id: 1, lat: 32, lon: 288, time: 240, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 256B 60.79 61.73 62.67 63.61 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 37kB dask.array<chunksize=(1, 32, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 283MB dask.array<chunksize=(1, 1, 16, 32, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 9MB dask.array<chunksize=(1, 32, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 9MB dask.array<chunksize=(1, 1, 32, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 9MB dask.array<chunksize=(1, 1, 32, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

isel, sel: index selecting#

Select the surface (which in this case is the last index :)

dss

<xarray.Dataset> Size: 2GB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 240, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 2GB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 53MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

dss_s = dss.isel(lev=-1)

dss_s

<xarray.Dataset> Size: 213MB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 240, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
    lev        float64 8B -992.6
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 53MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          float64 8B dask.array<chunksize=(), meta=np.ndarray>
    b          float64 8B dask.array<chunksize=(), meta=np.ndarray>
    b_bnds     (nbnd) float64 16B dask.array<chunksize=(2,), meta=np.ndarray>
    lev_bnds   (nbnd) float64 16B dask.array<chunksize=(2,), meta=np.ndarray>
    a_bnds     (nbnd) float64 16B dask.array<chunksize=(2,), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

3.2 Calculates variables and assign attributes!#

Nice for plotting and to keep track of what is in your dataset (especially ‘units’ and ‘standard_name’/’long_name’ will be looked for by xarray.

dss['T_C'] = dss['tas']-273.15

dss['T_C'] = dss['T_C'].assign_attrs({'units': '$^\circ$C'})

dss['T_C']

<xarray.DataArray 'T_C' (member_id: 1, time: 240, lat: 192, lon: 288)> Size: 53MB
dask.array<sub, shape=(1, 240, 192, 288), dtype=float32, chunksize=(1, 1, 192, 288), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Attributes:
    units:    $^\circ$C

May always be small things you need to adjust:

dss['time']

<xarray.DataArray 'time' (time: 240)> Size: 2kB
array([cftime.DatetimeNoLeap(1990, 1, 15, 12, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(1990, 2, 14, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(1990, 3, 15, 12, 0, 0, 0, has_year_zero=True),
       ...,
       cftime.DatetimeNoLeap(2009, 10, 15, 12, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2009, 11, 15, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2009, 12, 15, 12, 0, 0, 0, has_year_zero=True)],
      dtype=object)
Coordinates:
  * time     (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Attributes:
    axis:           T
    bounds:         time_bnds
    standard_name:  time
    title:          time
    type:           double

This calendar is in cftime and noLeap. Sometimes this causes issues when plotting timeseries, so just for fun we will convert to a standard datetime64 index calendar because it’s anyway monthly.

dss['time'] = dss['time'].to_dataframe().index.to_datetimeindex()

/tmp/ipykernel_14081/4195133999.py:1: RuntimeWarning: Converting a CFTimeIndex with dates from a non-standard calendar, 'noleap', to a pandas.DatetimeIndex, which uses dates from the standard calendar.  This may lead to subtle errors in operations that depend on the length of time between dates.
  dss['time'] = dss['time'].to_dataframe().index.to_datetimeindex()

dss['time']

<xarray.DataArray 'time' (time: 240)> Size: 2kB
array(['1990-01-15T12:00:00.000000000', '1990-02-14T00:00:00.000000000',
       '1990-03-15T12:00:00.000000000', ..., '2009-10-15T12:00:00.000000000',
       '2009-11-15T00:00:00.000000000', '2009-12-15T12:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T12:...

We get a warning, but it's ok as long as we know what we are doing.

3.3 Convert longitude:#

this data comes in 0–360 degrees, but often -180 to 180 is more convenient. So we can convert:

NOTE: Maybe you want to put this in a module? Or a package..

dss

<xarray.Dataset> Size: 2GB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 240, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T1...
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 2GB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 53MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    T_C        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

def convert_lon_to_180(ds):
    ds['lon'] =(ds['lon']+ 180) % 360 - 180
    ds = ds.sortby('lon')
    return ds

def convert_lon_to_360(ds):
    ds['lon'] =ds['lon']% 360
    ds = ds.sortby('lon')
    return lon % 360

(migth want to move this to a module!)

dss = convert_lon_to_180(dss)

dss['lon'].attrs['units'] = '$^\circ$ East'

Notice how the labels use both the attribute “standard_name” and “units” from the dataset.

4. The easiest interpolation: select with ‘nearest’ neighboor#

Example: let’s select zeppelin station: 78.906661, 11.889203

lat_zep = 78.906661
lon_zep = 11.889203

dss['T_C'].sel(lon=lon_zep, lat=lat_zep, method='nearest').plot()

[<matplotlib.lines.Line2D at 0x7f344060d090>]

../../../_images/71d4a69eb7ff1affe22d2964504823c8882372ee3cbfc72abb3daa32ff57b0f8.png

Super quick averaging etc#

da_so4 = dss['mmrso4']
da_so4

<xarray.DataArray 'mmrso4' (member_id: 1, time: 240, lev: 32, lat: 192, lon: 288)> Size: 2GB
dask.array<getitem, shape=(1, 240, 32, 192, 288), dtype=float32, chunksize=(1, 1, 16, 96, 144), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T1...
Attributes: (12/19)
    cell_measures:  area: areacella
    cell_methods:   area: time: mean
    comment:        Dry mass of sulfate (SO4) in aerosol particles as a fract...
    description:    Dry mass of sulfate (SO4) in aerosol particles as a fract...
    frequency:      mon
    id:             mmrso4
    ...             ...
    time_label:     time-mean
    time_title:     Temporal mean
    title:          Aerosol Sulfate Mass Mixing Ratio
    type:           real
    units:          kg kg-1
    variable_id:    mmrso4

Mean:

da_so4.mean(['time','lon'], keep_attrs=True).plot()#ylim=[1000,100], yscale='log')

<matplotlib.collections.QuadMesh at 0x7f34404d8750>

../../../_images/e6fcd83cfec84529417b8e3848bf15bd92b6d4adb7cda47fb82889de5887ac35.png

da_so4['lev'] = np.abs(da_so4['lev'].values)

da_so4.mean(['time','lon'], keep_attrs=True).plot(
    ylim=[1000,100], 
    yscale='log'
)

<matplotlib.collections.QuadMesh at 0x7f34400b9490>

../../../_images/039c36f23c77a222dd931148063701e86fe9cacc305210f55e3be9c5d6043049.png

Standard deviation

dss['T_C'].std(['time']).plot()

<matplotlib.collections.QuadMesh at 0x7f34142cfa10>

../../../_images/ef2b08413f971f9d652d14783f979df3c0ad0b04d5a86ad2656f5b2fa985924e.png

Temperature change much stronger over land than ocean and higher at high latitudes…

5. Mask data and groupby: pick out seasons#

month = ds['time.month']

month

<xarray.DataArray 'month' (time: 1980)> Size: 16kB
array([ 1,  2,  3, ..., 10, 11, 12])
Coordinates:
  * time     (time) object 16kB 1850-01-15 12:00:00 ... 2014-12-15 12:00:00

dss_JA = dss.where(month.isin([7,8])).mean('time')

dss_JA

<xarray.Dataset> Size: 8MB
Dimensions:    (member_id: 1, lat: 192, lon: 288, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, lev, lat, lon) float32 7MB dask.array<chunksize=(1, 16, 96, 144), meta=np.ndarray>
    ps         (lat, lon) float32 221kB dask.array<chunksize=(192, 288), meta=np.ndarray>
    p0         float32 4B nan
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    tas        (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    T_C        (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>

dss_season = dss.groupby('time.season').mean(keep_attrs=True)

dss_season

<xarray.Dataset> Size: 33MB
Dimensions:    (season: 4, member_id: 1, lev: 32, lat: 192, lon: 288, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * season     (season) object 32B 'DJF' 'JJA' 'MAM' 'SON'
Dimensions without coordinates: nbnd
Data variables:
    mmrso4     (season, member_id, lev, lat, lon) float32 28MB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (season, lat, lon) float32 885kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    hurs       (season, member_id, lat, lon) float32 885kB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (season, member_id, lat, lon) float32 885kB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    T_C        (season, member_id, lat, lon) float32 885kB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    areacella  (season, member_id, lat, lon) float32 885kB dask.array<chunksize=(4, 1, 192, 288), meta=np.ndarray>
    p0         (season) float32 16B 1e+05 1e+05 1e+05 1e+05
    a          (season, lev) float64 1kB dask.array<chunksize=(4, 32), meta=np.ndarray>
    b          (season, lev) float64 1kB dask.array<chunksize=(4, 32), meta=np.ndarray>
    b_bnds     (season, lev, nbnd) float64 2kB dask.array<chunksize=(4, 32, 2), meta=np.ndarray>
    lev_bnds   (season, lev, nbnd) float64 2kB dask.array<chunksize=(4, 32, 2), meta=np.ndarray>
    a_bnds     (season, lev, nbnd) float64 2kB dask.array<chunksize=(4, 32, 2), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

da_TC = dss_season['T_C']

%%time
da_TC.plot(col='season')

CPU times: user 21min 5s, sys: 3min 15s, total: 24min 20s
Wall time: 3min 13s

<xarray.plot.facetgrid.FacetGrid at 0x7f344060fa50>

../../../_images/8c90e16a44b0835926ff28b6c6ded8ff67326625dd7fe720322319dd52a2bb79.png

Tip: Might want to load or compute.#

from dask.diagnostics import ProgressBar

%%time
with ProgressBar():
    da_TC.load()

[########################################] | 100% Completed | 51.84 s
CPU times: user 10min 8s, sys: 3min 37s, total: 13min 46s
Wall time: 51.9 s

6. Controle the plot visuals:#

%%time
da_TC.plot(
    col = 'season',
    cmap = 'coolwarm',
    xlim = [-100,100],
    cbar_kwargs = {'label':'Temperature [$^\circ$C]'}
    )

CPU times: user 644 ms, sys: 55.8 ms, total: 700 ms
Wall time: 702 ms

<xarray.plot.facetgrid.FacetGrid at 0x7f341c7678d0>

../../../_images/6933b9c9ae339cf90e2dd779ad54e7bec328291319d198c640d75fd5ec51cf65.png

da_TC = da_TC.assign_attrs({'long_name': 'Temperature near surface'})

da_TC.mean('season', keep_attrs=True).plot()

<matplotlib.collections.QuadMesh at 0x7f33b044f9d0>

../../../_images/72ee3e980f8d1b03bbf299321453e47a73218620a84549509315245c4770b45e.png

7. Plotting with cartopy#

See more here:

import cartopy as cy
import cartopy.crs as ccrs

da_plt = dss['mmrso4'].isel(lev=-1).mean('time', keep_attrs=True).squeeze()#('member_id')

da_plt

<xarray.DataArray 'mmrso4' (lat: 192, lon: 288)> Size: 221kB
dask.array<getitem, shape=(192, 288), dtype=float32, chunksize=(96, 144), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
    member_id  <U8 32B 'r1i1p1f1'
    lev        float64 8B -992.6
Attributes: (12/19)
    cell_measures:  area: areacella
    cell_methods:   area: time: mean
    comment:        Dry mass of sulfate (SO4) in aerosol particles as a fract...
    description:    Dry mass of sulfate (SO4) in aerosol particles as a fract...
    frequency:      mon
    id:             mmrso4
    ...             ...
    time_label:     time-mean
    time_title:     Temporal mean
    title:          Aerosol Sulfate Mass Mixing Ratio
    type:           real
    units:          kg kg-1
    variable_id:    mmrso4

from matplotlib.colors import LogNorm

with ProgressBar():
    da_plt.load()

[########################################] | 100% Completed | 3.53 sms

My headline#

my comments

comment number two#

f, ax = plt.subplots(1, 1, dpi=150, 
                    subplot_kw={'projection':ccrs.Robinson()}
                   )

da_plt.plot.pcolormesh(
    cmap = 'Reds',
    ax=ax,
    transform = ccrs.PlateCarree(), 
    robust=True,
    norm = LogNorm(), # to add log scale colorbar
    #cbar_kwargs={
    #    #'label':'Wind Speed [m/s]', 
    #    'orientation':'horizontal',
    #    'shrink':.8
    #},
    #levels = 6
)
ax.set_title('Bottom level: Mean over Time')


ax.coastlines(linewidth=.1)
gl = ax.gridlines(draw_labels=True,#{'bottom':True, 'left':True},
                  #xlocs=[-180,-140,0,140, 180],
                  #ylocs=[-90,-45,0,45,90],
                  #x_inline=False,
                  #y_inline=False,
                  color='k',
                  linestyle='dotted'
                 )

../../../_images/53ac1fb8f7ed3d202e4a06cad9e5dbbb833bd59d5240d8499df35e6444a60014.png

8. Global average: weighted means!#

dss['T_C'].mean().compute()

<xarray.DataArray 'T_C' ()> Size: 4B
array(5.4478564, dtype=float32)

Why is this wrong?

dss['T_C'].weighted(dss['areacella']).mean().compute()

<xarray.DataArray 'T_C' ()> Size: 4B
array(14.715008, dtype=float32)

dss['areacella'].plot()

<matplotlib.collections.QuadMesh at 0x7f337c6b7cd0>

../../../_images/6fa4c8115a7b458a32bd156148558805a4ef84fb138baebd954a914b468d8213.png

9. Convert to pandas & do some random fun stuff:#

Maybe we e.g. want to compare with a station, or just use some of the considerable functionalities available from pandas. It’s easy to convert back and forth between xarray and pandas:

A lot of these functions also exist in xarray!

10 Other stuff#

Pick out station:#

lat_kristineberg = 58.24
lon_kristineberg = 11.44
# pick out surface
ds_surf =dss.isel(lev=-1)
ds_kristineberg = ds_surf.sel(lat=lat_kristineberg, lon = lon_kristineberg, method ='nearest')

ds_kristineberg['mmrso4'].plot()

[<matplotlib.lines.Line2D at 0x7f337c547590>]

../../../_images/4aec046e565cedf0f8f98f00c3cc0d635341400ab263eed2841e988ce4d641b7.png

Pick out the vairables we need.

vl = ['mmrso4','hurs','tas','T_C']

ds_kristineberg[vl]

<xarray.Dataset> Size: 6kB
Dimensions:    (member_id: 1, time: 240)
Coordinates:
    lat        float64 8B 57.96
    lon        float64 8B 11.25
  * member_id  (member_id) object 8B 'r1i1p1f1'
    lev        float64 8B -992.6
  * time       (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T1...
Data variables:
    mmrso4     (member_id, time) float32 960B dask.array<chunksize=(1, 1), meta=np.ndarray>
    hurs       (member_id, time) float32 960B dask.array<chunksize=(1, 1), meta=np.ndarray>
    tas        (member_id, time) float32 960B dask.array<chunksize=(1, 1), meta=np.ndarray>
    T_C        (member_id, time) float32 960B dask.array<chunksize=(1, 1), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas1-ns9989k-ns9560k/ESGF/CMIP6/...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

Resample:#

ds_yearly = ds_kristineberg.resample(time='YE').mean('time')

ds_yearly['T_C'].plot()
ds_kristineberg['T_C'].plot(linewidth=0, marker='.')

[<matplotlib.lines.Line2D at 0x7f33d07a79d0>]

../../../_images/257f69075a89a3f0cadc79b88c823024cee1a30132bbd89d20d00c4e51c79d5e.png

ds_kristineberg['season'] = ds_kristineberg['time.season']

for s in ['MAM','JJA','SON','DJF']:
    ds_kristineberg.where(ds_kristineberg['season']==s)['T_C'].plot.hist(alpha=0.5, bins=20, label=s)
    
plt.legend()

<matplotlib.legend.Legend at 0x7f337c24c490>

../../../_images/f4fb4d117fb2f3c2327723d33d4bcd0a2681d4ad3758d9fed7ca81627d6c211f.png

Obs: might want to convert to pandas#

At this point we basically have a time series and the dataset is not to big, so might want to convert to pandas:

subset_of_variables = ['mmrso4','hurs','tas','T_C','season']
df_kristineberg = ds_kristineberg[subset_of_variables].squeeze().to_dataframe()
df_kristineberg

	mmrso4	hurs	tas	T_C	season	lat	lon	member_id	lev
time
1990-01-15 12:00:00	9.319389e-10	85.651199	277.036011	3.886017	DJF	57.958115	11.25	r1i1p1f1	-992.556095
1990-02-14 00:00:00	9.139207e-10	79.779701	273.601807	0.451813	DJF	57.958115	11.25	r1i1p1f1	-992.556095
1990-03-15 12:00:00	1.282672e-09	82.473785	277.405701	4.255707	MAM	57.958115	11.25	r1i1p1f1	-992.556095
1990-04-15 00:00:00	7.225737e-10	79.588730	280.137329	6.987335	MAM	57.958115	11.25	r1i1p1f1	-992.556095
1990-05-15 12:00:00	2.263646e-09	80.421875	283.043152	9.893158	MAM	57.958115	11.25	r1i1p1f1	-992.556095
...	...	...	...	...	...	...	...	...	...
2009-08-15 12:00:00	6.469142e-10	82.338539	290.292358	17.142365	JJA	57.958115	11.25	r1i1p1f1	-992.556095
2009-09-15 00:00:00	6.174767e-10	75.105225	288.502136	15.352142	SON	57.958115	11.25	r1i1p1f1	-992.556095
2009-10-15 12:00:00	5.226203e-10	81.029533	284.518219	11.368225	SON	57.958115	11.25	r1i1p1f1	-992.556095
2009-11-15 00:00:00	2.411485e-10	83.641380	278.707977	5.557983	SON	57.958115	11.25	r1i1p1f1	-992.556095
2009-12-15 12:00:00	4.485378e-10	89.487213	277.537415	4.387421	DJF	57.958115	11.25	r1i1p1f1	-992.556095

240 rows × 9 columns

lets do something unnecesarily complicated :D#

Convert to pandas dataframe#

Because more functionality

df = dss.isel(lev=-1)[vl].to_dataframe()

df_ri = df.reset_index()
df_ri.head()

	member_id	time	lat	lon	mmrso4	hurs	tas	T_C	lev
0	r1i1p1f1	1990-01-15 12:00:00	-90.0	-180.00	2.049096e-11	98.689842	242.897797	-30.252197	-992.556095
1	r1i1p1f1	1990-01-15 12:00:00	-90.0	-178.75	2.049095e-11	98.675537	242.903198	-30.246796	-992.556095
2	r1i1p1f1	1990-01-15 12:00:00	-90.0	-177.50	2.049095e-11	98.675537	242.903198	-30.246796	-992.556095
3	r1i1p1f1	1990-01-15 12:00:00	-90.0	-176.25	2.049095e-11	98.675537	242.903198	-30.246796	-992.556095
4	r1i1p1f1	1990-01-15 12:00:00	-90.0	-175.00	2.049095e-11	98.675537	242.903198	-30.246796	-992.556095

qcut, cut#

qcut splits the data into quantile ranges

df_ri['hurs_cat'] = pd.qcut(df_ri['hurs'],
                            q=[0.05,0.17, 0.34,0.66, 0.83,0.95],
                            labels=['very low','low','med','high','very high'])

Cut cuts into categories

df_ri['lat_cat'] = pd.cut(df_ri['lat'], [-90,-60,-30,0,30,60,90], 
                          labels=['S polar','S mid','S tropics', 'N tropic', 'N mid','N polar'])

df_ri

	member_id	time	lat	lon	mmrso4	hurs	tas	T_C	lev	hurs_cat	lat_cat
0	r1i1p1f1	1990-01-15 12:00:00	-90.0	-180.00	2.049096e-11	98.689842	242.897797	-30.252197	-992.556095	very high	NaN
1	r1i1p1f1	1990-01-15 12:00:00	-90.0	-178.75	2.049095e-11	98.675537	242.903198	-30.246796	-992.556095	very high	NaN
2	r1i1p1f1	1990-01-15 12:00:00	-90.0	-177.50	2.049095e-11	98.675537	242.903198	-30.246796	-992.556095	very high	NaN
3	r1i1p1f1	1990-01-15 12:00:00	-90.0	-176.25	2.049095e-11	98.675537	242.903198	-30.246796	-992.556095	very high	NaN
4	r1i1p1f1	1990-01-15 12:00:00	-90.0	-175.00	2.049095e-11	98.675537	242.903198	-30.246796	-992.556095	very high	NaN
...	...	...	...	...	...	...	...	...	...	...	...
13271035	r1i1p1f1	2009-12-15 12:00:00	90.0	173.75	5.152184e-11	106.941933	250.874847	-22.275146	-992.556095	NaN	N polar
13271036	r1i1p1f1	2009-12-15 12:00:00	90.0	175.00	5.151738e-11	106.943703	250.874573	-22.275421	-992.556095	NaN	N polar
13271037	r1i1p1f1	2009-12-15 12:00:00	90.0	176.25	5.151281e-11	106.945244	250.874329	-22.275665	-992.556095	NaN	N polar
13271038	r1i1p1f1	2009-12-15 12:00:00	90.0	177.50	5.150912e-11	106.946594	250.874100	-22.275894	-992.556095	NaN	N polar
13271039	r1i1p1f1	2009-12-15 12:00:00	90.0	178.75	5.150579e-11	106.947731	250.873901	-22.276093	-992.556095	NaN	N polar

13271040 rows × 11 columns

df_ri.sample(n=20000).groupby('lat_cat').mean(numeric_only=True)

/tmp/ipykernel_14081/465231907.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df_ri.sample(n=20000).groupby('lat_cat').mean(numeric_only=True)

	lat	lon	mmrso4	hurs	tas	T_C	lev
lat_cat
S polar	-74.938019	-2.015824	2.571253e-11	95.192390	247.504166	-25.645840	-992.556095
S mid	-45.171888	-1.185364	1.620153e-10	79.848122	282.876770	9.726768	-992.556095
S tropics	-15.054866	0.087900	3.591108e-10	76.485733	297.266754	24.116770	-992.556095
N tropic	15.064793	-0.882442	1.276509e-09	72.494232	298.815186	25.665192	-992.556095
N mid	45.191990	-0.853557	1.281671e-09	72.892700	282.798279	9.648300	-992.556095
N polar	75.383530	0.788571	1.325593e-10	95.514252	263.269287	-9.880700	-992.556095

sns.boxenplot(x="lat_cat", y="hurs",
              color="b",
              #scale="linear", 
              width_method='linear',
              data=df_ri)#.sample(n=20000))

<Axes: xlabel='lat_cat', ylabel='hurs'>

../../../_images/3a201b8fcd8b23131384f115b087688bd15ba9b264527ce1d2dcadf4f36dfc06.png

sns.boxenplot(x="hurs_cat", 
              y="mmrso4",
              color="b",
              width_method='linear',
              data=df_ri.sample(n=20000),
             )

<Axes: xlabel='hurs_cat', ylabel='mmrso4'>

../../../_images/c513bca432bae27279e5e0f84672654fe7073c83d719053ed28195e60a1dd29f.png

sns.displot(x="mmrso4",
            hue='lat_cat',
            log_scale=True,
            kind='kde',
            data=df_ri.sample(n=20000), 
            multiple="stack"
           )

<seaborn.axisgrid.FacetGrid at 0x7f3072d3e8d0>

../../../_images/7fe543c280ed4379c976db8d3dd44a8653b73863039f1503cf5665e77d76194c.png

mask by category#

ds_new.where(ds_new['hurs_cat']=='very low').mean(['time','lon'])['mmrso4'].plot(label='very low')#vmin = 0, vmax = 1.5e-8)
ds_new.where(ds_new['hurs_cat']=='low').mean(['time','lon'])['mmrso4'].plot(label='low')#vmin = 0, vmax = 1.5e-8)

ds_new.where(ds_new['hurs_cat']=='high').mean(['time','lon'])['mmrso4'].plot(label='high')#vmin = 0, vmax = 1.5e-8)
ds_new.where(ds_new['hurs_cat']=='very high').mean(['time','lon'], keep_attrs=True)['mmrso4'].plot(label='very high')#vmin = 0, vmax = 1.5e-8)
plt.legend()

<matplotlib.legend.Legend at 0x7f335b37c490>

../../../_images/76f08dda63a025b448306d6998563dda751ea38300c6bc228f66e9b3ee6b1092.png

Improve and reduce risk of typo:#

for cat in ['very low','low','high','very high']:
    ds_new.where(ds_new['hurs_cat']==cat).mean(['time','lon'])['mmrso4'].plot(label=cat) 
plt.legend()

<matplotlib.legend.Legend at 0x7f335b784d50>