Some tips with xarray and pandas

Contents

Some tips with xarray and pandas#

  • We have massively different levels here

  • Try to make some aims for technical skills you can learn!

    • If you are beginning with python → learn the basics

    • If you are good at basic python → learn new packages and tricks

    • If you know all the packages → improve your skills with producing your own software, organising your code etc.

    • If you don’t know git and github → get better at this!

  • Learn from each other!

  • Every time I taught on this course, I learned something!

Please feel free to come with suggestions and extra input as we go!

Etherpad for the course

Note to Sara: do menti

Jupyter notebooks:#

Jupyter is an interactive computing environment that allows users to write, run, and document code in a single interface, typically using notebooks that combine live code, text, equations, and visualizations. It supports multiple programming languages (like Python, R, and Julia) and is widely used in data science, research, and education for reproducible analysis and exploration.

a=1
import numpy as np
  • cells can be markdown or code

  • run cell by shift+enter, cmd+enter, ctrl+enter

  • Kernel: You can restart, interrupt etc.

  • Tricks:

    • mark multiple places at the same time

    • cmd or ctr + D: select same snippet.

    • Escape to move out of cell

      • Once out of cell mode, you can copy (c), paste (v), insert cell above (a), incert cell below (b) etc.

  • Contextual help

.

.

.

.

.

What are pandas and xarray?#

  • Pandas → like a spreadsheet 2D data with columns and rows

  • xarray → like pandas, but in N dimensions

  • Use the functionality these packages gives you! Will help you avoid mistakes. Try to get as good as possible :)

Pandas#

image.png (Source: https://www.geeksforgeeks.org/python-pandas-dataframe/)

Xarray#

image.png (Source: https://docs.xarray.dev/)

.

.

.

.

.

1. Import python packages#

import xarray as xr
xr.set_options(display_style='html')
import intake
import cftime
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import numpy as np
import pandas as pd
import datetime
import seaborn as sns
%matplotlib inline

2. Pandas basics:#

Can read a range of data formats, e.g.

pd.read_csv('filename.csv')
pd.read_excel('filename')
pd.read_sql('filename')

See overview: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

fn = '~/shared-craas2-ns9988k-ns9252k/escience2025/data/group3/eScience2024/zeppelin-ebc-2015-2019-1.csv'

Read in as a DataFrame#

import pandas as pd
df = pd.read_csv(fn, index_col=0, parse_dates=True )
df
CVI_status BC_cvi z BC_tot flag visibility
time
2015-11-13 00:00:00 off 0.00 0.30583 0.04 bad 34596.2
2015-11-13 00:01:00 off 0.00 0.43400 0.02 bad 34481.7
2015-11-13 00:02:00 off 0.04 0.53317 0.01 bad 34572.3
2015-11-13 00:03:00 off 0.05 0.60250 0.02 bad 34061.8
2015-11-13 00:04:00 off -0.01 1.16433 0.02 bad 34461.8
... ... ... ... ... ... ...
2019-11-15 10:30:00 off 0.02 1.14300 0.02 good 36983.4
2019-11-15 10:31:00 off 0.00 1.34233 0.01 good 38964.4
2019-11-15 10:32:00 off 0.05 0.68667 -0.00 good 45981.7
2019-11-15 10:33:00 off -0.02 1.00450 -0.00 good 48805.2
2019-11-15 10:34:00 off 0.04 0.86300 0.01 good 47112.9

2107355 rows × 6 columns

Select a column:#

One column is called a Series

df['BC_cvi']
time
2015-11-13 00:00:00    0.00
2015-11-13 00:01:00    0.00
2015-11-13 00:02:00    0.04
2015-11-13 00:03:00    0.05
2015-11-13 00:04:00   -0.01
                       ... 
2019-11-15 10:30:00    0.02
2019-11-15 10:31:00    0.00
2019-11-15 10:32:00    0.05
2019-11-15 10:33:00   -0.02
2019-11-15 10:34:00    0.04
Name: BC_cvi, Length: 2107355, dtype: float64

Select data in general:#

Several ways to do this:

df.loc[<from_index>: <to_index>, <from_column>:<to_column>]
df.loc['2016-01-01':'2016-01-02', 'CVI_status':'BC_cvi']#['BC_cvi']
CVI_status BC_cvi
time
2016-01-01 00:00:00 off 0.13
2016-01-01 00:01:00 off 0.11
2016-01-01 00:02:00 off 0.03
2016-01-01 00:03:00 off 0.03
2016-01-01 00:04:00 off 0.00
... ... ...
2016-01-02 23:55:00 off 0.02
2016-01-02 23:56:00 off 0.00
2016-01-02 23:57:00 off 0.12
2016-01-02 23:58:00 off 0.00
2016-01-02 23:59:00 off 0.07

2880 rows × 2 columns

The same, but using indexes:

df.iloc[0:5, 0:2]#['BC_cvi']
CVI_status BC_cvi
time
2015-11-13 00:00:00 off 0.00
2015-11-13 00:01:00 off 0.00
2015-11-13 00:02:00 off 0.04
2015-11-13 00:03:00 off 0.05
2015-11-13 00:04:00 off -0.01

Plot something simple:#

df.loc['2016-01-01':'2016-02-01','BC_cvi'].plot()
<Axes: xlabel='time'>
../../../../_images/6eae2164a0a3c7cb418b0c2d82e4a3bdd0719cec4b25f4aa323d79551fda70f8.png

Save your dataframe:#

df.to_csv('my_new_dataset.csv')

3. Xarray and netcdf#

3.1 Reading in the data from a netcdf file:#

Netcdf format:#

  • NetCDF (Network Common Data Form) is a widely used, platform-independent file format designed for array-oriented scientific data.

  • It is self-describing, meaning metadata (e.g., units, descriptions) is stored alongside the data.

  • Supports multidimensional data structures, such as time-varying grids (e.g., time, latitude, longitude, level).

  • Commonly used in earth and atmospheric sciences for storing model outputs, reanalyses, and satellite data. Developed at UCAR.

  • Organizes content into:

    • Dimensions (e.g., time, lat, lon)

    • Variables (e.g., temperature, pressure)

    • Attributes (e.g., units, standard names, conventions)

  • Enables efficient storage and retrieval, even for very large datasets.

  • Supported by many scientific tools and libraries including xarray, netCDF4, NCO, and CDO.

  • Ideal for interoperability, reproducibility, and long-term data archiving.

NetCDF + xarray = <3

Xarray is a Python library designed to work with multi-dimensional arrays and datasets, particularly those used in earth sciences, climate science, and atmospheric science. It builds upon and extends the functionality of NumPy, Pandas, and NetCDF, providing a high-level interface for working with labeled, multi-dimensional data.

.

.

.

.

.

Open dataset#

  • Use xarray python package to analyze netCDF dataset

  • open_dataset allows to get all the metadata without loading data into memory.

  • with xarray, we only load into memory what is needed.

path='filename.nc'
ds = xr.open_dataset(path)
Opening multiple files:#

list_of_files = [
    'file1.nc',
    'file2.nc'
]
xr.open_mfdataset(list_of_files, concat_dim='time',combine='by_coords')

Example:#

(I’ve gone hunting in your data folders)

f1 = '~/shared-craas2-ns9988k-ns9252k/escience2025/data/group6/CMIP6/abrupt-4xCO2/rlut_Amon_CAMS-CSM1-0_abrupt-4xCO2_r1i1p1f1_gn.nc'
f2 = '~/shared-craas2-ns9988k-ns9252k/escience2025/data/group6/CMIP6/abrupt-4xCO2/rsut_Amon_CAMS-CSM1-0_abrupt-4xCO2_r1i1p1f1_gn.nc'
Open one file:#
xr.open_dataset(f1)
<xarray.Dataset> Size: 369MB
Dimensions:  (time: 1800, lon: 320, lat: 160)
Coordinates:
  * time     (time) object 14kB 0001-01-01 00:00:00 ... 0150-12-01 00:00:00
  * lon      (lon) float64 3kB 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
  * lat      (lat) float64 1kB -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
Data variables:
    rlut     (time, lat, lon) float32 369MB ...
Attributes: (12/50)
    CDI:                    Climate Data Interface version 2.0.5 (https://mpi...
    Conventions:            CF-1.7 CMIP-6.2
    source:                 CAMS_CSM 1.0 (2016): \naerosol: none\natmos: ECHA...
    institution:            Chinese Academy of Meteorological Sciences, Beiji...
    activity_id:            CMIP
    branch_method:          Standard
    ...                     ...
    tracking_id:            hdl:21.14100/73320c27-d385-493c-9674-1470acacb8ed
    variable_id:            rlut
    variant_label:          r1i1p1f1
    license:                CMIP6 model data produced by Lawrence Livermore P...
    cmor_version:           3.4.0
    CDO:                    Climate Data Operators version 2.0.5 (https://mpi...
Open multiple files:#
ds1 = xr.open_mfdataset([f1, f2])

3.1 Check how your dataset looks#

Notice:

  • Coordinates

  • Data variables (each with assigned coordinates)

  • Attributes

ds1
<xarray.Dataset> Size: 737MB
Dimensions:  (time: 1800, lon: 320, lat: 160)
Coordinates:
  * time     (time) object 14kB 0001-01-01 00:00:00 ... 0150-12-01 00:00:00
  * lon      (lon) float64 3kB 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
  * lat      (lat) float64 1kB -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
Data variables:
    rlut     (time, lat, lon) float32 369MB dask.array<chunksize=(1, 160, 320), meta=np.ndarray>
    rsut     (time, lat, lon) float32 369MB dask.array<chunksize=(1, 160, 320), meta=np.ndarray>
Attributes: (12/50)
    CDI:                    Climate Data Interface version 2.0.5 (https://mpi...
    Conventions:            CF-1.7 CMIP-6.2
    source:                 CAMS_CSM 1.0 (2016): \naerosol: none\natmos: ECHA...
    institution:            Chinese Academy of Meteorological Sciences, Beiji...
    activity_id:            CMIP
    branch_method:          Standard
    ...                     ...
    tracking_id:            hdl:21.14100/73320c27-d385-493c-9674-1470acacb8ed
    variable_id:            rlut
    variant_label:          r1i1p1f1
    license:                CMIP6 model data produced by Lawrence Livermore P...
    cmor_version:           3.4.0
    CDO:                    Climate Data Operators version 2.0.5 (https://mpi...

.

.

.

.

.

3.2 Read in CMIP6 data: We will skip this next part, but you can check it later to read data:#

#cat_url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
cat_url = '/mnt/craas2-ns9988k/data/catalogs/cmip6.json'
col = intake.open_esm_datastore(cat_url)
col

cmip6 catalog with 155 dataset(s) from 536945 asset(s):

unique
variable_id 583
table_id 24
source_id 75
experiment_id 94
member_id 190
grid_label 11
time_range 9100
activity_id 18
institution_id 35
version 577
path 536945
dcpp_init_year 0
derived_variable_id 0

Search corresponding data#

Please check here for info about CMIP and variables :)

Particularly useful is maybe the variable search which you find here: https://clipc-services.ceda.ac.uk/dreq/mipVars.html

cat = col.search(source_id = ['CESM2'], 
                 experiment_id=['historical'], 
                 table_id=['Amon','fx','AERmon'], 
                 variable_id=['tas','hurs', 'areacella','mmrso4' ], 
                 member_id=['r1i1p1f1'],
                )
cat.df
variable_id table_id source_id experiment_id member_id grid_label time_range activity_id institution_id version path dcpp_init_year
0 mmrso4 AERmon CESM2 historical r1i1p1f1 gn 185001-201412 CMIP NCAR v20190308 /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMIP6/CMIP... NaN
1 hurs Amon CESM2 historical r1i1p1f1 gn 185001-201412 CMIP NCAR v20190308 /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMIP6/CMIP... NaN
2 tas Amon CESM2 historical r1i1p1f1 gn 185001-201412 CMIP NCAR v20190308 /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMIP6/CMIP... NaN
3 areacella fx CESM2 historical r1i1p1f1 gn NaN CMIP NCAR v20190308 /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMIP6/CMIP... NaN
cat.esmcat.aggregation_control.groupby_attrs = ['activity_id','experiment_id', 'source_id','table_id','grid_label']
cat.esmcat.aggregation_control.groupby_attrs
['activity_id', 'experiment_id', 'source_id', 'table_id', 'grid_label']

Create dictionary from the list of datasets we found#

  • This step may take several minutes so be patient!

dset_dict = cat.to_dataset_dict(zarr_kwargs={'use_cftime':True})
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.experiment_id.source_id.table_id.grid_label'
100.00% [3/3 00:03<00:00]
/opt/conda/envs/pangeo-notebook/lib/python3.11/site-packages/xarray/conventions.py:286: SerializationWarning: variable 'areacella' has multiple fill values {1e+20, 1e+20} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
/opt/conda/envs/pangeo-notebook/lib/python3.11/site-packages/xarray/conventions.py:286: SerializationWarning: variable 'mmrso4' has multiple fill values {1e+20, 1e+20} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
/opt/conda/envs/pangeo-notebook/lib/python3.11/site-packages/xarray/conventions.py:286: SerializationWarning: variable 'hurs' has multiple fill values {1e+20, 1e+20} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
/opt/conda/envs/pangeo-notebook/lib/python3.11/site-packages/xarray/conventions.py:286: SerializationWarning: variable 'tas' has multiple fill values {1e+20, 1e+20} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
list(dset_dict.keys())
['CMIP.historical.CESM2.fx.gn',
 'CMIP.historical.CESM2.AERmon.gn',
 'CMIP.historical.CESM2.Amon.gn']
Since I have already checked that these datasets are on the same grid, we can merge them:
ds_list =[]
for k in dset_dict.keys():
    ds = dset_dict[k]
    for v in ['lon_bnds', 'lat_bnds', 'time_bnds']:
        if v in ds:
            ds= ds.drop(v)
    ds_list.append(ds)     
ds = xr.merge(ds_list,compat='override')
/tmp/ipykernel_547/179335234.py:6: DeprecationWarning: dropping variables using `drop` is deprecated; use drop_vars.
  ds= ds.drop(v)

.

.

.

.

.

4. Check how your dataset looks#

Notice:

  • Coordinates

  • Data variables (each with assigned coordinates)

  • Attributes

ds
<xarray.Dataset> Size: 15GB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 1980, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) object 16kB 1850-01-15 12:00:00 ... 2014-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 14GB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 438MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 438MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 438MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMI...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

.

.

.

.

.

5. Sometimes we want to do some nice tweaks before we start:#

5.1 Indexing/selecting data:#

  • Xarray loads data only when it needs to (it’s lazy, someone else can explain) – pandas loads everything right away

  • You might want to early on define the subset of data you want to look at so that you don’t end up loading a lot of extra data.

See here for nice overview#

In order to reduce unecessary calculations and loading of data, think about which part of the data you want, and slice early on.

Slice in time: the sel method#

dss = ds.sel(time = slice('1990-01-01','2010-01-01'))
dss
<xarray.Dataset> Size: 2GB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 240, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 2GB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 53MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMI...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn
NB: Always check that the function you are using is doing what you expect!
  • A lot of issues arise from people not doing sanity checks.

Example 2: You might want to select only the arctic e.g.:

dss_arctic = dss.sel(lat = slice(60,None))
dss_arctic
<xarray.Dataset> Size: 310MB
Dimensions:    (member_id: 1, lat: 32, lon: 288, time: 240, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 256B 60.79 61.73 62.67 63.61 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 37kB dask.array<chunksize=(1, 32, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 283MB dask.array<chunksize=(1, 1, 16, 32, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 9MB dask.array<chunksize=(1, 32, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 9MB dask.array<chunksize=(1, 1, 32, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 9MB dask.array<chunksize=(1, 1, 32, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMI...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

isel, sel: index selecting#

  • Select the surface (which in this case is the last index :)

dss
<xarray.Dataset> Size: 2GB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 240, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 2GB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 53MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMI...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn
dss_s = dss.isel(lev=-1)
dss_s
<xarray.Dataset> Size: 213MB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 240, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
    lev        float64 8B -992.6
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 53MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          float64 8B dask.array<chunksize=(), meta=np.ndarray>
    b          float64 8B dask.array<chunksize=(), meta=np.ndarray>
    b_bnds     (nbnd) float64 16B dask.array<chunksize=(2,), meta=np.ndarray>
    lev_bnds   (nbnd) float64 16B dask.array<chunksize=(2,), meta=np.ndarray>
    a_bnds     (nbnd) float64 16B dask.array<chunksize=(2,), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMI...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

.

.

.

.

.

5.2 Calculates variables and assign attributes!#

Nice for plotting and to keep track of what is in your dataset (especially ‘units’ and ‘standard_name’/’long_name’ will be looked for by xarray.

dss['T_C'] = dss['tas']-273.15
dss['T_C']
<xarray.DataArray 'T_C' (member_id: 1, time: 240, lat: 192, lon: 288)> Size: 53MB
dask.array<sub, shape=(1, 240, 192, 288), dtype=float32, chunksize=(1, 1, 192, 288), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
#dss['T_C'] = dss['T_C'].assign_attrs({'units': 'deg C'})
dss['T_C'] = dss['T_C'].assign_attrs({'units': '$^\circ$C'})
dss['T_C'] = dss['T_C'].assign_attrs({'long_name': 'Temperature at surface'})
dss['T_C']
<xarray.DataArray 'T_C' (member_id: 1, time: 240, lat: 192, lon: 288)> Size: 53MB
dask.array<sub, shape=(1, 240, 192, 288), dtype=float32, chunksize=(1, 1, 192, 288), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * time       (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Attributes:
    units:      $^\circ$C
    long_name:  Temperature at surface
May always be small things you need to adjust:
dss['time']
<xarray.DataArray 'time' (time: 240)> Size: 2kB
array([cftime.DatetimeNoLeap(1990, 1, 15, 12, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(1990, 2, 14, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(1990, 3, 15, 12, 0, 0, 0, has_year_zero=True),
       ...,
       cftime.DatetimeNoLeap(2009, 10, 15, 12, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2009, 11, 15, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2009, 12, 15, 12, 0, 0, 0, has_year_zero=True)],
      dtype=object)
Coordinates:
  * time     (time) object 2kB 1990-01-15 12:00:00 ... 2009-12-15 12:00:00
Attributes:
    axis:           T
    bounds:         time_bnds
    standard_name:  time
    title:          time
    type:           double
  • This calendar is in cftime and noLeap. Sometimes this causes issues when plotting timeseries, so just for fun we will convert to a standard datetime64 index calendar because it’s anyway monthly.

dss['time'] = dss['time'].to_dataframe().index.to_datetimeindex()
/tmp/ipykernel_547/4195133999.py:1: RuntimeWarning: Converting a CFTimeIndex with dates from a non-standard calendar, 'noleap', to a pandas.DatetimeIndex, which uses dates from the standard calendar.  This may lead to subtle errors in operations that depend on the length of time between dates.
  dss['time'] = dss['time'].to_dataframe().index.to_datetimeindex()
dss['time']
<xarray.DataArray 'time' (time: 240)> Size: 2kB
array(['1990-01-15T12:00:00.000000000', '1990-02-14T00:00:00.000000000',
       '1990-03-15T12:00:00.000000000', ..., '2009-10-15T12:00:00.000000000',
       '2009-11-15T00:00:00.000000000', '2009-12-15T12:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T12:...
We get a warning, but it's ok as long as we know what we are doing.

3.3 Convert longitude:#

this data comes in 0–360 degrees, but often -180 to 180 is more convenient. So we can convert:

NOTE: Maybe you want to put this in a module? Or a package..

dss
<xarray.Dataset> Size: 2GB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 240, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T1...
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 2GB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 53MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    T_C        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMI...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn
def convert_lon_to_180(ds):
    ds['lon'] =(ds['lon']+ 180) % 360 - 180
    ds = ds.sortby('lon')
    return ds

def convert_lon_to_360(ds):
    ds['lon'] =ds['lon']% 360
    ds = ds.sortby('lon')
    return lon % 360

(migth want to move this to a module!)

dss = convert_lon_to_180(dss)
dss
<xarray.Dataset> Size: 2GB
Dimensions:    (member_id: 1, lat: 192, lon: 288, time: 240, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T1...
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, time, lev, lat, lon) float32 2GB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (time, lat, lon) float32 53MB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    p0         float32 4B ...
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    T_C        (member_id, time, lat, lon) float32 53MB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMI...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn
  • Notice that it lost the units now..

dss['lon'].attrs['units'] = '$^\circ$ East'

Notice how the plotting labels use both the attribute “standard_name” and “units” from the dataset.

.

.

.

.

.

6. The easiest interpolation: select with ‘nearest’ neighboor#

Example: let’s select Zeppelin station: 78.906661, 11.889203

lat_zep = 78.90
lon_zep = 11.89
dss['T_C'].sel(lon=lon_zep, lat=lat_zep, method='nearest')#.plot()
<xarray.DataArray 'T_C' (member_id: 1, time: 240)> Size: 960B
dask.array<getitem, shape=(1, 240), dtype=float32, chunksize=(1, 1), chunktype=numpy.ndarray>
Coordinates:
    lat        float64 8B 78.69
    lon        float64 8B 12.5
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * time       (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T1...
Attributes:
    units:      $^\circ$C
    long_name:  Temperature at surface

7. Super quick averaging etc with xarray#

da_so4 = dss['mmrso4']
da_so4
<xarray.DataArray 'mmrso4' (member_id: 1, time: 240, lev: 32, lat: 192, lon: 288)> Size: 2GB
dask.array<getitem, shape=(1, 240, 32, 192, 288), dtype=float32, chunksize=(1, 1, 16, 96, 144), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * time       (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T1...
Attributes: (12/19)
    cell_measures:  area: areacella
    cell_methods:   area: time: mean
    comment:        Dry mass of sulfate (SO4) in aerosol particles as a fract...
    description:    Dry mass of sulfate (SO4) in aerosol particles as a fract...
    frequency:      mon
    id:             mmrso4
    ...             ...
    time_label:     time-mean
    time_title:     Temporal mean
    title:          Aerosol Sulfate Mass Mixing Ratio
    type:           real
    units:          kg kg-1
    variable_id:    mmrso4

Mean:

da_so4_lmean = da_so4.mean(['time','lon'], keep_attrs=True)#.plot()#ylim=[1000,100], yscale='log')
da_so4_lmean
<xarray.DataArray 'mmrso4' (member_id: 1, lev: 32, lat: 192)> Size: 25kB
dask.array<mean_agg-aggregate, shape=(1, 32, 192), dtype=float32, chunksize=(1, 16, 96), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
Attributes: (12/19)
    cell_measures:  area: areacella
    cell_methods:   area: time: mean
    comment:        Dry mass of sulfate (SO4) in aerosol particles as a fract...
    description:    Dry mass of sulfate (SO4) in aerosol particles as a fract...
    frequency:      mon
    id:             mmrso4
    ...             ...
    time_label:     time-mean
    time_title:     Temporal mean
    title:          Aerosol Sulfate Mass Mixing Ratio
    type:           real
    units:          kg kg-1
    variable_id:    mmrso4

8. Plotting with xarray#

  • Notice it’s slow now, why?

da_so4_lmean.plot()
<matplotlib.collections.QuadMesh at 0x7f272e06e550>
../../../../_images/e6fcd83cfec84529417b8e3848bf15bd92b6d4adb7cda47fb82889de5887ac35.png
da_so4_lmean['lev'] = np.abs(da_so4_lmean['lev'].values)
da_so4_lmean.plot(
    ylim=[1000,100], 
    yscale='log'
)
<matplotlib.collections.QuadMesh at 0x7f263eed7e10>
../../../../_images/039c36f23c77a222dd931148063701e86fe9cacc305210f55e3be9c5d6043049.png

Standard deviation

dss['T_C'].std(['time']).plot()
<matplotlib.collections.QuadMesh at 0x7f264cc67190>
../../../../_images/ef2b08413f971f9d652d14783f979df3c0ad0b04d5a86ad2656f5b2fa985924e.png

Temperature change much stronger over land than ocean and higher at high latitudes…

9. Mask data and groupby: pick out seasons#

month = ds['time.month']
month
<xarray.DataArray 'month' (time: 1980)> Size: 16kB
array([ 1,  2,  3, ..., 10, 11, 12])
Coordinates:
  * time     (time) object 16kB 1850-01-15 12:00:00 ... 2014-12-15 12:00:00
dss_JA = dss.where(month.isin([7,8])).mean('time')
dss_JA
<xarray.Dataset> Size: 8MB
Dimensions:    (member_id: 1, lat: 192, lon: 288, lev: 32, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
Dimensions without coordinates: nbnd
Data variables:
    areacella  (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    mmrso4     (member_id, lev, lat, lon) float32 7MB dask.array<chunksize=(1, 16, 96, 144), meta=np.ndarray>
    ps         (lat, lon) float32 221kB dask.array<chunksize=(192, 288), meta=np.ndarray>
    p0         float32 4B nan
    a          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b          (lev) float64 256B dask.array<chunksize=(32,), meta=np.ndarray>
    b_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    lev_bnds   (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    a_bnds     (lev, nbnd) float64 512B dask.array<chunksize=(32, 2), meta=np.ndarray>
    hurs       (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    tas        (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    T_C        (member_id, lat, lon) float32 221kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
dss_season = dss.groupby('time.season').mean(keep_attrs=True)
dss_season
<xarray.Dataset> Size: 33MB
Dimensions:    (season: 4, member_id: 1, lev: 32, lat: 192, lon: 288, nbnd: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
  * member_id  (member_id) object 8B 'r1i1p1f1'
  * lev        (lev) float64 256B -3.643 -7.595 -14.36 ... -957.5 -976.3 -992.6
  * season     (season) object 32B 'DJF' 'JJA' 'MAM' 'SON'
Dimensions without coordinates: nbnd
Data variables:
    mmrso4     (season, member_id, lev, lat, lon) float32 28MB dask.array<chunksize=(1, 1, 16, 96, 144), meta=np.ndarray>
    ps         (season, lat, lon) float32 885kB dask.array<chunksize=(1, 192, 288), meta=np.ndarray>
    hurs       (season, member_id, lat, lon) float32 885kB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    tas        (season, member_id, lat, lon) float32 885kB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    T_C        (season, member_id, lat, lon) float32 885kB dask.array<chunksize=(1, 1, 192, 288), meta=np.ndarray>
    areacella  (season, member_id, lat, lon) float32 885kB dask.array<chunksize=(4, 1, 192, 288), meta=np.ndarray>
    p0         (season) float32 16B 1e+05 1e+05 1e+05 1e+05
    a          (season, lev) float64 1kB dask.array<chunksize=(4, 32), meta=np.ndarray>
    b          (season, lev) float64 1kB dask.array<chunksize=(4, 32), meta=np.ndarray>
    b_bnds     (season, lev, nbnd) float64 2kB dask.array<chunksize=(4, 32, 2), meta=np.ndarray>
    lev_bnds   (season, lev, nbnd) float64 2kB dask.array<chunksize=(4, 32, 2), meta=np.ndarray>
    a_bnds     (season, lev, nbnd) float64 2kB dask.array<chunksize=(4, 32, 2), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMI...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn
da_TC = dss_season['T_C']
%%time
da_TC.plot(col='season')
CPU times: user 7min 48s, sys: 6min 20s, total: 14min 8s
Wall time: 53.3 s
<xarray.plot.facetgrid.FacetGrid at 0x7f264e7676d0>
../../../../_images/7fd401a3f3d30a5ad40e03dd40df75fd7f6fc8d2fee9dd1e8772f8366c085d5d.png
Note to Sara: Do quiz...

Tip: Might want to load or compute.#

from dask.diagnostics import ProgressBar
%%time
with ProgressBar():
    da_TC.load()
[########################################] | 100% Completed | 74.43 s
CPU times: user 11min 24s, sys: 8min 22s, total: 19min 47s
Wall time: 1min 14s

10. Controle the plot visuals:#

%%time
da_TC.plot(
    col = 'season',
    #cmap = 'coolwarm',
    #xlim = [-100,100],
    #ylim = [-0,90],
    #cbar_kwargs = {'label':'Temperature [$^\circ$C]'}
    )
CPU times: user 135 ms, sys: 11.6 ms, total: 147 ms
Wall time: 142 ms
<xarray.plot.facetgrid.FacetGrid at 0x7f263f1fbed0>
../../../../_images/7fd401a3f3d30a5ad40e03dd40df75fd7f6fc8d2fee9dd1e8772f8366c085d5d.png
  • Xarray uses certain attributes as labels when plotting:

da_TC = da_TC.assign_attrs({'long_name': 'Temperature near surface'})
da_TC.mean('season', keep_attrs=True).plot()
<matplotlib.collections.QuadMesh at 0x7f268f549410>
../../../../_images/72ee3e980f8d1b03bbf299321453e47a73218620a84549509315245c4770b45e.png

11. Plotting on maps with cartopy#

See more here:

import cartopy as cy
import cartopy.crs as ccrs
da_plt = dss['mmrso4'].isel(lev=-1).mean('time', keep_attrs=True).squeeze()#('member_id')
da_plt
<xarray.DataArray 'mmrso4' (lat: 192, lon: 288)> Size: 221kB
dask.array<getitem, shape=(192, 288), dtype=float32, chunksize=(96, 144), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
    member_id  <U8 32B 'r1i1p1f1'
    lev        float64 8B -992.6
Attributes: (12/19)
    cell_measures:  area: areacella
    cell_methods:   area: time: mean
    comment:        Dry mass of sulfate (SO4) in aerosol particles as a fract...
    description:    Dry mass of sulfate (SO4) in aerosol particles as a fract...
    frequency:      mon
    id:             mmrso4
    ...             ...
    time_label:     time-mean
    time_title:     Temporal mean
    title:          Aerosol Sulfate Mass Mixing Ratio
    type:           real
    units:          kg kg-1
    variable_id:    mmrso4
from matplotlib.colors import LogNorm
with ProgressBar():
    da_plt.load()
[########################################] | 100% Completed | 3.53 sms
da_plt
<xarray.DataArray 'mmrso4' (lat: 192, lon: 288)> Size: 221kB
array([[7.0655864e-12, 7.0656254e-12, 7.0656757e-12, ..., 7.0654580e-12,
        7.0654359e-12, 7.0655357e-12],
       [7.4523243e-12, 7.4701799e-12, 7.4873961e-12, ..., 7.3945355e-12,
        7.4097612e-12, 7.4321209e-12],
       [8.0703352e-12, 8.0987569e-12, 8.1331625e-12, ..., 7.9838211e-12,
        8.0185901e-12, 8.0491083e-12],
       ...,
       [3.2195205e-11, 3.2224824e-11, 3.2185157e-11, ..., 3.2204971e-11,
        3.2198522e-11, 3.2200648e-11],
       [3.1399085e-11, 3.1360515e-11, 3.1390588e-11, ..., 3.1466201e-11,
        3.1436742e-11, 3.1405465e-11],
       [3.0563073e-11, 3.0563111e-11, 3.0562844e-11, ..., 3.0563923e-11,
        3.0565092e-11, 3.0564863e-11]], dtype=float32)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
    member_id  <U8 32B 'r1i1p1f1'
    lev        float64 8B -992.6
Attributes: (12/19)
    cell_measures:  area: areacella
    cell_methods:   area: time: mean
    comment:        Dry mass of sulfate (SO4) in aerosol particles as a fract...
    description:    Dry mass of sulfate (SO4) in aerosol particles as a fract...
    frequency:      mon
    id:             mmrso4
    ...             ...
    time_label:     time-mean
    time_title:     Temporal mean
    title:          Aerosol Sulfate Mass Mixing Ratio
    type:           real
    units:          kg kg-1
    variable_id:    mmrso4
  • projection tells cartopy how we want the data visualized

  • transform tells it what kind of grid the data is on:

    • PlateCarree is a regular lat, lon grid.

f, ax = plt.subplots(1, 1, dpi=150, figsize = [6,3],
                    subplot_kw={'projection':ccrs.Robinson()},
                    #subplot_kw={'projection':ccrs.Orthographic()},
                   )

da_plt.plot.pcolormesh(
    cmap = 'Reds',
    ax=ax,
    transform = ccrs.PlateCarree(), 
    #robust=True,
    #norm = LogNorm(), # to add log scale colorbar
    #cbar_kwargs={
    #    #'label':'Wind Speed [m/s]', 
    #    'orientation':'horizontal',
    #    'shrink':.8
    #},
    #levels = 6
)
#ax.set_title('Bottom level: Mean over Time')


ax.coastlines(linewidth=.1)
gl = ax.gridlines(draw_labels=True,#{'bottom':True, 'left':True},
                  #xlocs=[-180,-140,0,140, 180],
                  #ylocs=[-90,-45,0,45,90],
                  #x_inline=False,
                  #y_inline=False,
                  #color='k',
                  #linestyle='dotted'
                 )
../../../../_images/b9b851c1142209fd77e9b71ce25e941f0997783b73b13c780fd56eba16404d02.png

12. Global average: weighted means!#

dss['T_C'].mean().compute()
<xarray.DataArray 'T_C' ()> Size: 4B
array(5.4478564, dtype=float32)

Why is this wrong?

dss['T_C'].weighted(dss['areacella']).mean().compute()
<xarray.DataArray 'T_C' ()> Size: 4B
array(14.715008, dtype=float32)
dss['areacella'].plot()
<matplotlib.collections.QuadMesh at 0x7f268e6b2b50>
../../../../_images/6fa4c8115a7b458a32bd156148558805a4ef84fb138baebd954a914b468d8213.png

13 Filtering, grouping etc#

Pick out station:#

lat_kristineberg = 58.24
lon_kristineberg = 11.44
# pick out surface
ds_surf =dss.isel(lev=-1)
ds_kristineberg = ds_surf.sel(lat=lat_kristineberg, lon = lon_kristineberg, method ='nearest')
ds_kristineberg['mmrso4'].plot()
[<matplotlib.lines.Line2D at 0x7f268ea93190>]
../../../../_images/4aec046e565cedf0f8f98f00c3cc0d635341400ab263eed2841e988ce4d641b7.png

Pick out the vairables we need.

vl = ['mmrso4','hurs','tas','T_C']
ds_kristineberg[vl]
<xarray.Dataset> Size: 6kB
Dimensions:    (member_id: 1, time: 240)
Coordinates:
    lat        float64 8B 57.96
    lon        float64 8B 11.25
  * member_id  (member_id) object 8B 'r1i1p1f1'
    lev        float64 8B -992.6
  * time       (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T1...
Data variables:
    mmrso4     (member_id, time) float32 960B dask.array<chunksize=(1, 1), meta=np.ndarray>
    hurs       (member_id, time) float32 960B dask.array<chunksize=(1, 1), meta=np.ndarray>
    tas        (member_id, time) float32 960B dask.array<chunksize=(1, 1), meta=np.ndarray>
    T_C        (member_id, time) float32 960B dask.array<chunksize=(1, 1), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                      CF-1.7 CMIP-6.2
    activity_id:                      CMIP
    case_id:                          15
    cesm_casename:                    b.e21.BHIST.f09_g17.CMIP6-historical.001
    contact:                          cesm_cmip6@ucar.edu
    creation_date:                    2019-01-16T21:43:39Z
    ...                               ...
    intake_esm_attrs:activity_id:     CMIP
    intake_esm_attrs:institution_id:  NCAR
    intake_esm_attrs:version:         v20190308
    intake_esm_attrs:path:            /mnt/craas2-ns9988k-dl-ns9560k/ESGF/CMI...
    intake_esm_attrs:_data_format_:   netcdf
    intake_esm_dataset_key:           CMIP.historical.CESM2.fx.gn

13.1 Resample#

ds_yearly = ds_kristineberg.resample(time='YE').mean('time')
ds_yearly['T_C'].plot()
ds_kristineberg['T_C'].plot(linewidth=0, marker='.')
[<matplotlib.lines.Line2D at 0x7f268eaa6110>]
../../../../_images/4d0ce9d1e4efe8e53437b2811bfd7e9c42e85c8fd04d5d96a96ec229857488a6.png

13.2 Filter data using where:#

ds_kristineberg['season'] = ds_kristineberg['time.season']
for s in ['MAM','JJA','SON','DJF']:
    ds_kristineberg.where(ds_kristineberg['season']==s)['T_C'].plot.hist(alpha=0.5, bins=20, label=s)
    
plt.legend()
<matplotlib.legend.Legend at 0x7f268e053a50>
../../../../_images/0a6365755212d233080a1087e198c852df882c42f9887d81c5d8dd11397def8f.png

13.3 Group by#

ds_kristineberg['month'] = ds_kristineberg['time.month']
ds_kristineberg['T_C'].groupby(ds_kristineberg['month']).median().plot()
[<matplotlib.lines.Line2D at 0x7f263f78fad0>]
../../../../_images/75773fd6641bf8f7337c214468c3f9feb2389970b332ab75ecbb4e133ad77722.png

14. Convert to pandas & do some random fun stuff:#

Maybe we e.g. want to compare with a station, or just use some of the considerable functionalities available from pandas. It’s easy to convert back and forth between xarray and pandas:

A lot of these functions also exist in xarray!

Obs: might want to convert to pandas#

At this point we basically have a time series and the dataset is not to big, so might want to convert to pandas:

subset_of_variables = ['mmrso4','hurs','tas','T_C','season']
df_kristineberg = ds_kristineberg[subset_of_variables].squeeze().to_dataframe()
df_kristineberg
mmrso4 hurs tas T_C season lat lon member_id lev
time
1990-01-15 12:00:00 9.319389e-10 85.651199 277.036011 3.886017 DJF 57.958115 11.25 r1i1p1f1 -992.556095
1990-02-14 00:00:00 9.139207e-10 79.779701 273.601807 0.451813 DJF 57.958115 11.25 r1i1p1f1 -992.556095
1990-03-15 12:00:00 1.282672e-09 82.473785 277.405701 4.255707 MAM 57.958115 11.25 r1i1p1f1 -992.556095
1990-04-15 00:00:00 7.225737e-10 79.588730 280.137329 6.987335 MAM 57.958115 11.25 r1i1p1f1 -992.556095
1990-05-15 12:00:00 2.263646e-09 80.421875 283.043152 9.893158 MAM 57.958115 11.25 r1i1p1f1 -992.556095
... ... ... ... ... ... ... ... ... ...
2009-08-15 12:00:00 6.469142e-10 82.338539 290.292358 17.142365 JJA 57.958115 11.25 r1i1p1f1 -992.556095
2009-09-15 00:00:00 6.174767e-10 75.105225 288.502136 15.352142 SON 57.958115 11.25 r1i1p1f1 -992.556095
2009-10-15 12:00:00 5.226203e-10 81.029533 284.518219 11.368225 SON 57.958115 11.25 r1i1p1f1 -992.556095
2009-11-15 00:00:00 2.411485e-10 83.641380 278.707977 5.557983 SON 57.958115 11.25 r1i1p1f1 -992.556095
2009-12-15 12:00:00 4.485378e-10 89.487213 277.537415 4.387421 DJF 57.958115 11.25 r1i1p1f1 -992.556095

240 rows × 9 columns

lets do something unnecesarily complicated :D#

Convert to pandas dataframe#

Because more functionality

df = dss.isel(lev=-1)[vl].to_dataframe()
df_ri = df.reset_index()
df_ri.head()
member_id time lat lon mmrso4 hurs tas T_C lev
0 r1i1p1f1 1990-01-15 12:00:00 -90.0 -180.00 2.049096e-11 98.689842 242.897797 -30.252197 -992.556095
1 r1i1p1f1 1990-01-15 12:00:00 -90.0 -178.75 2.049095e-11 98.675537 242.903198 -30.246796 -992.556095
2 r1i1p1f1 1990-01-15 12:00:00 -90.0 -177.50 2.049095e-11 98.675537 242.903198 -30.246796 -992.556095
3 r1i1p1f1 1990-01-15 12:00:00 -90.0 -176.25 2.049095e-11 98.675537 242.903198 -30.246796 -992.556095
4 r1i1p1f1 1990-01-15 12:00:00 -90.0 -175.00 2.049095e-11 98.675537 242.903198 -30.246796 -992.556095

qcut, cut#

qcut splits the data into quantile ranges

df_ri['hurs_cat'] = pd.qcut(df_ri['hurs'],
                            q=[0.05,0.17, 0.34,0.66, 0.83,0.95],
                            labels=['very low','low','med','high','very high'])

Cut cuts into categories

df_ri['lat_cat'] = pd.cut(df_ri['lat'], [-90,-60,-30,0,30,60,90], 
                          labels=['S polar','S mid','S tropics', 'N tropic', 'N mid','N polar'])
df_ri
member_id time lat lon mmrso4 hurs tas T_C lev hurs_cat lat_cat
0 r1i1p1f1 1990-01-15 12:00:00 -90.0 -180.00 2.049096e-11 98.689842 242.897797 -30.252197 -992.556095 very high NaN
1 r1i1p1f1 1990-01-15 12:00:00 -90.0 -178.75 2.049095e-11 98.675537 242.903198 -30.246796 -992.556095 very high NaN
2 r1i1p1f1 1990-01-15 12:00:00 -90.0 -177.50 2.049095e-11 98.675537 242.903198 -30.246796 -992.556095 very high NaN
3 r1i1p1f1 1990-01-15 12:00:00 -90.0 -176.25 2.049095e-11 98.675537 242.903198 -30.246796 -992.556095 very high NaN
4 r1i1p1f1 1990-01-15 12:00:00 -90.0 -175.00 2.049095e-11 98.675537 242.903198 -30.246796 -992.556095 very high NaN
... ... ... ... ... ... ... ... ... ... ... ...
13271035 r1i1p1f1 2009-12-15 12:00:00 90.0 173.75 5.152184e-11 106.941933 250.874847 -22.275146 -992.556095 NaN N polar
13271036 r1i1p1f1 2009-12-15 12:00:00 90.0 175.00 5.151738e-11 106.943703 250.874573 -22.275421 -992.556095 NaN N polar
13271037 r1i1p1f1 2009-12-15 12:00:00 90.0 176.25 5.151281e-11 106.945244 250.874329 -22.275665 -992.556095 NaN N polar
13271038 r1i1p1f1 2009-12-15 12:00:00 90.0 177.50 5.150912e-11 106.946594 250.874100 -22.275894 -992.556095 NaN N polar
13271039 r1i1p1f1 2009-12-15 12:00:00 90.0 178.75 5.150579e-11 106.947731 250.873901 -22.276093 -992.556095 NaN N polar

13271040 rows × 11 columns

df_ri.sample(n=20000).groupby('lat_cat').mean(numeric_only=True)
/tmp/ipykernel_547/465231907.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df_ri.sample(n=20000).groupby('lat_cat').mean(numeric_only=True)
lat lon mmrso4 hurs tas T_C lev
lat_cat
S polar -74.839108 -1.731820 2.497121e-11 95.258957 247.787003 -25.362999 -992.556095
S mid -45.477205 -1.624213 1.547163e-10 79.717712 282.582733 9.432746 -992.556095
S tropics -15.143979 -0.178895 3.527229e-10 76.358116 297.262939 24.112949 -992.556095
N tropic 15.357704 -1.251104 1.189028e-09 73.107628 298.691284 25.541285 -992.556095
N mid 45.164037 -0.407609 1.326074e-09 72.727493 282.961426 9.811419 -992.556095
N polar 75.385127 1.800438 1.342937e-10 95.597260 263.293518 -9.856478 -992.556095
sns.boxenplot(x="lat_cat", y="hurs",
              color="b",
              #scale="linear", 
              width_method='linear',
              data=df_ri)#.sample(n=20000))
<Axes: xlabel='lat_cat', ylabel='hurs'>
../../../../_images/3a201b8fcd8b23131384f115b087688bd15ba9b264527ce1d2dcadf4f36dfc06.png
sns.boxenplot(x="hurs_cat", 
              y="mmrso4",
              color="b",
              width_method='linear',
              data=df_ri.sample(n=20000),
             )
<Axes: xlabel='hurs_cat', ylabel='mmrso4'>
../../../../_images/0d0a9d013e444bcf05c4d785cde4505873d2e3afda4881baad48587e0eb38d64.png
sns.displot(x="mmrso4",
            hue='lat_cat',
            log_scale=True,
            kind='kde',
            data=df_ri.sample(n=20000), 
            multiple="stack"
           )
<seaborn.axisgrid.FacetGrid at 0x7f268e4f38d0>
../../../../_images/6155b7dffe376e184e58227fef60a082a6cda07807a736414f7c776b65fd65ba.png

Convert back to xarray if we need:#

ds_new = df_ri.set_index(['time','lat','lon']).to_xarray()
ds_new
<xarray.Dataset> Size: 637MB
Dimensions:    (time: 240, lat: 192, lon: 288)
Coordinates:
  * time       (time) datetime64[ns] 2kB 1990-01-15T12:00:00 ... 2009-12-15T1...
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB -180.0 -178.8 -177.5 ... 176.2 177.5 178.8
Data variables:
    member_id  (time, lat, lon) object 106MB 'r1i1p1f1' ... 'r1i1p1f1'
    mmrso4     (time, lat, lon) float32 53MB 2.049e-11 2.049e-11 ... 5.151e-11
    hurs       (time, lat, lon) float32 53MB 98.69 98.68 98.68 ... 106.9 106.9
    tas        (time, lat, lon) float32 53MB 242.9 242.9 242.9 ... 250.9 250.9
    T_C        (time, lat, lon) float32 53MB -30.25 -30.25 ... -22.28 -22.28
    lev        (time, lat, lon) float64 106MB -992.6 -992.6 ... -992.6 -992.6
    hurs_cat   (time, lat, lon) object 106MB 'very high' 'very high' ... nan nan
    lat_cat    (time, lat, lon) object 106MB nan nan nan ... 'N polar' 'N polar'

mask by category#

ds_new.where(ds_new['hurs_cat']=='very low').mean(['time','lon'])['mmrso4'].plot(label='very low')#vmin = 0, vmax = 1.5e-8)
ds_new.where(ds_new['hurs_cat']=='low').mean(['time','lon'])['mmrso4'].plot(label='low')#vmin = 0, vmax = 1.5e-8)

ds_new.where(ds_new['hurs_cat']=='high').mean(['time','lon'])['mmrso4'].plot(label='high')#vmin = 0, vmax = 1.5e-8)
ds_new.where(ds_new['hurs_cat']=='very high').mean(['time','lon'], keep_attrs=True)['mmrso4'].plot(label='very high')#vmin = 0, vmax = 1.5e-8)
plt.legend()
<matplotlib.legend.Legend at 0x7f264c1ee690>
../../../../_images/76f08dda63a025b448306d6998563dda751ea38300c6bc228f66e9b3ee6b1092.png

Improve and reduce risk of typo:#

for cat in ['very low','low','high','very high']:
    ds_new.where(ds_new['hurs_cat']==cat).mean(['time','lon'])['mmrso4'].plot(label=cat) 
plt.legend()
<matplotlib.legend.Legend at 0x7f268f585a90>
../../../../_images/76f08dda63a025b448306d6998563dda751ea38300c6bc228f66e9b3ee6b1092.png