This page was generated from /home/docs/checkouts/readthedocs.org/user_builds/oceanspy/checkouts/latest/docs/Statistics.ipynb.
Use Case 3: Hydrographic Statistics
In this example, we compute some useful hydrographic statistics. We show how to properly account for different grid cell volumes using weighted statistics. The computation is accelerated using dask.
[1]:
# Setup environment:
import matplotlib.pyplot as plt
# Additional packages
import numpy as np
import xarray as xr
import oceanspy as ospy
The following cell starts a dask client (see the Dask Client section in the tutorial).
[2]:
import warnings
import dask
# Start client for multiprocessing
from dask.distributed import Client
client = Client()
client
[2]:
Client
Client-26ae3825-d8e2-11ed-8cff-0242ac110006
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: http://127.0.0.1:8787/status |
Cluster Info
LocalCluster
81cdac5e
Dashboard: http://127.0.0.1:8787/status | Workers: 4 |
Total threads: 4 | Total memory: 100.00 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-4d7a9c82-5cdf-4ce7-9c42-615ef58e770e
Comm: tcp://127.0.0.1:44090 | Workers: 4 |
Dashboard: http://127.0.0.1:8787/status | Total threads: 4 |
Started: Just now | Total memory: 100.00 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:35621 | Total threads: 1 |
Dashboard: http://127.0.0.1:40576/status | Memory: 25.00 GiB |
Nanny: tcp://127.0.0.1:34369 | |
Local directory: /tmp/dask-worker-space/worker-8_d4bcgx |
Worker: 1
Comm: tcp://127.0.0.1:37592 | Total threads: 1 |
Dashboard: http://127.0.0.1:44162/status | Memory: 25.00 GiB |
Nanny: tcp://127.0.0.1:40318 | |
Local directory: /tmp/dask-worker-space/worker-2pud10zh |
Worker: 2
Comm: tcp://127.0.0.1:34588 | Total threads: 1 |
Dashboard: http://127.0.0.1:35118/status | Memory: 25.00 GiB |
Nanny: tcp://127.0.0.1:42551 | |
Local directory: /tmp/dask-worker-space/worker-ccmr530j |
Worker: 3
Comm: tcp://127.0.0.1:40062 | Total threads: 1 |
Dashboard: http://127.0.0.1:36471/status | Memory: 25.00 GiB |
Nanny: tcp://127.0.0.1:34549 | |
Local directory: /tmp/dask-worker-space/worker-emp51ee1 |
Open the oceandataset and compute the potential density anomaly:
[3]:
SciServer = True # True: SciServer - False: any computer
if SciServer:
od_snapshot = ospy.open_oceandataset.from_catalog("get_started")
else:
import os
if not os.path.isdir("oceanspy_get_started"):
# Download get_started
import subprocess
print("Downloading and uncompressing get_started data...")
print("...it might take a couple of minutes.")
commands = [
"wget -v -O oceanspy_get_started.tar.gz -L "
"https://livejohnshopkins-my.sharepoint.com/"
":u:/g/personal/malmans2_jh_edu/"
"EXjiMbANEHBZhy62oUDjzT4BtoJSW2W0tYtS2qO8_SM5mQ?"
"download=1",
"tar xvzf oceanspy_get_started.tar.gz",
"rm -f oceanspy_get_started.tar.gz",
]
subprocess.call("&&".join(commands), shell=True)
od_snapshot = ospy.open_oceandataset.from_zarr("oceanspy_get_started")
od_snapshot = od_snapshot.compute.potential_density_anomaly()
Opening get_started.
Small cutout from EGshelfIIseas2km_ASR_crop.
Citation:
* Almansi et al., 2020 - GRL.
See also:
* EGshelfIIseas2km_ASR_full: Full domain without variables to close budgets.
* EGshelfIIseas2km_ASR_crop: Cropped domain with variables to close budgets.
Computing potential density anomaly using the following parameters: {'eq_state': 'jmd95'}.
Make a temperature/salinity diagram.
[4]:
# The statistics are computed using these data.
fig = plt.figure(figsize=(10, 10))
od_snapshot.plot.TS_diagram(markersize=1, alpha=0.5)
# Print number of samples
n_samples = od_snapshot.dataset["Temp"].count().values
print("\nNumber of samples: {:,}\n".format(n_samples))
Isopycnals: Computing potential density anomaly using the following parameters: {'eq_state': 'jmd95'}.
Number of samples: 5,927,388

The temperature/salinity diagram shows the distribution of watermass types in terms of their hydrographic properties. Each dot represents a single grid cell in the dataset. We will compute statistics using this data to interpret the plot above.
Binning and Parallelization
We define a function bin_snapshot
to perform the binning we need to compute the volume-weighted statistics, or the mode of a variable. The use of the xarray stack method accelerates the groupby_bins task.
[5]:
# Function for binning a single snapshot in time
# dat = data array to bin (e.g., temperature distributed over 3D space)
# weights = data array of the relative volumes of the cells in 3D space
# bin_edges = array of the edges of the bins (e.g. temperature bins)
# bin_mid = array of the mid-points of the bins
def bin_snapshot(dat, weight, bin_edges, bin_mid):
# Speed up by stacking over space
dat = dat.stack(alldims=["X", "Y", "Z"])
weight = weight.stack(alldims=["X", "Y", "Z"])
# Bin the weight data into these variable bins
# using the xarray groupby_bins method
return weight.groupby_bins(
dat.rename("groups"),
bins=bin_edges,
labels=bin_mid,
include_lowest=True,
).sum()
Pick True
or False
for the parallel variable below to run bin_snapshot
with/without dask.
[6]:
# Parallel switch
parallel = True
Raw statistics
First compute and report some raw statistics using xarray, for comparison below. Because we ignore the different volumes of model grid cells, these statistics are not representative of the underlying fields.
[7]:
# Compute and print raw statistics
# Define the percentiles required:
qiles = [0.01, 0.50, 0.99]
# Define the variables required:
var_list = ["Temp", "S", "Sigma0"]
# Pretty print the results:
print("Raw statistics (not volume-weighted):")
print(
"{:>25s}: {:>10s} {:>10s} {:>10s} {:>10s} {:>10s} {:>10s} {:s}".format(
"Variable", "mean", "std", "mode", "1", "50", "99", "percentiles"
)
)
# Loop over variables computing statistics using xarray methods.
# Variables need to be load because dask chunks are not compatible
# with xarray quantiles function.
for var in var_list:
print("{:25s}:".format(od_snapshot.dataset[var].long_name), end=" ")
# Data
dats = getattr(od_snapshot.dataset, var).load()
# Mean
var_mean = dats.mean(keep_attrs=True)
print("{:10.4g}".format(var_mean.values), end=" ")
# Std
var_std = dats.std(keep_attrs=True)
print("{:10.4g}".format(var_std.values), end=" ")
# Mode
bin_edges = np.linspace(dats.min(), dats.max(), 512)
bin_mid = (bin_edges[:-1] + bin_edges[1:]) / 2
# Parallel switch
if parallel:
bin_snapshot_ = dask.delayed(bin_snapshot)
# Needs to load first
dat1 = [
bin_snapshot_(
dats.sel(time=time).load(),
xr.ones_like(dats).sel(time=time).load(),
bin_edges,
bin_mid,
)
for time in dats["time"]
]
with warnings.catch_warnings():
warnings.simplefilter("ignore")
dat1 = dask.compute(dat1)[0]
else:
dat1 = [
bin_snapshot(
dats.sel(time=time),
xr.ones_like(dats).sel(time=time),
bin_edges,
bin_mid,
)
for time in dats["time"]
]
dat1 = xr.concat(dat1, "time").sum("time")
var_mode = dat1["groups_bins"].where(dat1 == dat1.max(), drop=True)
print("{:10.4g}".format(var_mode.values.squeeze()), end=" ")
# Qiles
var_qiles = dats.quantile(qiles, keep_attrs=True)
print(
"{:10.4g} {:10.4g} {:10.4g} {:s}".format(
var_qiles.sel(quantile=qiles[0]).values,
var_qiles.sel(quantile=qiles[1]).values,
var_qiles.sel(quantile=qiles[2]).values,
od_snapshot.dataset[var].units,
)
)
del dats
Raw statistics (not volume-weighted):
Variable: mean std mode 1 50 99 percentiles
potential_temperature : 0.8009 1.366 0.1019 -1.421 0.4803 5.716 degC
salinity : 34.44 0.9865 34.9 30.68 34.89 34.93 g kg-1
potential density anomaly: 27.59 0.7954 28.01 24.55 27.97 28.05 kg/m^3
Weighted statistics
Compute the weighted mean (OceanSpy provides a method to do so).
[8]:
od_snapshot = od_snapshot.compute.weighted_mean(varNameList=var_list)
# Print computed variables
print("\nVariables added:")
for var in od_snapshot.dataset.data_vars:
if any(
[(var == v or var == f"weight_{v}" or var == f"w_mean_{v}") for v in var_list]
):
print(
"{:>15}: {} [{}]".format(
var,
od_snapshot.dataset[var].long_name,
od_snapshot.dataset[var].units,
)
)
Computing weighted_mean.
Variables added:
Temp: potential_temperature [degC]
S: salinity [g kg-1]
Sigma0: potential density anomaly [kg/m^3]
w_mean_Temp: potential_temperature [degC]
weight_Temp: Weights for average [s*m^2*m]
w_mean_S: salinity [g kg-1]
weight_S: Weights for average [s*m^2*m]
w_mean_Sigma0: potential density anomaly [kg/m^3]
weight_Sigma0: Weights for average [s*m^2*m]
The weights used by OceanSpy have been stored in od_snapshot
. The weights correspond to space-time volume elements \(dx \; dy\; dz\; dt\) and hence have units of m\(^3\)s. The variations in the weights are dominated by variations in grid cell volume, which comes from \(dz\) (see below), although \(dx\) and \(dy\) are also important. As the timestep \(dt\) is constant it isn’t responsible for any variation in the weights. We can use the weights, for example, to
color a TS-diagrams and better understand which temperature and salinity values are misrepresented using raw statistics.
[9]:
# Scatter plots are not optimized for large amount of samples,
# so the following TS-diagram only shows sea surface water.
fig = plt.figure(figsize=(10, 10))
ax = od_snapshot.plot.TS_diagram(
s=1,
alpha=0.5,
colorName="weight_Temp",
cutout_kwargs={"ZRange": 0, "dropAxes": True},
)
Cutting out the oceandataset.
Isopycnals: Computing potential density anomaly using the following parameters: {'eq_state': 'jmd95'}.

The vertical resolution of the model linearly increases from 1 to 15m in the upper 120m and is 15m thereafter. The following plots show that the weights of warmer and fresher water masses are overestimated in the raw statistics.
[10]:
fig = plt.figure(figsize=(10, 10))
for i, (var, pos) in enumerate(zip(["Temp", "S"], [221, 223])):
ax = plt.subplot(pos)
data = od_snapshot.dataset[var]
weight = od_snapshot.dataset[f"weight_{var}"]
ax.plot(
data.values.flatten(),
weight.values.flatten(),
".",
color=f"C{i}",
alpha=0.5,
markersize=0.5,
)
ax.set_xlabel(f"{data.long_name} [{data.units}]")
ax.set_ylabel(f"{weight.long_name} [{weight.units}]")
ax = plt.subplot(122)
drF = od_snapshot.dataset["drF"]
Z = od_snapshot.dataset["Z"]
line = ax.plot(drF, Z, ".-", color="k", linewidth=3, markersize=15)
xlab = ax.set_xlabel(f"{drF.long_name} [{drF.units}]")
ylab = ax.set_ylabel(f"{Z.long_name} [{Z.units}]")
tit = ax.set_title("Model vertical resolution")
ax.grid(True)

Next compute and report the same statistics as before but taking into account the different volumes of different grid cells. We also plot probability distribution and cumulative distribution functions.
[11]:
# Setup plotting
fig, axes = plt.subplots(1, len(var_list), figsize=(30, 10))
hist_color = "tab:red"
qiles_color = "tab:blue"
stat_color = "tab:green"
stat_color2 = "tab:cyan"
stat_color3 = "tab:pink"
stat_color4 = "tab:brown"
bin_edges2 = np.linspace(0, 1, 512)
qiles2 = (bin_edges2[:-1] + bin_edges2[1:]) / 2
# Pretty print the results:
print("Raw statistics (not volume-weighted):")
print(
"{:>25s}: {:>10s} {:>10s} {:>10s} {:>10s} {:>10s} {:>10s} {:s}".format(
"Variable", "mean", "std", "mode", "1", "50", "99", "percentiles"
)
)
# Loop over variables of interest to compute std and percentiles (quantiles),
# pretty print, and plot
for i, (var, ax) in enumerate(zip(var_list, axes.flat)):
# Print on the way to monitor progress
var_name = od_snapshot.dataset[var].long_name
units = od_snapshot.dataset[var].attrs["units"]
print("{:25s}:".format(var_name), end=" ")
# Extract variables
dats = getattr(od_snapshot.dataset, var)
weights = getattr(od_snapshot.dataset, "weight_" + var)
# Normalize weights
weights = weights.where(dats.notnull())
weightsum = weights.sum()
weights = weights / weightsum
# Compute the arithmetic mean
var_mean = getattr(od_snapshot.dataset, "w_mean_" + var)
print("{:10.4g}".format(var_mean.values), end=" ")
# There is no OceanSpy method to compute
# the weighted standard deviation or percentiles.
# Compute the standard deviation
dat2 = ((od_snapshot.dataset[var] - var_mean) ** 2) * weights
var_std = np.sqrt(dat2.sum().values)
print("{:10.4g}".format(var_std), end=" ")
# Define bin edges and mid points for the variable in question.
# We use 512 uniformly-spaced bins:
bin_edges = np.linspace(dats.min(), dats.max(), 512)
bin_mid = (bin_edges[:-1] + bin_edges[1:]) / 2
# Compute percentiles:
# Parallel switch
if parallel:
bin_snapshot_ = dask.delayed(bin_snapshot)
# Needs to load first
dat3 = [
bin_snapshot_(
dats.sel(time=time).load(),
weights.sel(time=time).load(),
bin_edges,
bin_mid,
)
for time in dats["time"]
]
with warnings.catch_warnings():
warnings.simplefilter("ignore")
dat3 = dask.compute(dat3)[0]
else:
dat3 = [
bin_snapshot(
dats.sel(time=time), weights.sel(time=time), bin_edges, bin_mid
)
for time in dats["time"]
]
# Concatenate dat
dat3 = xr.concat(dat3, "time").sum("time")
var_hist = dat3 / weights.sum()
dat4 = dat3.cumsum()
dat5 = dat4 / weights.sum()
# Mode
var_mode = dat3["groups_bins"].where(dat3 == dat3.max(), drop=True)
print("{:10.4g}".format(var_mode.values.squeeze()), end=" ")
# Use numpy interpolation to find the percentiles
# - effectively evaluating the inverse CDF
var_qiles = np.interp(qiles, dat5, bin_mid)
var_qiles2 = np.interp(qiles2, dat5, bin_mid)
var_median = np.interp(0.50, dat5, bin_mid)
var_limits = np.interp([0.01, 0.99], dat5, bin_mid)
for var_qile in var_qiles:
print("{:10.4g}".format(var_qile), end=" ")
print("{:s}".format(units))
# Plot the PDF
p0 = ax.plot(bin_mid, var_hist, "-", color=hist_color, label="PDF", linewidth=4)
ax.set_xlabel(var_name + " [" + units + "]", fontsize=15)
ax.set_ylabel("Probability distribution function", fontsize=15, color=hist_color)
ax.tick_params(axis="y", labelcolor=hist_color)
ax.grid(True)
ax.set_xlim(var_limits)
# Compute and plot the CDF
ax2 = ax.twinx()
p1 = ax2.plot(var_qiles2, qiles2, "-", color=qiles_color, label="CDF", linewidth=4)
ax2.set_ylabel("Cumulative distribution function", fontsize=15, color=qiles_color)
ax2.tick_params(axis="y", labelcolor=qiles_color)
ax2.set_xlim(var_limits)
# Add statistics to plot
p2 = ax.axvspan(var_mean - var_std, var_mean + var_std, color=stat_color)
l1 = ax.axvline(var_mean, linestyle="-", color="w", linewidth=2)
l2 = ax.axvline(var_median, linestyle="--", color=stat_color3, linewidth=2)
l3 = ax.axvline(var_mode, linestyle="--", color=stat_color4, linewidth=2)
ax.legend((l2, l3, p2), ("median", "mode", "mean +/- std"), loc="upper left")
Raw statistics (not volume-weighted):
Variable: mean std mode 1 50 99 percentiles
potential_temperature : 0.5422 1.033 0.1019 -1.435 0.3659 4.811 degC
salinity : 34.6 0.7318 34.9 31.5 34.89 34.92 g kg-1
potential density anomaly: 27.75 0.5858 28.01 25.24 27.98 28.05 kg/m^3

Notice the difference between the weighted mean statistics (in the table of numbers immediately above) and the raw statistics (further above). For example, the weighted-mean temperatures are significantly cooler. The reason is that the grid cell volumes vary, which is ignored in the computation of the raw mean.
Notice how the salinity distribution (probability density function) resembles the potential density anomaly distribution, indicating that density is controlled by salinity, not temperature. This fact is reflected in the T/S diagrams above by the nearly vertical contours of density anomaly. This is a beta-ocean in the parlance of Carmack (2007) and Stewart & Haine (2016).
Notice how the computation is accelerated using dask (try running with/without dask by setting the parallel flag at the top of this notebook).
Volumetric census
A natural extension of these concepts is to compute and plot the volumetric census over temperature/salinity classes. The plots above of probability distribution and cumulative distribution functions focus only on the variable of interest. In other words, the temperature binning to construct the histogram (var_hist
in the code) ignores the salinity. But we’re also interested in binning simultaneously into temperature and salinity classes. This quantity is (essentially) the joint distribution
function of seawater volume over temperature and salinity.
To compute the joint distribution of volume over temperature and salinity requires two-dimensional binning. This functionality is achieved using the xhistogram package.
[12]:
from xhistogram.xarray import histogram
First we construct a histogram of the number of data points in each temperature and salinity class. The temperature and salinity classes are defined using tbins
and sbins
. All timesteps are considered in the histogram. The colorscale is logarithmic.
[13]:
# Histogram of number of data points. From xhistogram tutorial:
sbins = np.arange(25.5, 35.5, 0.0125)
tbins = np.arange(-2, 10, 0.05)
fig = plt.figure(figsize=(10, 6))
hTS = histogram(
od_snapshot.dataset["S"].mean("time"),
od_snapshot.dataset["Temp"].mean("time"),
bins=[sbins, tbins],
)
np.log10(hTS.T).plot(levels=31)
/home/idies/mambaforge/envs/Oceanography/lib/python3.9/site-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in log10
return func(*(_execute_task(a, cache) for a in args))
[13]:
<matplotlib.collections.QuadMesh at 0x7f82773b34f0>

Next we construct the histogram taking into account the different cell volumes. The weight_S
variable from compute.weighted_mean
supplies these cell volumes (we could also use the weight_T
variable). This histogram shows the volume (\(m^3\)) in each temperature and salinity class.
[14]:
# Histogram using volume weights.
dVol = ospy.compute.weighted_mean(
od_snapshot, varNameList="S", axesList=["X", "Y", "Z"]
)["weight_S"]
fig = plt.figure(figsize=(10, 6))
hTSw = histogram(
od_snapshot.dataset["S"],
od_snapshot.dataset["Temp"],
bins=[sbins, tbins],
weights=dVol,
)
np.log10(hTSw.T).plot(levels=31)
Computing weighted_mean.
/home/idies/mambaforge/envs/Oceanography/lib/python3.9/site-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in log10
return func(*(_execute_task(a, cache) for a in args))
[14]:
<matplotlib.collections.QuadMesh at 0x7f819f416d30>

This is the volumetric census over water temperature and salinity classes.