Normality of Inverse Gaussian

The Inverse Gaussian distribution \(\mathsf{IG}(\mu,\lambda)\) is a probability law with density

$$f(x;\mu,\lambda)=\sqrt{\frac{\lambda}{2 \pi x^{3}}} e^{- \frac{\lambda \left(- \mu + x\right)^{2}}{2 \mu^{2} x}}.$$

This distribution arises as passage times of random walks ​1​. For a survey of its basic properties, see ​2​. In this note we explore in detail its approximate normality, which holds when \( \lambda/\mu \to \infty\).

Consider the random variable \(X\sim \mathsf{IG}(\mu,\lambda)\) and its standardised version $$Y = \frac{X-\mathbf{E}[X]}{\sqrt{\mathbf{Var}[X]}}.$$
Below we establish normal approximation in various metrics.

Convergence in Distribution

Since the characteristic function of \(X\) is \(\phi_X(t) = e^{\frac{\lambda \left(1 – \sqrt{1 – \frac{2 i \mu^{2} t}{\lambda}}\right)}{\mu}}\), and \(\mathbf{E}[X] = \mu, \mathbf{Var}[X]=\mu^3/\lambda\), the characteristic function of the standardized distribution equals

$$ \phi_Y(t) = e^{k^{2} \cdot \left(1 – \sqrt{1 – \frac{2 i t}{k}}\right)} e^{- i k t}, $$

where \( k \triangleq \sqrt{\lambda/\mu}\). The limit under \(k\to\infty\) equals \(e^{-\frac{t^2}{2}}\), the characteristic function of the standard normal distribution, which proves convergence in distribution:

$$Y \overset{d}{\rightarrow} \mathsf{N}(0,1),\quad \lambda / \mu \to \infty$$

Convergence in Total Variation

We can also look at the pdf of the normalized distribution, with \(k\) as before,

$$ f_Y(y) = \frac{\sqrt{2} k^{\frac{3}{2}} e^{- \frac{k y^{2}}{2 \left(k + y\right)}}}{2 \sqrt{\pi} \left(k + y\right)^{\frac{3}{2}}},\quad y>-k$$

that can be calculated with SymPy

import sympy as sm

x, y, t, lmbd, mu, k = sm.symbols('x y t lambda mu k', positive=True)

pdf = sm.Mul( sm.sqrt(lmbd/(2*sm.pi*x**3), evaluate=False), sm.exp(-lmbd*(x-mu)**2/(2*mu**2*x)), evaluate=False)

mean = mu
var = mu**3/lmbd

norm_pdf = sm.sqrt(var) * pdf.subs({x:y*sm.sqrt(var)+mu})
norm_pdf.subs({lmbd:k**2*mu}).simplify()

and compare it to that of the standard normal distribution. Namely, we have

$$ f_Y(y) \rightarrow \frac{1}{\sqrt{2\pi}} e^{-\frac{y^2}{2}}, \quad k=\lambda / \mu \to \infty, $$

that is the convergence of probability densities to the standard normal distribution. By Scheffe’s Theorem ​3​ this proves the convergence in total variation (stronger than convergence in distribution)

$$Y\overset{TV}{\rightarrow} \mathsf{N}(0,1), \quad \lambda/\mu \to \infty.$$

Kolmogorov Distance

We now attempt to derive concrete numerical estimates. The ratio of the characteristic function \(\phi_k\) of \(Y\) and \(\phi_{\infty}\) of the standard normal equals

$$\phi_k(t)/\phi_{\infty}(t) = e^{ k^2 h(t /k ) }, \quad h(x) \triangleq 1-\sqrt{1-2 i x}- ix + \frac{x^2}{2}.$$

Observe that \(h(x)\) expands into a convergent power series around \(x=0\) when \(|x|<\frac{1}{2}\). Thus, under the condition that \(|t| \leq \sqrt{k}\) and \(k\geq 4\), we obtain that \(|k^2 h(t /k )|\leq O(1)\), and by the bound \(|e^{z}-1| = \Theta(z)\) for \(|z|\leq O(1)\) that

$$ \left| e^{ k^2 h(t /k )} -1 \right| = \Theta(1)\cdot \frac{t^3}{k} $$

This gives the following integral estimate

$$ \int_{|t|\leq \sqrt{k}} \left| \frac{\phi_k(t)-\phi_{\infty}(t)}{t} \right| \mbox{d} t = \Theta(1/k). $$

We know handle the regime \(\sqrt{k} < |t| \leq k/2\). Observe that \(\Re h(x) =\Theta(1) x^4\) by the power series expansion. It follows that \(|e^{k^2 h(t /k )}| = e^{\Theta(1)}\) and therefore

$$\int_{\sqrt{k} < |t| \leq k/2} \left| \frac{\phi_k(t)-\phi_{\infty}(t)}{t} \right| \mbox{d} t \leq O(1) \int_{\sqrt{k} < |t| \leq k/2} \frac{\phi_{\infty}(t)}{t} \mbox{d}t \leq o(1/k). $$

The integral Berry-Essen inequality ​4​ implies now the convergence rate in Kolmogorov–Smirnov distance

$$d_{KS}(Y,\mathsf{N}(0,1))=O(1/k) = O(\sqrt{\mu/\lambda}).$$

Total Variation Distance

Observe that the ratio of the densities equals

$$\frac{f_Y(y)}{f_{\infty}(y)} = \frac{k^{\frac{3}{2}} e^{\frac{y^{3}}{2 \left(k + y\right)}}}{\left(k + y\right)^{\frac{3}{2}}},\quad y>-k.$$

By power series expansions we have, uniformly in \(k\)
$$\frac{f_Y(y)}{f_{\infty}(y)} = 1-\Theta(y/k),\quad |y|\leq k^{1/3},$$
and because the expression increases in \(y\) and decreases in \(k\) when \(|y| > k^{1/3}, k>8\) (we have \(\frac{\partial}{\partial y} \frac{f_Y(y)}{f_{\infty}(y)} = \frac{k^{\frac{3}{2}} \left(- 3 k + y^{2} \cdot \left(3 k + 2 y\right) – 3 y\right) e^{\frac{y^{3}}{2 \left(k + y\right)}}}{2 \left(k + y\right)^{\frac{7}{2}}}\) and \(\frac{\partial}{\partial k} \frac{f_Y(y)}{f_{\infty}(y)} = – \frac{\sqrt{k} y \left(k y^{2} – 3 k – 3 y\right) e^{\frac{y^{3}}{2 k + 2 y}}}{2 \left(k + y\right)^{\frac{7}{2}}}\)) also
$$ \frac{f_Y(y)}{f_{\infty}(y)} > 1,\quad y \geq k^{1/3},k\geq 8$$ and
$$ \frac{f_Y(y)}{f_{\infty}(y)} < 1,\quad y \leq -k^{1/3},k\geq 8.$$ Therefore, we obtain
$$ \int \left| f_Y(y)-f_{\infty}(y) \right| \mbox{d}y \leq O(1/k)\int_{|y|<k^{1/3}} |y| f_{\infty}(y)\mbox{d} y + \int_{t<-k^{1/3}} 2 f_{\infty}(y) = O(1/k). $$

Which proves that the convergence rate in total variation is \(O(1/k)\):

$$d_{TV}(Y;\mathsf{N}(0,1)) \leq O(1/k) = O(\sqrt{\mu/\lambda})$$ (in fact, here we obtain the matching lower bound too).

Numerical Evaluation

Numerical results match the theoretical analysis, as shown below:

For completeness, here is the script:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import invgauss, norm

fig, axs = plt.subplots(1,2, figsize=(12,4))

# Part 1: evaluate total variation
mus = np.linspace(0.5,0.00001,100)
out = []

for mu in mus:

  # construct Inverse Gaussian
  Z = invgauss(mu=mu)
  mean = Z.mean()
  std = Z.std()
  k = (mean/std)**2

  # standardise and compare with normal
  Z_std = invgauss(mu=mu,loc=-mean/std,scale=1/std)
  Norm = norm(0,1)

  xs = np.linspace(-10,10,1000)
  metric = np.abs(Z_std.pdf(xs) - Norm.pdf(xs)).mean()*20
  result = (k,metric)
  out.append(result)


ax = axs[1]
ax.plot(*zip(*out))
ax.set_xscale('log')
ax.set_xlabel('$\lambda / \mu$')
ax.set_yscale('log')
ax.set_ylabel('total-variation to normal')

# Part 2: plot pdf

ax = axs[0]
for mu in [0.2,0.1,0.05,0.01,0.0001]:

  # construct Inverse Gaussian
  Z = invgauss(mu=mu)
  mean = Z.mean()
  std = Z.std()
  k = (mean/std)**2

  # standardise and plot
  Z_std = invgauss(mu=mu,loc=-mean/std,scale=1/std)
  
  xs = np.linspace(-5,5,10000)

  ax.plot(xs,Z_std.pdf(xs), label=f'$\lambda/\mu$={k:.2f}')

ax.legend(title="standardised distribution")
ax.set_xlabel('$x$')
ax.set_ylabel('pdf')

plt.tight_layout()
plt.show()

References

  1. 1.
    Hartzell FZ. Sequential Analysis. Agronomy Journal. Published online July 1947:640-641. doi:10.2134/agronj1947.00021962003900070011x
  2. 2.
    Folks JL, Chhikara RS. The Inverse Gaussian Distribution and its Statistical Application—A Review. Journal of the Royal Statistical Society: Series B (Methodological). Published online July 1978:263-275. doi:10.1111/j.2517-6161.1978.tb01039.x
  3. 3.
    Scheffe H. A Useful Convergence Theorem for Probability Distributions. Ann Math Statist. Published online September 1947:434-438. doi:10.1214/aoms/1177730390
  4. 4.
    Bobkov SG. Proximity of probability distributions in terms of Fourier-Stieltjes transforms. Russ Math Surv. Published online December 31, 2016:1021-1079. doi:10.1070/rm9749

Optimizing Docker Images – A Case Study

Docker is a modern technology for containerizing software. While packaging to a working version is relatively easy, creating a lightweight Docker image presents a challenge. In this note, I will briefly discuss an example from my university class on software engineering.

The task was to containerize a card game web application, written by students as a group project. The application was created with React.js and used to be run in a development mode (which supports tracing, debugging and so on) via npm start. The page can be then optimized for deployment via npm build. The job of a Docker container is to serve this optimized version with a simple http server.

Of course, the easiest way was to both page build and deployment within the Create React App framework. But the optimal way is to stay minimalistic and leverage multi-stage Docker builds with task-tailored sub-images: one to build and another to serve. The image size indeed gets reduced from 500Mb to about 20MB! To build and serve we used, respectively, a Node.js image and a Static Web Server image. The build product of the first step gets copied to the second image which provides a lightweight server, leaving unessential stuff (cached files, development tools ) behind. The implementation is shown below:

# docker build -t brydz:latest .
# docker run -it -p 3000:3000 brydz:latest sh

# stage 1: install dependencies and build the app
FROM node:18-alpine AS pre_build

WORKDIR /brydz

COPY client/package.json ./client/package.json
COPY client/package-lock.json ./client/package-lock.json
RUN npm install --prefix client
COPY client/ ./client/
ENV PUBLIC_URL=“.” 
RUN npm run build --prefix client

# stage 2: move to a clean prod environment
FROM joseluisq/static-web-server:2-alpine AS final
WORKDIR /brydz
COPY --from=pre_build /brydz/client/build ./build

EXPOSE 3000

CMD ["static-web-server", "--port=3000", "--root=build", "--log-level=trace"]

Approximate Nash Equilibrium by Regret Matching

Two-player games can be solved by following a very intuitive algorithm called Regret Matching ​1​. Players modify their action probabilities according to the so-called regrets or advantages, which can be thought as consequences of alternative choices. For a good overview of the topic, see the friendly yet detailed introduction by Neller and Lanctot ​2​.

The following code snippet demonstrates an efficient implementation for two-player zero-sum finite games in PyTorch, and tests on the game called Rock-Paper-Scissors with Crash. In this variant, the scissors crashes against rock, and the loss is doubled. This shifts the equilibrium choices in favour of “paper”, at the exact proportion of 25%/50%/25%. Remarkably, regrets can be used to estimate the approximation accuracy, which approaches the perfect solution under mild conditions ​3​. This is also demonstrated in the snippet.

import torch

## Game: Rock-Paper-Scissors with Crash, the equilibrium achieved with 25%/50%/25%

N_ACTIONS = 3 # 0: rock, 1: paper, 2: scissors
PAYOFF = torch.tensor([[0,-1,2], [1,0,-1], [-2,1,0]], dtype=torch.float).cuda() # gain of row player / loss of col player

## Utils

def getStrategy(cfr):
  """Return a strategy corresponding to observed regrets
  Args:
    cfr (array): counter-factual regret, shape (N_ACTIONS,)
  Returns:
    weights (array): strategy, shape (N_ACTIONS,)
  """
  weights = torch.clip(cfr,0,torch.inf)
  weights = weights/torch.sum(weights)
  N_ACTIONS = weights.shape[0]
  weights = torch.nan_to_num(weights,nan=1/N_ACTIONS)
  return weights

#@torch.jit.script
@torch.compile(mode='max-autotune')
def getEquilibrium(PAYOFF, N_ITER:int = 500, stochastic_advantage:bool=False):
  # auxiliary variables
  N_ACTIONS = PAYOFF.shape[0]
  cumCFR1 = torch.zeros(N_ACTIONS,).cuda()
  cumCFR2 = torch.zeros(N_ACTIONS,).cuda()
  cumStrategy1 = torch.zeros(N_ACTIONS,).cuda()
  cumStrategy2 = torch.zeros(N_ACTIONS,).cuda()
  strategy1 = torch.ones(N_ACTIONS,).cuda()/N_ACTIONS
  strategy2 = torch.ones(N_ACTIONS,).cuda()/N_ACTIONS
  # training loop
  for _ in range(N_ITER):
    # sample actions and observe regrets
    if stochastic_advantage:
      # a) stochastic variant, often implemented in tutorials
      action1 = torch.multinomial(strategy1,num_samples=1).squeeze()
      action2 = torch.multinomial(strategy2,num_samples=1).squeeze()
      cfr1 = PAYOFF[:,action2]-PAYOFF[action1,action2]
      cfr2 = - (PAYOFF[action1,:]-PAYOFF[action1,action2])
    else:
      # b) averaged variant
      PAYOFF_avg = strategy1.view(1,-1).mm(PAYOFF).mm(strategy2.view(-1,1))
      cfr1 = (PAYOFF.mm(strategy2.view(-1,1))-PAYOFF_avg).squeeze()
      cfr2 = (strategy1.view(1,-1).mm(PAYOFF)-PAYOFF_avg).squeeze() * (-1)
    # update strategies proportionally to regrets
    strategy1 = getStrategy(cumCFR1)
    strategy2 = getStrategy(cumCFR2)
    # track cumulated regrets and strategies
    cumCFR1 += cfr1
    cumCFR2 += cfr2
    cumStrategy1 += strategy1
    cumStrategy2 += strategy2

  # averaged strategies converge to Nash Equilibrium
  avgStrategy1 = cumStrategy1/cumStrategy1.sum()
  avgStrategy2 = cumStrategy2/cumStrategy2.sum()
  # estimate approximation error (upper bound)
  eps = 2*torch.max(cumCFR1.max(),cumCFR2.max())/N_ITER
  torch.cuda.synchronize()
  return (avgStrategy1,avgStrategy2,eps)

getEquilibrium(PAYOFF) # eps < 0.03

Regrets are expensive to compute in large games, but can be approximated using modern machine-learning techniques. This approach has recently found many applications, including solvers for Poker and even larger card games ​4,5​ .

  1. 1.
    Hart S, Mas-Colell A. A Simple Adaptive Procedure Leading to Correlated Equilibrium. Econometrica. Published online September 2000:1127-1150. doi:10.1111/1468-0262.00153
  2. 2.
    Neller TW, Lanctot M. An Introduction to Counterfactual Regret Minimization. Teaching Materials. Published 2013. Accessed 2023. https://www.ma.imperial.ac.uk/~dturaev/neller-lanctot.pdf
  3. 3.
    Waugh K. Abstraction in Large Extensive Games. University of Alberta Libraries. Published online 2009. doi:10.7939/R3CM74
  4. 4.
    Brown N, Lerer A, Gross S, Sandholm T. Deep Counterfactual Regret Minimization. In: ; 2019.
  5. 5.
    Adams D. The Feasibility of Deep Counterfactual Regret Minimisation for Trading Card Games. AI 2022: Advances in Artificial Intelligence. Published online 2022:145-160. doi:10.1007/978-3-031-22695-3_11

Tracing GPU Resources

NVIDIA Monitoring Tools

When evaluating computing performance we look at various KPIs: memory consumption, utilisation of compute power, occupation of hardware accelerators, and – more recently – at the energy consumption and energy efficiency ​1,2​. For popular NVIDIA cards this can be solved with the help of the NVIDIA Management Library, which allows developer to query details of the device state​3​.

The library is easier to use through Python bindings available as pyNVML​4​. Note that Python overheads may be problematic if higher-frequency querying is needed, plus the API likely comes with its own overheads. So the readings should be understood as estimations.

Here is a simple script, which can be adjusted to query more details, if needed:

# see the NVIDIA docs: https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries
# to monitor GPU-1 and dump to a log file, run: python gpu_trace.py 1 log.csv

import sys
import time
import pynvml

pynvml.nvmlInit()

if __name__ == "__main__":
    gpu_index = int(sys.argv[1]) # device
    fname = sys.argv[2] # log file
    with open(fname,'w') as f:
        # select device
        device_handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
        # prepare headers
        f.write('Timestamp;Temperature [C];Power [% max];GPU Util [% time];Mem Util [% time];Mem Cons [% max];Energy [kJ]\n')
        # get some metadata
        power_max = pynvml.nvmlDeviceGetPowerManagementLimit(device_handle)
        energy_start = pynvml.nvmlDeviceGetTotalEnergyConsumption(device_handle)
        while True:
            # timestamp
            timestamp = time.time()
            # temperature
            temp = pynvml.nvmlDeviceGetTemperature(device_handle,0) # TODO: set sensor if many?
            # power [% of max]
            power = pynvml.nvmlDeviceGetPowerUsage(device_handle) / power_max * 100.0
            # memory and gpu utilisation [%]
            util = pynvml.nvmlDeviceGetUtilizationRates(device_handle)
            # memory consumption [%]
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle)
            mem_cons = mem_info.used / mem_info.total * 100.0
            # eneregy delta in kJ (API uses in mJ)
            eneregy = (pynvml.nvmlDeviceGetTotalEnergyConsumption(device_handle)-energy_start)/10**6
            # output result
            result = (timestamp,temp,power,util.gpu,util.memory,mem_cons,eneregy)
            f.write(';'.join(map(str, result))+'\n')
            time.sleep(0.1)

And here is how to post-process and present results:

from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt

trace_df = pd.read_csv('/home/log.csv',sep=';',header=0)
trace_df['Timestamp'] = trace_df['Timestamp'].map(datetime.utcfromtimestamp)
trace_df.set_index('Timestamp',inplace=True)

fig,ax1 = plt.subplots(1,1,figsize=(12,6))
cols = ['Power [% max]','GPU Util [% time]','Mem Util [% time]','Mem Cons [% max]']
trace_df[cols].plot(ax=ax1)
ax1.set_ylabel('%')
ax1.legend(loc='upper left')

cols = ['Energy [kJ]']
ax2 = ax1.twinx()
trace_df[cols].plot(ax=ax2, linestyle='dashed',color='black')
ax2.set_ylabel('kJ')
ax2.legend(loc='upper right')

fig.tight_layout() 
plt.show()

Case Study 1: Profiling ETL

The example shown below comes from an ETL processes which utilizes a GPU.

Note that, in this case, monitoring identified likely bottlenecks: the GPU gets idle on a periodic basis (likely, device-to-host transfers) plus is overall underutilised. Estimation of energy consumed is a nice feature, as it would be hard to measure it accurately from power traces (due to high variation and subsampling).

Note that utilisation should be understood as time-occupation, in case of both memory and computing. From the documentation:

unsigned int gpu: Percent of time over the past sample period during which one or more kernels was executing on the GPU.
unsigned int memory : Percent of time over the past sample period during which global (device) memory was being read or written.

https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t

Case Study 2: Scientific Computing and Power Management

The example below shows a trace from a matrix computation task (see the script below)

import torch

x = torch.randn( (256,270725) ).float().cuda('cuda:0')

MATRIX_BATCH = 1000

@torch.compile(mode='reduce-overhead', backend='inductor')
def similarity_op(x,y):
    xy = x[:,:,None] - y[:,None,:]
    xy = xy.abs() < 1
    xy = xy.all(axis=0)
    return xy

_ = similarity_op(torch.randn(1,MATRIX_BATCH),torch.randn(1,MATRIX_BATCH))


def similarity(x):
    x_slices = torch.split(x, MATRIX_BATCH, -1)
    result = []
    for x_i in x_slices:
        result_i = []
        for x_j in x_slices:
            result_i.append(similarity_op(x_i,x_j))
        result_i = torch.cat(result_i,-1)
        result_i = result_i.to(device='cpu', non_blocking=True)
        result.append(result_i)
    result = torch.cat(result,-2)
    torch.cuda.synchronize()
    return result

# start profiling here
_= similarity(x)

In this example, we see different power management strategies on two similar devices:

Case Study 3: Energy Efficiency of Deep Learning

Here we reproduce some results from Tang et al.​1​ to illustrate how adjusting frequency can be used to minimise energy spent per computational task (in their case: image prediction). Higher performance comes at a price of excessive energy used, so that energy curves assumes a typical parabolic shape. Note that, in general, the energy-efficient configuration may be optimised over both clock and memory frequencies ​5​.

And here is the code to reproduce:

import pandas as pd
import seaborn as sns
import numpy as np

# source: Fig 4d, data for resnet-b32 from "The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study" 
freq =  [544, 683, 810, 936, 1063, 1202, 1328]
power = [57, 62, 65, 70, 78, 88, 115] # W = J/s
requests = [60, 75, 85, 95, 105, 115, 120] # requests/s

data = pd.DataFrame(data=zip(freq,power,requests),columns=['Frequency','Power','Performance'])
data['Energy'] = data['Power'] / data['Performance'] # [J/s] / [Images/s] = [J/Image]

import matplotlib.pyplot as plt

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,6))

sns.lineplot(data=data,x='Frequency',y='Performance',ax=ax1,color='orange', label='Performance',marker='o')
ax1.set_xticks(data['Frequency'])
ax1.set_ylabel('Image / s')
ax1.set_xlabel('Frequency [MHz]')
ax1.legend(loc=0)

ax12 = ax1.twinx()
sns.lineplot(data=data,x='Frequency',y='Power',ax=ax12,color='steelblue',label='Power',marker='D')
ax12.set_ylabel('W')
ax12.legend(loc=1)

sns.lineplot(data,x='Frequency',y='Energy',ax=ax2,label='Energy')
ax2.set_xticks(data['Frequency'])
ax2.set_ylabel('J / Image')
ax2.set_xlabel('Frequency [MHz]')
ax2.legend(loc=0)
plt.title('Performance, power, and energy for training of resnet-b32 network on P100.\n Reproduced from: "The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study"')
plt.tight_layout()
plt.show()

References

  1. 1.
    Tang Z, Wang Y, Wang Q, Chu X. The Impact of GPU DVFS on the Energy and Performance of Deep Learning. Proceedings of the Tenth ACM International Conference on Future Energy Systems. Published online June 15, 2019. doi:10.1145/3307772.3328315
  2. 2.
    Tang K, He X, Gupta S, Vazhkudai SS, Tiwari D. Exploring the Optimal Platform Configuration for Power-Constrained HPC Workflows. 2018 27th International Conference on Computer Communication and Networks (ICCCN). Published online July 2018. doi:10.1109/icccn.2018.8487322
  3. 3.
    NVIDIA. NVIDIA Management Library Documentation. NVML-API. Accessed August 2023. https://docs.nvidia.com/deploy/nvml-api/index.html
  4. 4.
    Hirschfeld A. Python bindings to the NVIDIA Management Library. pyNVML. Accessed August 2023. https://pypi.org/project/nvidia-ml-py/#description
  5. 5.
    Fan K, Cosenza B, Juurlink B. Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels. Computation. Published online April 27, 2020:37. doi:10.3390/computation8020037

Compiling Torch Code – Example on Memory Reduction

Staring from the version 2.x PyTorch, a popular deep-learning framework, introduces a JIT compiler torch.compile. In this post, I am sharing a non-trivial example demonstrating how this tool can reduce memory footprint on GPU. The point of departure is a sub-routine which computes similarity, similar to covariance but not as friendly to compute.

def similiarity(x,y):
    xy = x[:,:,None]-y[:,None,:]
    xy = xy.abs().lt(1).sum(axis=0)
    xy = xy.to('cpu')
    return xy

For two tensors of shape \( (n_{samples},n_{dim})\) it produces a similarity tensor of shape \( (n_{dim},n_{dim})\). However, the logic uses broadcasting when constructing and reducing an intermediate tensor of shape \( (n_{samples},n_{dim},n_{dim})\). Thus, the naive implementation takes \( O(n_{samples}\cdot n_{dim}^2)\) of memory which is seen from the profiler. After compilation, this bottleneck is removed 💪

This is how the profiling code looks like:

import torch
from torch.profiler import profile, record_function, ProfilerActivity

x = torch.randn( (256,2000) ).float().cuda()
torch.cuda.synchronize()

#@torch.compile(mode='max-autotune') # compare the effect with and without !
def similiarity(x,y):
    xy = x[:,:,None]-y[:,None,:]
    xy = xy.abs().lt(1).sum(axis=0)
    xy = xy.to('cpu')
    return xy

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True) as prof:
    with record_function("memory profile"):
        similiarity(x,x)
        torch.cuda.synchronize()

profiler_summary = prof.key_averages().table(sort_by="self_cuda_memory_usage", row_limit=10)

And this is a useful utility to convert the profiling output to table

# torch profiler output to pandas

import pandas as pd
import io
import re

total_width = re.search('\n',profiler_summary).start()
widths = [t.end()-t.start()+2 for t in re.finditer('-{1,}',profiler_summary[:total_width])]

df = pd.read_fwf(io.StringIO(profiler_summary),widths=widths)
df.columns = df.loc[0]
df.drop([0,1],axis=0,inplace=True)
df.set_index(df.columns[0], inplace=True)
df.head(10)

This is the output without compiler, note huge memory excess in tensor operations while broadcasting:

NameSelf CPUSelf CUDASelf CUDA Mem# of Calls
aten::empty_strided32.000us0.000us7.63 Gb2
aten::sub69.000us4.624ms3.81 Gb1
aten::resize_10.000us0.000us3.81 Gb1
aten::lt51.000us5.857ms976.56 Mb1
aten::slice32.000us0.000us0 b4
aten::as_strided7.000us0.000us0 b7
aten::unsqueeze10.000us0.000us0 b2
cudaLaunchKernel185.000us0.000us0 b14
void at::native::elementwise_kernel<128, 2, at::nati…0.000us4.624ms0 b2
aten::abs43.000us9.711ms0 b2
Profiling without torch.compiler

Dealing with geo coordinates

Location on maps are often provided in a Coordinate Reference System (CRS). In computer vision projects, however, we need to translate them to pixels. Plotting and transformations can be accomplished with the rasterio Python module.

Below I share a useful snippet showing how to convert CRS locations to pixel coordinates:

import matplotlib.pyplot as plt
import geopandas
import rasterio as rs
from rasterio.plot import show
import pandas as pd
from shapely.geometry import Point
from PIL import Image, ImageDraw

fig, axs = plt.subplots(1, 2)
# open the image and its annotations
img_path = "trees_counting_mount/polish_ortophoto/1_2000/images/66579_623608_8.135.12.19_cropped_2000.tif"
label_path = "trees_counting_mount/polish_ortophoto/1_2000/annotations/66579_623608_8.135.12.19_cropped_2000.csv"
raster = rs.open(img_path, crs="EPSG:2180")
label_df = pd.read_csv(label_path)

# extract selected points
geometry = list(map(Point, label_df[["x", "y"]].values))
idxs = [0, 100, 200]
Ps = [geometry[idx] for idx in idxs]

# plot points in two alternative ways!
ax = axs[0]
show(raster, ax=ax)
# variant 1: plot geo-coordinates with geopandas.GeoDataFrame
geo_df = geopandas.GeoDataFrame(None, geometry=Ps, crs="EPSG:2180")
geo_df.plot(ax=ax)
ax = axs[1]
# variant 2: convert geo-coordinates to pixel locations
ys, xs = rs.transform.rowcol(raster.transform, [P.x for P in Ps], [P.y for P in Ps])
img = Image.open(img_path)
img_draw = ImageDraw.Draw(img)

for x, y in zip(xs, ys):
  img_draw.ellipse( (x-150,y-150,x+150,y+150), fill="yellow")

plt.imshow(img)

plt.title("Points in geo-coordinates (left) and image pixel coordinates (right).")
plt.tight_layout()
plt.show()

Setting up CUDA tools properly

CUDA is a computing platform for graphical processing units (GPUs) developed by NVIDIA, widely used to accelerate machine-learning. Existing frameworks, such as Tensorflow or PyTorch, utilize it under the hood not asking user for any specific coding. However, it is still necessary to set its dependencies, particularly the compiler nvcc, properly to benefit of acceleration. In this short note, I share an interesting use-case that occurred when prototyping on Kaggle Docker image and NVIDIA Docker image.

Compatibility of CUDA tools and targeted libraries

It turns out that one of Kaggle images was released with incompatible CUDA dependencies: compilation tools were not aligned with PyTorch, as revealed when attempting to compile detectron2, an object detection library by Facebook.

(base) maciej.skorski@shared-notebooks:~$ docker images
REPOSITORY                        TAG                        IMAGE ID       CREATED        SIZE
gcr.io/kaggle-gpu-images/python   latest                     87983e20c290   4 weeks ago    48.1GB
nvidia/cuda                       11.6.2-devel-ubuntu20.04   e1687ea9fbf2   7 weeks ago    5.75GB
gcr.io/kaggle-gpu-images/python   <none>                     2b12fe42f372   2 months ago   50.2GB

(base) maciej.skorski@shared-notebooks:~$ docker run -d \
  -it \
  --name kaggle-test \
  --runtime=nvidia \
  --mount type=bind,source=/home/maciej.skorski,target=/home \
  2b12fe42f372

(base) maciej.skorski@shared-notebooks:~$ docker exec -it kaggle-test python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
...
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.8). Please make sure to use the same CUDA versions.

In order to compile detectron2, it was necessary to align the CUDA toolkit version. Rather than trying to install it manually – which is known to be an error-prone task – a working solution was to change the Kaggle image. It turns out that the gap was bridged in a subsequent release:

(base) maciej.skorski@shared-notebooks:~$ docker run 87983e20c290 nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
(base) maciej.skorski@shared-notebooks:~$ docker run 2b12fe42f372 nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

And indeed, the Facebook library installed smoothly under the new image 👍

(base) maciej.skorski@shared-notebooks:~$ docker run -d \
   -it \
   --name kaggle-test \
   --runtime=nvidia \
   --mount type=bind,source=/home/maciej.skorski,target=/home \
   87983e20c290
bf60d0e3f3bdb42c5c08b24598bb3502b96ba2c461963d11b31c1fda85f9c26b
(base) maciej.skorski@shared-notebooks:~$ docker exec -it kaggle-test python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
Collecting git+https://github.com/facebookresearch/detectron2.git
...
Successfully built detectron2 fvcore antlr4-python3-runtime pycocotools

Compatibility of CUDA tools and GPU drivers

The compiler version should not be significantly newer than that that of the driver, as presented by nvidia-smi:

(base) maciej.skorski@shared-notebooks:~$ nvidia-smi
Thu Aug 10 14:56:44 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0    30W /  70W |  12262MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8635      C   ...detectron_venv/bin/python    12260MiB |
+-----------------------------------------------------------------------------+

Consider the simple CUDA script querying the GPU device properties:

// query_GPU.cu

#include <stdio.h> 

int main() {
  int nDevices;

  cudaGetDeviceCount(&nDevices);
  for (int i = 0; i < nDevices; i++) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, i);
    printf("Device Number: %d\n", i);
    printf("  Name: %s\n", prop.name);
    printf("  Integrated: %d\n", prop.integrated);
    printf("  Compute capability: %d.%d\n", prop.major, prop.minor );
    printf("  Peak Memory Bandwidth (GB/s): %f\n\n",
           2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
    printf( "  Total global mem: %ld\n", prop.totalGlobalMem );
    printf( "  Multiprocessor count: %d\n", prop.multiProcessorCount );
  }
}

This code compiles and presents GPU properties only under the image equipped with the matching major compiler version (select the appropriate image here):

(base) maciej.skorski@shared-notebooks:~$ docker run -d \
  -it \
  --name nvidia-cuda \
  --runtime=nvidia \
  --mount type=bind,source=$(pwd),target=/home \
  --privileged \
  nvidia/cuda:11.6.2-devel-ubuntu20.04

docker exec -it nvidia-cuda sh -c "nvcc /home/query_GPU.cu -o /home/query_GPU && /home/query_GPU"
Device Number: 0
  Name: Tesla T4
  Integrated: 0
  Compute capability: 7.5
  Peak Memory Bandwidth (GB/s): 320.064000

  Total global mem: 15634661376
  Multiprocessor count: 40

However, the container doesn’t even start with a mismatching major version:

(base) maciej.skorski@shared-notebooks:~$ docker run -d \
>   -it \
>   --name nvidia-cuda \
>   --runtime=nvidia \
>   --mount type=bind,source=$(pwd),target=/home \
>   --privileged \
>   nvidia/cuda:12.2.0-devel-ubuntu20.04
d14d07b8b04bc7e6e27ce8312452850946d98d82611cb24c3e662ceb27d708c5
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.2, please update your driver to a newer version, or use an earlier cuda container: unknown.

Fixing Reproducibility of Scientific Repos

As the last example, consider the recent cuZK project which implements some state-of-the-art cryptographic protocols on GPU. The original code was missing dependencies and compilation instructions, therefore I shared a working fork version.

To work with the code, let’s use the NVIDIA Docker image with the appropriate version, here I selected the tag 11.6.2-devel-ubuntu20.04. Checkout the code and start a container mounting the working directory with the GitHub code, like below:

docker run -d \
   -it \
   --name nvidia-cuda \
   --runtime=nvidia \
   --mount type=bind,source=$(pwd),target=/home \
   --privileged \
   nvidia/cuda:11.6.2-devel-ubuntu20.04

To work with the code, we need few more dependencies within the container:

apt-get update
apt-get install -y git libgmp3-dev

After adjusting the headers in Makefile, the CUDA code can be compiled and run

root@7816e1643c2a:/home/cuZK/test# make
...
root@7816e1643c2a:/home/cuZK/test# ls
BLS377         MSMtestbn.cu   Makefile      core          msmtesta  testb       testbn.cu
MSMtestbls.cu  MSMtestmnt.cu  libgmp.a     msmtestb  testbls.cu  testmnt.cu
root@7816e1643c2a:/home/cuZK/test# ./msmtestb
Please enter the MSM scales (e.g. 20 represents 2^20) 

Lego Bricks in LaTeX

Who does not enjoy lego bricks, raise a hand! In this post, I am sharing an elegant and efficient way of plotting bricks under 3d view in TikZ. Briefly speaking, it utilizes canvas transforms to plot facets, and describes boundaries of studs in a simple way with cylindrical coordinates based on the azimuth angle (localizing extreme edges might be a challenge on its own).
While there are other packages, like TikZbricks, this method seems simpler in terms of complexity and brings some educational value in terms of cylinders geometry.

And here is the code (see also this online note)

\documentclass[12pt]{standalone}

\usepackage{pgfplots}
\usepackage{tikz-3dplot}

\begin{document}

\pgfmathsetmacro{\pinradius}{0.25}

%  elevation and azimuth for 3D-view
\def\rotx{60}
\def\rotz{120}

\newcommand{\brick}[8]{
    \pgfmathsetmacro{\posx}{#1}
    \pgfmathsetmacro{\posy}{#2}
    \pgfmathsetmacro{\posz}{#3}
    \pgfmathsetmacro{\cubex}{#4}
    \pgfmathsetmacro{\cubey}{#5}
    \pgfmathsetmacro{\cubez}{#6}

    % cube by rectangle facets
    \begin{scope}
    \begin{scope}[canvas is yx plane at z=\posz,transform shape]
    \draw[fill=#8] (\posy,\posx) rectangle ++(\cubey,\cubex);
    \end{scope}
    \begin{scope}[canvas is yx plane at z=\posz+\cubez,transform shape]
    \draw[fill=#8] (\posy,\posx) rectangle ++(\cubey,\cubex);
    \end{scope}
    \begin{scope}[canvas is yz plane at x=\posx+\cubex,transform shape]
    \draw[fill=#8] (\posy,\posz) rectangle ++(\cubey,\cubez) node[pos=.5] {#7};
    \end{scope}
    \begin{scope}[canvas is xz plane at y=\posy+\cubey,transform shape]
    \draw[fill=#8] (\posx,\posz) rectangle ++(\cubex,\cubez);
    \end{scope}
    \end{scope}

    % studs by arcs and extreme edges
    \foreach \i in {1,...,\cubey}{
        \foreach \j in {1,...,\cubex}{
            % upper part - full circle
            \draw [thin] (\posx-0.5+\j,\posy-0.5+\i,\posz+\cubez+0.15) circle (\pinradius);
            % lower part - arc
            \begin{scope}[canvas is xy plane at z=\posz+\cubez]
            \draw[thin] ([shift=(\rotz:\pinradius)] \posx-0.5+\j,\posy-0.5+\i) arc (\rotz:\rotz-180:\pinradius);
            \end{scope}
            \begin{scope}[shift={(\posx-0.5+\j,\posy-0.5+\i)}]
                % edges easily identified in cylindrical coordinates! 
                \pgfcoordinate{edge1_top}{ \pgfpointcylindrical{\rotz}{\pinradius}{\posz+\cubez+0.15} };
                \pgfcoordinate{edge1_bottom}{ \pgfpointcylindrical{\rotz}{\pinradius}{\posz+\cubez} };
                \draw[] (edge1_top) -- (edge1_bottom);
                \pgfcoordinate{edge1_top}{ \pgfpointcylindrical{\rotz+180}{\pinradius}{\posz+\cubez+0.15} };
                \pgfcoordinate{edge1_bottom}{ \pgfpointcylindrical{\rotz+180}{\pinradius}{\posz+\cubez} };
                \draw[] (edge1_top) -- (edge1_bottom);
           \end{scope}
        }
    }
}

\tdplotsetmaincoords{\rotx}{\rotz}
\begin{tikzpicture}[tdplot_main_coords,]
    % draw axes
    \coordinate (O) at (0,0,0);
    \coordinate (A) at (5,0,0);
    \coordinate (B) at (0,5,0);
    \coordinate (C) at (0,0,5);
    \draw[-latex] (O) -- (A) node[below] {$x$};
    \draw[-latex] (O) -- (B) node[above] {$y$};
    \draw[-latex] (O) -- (C) node[left] {$z$};
    % draw brick
    \brick{0}{1}{0}{3}{3}{1}{Lego}{blue!50};
    \brick{0}{1}{2}{2}{3}{1}{Enjoys}{green!50};
    \brick{0}{1}{4}{1}{3}{1}{Everybody}{red!50};
\end{tikzpicture}


\end{document}

Cylinders in LaTeX the Easy and Correct Way

Drawing cylinders in vector graphic is a common task. It is less trivial as it looks at first glance, due to the challenge of finding a proper projection. In this post, I share a simple and robust recipe using the tikz-3dplot package of LaTeX. As opposed to many examples shared online, this approach automatically identifies the boundary of a cylinder, under a given perspective. The trick is to identify edges using the azimuth angle in cylindrical coordinates 💪.

\documentclass{standalone}
\usepackage{tikz,tikz-3dplot}
\begin{document}

\def\rotx{70}
\def\rotz{110}
\tdplotsetmaincoords{\rotx}{\rotz}
\begin{tikzpicture}[tdplot_main_coords]
\tikzstyle{every node}=[font=\small]
\draw[ultra thin,-latex] (0,0,0) -- (6,0,0) node[anchor=north east]{$x$};
\draw[ultra thin,-latex] (0,0,0) -- (0,6,0) node[anchor=north west]{$y$};
\draw[ultra thin,-latex] (0,0,0) -- (0,0,6) node[anchor=south]{$z$};
\draw [thick](0,0,4) circle (3);
\begin{scope}[canvas is xy plane at z=0]
\draw[thick, dashed] ([shift=(\rotz:3)] 0,0,0) arc (\rotz:\rotz+180:3);
\draw[thick] ([shift=(\rotz:3)] 0,0,0) arc (\rotz:\rotz-180:3);
\end{scope}
% manual edges
\draw [dotted, red](1.9,-2.34,0) -- (1.9,-2.34,4) node[midway, left]{};
\draw [dotted, red](-1.9,2.35,0) -- (-1.9,2.35,4);
% automatic edges !
\pgfcoordinate{edge1_top}{ \pgfpointcylindrical{\rotz}{3}{4} };
\pgfcoordinate{edge1_bottom}{ \pgfpointcylindrical{\rotz}{3}{0} };
\draw[thick] (edge1_top) -- (edge1_bottom);
\pgfcoordinate{edge2_top}{ \pgfpointcylindrical{\rotz+180}{3}{4} };
\pgfcoordinate{edge2_bottom}{ \pgfpointcylindrical{\rotz+180}{3}{0} };
\draw[thick] (edge2_top) -- (edge2_bottom);
\end{tikzpicture}

\end{document}

And we can enjoy this output, comparing with manual edges adapted from this post:

Properly drawn cylinder

Automating Accessors and Mutators Tests

In object-oriented programming, there are plenty of accessors and mutators to test. This post demonstrates that this effort can be automated with reflection 🚀. The inspiration came from discussions I had with my students during our software-engineering class: how to increase code coverage without lots of manual effort? 🤔

Roughly speaking, the reflection mechanism allows the code to analyse itself. At runtime, we are able to construct calls based on extracted class properties. The idea is not novel, see for instance this gist. To add the value and improve the presentation, I modernized and completed the code to a fully featured project on GitHub with CI/CD on GitHub Actions and Code Coverage connected 😎.

Here is how the testing class looks like. Java reflection accesses classes, extracts fields and their types and constructs calls with type-matching values accordingly:

// tester class
class AutoTests {
  
    private static final Class[] classToTest = new Class[]{ 
        // the list of classes to test
        PersonClass.class, AnimalClass.class
    };

   @Test 
   public void correctGettersSetters() {
      for (Class aClass : classToTest) {
        Object instance;
        try {      
           instance = aClass.getDeclaredConstructor().newInstance();
           Field[] declaredFields = aClass.getDeclaredFields();
           for(Field f: declaredFields) {
              // get the field getter and setter, following the Java naming convention (!)
              // www.theserverside.com/feature/Java-naming-conventions-explained
              String name = f.getName();
              name = name.substring(0,1).toUpperCase() + name.substring(1);
              String getterName = "get" + name;
              String setterName = "set" + name;
              Method getterMethod = aClass.getMethod(getterName);
              Method setterMethod = aClass.getMethod(setterName, getterMethod.getReturnType());
              // prepare a test value based on the filed type 
              Object testVal = null;
              Class<?> fType = f.getType();
              if (fType.isAssignableFrom(Integer.class)) {
                  testVal = 1234;
              } else if (fType.isAssignableFrom(String.class)) {
                  testVal = "abcd";
              }
              // test by composing the setter and getter
              setterMethod.invoke(instance, testVal);
              Object result = getterMethod.invoke(instance);
              System.out.printf("Testing class=%s field=%s...\n", aClass.getName(), f.getName());
              assertThat(result).as("in class %s fields %s", aClass.getName(), f.getName()).isEqualTo(testVal);
           }
        } catch(Exception e) {
           System.out.println(e.toString());
        }
      }
   }
}

And here is one more demo available online.

Finally, a disclaimer: accessors and mutators may deserve smarter tests than what stated here – depending on a use-case.