Tracing GPU Resources

NVIDIA Monitoring Tools

When evaluating computing performance we look at various KPIs: memory consumption, utilisation of compute power, occupation of hardware accelerators, and – more recently – at the energy consumption and energy efficiency ​1,2​. For popular NVIDIA cards this can be solved with the help of the NVIDIA Management Library, which allows developer to query details of the device state​3​.

The library is easier to use through Python bindings available as pyNVML​4​. Note that Python overheads may be problematic if higher-frequency querying is needed, plus the API likely comes with its own overheads. So the readings should be understood as estimations.

Here is a simple script, which can be adjusted to query more details, if needed:

# see the NVIDIA docs:
# to monitor GPU-1 and dump to a log file, run: python 1 log.csv

import sys
import time
import pynvml


if __name__ == "__main__":
    gpu_index = int(sys.argv[1]) # device
    fname = sys.argv[2] # log file
    with open(fname,'w') as f:
        # select device
        device_handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
        # prepare headers
        f.write('Timestamp;Temperature [C];Power [% max];GPU Util [% time];Mem Util [% time];Mem Cons [% max];Energy [kJ]\n')
        # get some metadata
        power_max = pynvml.nvmlDeviceGetPowerManagementLimit(device_handle)
        energy_start = pynvml.nvmlDeviceGetTotalEnergyConsumption(device_handle)
        while True:
            # timestamp
            timestamp = time.time()
            # temperature
            temp = pynvml.nvmlDeviceGetTemperature(device_handle,0) # TODO: set sensor if many?
            # power [% of max]
            power = pynvml.nvmlDeviceGetPowerUsage(device_handle) / power_max * 100.0
            # memory and gpu utilisation [%]
            util = pynvml.nvmlDeviceGetUtilizationRates(device_handle)
            # memory consumption [%]
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle)
            mem_cons = mem_info.used / * 100.0
            # eneregy delta in kJ (API uses in mJ)
            eneregy = (pynvml.nvmlDeviceGetTotalEnergyConsumption(device_handle)-energy_start)/10**6
            # output result
            result = (timestamp,temp,power,util.gpu,util.memory,mem_cons,eneregy)
            f.write(';'.join(map(str, result))+'\n')

And here is how to post-process and present results:

from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt

trace_df = pd.read_csv('/home/log.csv',sep=';',header=0)
trace_df['Timestamp'] = trace_df['Timestamp'].map(datetime.utcfromtimestamp)

fig,ax1 = plt.subplots(1,1,figsize=(12,6))
cols = ['Power [% max]','GPU Util [% time]','Mem Util [% time]','Mem Cons [% max]']
ax1.legend(loc='upper left')

cols = ['Energy [kJ]']
ax2 = ax1.twinx()
trace_df[cols].plot(ax=ax2, linestyle='dashed',color='black')
ax2.legend(loc='upper right')


Case Study 1: Profiling ETL

The example shown below comes from an ETL processes which utilizes a GPU.

Note that, in this case, monitoring identified likely bottlenecks: the GPU gets idle on a periodic basis (likely, device-to-host transfers) plus is overall underutilised. Estimation of energy consumed is a nice feature, as it would be hard to measure it accurately from power traces (due to high variation and subsampling).

Note that utilisation should be understood as time-occupation, in case of both memory and computing. From the documentation:

unsigned int gpu: Percent of time over the past sample period during which one or more kernels was executing on the GPU.
unsigned int memory : Percent of time over the past sample period during which global (device) memory was being read or written.

Case Study 2: Scientific Computing and Power Management

The example below shows a trace from a matrix computation task (see the script below)

import torch

x = torch.randn( (256,270725) ).float().cuda('cuda:0')


@torch.compile(mode='reduce-overhead', backend='inductor')
def similarity_op(x,y):
    xy = x[:,:,None] - y[:,None,:]
    xy = xy.abs() < 1
    xy = xy.all(axis=0)
    return xy

_ = similarity_op(torch.randn(1,MATRIX_BATCH),torch.randn(1,MATRIX_BATCH))

def similarity(x):
    x_slices = torch.split(x, MATRIX_BATCH, -1)
    result = []
    for x_i in x_slices:
        result_i = []
        for x_j in x_slices:
        result_i =,-1)
        result_i ='cpu', non_blocking=True)
    result =,-2)
    return result

# start profiling here
_= similarity(x)

In this example, we see different power management strategies on two similar devices:

Case Study 3: Energy Efficiency of Deep Learning

Here we reproduce some results from Tang et al.​1​ to illustrate how adjusting frequency can be used to minimise energy spent per computational task (in their case: image prediction). Higher performance comes at a price of excessive energy used, so that energy curves assumes a typical parabolic shape. Note that, in general, the energy-efficient configuration may be optimised over both clock and memory frequencies ​5​.

And here is the code to reproduce:

import pandas as pd
import seaborn as sns
import numpy as np

# source: Fig 4d, data for resnet-b32 from "The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study" 
freq =  [544, 683, 810, 936, 1063, 1202, 1328]
power = [57, 62, 65, 70, 78, 88, 115] # W = J/s
requests = [60, 75, 85, 95, 105, 115, 120] # requests/s

data = pd.DataFrame(data=zip(freq,power,requests),columns=['Frequency','Power','Performance'])
data['Energy'] = data['Power'] / data['Performance'] # [J/s] / [Images/s] = [J/Image]

import matplotlib.pyplot as plt

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,6))

sns.lineplot(data=data,x='Frequency',y='Performance',ax=ax1,color='orange', label='Performance',marker='o')
ax1.set_ylabel('Image / s')
ax1.set_xlabel('Frequency [MHz]')

ax12 = ax1.twinx()

ax2.set_ylabel('J / Image')
ax2.set_xlabel('Frequency [MHz]')
plt.title('Performance, power, and energy for training of resnet-b32 network on P100.\n Reproduced from: "The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study"')


  1. 1.
    Tang Z, Wang Y, Wang Q, Chu X. The Impact of GPU DVFS on the Energy and Performance of Deep Learning. Proceedings of the Tenth ACM International Conference on Future Energy Systems. Published online June 15, 2019. doi:10.1145/3307772.3328315
  2. 2.
    Tang K, He X, Gupta S, Vazhkudai SS, Tiwari D. Exploring the Optimal Platform Configuration for Power-Constrained HPC Workflows. 2018 27th International Conference on Computer Communication and Networks (ICCCN). Published online July 2018. doi:10.1109/icccn.2018.8487322
  3. 3.
    NVIDIA. NVIDIA Management Library Documentation. NVML-API. Accessed August 2023.
  4. 4.
    Hirschfeld A. Python bindings to the NVIDIA Management Library. pyNVML. Accessed August 2023.
  5. 5.
    Fan K, Cosenza B, Juurlink B. Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels. Computation. Published online April 27, 2020:37. doi:10.3390/computation8020037

Published by mskorski

Scientist, Consultant, Learning Enthusiast

Leave a comment

Your email address will not be published. Required fields are marked *