Staring from the version 2.x PyTorch, a popular deep-learning framework, introduces a JIT compiler torch.compile
. In this post, I am sharing a non-trivial example demonstrating how this tool can reduce memory footprint on GPU. The point of departure is a sub-routine which computes similarity, similar to covariance but not as friendly to compute.
def similiarity(x,y):
xy = x[:,:,None]-y[:,None,:]
xy = xy.abs().lt(1).sum(axis=0)
xy = xy.to('cpu')
return xy
For two tensors of shape \( (n_{samples},n_{dim})\) it produces a similarity tensor of shape \( (n_{dim},n_{dim})\). However, the logic uses broadcasting when constructing and reducing an intermediate tensor of shape \( (n_{samples},n_{dim},n_{dim})\). Thus, the naive implementation takes \( O(n_{samples}\cdot n_{dim}^2)\) of memory which is seen from the profiler. After compilation, this bottleneck is removed 💪
This is how the profiling code looks like:
import torch
from torch.profiler import profile, record_function, ProfilerActivity
x = torch.randn( (256,2000) ).float().cuda()
torch.cuda.synchronize()
#@torch.compile(mode='max-autotune') # compare the effect with and without !
def similiarity(x,y):
xy = x[:,:,None]-y[:,None,:]
xy = xy.abs().lt(1).sum(axis=0)
xy = xy.to('cpu')
return xy
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True) as prof:
with record_function("memory profile"):
similiarity(x,x)
torch.cuda.synchronize()
profiler_summary = prof.key_averages().table(sort_by="self_cuda_memory_usage", row_limit=10)
And this is a useful utility to convert the profiling output to table
# torch profiler output to pandas
import pandas as pd
import io
import re
total_width = re.search('\n',profiler_summary).start()
widths = [t.end()-t.start()+2 for t in re.finditer('-{1,}',profiler_summary[:total_width])]
df = pd.read_fwf(io.StringIO(profiler_summary),widths=widths)
df.columns = df.loc[0]
df.drop([0,1],axis=0,inplace=True)
df.set_index(df.columns[0], inplace=True)
df.head(10)
This is the output without compiler, note huge memory excess in tensor operations while broadcasting:
Name | Self CPU | Self CUDA | Self CUDA Mem | # of Calls |
---|---|---|---|---|
aten::empty_strided | 32.000us | 0.000us | 7.63 Gb | 2 |
aten::sub | 69.000us | 4.624ms | 3.81 Gb | 1 |
aten::resize_ | 10.000us | 0.000us | 3.81 Gb | 1 |
aten::lt | 51.000us | 5.857ms | 976.56 Mb | 1 |
aten::slice | 32.000us | 0.000us | 0 b | 4 |
aten::as_strided | 7.000us | 0.000us | 0 b | 7 |
aten::unsqueeze | 10.000us | 0.000us | 0 b | 2 |
cudaLaunchKernel | 185.000us | 0.000us | 0 b | 14 |
void at::native::elementwise_kernel<128, 2, at::nati… | 0.000us | 4.624ms | 0 b | 2 |
aten::abs | 43.000us | 9.711ms | 0 b | 2 |