Photo by John Thomas on Unsplash. This bear has absolutely nothing to do with the content of the article. I Just like bears. Bears, beats, battlestar galactica.

Applied Machine Learning

Writing Custom Functions in PyTorch

Using C++ and CUDA Extensions to write high performance kernels in PyTorch

2 min readFeb 15, 2021

This article describes how to use Torch’s CUDA extension library to write high performance kernels for PyTorch modules.

Background

Occasionally, you may need to process a tensor (transform or apply a kernel) that isn’t in PyTorch’s standard library. You could detach the tensor, perform the transformation, then move it back to the GPU, but that wastes valuable time, especially with a complex training loop. In this case, it makes sense to apply the transformation on the tensor on the GPU in place.

In my example, I needed to write a “rounding” map, that rounds to a nearest decimal place and not the nearest integer. Torch doesn’t have any out of the box methods to do this, and the only other option was performing a costly transfer of a large tensor back to the CPU. This also gave me an opportunity to document the process in the case that I ever needed to do a more complex operation

The process

For reference, the documentation is quite good, find the official guide from PyTorch here.

Write your kernel in C++. Use <torch/extension.h>
Use pybind11 to bind your custom functions into python
Write a setup.py and use torch.utils.cpp_extension to build your module
Build your module, and use it in your application

It’s really that easy.

Thanks for reading!

If you liked this article, you might like:

Deep Convolutional Networks for Monocular Velocity Estimation

Making convnets “chill”

towardsdatascience.com