Photo by John Thomas on Unsplash. This bear has absolutely nothing to do with the content of the article. I Just like bears. Bears, beats, battlestar galactica.

Applied Machine Learning

Writing Custom Functions in PyTorch

Using C++ and CUDA Extensions to write high performance kernels in PyTorch

This article describes how to use Torch’s CUDA extension library to write high performance kernels for PyTorch modules.


Occasionally, you may need to process a tensor (transform or apply a kernel) that isn’t in PyTorch’s standard library. You could detach the tensor, perform the transformation, then move it back to the GPU, but that wastes valuable time, especially with a complex training loop. In this case, it makes sense to apply the transformation on the tensor on the GPU in place.

In my example, I needed to write a “rounding” map, that rounds to a nearest decimal place and not the nearest integer. Torch doesn’t have any out of the box methods to do this, and the only other option was performing a costly transfer of a large tensor back to the CPU. This also gave me an opportunity to document the process in the case that I ever needed to do a more complex operation

The process

For reference, the documentation is quite good, find the official guide from PyTorch here.

  1. Write your kernel in C++. Use <torch/extension.h>
  2. Use pybind11 to bind your custom functions into python
  3. Write a and use torch.utils.cpp_extension to build your module
  4. Build your module, and use it in your application

It’s really that easy.

Thanks for reading!

If you liked this article, you might like:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store