Applied Machine Learning
Writing Custom Functions in PyTorch
Using C++ and CUDA Extensions to write high performance kernels in PyTorch
This article describes how to use Torch’s CUDA extension library to write high performance kernels for PyTorch modules.
Background
Occasionally, you may need to process a tensor (transform or apply a kernel) that isn’t in PyTorch’s standard library. You could detach the tensor, perform the transformation, then move it back to the GPU, but that wastes valuable time, especially with a complex training loop. In this case, it makes sense to apply the transformation on the tensor on the GPU in place.
In my example, I needed to write a “rounding” map, that rounds to a nearest decimal place and not the nearest integer. Torch doesn’t have any out of the box methods to do this, and the only other option was performing a costly transfer of a large tensor back to the CPU. This also gave me an opportunity to document the process in the case that I ever needed to do a more complex operation
The process
For reference, the documentation is quite good, find the official guide from PyTorch here.
- Write your kernel in C++. Use
<torch/extension.h>
- Use pybind11 to bind your custom functions into python
- Write a
setup.py
and usetorch.utils.cpp_extension
to build your module - Build your module, and use it in your application
It’s really that easy.
Thanks for reading!
If you liked this article, you might like: