5 gradient/derivative related PyTorch functions

Attyuttam Saha
6 min readJun 16, 2020

In this article, I will be talking about the 5 PyTorch functions that I have studied through. Examples will be provided along with scenarios when the functions might break. This article is a great head start to explore PyTorch and the various plethora of functionalities it provides.

The 5 functions that I will be discussion are:

  1. detach()
  2. no_grad()
  3. clone()
  4. backward()
  5. register_hook()
importing torch

1. tensor.detach()

tensor.detach() creates a tensor that shares storage with tensor that does not require grad. You should use detach() when attempting to remove a tensor from a computation graph. In order to enable automatic differentiation, PyTorch keeps track of all operations involving tensors for which the gradient may need to be computed (i.e., require_grad is True). The operations are recorded as a directed graph. The detach() method constructs a new view on a tensor which is declared not to need gradients, i.e., it is to be excluded from further tracking of operations, and therefore the sub-graph involving this view is not recorded.

In the above example we can see that upon using detach(), the value of the gradient of the variable changes. This can be explained as follows: Initially a vector x is defined of size 10 and each element is 1.

  1. Case 1 — no detach() is used:

as y is x² and z is x³. Hence r is x²+x³. Thus the derivative of r is 2x+3x². Hence gradient of x is 2.1+3.1² = 5 Thus, x.grad produces a vector of 10 elements each having the value of 5

  1. Case 2 — detach() is used:

as y is x² and z is x³. Hence r is x²+x³. Thus the derivative of r is 2x+3x². But as z is calculated by detaching x (x.detach()), hence z is not included when calculating the gradient of x. Hence gradient of x is 2.1 = 2 Thus, x.grad produces a vector of 10 elements each having the value of 2.

In the above scenario, it causes an error, because it detects that a has changed inplace and this will trip gradient calculation. If you comment the c.zero_() or use .clone().detach(), you see that a.grad is 2*a just as it should be.

It is because .detach() doesnt implicitly create a copy of the tensor, so when the tensor is modified later, it’s updating the tensor on the upstream side of .detach() too. By cloning first, this issue doesnt arise, and all is ok.

detach doesn’t create copies and should only prevent the gradients to be computed but shares the data.

Thus, we should use detach() when we don’t want to include a tensor in the resulting computational graph.

2. torch.no_grad()

Context-manager that disabled gradient calculation.

Disabling gradient calculation is useful for inference, when you are sure that you will not call Tensor.backward(). It will reduce memory consumption for computations that would otherwise have requires_grad=True.

In this mode, the result of every computation will have requires_grad=False, even when the inputs have requires_grad=True.

In the above example, “with torch.no_grad():” will make all the operations in the block have no gradients. Hence, we cannot use backward() on the computed variable.

The above example breaks because we are trying to use backward() on r. Using no_grad() ensures that the operations being performed does not have gradients. That is why, r.requires_grad produces False as the output. Hence, this does not have a graident associated with it, when we use r.backward(), the code breaks.

3.tensor.clone(memory_format=torch.preserve_format ) → Tensor

tensor.clone()creates a copy of tensor that imitates the original tensor’s requires_grad field. We should use clone as a way to copy the tensor while still keeping the copy as a part of the computation graph it came from. Gradients propagating to the cloned tensor will propagate to the original tensor.

tensor.clone() maintains the connection with the computation graph. That means, if you use the new cloned tensor, and derive the loss from the new one, the gradients of that loss can be computed all the way back even beyond the point where the new tensor was created.

If you want to copy a tensor and detach from the computation graph you should be using

tensor.clone().detach()

In the above example, we can observe that x_clone is created by performing x.clone(). Thus, a copy of x is set to x_clone. It should be noted that although the variables y and z are obtained using x_clone, when r.backward() is done, the gradients are propagated to the original tensor.

We can observe that we get no value when we try to obtain the gradient of x_clone. This is because x-clone is computed by the clone operation in x, so it is not a leaf variable that it will does not have its grad. But the backpropagation will propagate it to x, so x will get the grad.

Thus, whenever we want to make a copy of a tensor and ensure that any operations done with the cloned tensor ensures that the gradients are propagated to the original tensor, we must use clone().

4. tensor.backward(gradient=None, retain_graph=None, create_graph=False)

Computes the gradient of current tensor w.r.t. graph leaves.

The graph is differentiated using the chain rule. If the tensor is non-scalar (i.e. its data has more than one element) and requires gradient, the function additionally requires specifying gradient. It should be a tensor of matching type and location, that contains the gradient of the differentiated function w.r.t. self.

This function accumulates gradients in the leaves — you might need to zero them before calling it.

Parameters gradient (Tensor or None) — Gradient w.r.t. the tensor. If it is a tensor, it will be automatically converted to a Tensor that does not require grad unless create_graph is True. None values can be specified for scalar Tensors or ones that don’t require grad. If a None value would be acceptable then this argument is optional.

retain_graph (bool, optional) — If False, the graph used to compute the grads will be freed. Note that in nearly all cases setting this option to True is not needed and often can be worked around in a much more efficient way. Defaults to the value of create_graph.

create_graph (bool, optional) — If True, graph of the derivative will be constructed, allowing to compute higher order derivative products. Defaults to False.

We can observe that upon calling backward() on r, the gradients of the leaf node x is calculated.

We can observe that as the leaf node variable i.e, x does not have requires_grad set to True, hence the backward() function throws an error. Thus, to use backward fuction on a variable, we need to ensure that the leaf nodes that are involved must have requires_grad set to True.

By default, pytorch expects backward() to be called for the last output of the network — the loss function. The loss function always outputs a scalar and therefore, the gradients of the scalar loss w.r.t all other variables/parameters is well defined (using the chain rule)

5. tensor.register_hook(hook)

Registers a backward hook.

The hook will be called every time a gradient with respect to the Tensor is computed. The hook should have the following signature:

hook(grad) -> Tensor or None

The hook should not modify its argument, but it can optionally return a new gradient which will be used in place of grad.

This function returns a handle with a method handle.remove() that removes the hook from the module. Use torch.Tensor.register_hook() directly on a specific input or output to get the required gradients.

In the above example, we have registered a lambda function which returns the square of a number which is passed to it. This lambda function is registered as a hook to the tensor variable v. Thus, v is essentially equal to x² where ‘x’ is the variable set to the tensor. Thus, when we do a v.backward, we calculate the gradient of the leaf variable and v is the leaf variable here (because we have defined it, its not a result of computation). Thus, gradient of v is 2x which results to 2, 4, 6 as you can see above.

In the above example we get a pretty verbose error “cannot register a hook on a tensor that doesn’t require gradient “. This means that we wont be able to register a hook on a tensor in which the requres_grad parameter is not set to True.

You could pass a function as the hook to register_hook, which will be called every time the gradient is calculated. This might be useful for debugging purposes, e.g. just printing the gradient or its statistics, or you could of course manipulate the gradient in a custom way, e.g. normalizing it somehow etc.

Conclusion

In this notebook, I have try to cover five functions that are related to playing with gradients. Using these functions, we can effectively calculate gradients of the leaf nodes and use them at various aspects of development using pytorch.

Well this is the first story I have ever written. Hope it is of some use to you. Please do throw a clap if you have enjoyed it !

Reference Links

Provide links to your references and other interesting articles about tensors

--

--

Attyuttam Saha

Software Engineer, MCA from NIT Warnagal, loves to read and watch horror and talk about programming.