Understanding PyTorch Autograd

deep-learning
pytorch
A hands-on introduction to automatic differentiation in PyTorch.
Author

Anindya Saha

Published

March 31, 2026

PyTorch’s autograd engine is the backbone of neural network training. It automatically computes gradients — the derivatives you need for backpropagation — so you never have to derive them by hand.

In this post, we’ll build intuition for how autograd works through short, runnable examples.

1. Tensors and Gradients

Setting requires_grad=True tells PyTorch to track every operation on a tensor so it can compute gradients later.

import torch

x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 2 * x + 1  # y = x² + 2x + 1

y.backward()  # dy/dx = 2x + 2
print(f"x = {x.item()}, y = {y.item()}, dy/dx = {x.grad.item()}")

At x = 3, the analytical derivative is 2(3) + 2 = 8 — exactly what autograd gives us.

2. The Computational Graph

Autograd builds a directed acyclic graph (DAG) on the fly. Leaf tensors (the ones you create) sit at the edges, and each operation creates a node. When you call .backward(), PyTorch walks this graph in reverse to apply the chain rule.

Let’s see the grad_fn attribute that links each tensor back to the operation that created it:

a = torch.tensor(2.0, requires_grad=True)
b = a * 3
c = b + 1

print(f"a.grad_fn = {a.grad_fn}")   # None — leaf tensor
print(f"b.grad_fn = {b.grad_fn}")   # MulBackward
print(f"c.grad_fn = {c.grad_fn}")   # AddBackward

3. Gradients with Vectors

When the output is a vector, you need to reduce it to a scalar before calling .backward(). A common approach is to sum or take the mean:

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 3          # [1, 8, 27]
y.sum().backward()   # dy/dx = 3x²

print(f"x     = {x.tolist()}")
print(f"dy/dx = {x.grad.tolist()}")  # [3, 12, 27]

4. Detaching from the Graph

Sometimes you need a tensor’s value without tracking gradients — for logging metrics, freezing part of a model, or avoiding memory leaks. Use .detach() or the torch.no_grad() context manager:

x = torch.tensor(5.0, requires_grad=True)
y = x ** 2

# .detach() creates a view that shares data but has no grad history
y_detached = y.detach()
print(f"y_detached.requires_grad = {y_detached.requires_grad}")

# torch.no_grad() suppresses tracking for everything inside
with torch.no_grad():
    z = x * 2
    print(f"z.requires_grad = {z.requires_grad}")

5. A Mini Training Loop

Let’s put it all together: use autograd to fit y = 2x + 1 with gradient descent.

torch.manual_seed(42)

# True relationship: y = 2x + 1
x_data = torch.linspace(0, 1, 50)
y_data = 2 * x_data + 1 + 0.1 * torch.randn(50)

# Learnable parameters
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)
lr = 0.1

for epoch in range(200):
    y_pred = w * x_data + b
    loss = ((y_pred - y_data) ** 2).mean()

    loss.backward()

    with torch.no_grad():
        w -= lr * w.grad
        b -= lr * b.grad

    w.grad.zero_()
    b.grad.zero_()

print(f"Learned: y = {w.item():.3f}x + {b.item():.3f}")
print(f"True:    y = 2.000x + 1.000")

Key Takeaways

  • requires_grad=True tells PyTorch to record operations for differentiation.
  • .backward() traverses the computational graph in reverse, applying the chain rule.
  • Gradients accumulate — always call .zero_() before the next backward pass.
  • torch.no_grad() and .detach() let you opt out of tracking when you don’t need gradients.