import torch
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 2 * x + 1 # y = x² + 2x + 1
y.backward() # dy/dx = 2x + 2
print(f"x = {x.item()}, y = {y.item()}, dy/dx = {x.grad.item()}")Understanding PyTorch Autograd
PyTorch’s autograd engine is the backbone of neural network training. It automatically computes gradients — the derivatives you need for backpropagation — so you never have to derive them by hand.
In this post, we’ll build intuition for how autograd works through short, runnable examples.
1. Tensors and Gradients
Setting requires_grad=True tells PyTorch to track every operation on a tensor so it can compute gradients later.
At x = 3, the analytical derivative is 2(3) + 2 = 8 — exactly what autograd gives us.
2. The Computational Graph
Autograd builds a directed acyclic graph (DAG) on the fly. Leaf tensors (the ones you create) sit at the edges, and each operation creates a node. When you call .backward(), PyTorch walks this graph in reverse to apply the chain rule.
Let’s see the grad_fn attribute that links each tensor back to the operation that created it:
a = torch.tensor(2.0, requires_grad=True)
b = a * 3
c = b + 1
print(f"a.grad_fn = {a.grad_fn}") # None — leaf tensor
print(f"b.grad_fn = {b.grad_fn}") # MulBackward
print(f"c.grad_fn = {c.grad_fn}") # AddBackward3. Gradients with Vectors
When the output is a vector, you need to reduce it to a scalar before calling .backward(). A common approach is to sum or take the mean:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 3 # [1, 8, 27]
y.sum().backward() # dy/dx = 3x²
print(f"x = {x.tolist()}")
print(f"dy/dx = {x.grad.tolist()}") # [3, 12, 27]4. Detaching from the Graph
Sometimes you need a tensor’s value without tracking gradients — for logging metrics, freezing part of a model, or avoiding memory leaks. Use .detach() or the torch.no_grad() context manager:
x = torch.tensor(5.0, requires_grad=True)
y = x ** 2
# .detach() creates a view that shares data but has no grad history
y_detached = y.detach()
print(f"y_detached.requires_grad = {y_detached.requires_grad}")
# torch.no_grad() suppresses tracking for everything inside
with torch.no_grad():
z = x * 2
print(f"z.requires_grad = {z.requires_grad}")5. A Mini Training Loop
Let’s put it all together: use autograd to fit y = 2x + 1 with gradient descent.
torch.manual_seed(42)
# True relationship: y = 2x + 1
x_data = torch.linspace(0, 1, 50)
y_data = 2 * x_data + 1 + 0.1 * torch.randn(50)
# Learnable parameters
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)
lr = 0.1
for epoch in range(200):
y_pred = w * x_data + b
loss = ((y_pred - y_data) ** 2).mean()
loss.backward()
with torch.no_grad():
w -= lr * w.grad
b -= lr * b.grad
w.grad.zero_()
b.grad.zero_()
print(f"Learned: y = {w.item():.3f}x + {b.item():.3f}")
print(f"True: y = 2.000x + 1.000")Key Takeaways
requires_grad=Truetells PyTorch to record operations for differentiation..backward()traverses the computational graph in reverse, applying the chain rule.- Gradients accumulate — always call
.zero_()before the next backward pass. torch.no_grad()and.detach()let you opt out of tracking when you don’t need gradients.