Automatic Differentiation (Autograd)¶
Nicole supports PyTorch's automatic differentiation (autograd) for gradient-based optimization of tensor network states. This guide explains when and how to use autograd effectively with symmetric tensors.
Overview¶
By default, Nicole disables autograd for performance, as most tensor network algorithms don't require gradients. However, you can enable gradient tracking for:
- Variational optimization of tensor network states
- Optimizing parametric quantum states
- Gradient-based energy minimization
- Differentiable tensor network layers in neural networks
Enabling Gradient Tracking¶
PyTorch's autograd system has two components: per-tensor flags (requires_grad) and global gradient mode (context managers). Understanding both is essential for correct usage.
The requires_grad Flag¶
The requires_grad attribute is a per-tensor property that marks whether a tensor is a variational parameter that should accumulate gradients. When set to True:
- The tensor becomes a "leaf" node in the computation graph
- Operations involving this tensor are recorded for backpropagation
- After calling
.backward(), gradients accumulate in the tensor's.gradattribute
# Set at creation
T1 = Tensor.random([idx, idx.flip()], itags=["i", "j"],
requires_grad=True, seed=42)
# Or set after creation
T2 = Tensor.random([idx, idx.flip()], itags=["k", "l"], seed=43)
T2.requires_grad = True
print(f"T1 requires grad: {T1.requires_grad}") # True
print(f"T2 requires grad: {T2.requires_grad}") # True
Storage Location
In Nicole, requires_grad is propagated to the underlying PyTorch blocks in Tensor.data. Each block inherits the gradient tracking status from the parent Tensor.
Global Gradient Mode¶
The global gradient mode controls whether operations are recorded in the computation graph, regardless of tensor flags. This is controlled by context managers:
# Disable gradient tracking globally (explicit, though already disabled in Nicole)
with torch.no_grad():
# No operations are tracked, even if tensors have requires_grad=True
T = Tensor.random([idx, idx.flip()], itags=["i", "j"], requires_grad=True)
result = T + T # result.requires_grad will be False
print(f"In no_grad: {result.requires_grad}") # False
# Enable gradient tracking explicitly (REQUIRED in Nicole)
with torch.enable_grad():
# Operations are tracked when explicitly enabled
T = Tensor.random([idx, idx.flip()], itags=["i", "j"], requires_grad=True)
result = T + T # result.requires_grad will be True
print(f"In enable_grad: {result.requires_grad}") # True
# Set mode programmatically
torch.set_grad_enabled(False) # Disable (Nicole's default)
print(f"Grad disabled: {not torch.is_grad_enabled()}")
torch.set_grad_enabled(True) # Enable for gradient tracking
print(f"Grad enabled: {torch.is_grad_enabled()}")
Gradient Propagation Rules¶
The requires_grad flag of output tensors depends on both input tensor flags and the global gradient mode:
| Global Mode | Input requires_grad |
Output requires_grad |
|---|---|---|
| Disabled (default in Nicole) | Any value | False |
Enabled (enable_grad) |
Any input is True |
True |
Enabled (enable_grad) |
All inputs are False |
False |
from nicole import contract
A = Tensor.random([idx, idx.flip()], itags=["i", "j"], requires_grad=True, seed=10)
B = Tensor.random([idx, idx.flip()], itags=["j", "k"], requires_grad=False, seed=11)
print("Tensor A (requires_grad=True):")
print(A)
print(f"\nTensor B (requires_grad=False):")
print(B)
Tensor A (requires_grad=True):
info: 2x { 2 x 1 } having 'A', Tensor, { i*, j }
data: 2-D float64 (40 B) 3 x 3 => 3 x 3 @ norm = 1.95727
1. 2x2 | 1x1 [ 0 ; 0 ] 32 B
2. 1x1 | 1x1 [ 1 ; 1 ] 0.9198
Tensor B (requires_grad=False):
info: 2x { 2 x 1 } having 'A', Tensor, { j*, k }
data: 2-D float64 (40 B) 3 x 3 => 3 x 3 @ norm = 2.60375
1. 2x2 | 1x1 [ 0 ; 0 ] 32 B
2. 1x1 | 1x1 [ 1 ; 1 ] -0.5133
# Rule 1: Default mode in Nicole is DISABLED - gradients NOT tracked
C = contract(A, B)
print(f"C requires grad: {C.requires_grad}") # False (grad mode disabled by default)
# Rule 2: Must explicitly enable grad mode for tracking
with torch.enable_grad():
D = contract(A, B) # A has requires_grad=True
print(f"D requires grad: {D.requires_grad}") # True (A propagates in enabled mode)
# Rule 3: no_grad context also disables (redundant in Nicole, but explicit)
with torch.no_grad():
E = contract(A, B)
print(f"E requires grad: {E.requires_grad}") # False (explicitly disabled)
# Rule 4: Nested contexts (inner takes precedence)
with torch.no_grad():
with torch.enable_grad():
F = contract(A, B)
print(f"F requires grad: {F.requires_grad}") # True (inner context wins)
Basic Gradient Computation¶
Since Nicole disables autograd by default, you must explicitly enable it for gradient tracking:
# Enable gradient tracking (REQUIRED in Nicole)
with torch.enable_grad():
# Forward pass: compute contraction
loss = contract(A_grad, B_grad)
# Backward pass: compute gradients
loss.backward()
# Access gradients (stored in PyTorch blocks)
for key, block in A_grad.data.items():
if block.grad is not None:
print(f"Gradient for block {key}:\n{block.grad}")
Remember to Enable Grad Mode
Nicole sets torch.set_grad_enabled(False) on import. Always wrap gradient computations in torch.enable_grad() or call torch.set_grad_enabled(True) before variational optimization.
Performance Considerations¶
Enabling autograd comes with trade-offs. While it enables powerful gradient-based optimization, it adds both computational overhead and memory costs. Understanding these costs helps you make informed decisions about when to use gradients.
Computational Overhead¶
Autograd adds overhead in two ways:
- Graph Construction: Every operation must record its inputs and create nodes in the computation graph
- Bookkeeping: PyTorch tracks tensor metadata (shapes, strides, version counters) to enable backpropagation
The overhead scales with:
- Depth of computation: Longer chains of operations accumulate more graph nodes
- Number of operations: Each tensor operation adds bookkeeping cost
- Tensor complexity: Symmetric tensors with many blocks require more metadata tracking
For tensor networks, this overhead can be significant because contractions often form deep chains. Let's measure the impact on a realistic tensor network computation:
# Create a chain of 10 tensors with sequential contractions
num_tensors = 10
tensors = []
for i in range(num_tensors):
itag_in = f"i{i}"
itag_out = f"i{i+1}"
T = Tensor.random([idx_large, idx_large.flip()],
itags=[itag_in, itag_out], seed=100+i)
tensors.append(T)
print(f"Created chain of {num_tensors} tensors")
# Without gradients - contract all sequentially
start = time.time()
for _ in range(20):
result = tensors[0]
for T in tensors[1:]:
result = contract(result, T)
time_no_grad = time.time() - start
# With gradients - same chain
for T in tensors:
T.requires_grad = True
torch.set_grad_enabled(True)
start = time.time()
for _ in range(20):
result = tensors[0]
for T in tensors[1:]:
result = contract(result, T)
if result.data: # Ensure computation
pass
time_with_grad = time.time() - start
print(f"\nSequential chain: T0 --[i1]-- T1 --[i2]-- ... --[i{num_tensors}]-- T{num_tensors-1}")
print(f"Without grad: {time_no_grad:.4f}s")
print(f"With grad: {time_with_grad:.4f}s")
print(f"Overhead: {(time_with_grad / time_no_grad - 1) * 100:.1f}%")
Key Observations:
- The overhead is multiplicative, not additive: longer chains show proportionally larger slowdowns
- For typical tensor network algorithms (DMRG, TEBD, etc.), this overhead is often not worth it since they use specialized update schemes rather than gradient descent
- For variational optimization of small networks, the overhead may be acceptable if gradients provide faster convergence
When to worry about overhead:
- ✅ Don't worry: Single-shot computations, prototyping, small-scale optimization
- ⚠️ Be cautious: Inner loops with thousands of iterations, production DMRG/TEBD
- ❌ Avoid: Time-critical code where gradients aren't needed
Memory Usage¶
Beyond computation time, autograd also increases memory consumption significantly. PyTorch must store:
- Intermediate tensors: All results from the forward pass needed for backprop
- Gradient buffers: Space to accumulate gradients for each parameter
- Computation graph metadata: Nodes tracking operation types and connections
For tensor networks, this can be problematic because:
- Large bond dimensions create big intermediate tensors
- Deep contraction trees accumulate many intermediates
- Iterative algorithms repeatedly build and destroy graphs
# Memory-efficient: clear gradients regularly
optimizer = torch.optim.Adam([block for block in psi.data.values()])
# Enable gradient tracking for variational optimization
torch.set_grad_enabled(True)
for step in range(100):
optimizer.zero_grad() # Clear old gradients (REQUIRED!)
# Compute energy/observable and gradients
energy = compute_energy(psi)
energy.backward()
optimizer.step()
# Free computation graph immediately
del energy
Additional memory-saving techniques:
- Gradient checkpointing: Recompute intermediate values during backward instead of storing them (trade compute for memory)
- Detach when possible: Use
.detach()on tensors you don't need gradients for to break the computation graph - Context managers: Wrap non-differentiable computations in
torch.no_grad()to avoid graph construction - Chunking: Break large contractions into smaller pieces to limit peak memory usage
When to Use Autograd¶
Good Use Cases:¶
- Variational ground state search
- Training parametric tensor networks
- Optimization problems with smooth objectives
- Differentiable physics simulations
Not Recommended:¶
- Standard DMRG (uses iterative eigensolver)
- Time evolution (uses analytical formulas)
- Sampling-based methods
- When gradients are not informative
See Also¶
- GPU Acceleration: Combine with GPU for faster optimization
- Performance Tips: General optimization strategies
- API: Tensor.requires_grad: Autograd property documentation