Automatic Differentiation (Autograd)¶

Nicole supports PyTorch's automatic differentiation (autograd) for gradient-based optimization of tensor network states. This guide explains when and how to use autograd effectively with symmetric tensors.

Overview¶

By default, Nicole disables autograd for performance, as most tensor network algorithms don't require gradients. However, you can enable gradient tracking for:

Variational optimization of tensor network states
Optimizing parametric quantum states
Gradient-based energy minimization
Differentiable tensor network layers in neural networks

Enabling Gradient Tracking¶

PyTorch's autograd system has two components: per-tensor flags (requires_grad) and global gradient mode (context managers). Understanding both is essential for correct usage.

The `requires_grad` Flag¶

The requires_grad attribute is a per-tensor property that marks whether a tensor is a variational parameter that should accumulate gradients. When set to True:

The tensor becomes a "leaf" node in the computation graph
Operations involving this tensor are recorded for backpropagation
After calling .backward(), gradients accumulate in the tensor's .grad attribute

# Set at creation
T1 = Tensor.random([idx, idx.flip()], itags=["i", "j"], 
                   requires_grad=True, seed=42)

# Or set after creation
T2 = Tensor.random([idx, idx.flip()], itags=["k", "l"], seed=43)
T2.requires_grad = True

print(f"T1 requires grad: {T1.requires_grad}")  # True
print(f"T2 requires grad: {T2.requires_grad}")  # True

T1 requires grad: True
T2 requires grad: True

Storage Location

In Nicole, requires_grad is propagated to the underlying PyTorch blocks in Tensor.data. Each block inherits the gradient tracking status from the parent Tensor.

Global Gradient Mode¶

The global gradient mode controls whether operations are recorded in the computation graph, regardless of tensor flags. This is controlled by context managers:

# Check current mode
print(torch.is_grad_enabled())  # False (Nicole disables by default)

False

# Disable gradient tracking globally (explicit, though already disabled in Nicole)
with torch.no_grad():
    # No operations are tracked, even if tensors have requires_grad=True
    T = Tensor.random([idx, idx.flip()], itags=["i", "j"], requires_grad=True)
    result = T + T  # result.requires_grad will be False
    print(f"In no_grad: {result.requires_grad}")  # False

In no_grad: False

# Enable gradient tracking explicitly (REQUIRED in Nicole)
with torch.enable_grad():
    # Operations are tracked when explicitly enabled
    T = Tensor.random([idx, idx.flip()], itags=["i", "j"], requires_grad=True)
    result = T + T  # result.requires_grad will be True
    print(f"In enable_grad: {result.requires_grad}")  # True

In enable_grad: True

# Set mode programmatically
torch.set_grad_enabled(False)  # Disable (Nicole's default)
print(f"Grad disabled: {not torch.is_grad_enabled()}")

torch.set_grad_enabled(True)   # Enable for gradient tracking
print(f"Grad enabled: {torch.is_grad_enabled()}")

Grad disabled: True
Grad enabled: True

Gradient Propagation Rules¶

The requires_grad flag of output tensors depends on both input tensor flags and the global gradient mode:

Global Mode	Input `requires_grad`	Output `requires_grad`
Disabled (default in Nicole)	Any value	`False`
Enabled (`enable_grad`)	Any input is `True`	`True`
Enabled (`enable_grad`)	All inputs are `False`	`False`

from nicole import contract

A = Tensor.random([idx, idx.flip()], itags=["i", "j"], requires_grad=True, seed=10)
B = Tensor.random([idx, idx.flip()], itags=["j", "k"], requires_grad=False, seed=11)

print("Tensor A (requires_grad=True):")
print(A)
print(f"\nTensor B (requires_grad=False):")
print(B)

Tensor A (requires_grad=True):

  info:  2x { 2 x 1 }  having 'A',   Tensor,  { i*, j }
  data:  2-D float64 (40 B)    3 x 3 => 3 x 3  @ norm = 1.95727

     1.  2x2     |  1x1     [ 0 ; 0 ]    32 B      
     2.  1x1     |  1x1     [ 1 ; 1 ]   0.9198     

Tensor B (requires_grad=False):

  info:  2x { 2 x 1 }  having 'A',   Tensor,  { j*, k }
  data:  2-D float64 (40 B)    3 x 3 => 3 x 3  @ norm = 2.60375

     1.  2x2     |  1x1     [ 0 ; 0 ]    32 B      
     2.  1x1     |  1x1     [ 1 ; 1 ]  -0.5133

# Rule 1: Default mode in Nicole is DISABLED - gradients NOT tracked
C = contract(A, B)
print(f"C requires grad: {C.requires_grad}")  # False (grad mode disabled by default)

C requires grad: False

# Rule 2: Must explicitly enable grad mode for tracking
with torch.enable_grad():
    D = contract(A, B)  # A has requires_grad=True
    print(f"D requires grad: {D.requires_grad}")  # True (A propagates in enabled mode)

D requires grad: True

# Rule 3: no_grad context also disables (redundant in Nicole, but explicit)
with torch.no_grad():
    E = contract(A, B)
    print(f"E requires grad: {E.requires_grad}")  # False (explicitly disabled)

E requires grad: False

# Rule 4: Nested contexts (inner takes precedence)
with torch.no_grad():
    with torch.enable_grad():
        F = contract(A, B)
        print(f"F requires grad: {F.requires_grad}")  # True (inner context wins)

F requires grad: True

Basic Gradient Computation¶

Since Nicole disables autograd by default, you must explicitly enable it for gradient tracking:

# Enable gradient tracking (REQUIRED in Nicole)
with torch.enable_grad():
    # Forward pass: compute contraction
    loss = contract(A_grad, B_grad)

    # Backward pass: compute gradients
    loss.backward()

# Access gradients (stored in PyTorch blocks)
for key, block in A_grad.data.items():
    if block.grad is not None:
        print(f"Gradient for block {key}:\n{block.grad}")

Gradient for block (0, 0):
tensor([[ 0.3923, -0.3195],
        [-0.2236, -1.2050]], dtype=torch.float64)
Gradient for block (1, 1):
tensor([[1.0445]], dtype=torch.float64)

Remember to Enable Grad Mode

Nicole sets torch.set_grad_enabled(False) on import. Always wrap gradient computations in torch.enable_grad() or call torch.set_grad_enabled(True) before variational optimization.

Performance Considerations¶

Enabling autograd comes with trade-offs. While it enables powerful gradient-based optimization, it adds both computational overhead and memory costs. Understanding these costs helps you make informed decisions about when to use gradients.

Computational Overhead¶

Autograd adds overhead in two ways:

Graph Construction: Every operation must record its inputs and create nodes in the computation graph
Bookkeeping: PyTorch tracks tensor metadata (shapes, strides, version counters) to enable backpropagation

The overhead scales with:

Depth of computation: Longer chains of operations accumulate more graph nodes
Number of operations: Each tensor operation adds bookkeeping cost
Tensor complexity: Symmetric tensors with many blocks require more metadata tracking

For tensor networks, this overhead can be significant because contractions often form deep chains. Let's measure the impact on a realistic tensor network computation:

# Create a chain of 10 tensors with sequential contractions
num_tensors = 10
tensors = []
for i in range(num_tensors):
    itag_in = f"i{i}"
    itag_out = f"i{i+1}"
    T = Tensor.random([idx_large, idx_large.flip()], 
                      itags=[itag_in, itag_out], seed=100+i)
    tensors.append(T)

print(f"Created chain of {num_tensors} tensors")

# Without gradients - contract all sequentially
start = time.time()
for _ in range(20):
    result = tensors[0]
    for T in tensors[1:]:
        result = contract(result, T)
time_no_grad = time.time() - start

# With gradients - same chain
for T in tensors:
    T.requires_grad = True
torch.set_grad_enabled(True)
start = time.time()
for _ in range(20):
    result = tensors[0]
    for T in tensors[1:]:
        result = contract(result, T)
    if result.data:  # Ensure computation
        pass
time_with_grad = time.time() - start

print(f"\nSequential chain: T0 --[i1]-- T1 --[i2]-- ... --[i{num_tensors}]-- T{num_tensors-1}")
print(f"Without grad: {time_no_grad:.4f}s")
print(f"With grad: {time_with_grad:.4f}s")
print(f"Overhead: {(time_with_grad / time_no_grad - 1) * 100:.1f}%")

Created chain of 10 tensors

Sequential chain: T0 --[i1]-- T1 --[i2]-- ... --[i10]-- T9
Without grad: 0.0100s
With grad: 0.0106s
Overhead: 6.6%

Key Observations:

The overhead is multiplicative, not additive: longer chains show proportionally larger slowdowns
For typical tensor network algorithms (DMRG, TEBD, etc.), this overhead is often not worth it since they use specialized update schemes rather than gradient descent
For variational optimization of small networks, the overhead may be acceptable if gradients provide faster convergence

When to worry about overhead:

✅ Don't worry: Single-shot computations, prototyping, small-scale optimization
⚠️ Be cautious: Inner loops with thousands of iterations, production DMRG/TEBD
❌ Avoid: Time-critical code where gradients aren't needed

Memory Usage¶

Beyond computation time, autograd also increases memory consumption significantly. PyTorch must store:

Intermediate tensors: All results from the forward pass needed for backprop
Gradient buffers: Space to accumulate gradients for each parameter
Computation graph metadata: Nodes tracking operation types and connections

For tensor networks, this can be problematic because:

Large bond dimensions create big intermediate tensors
Deep contraction trees accumulate many intermediates
Iterative algorithms repeatedly build and destroy graphs

# Memory-efficient: clear gradients regularly
optimizer = torch.optim.Adam([block for block in psi.data.values()])

# Enable gradient tracking for variational optimization
torch.set_grad_enabled(True)

for step in range(100):
    optimizer.zero_grad()  # Clear old gradients (REQUIRED!)

    # Compute energy/observable and gradients
    energy = compute_energy(psi)
    energy.backward()
    optimizer.step()

    # Free computation graph immediately
    del energy

Additional memory-saving techniques:

Gradient checkpointing: Recompute intermediate values during backward instead of storing them (trade compute for memory)
Detach when possible: Use .detach() on tensors you don't need gradients for to break the computation graph
Context managers: Wrap non-differentiable computations in torch.no_grad() to avoid graph construction
Chunking: Break large contractions into smaller pieces to limit peak memory usage

Automatic Differentiation (Autograd)¶

Overview¶

Enabling Gradient Tracking¶

The `requires_grad` Flag¶

Global Gradient Mode¶

Gradient Propagation Rules¶

Basic Gradient Computation¶

Performance Considerations¶

Computational Overhead¶

Memory Usage¶

When to Use Autograd¶

Good Use Cases:¶

Not Recommended:¶

See Also¶

Automatic Differentiation (Autograd)¶

Overview¶

Enabling Gradient Tracking¶

The requires_grad Flag¶

Global Gradient Mode¶

Gradient Propagation Rules¶

Basic Gradient Computation¶

Performance Considerations¶

Computational Overhead¶

Memory Usage¶

When to Use Autograd¶

Good Use Cases:¶

Not Recommended:¶

See Also¶

The `requires_grad` Flag¶