Performance Tips¶

Nicole's symmetry-aware architecture provides automatic efficiency gains, but understanding a few optimization strategies can dramatically improve performance for large-scale tensor network calculations.

Key strategies:

Memory efficiency: Use appropriate symmetries and aggressive truncation
Computational speed: Leverage block sparsity and optimize contraction order
GPU acceleration: Use CUDA or MPS for large tensors
Device management: Minimize CPU↔GPU transfers
Profiling: Identify bottlenecks in your specific application

While symmetry already eliminates most unnecessary computation, careful attention to tensor structure, contraction patterns, device placement, and truncation thresholds can achieve orders of magnitude speedup for demanding algorithms like DMRG, XTRG, or PEPS calculations.

Memory Efficiency¶

Use Appropriate Symmetries¶

# More symmetries = fewer blocks = less memory
from nicole import ProductGroup, U1Group, Z2Group

# Without symmetry: full dense tensor
# With U(1): ~N blocks
# With U(1) × Z(2): ~N/2 blocks

# Choose symmetries that match your problem

Truncate Aggressively¶

from nicole import decomp

# Truncate small singular values
U, S, Vh = decomp(T, axes=0, mode="SVD", trunc={"thresh": 1e-12})

# Or keep fixed number
U, S, Vh = decomp(T, axes=0, mode="SVD", trunc={"nkeep": 50})

Computation Efficiency¶

Contraction Order Matters¶

from nicole import contract

# BAD: Contract large tensors first
# A: (100×10), B: (10×100), C: (100×5)
# result = contract(contract(A, B), C)  # Creates 100×100 intermediate

# GOOD: Contract to reduce size early
# result = contract(A, contract(B, C))  # Creates 10×5 intermediate

Reuse Index Objects¶

# GOOD: Create once, reuse
idx = Index(Direction.OUT, group, sectors=sectors)
tensors = [Tensor.random([idx, idx.flip()], itags=[f"a{i}", f"b{i}"]) 
           for i in range(10)]

# BAD: Create new index each time
tensors = [Tensor.random([Index(...), Index(...)], ...) 
           for i in range(10)]  # Wastes time and memory

Use Appropriate Data Types¶

import torch

# Use float32 if precision allows (required for MPS on Apple Silicon)
T_32 = Tensor.random([idx, idx.flip()], dtype=torch.float32)  # 4 bytes per element

# Use float64 when needed (default for CPU/CUDA)
T_64 = Tensor.random([idx, idx.flip()], dtype=torch.float64)  # 8 bytes per element

# Complex types available
T_complex = Tensor.random([idx, idx.flip()], dtype=torch.complex128)

# Memory difference can be significant for large tensors

Device Management¶

GPU Acceleration¶

Nicole supports GPU acceleration through PyTorch:

import torch
from nicole import Tensor

# Check device availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")

# Create tensor on GPU (if available)
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

# Create tensor directly on device
T_gpu = Tensor.random([idx, idx.flip()], itags=["i", "j"], device=device)

# Or transfer existing tensor
T_cpu = Tensor.random([idx, idx.flip()], itags=["i", "j"])
T_gpu = T_cpu.to(device)

Minimize Device Transfers¶

# BAD: Unnecessary CPU→GPU round-trip each iteration
for i in range(100):
    T = Tensor.random([idx, idx.flip()], device='cpu')
    T_gpu = T.cuda()  # Redundant transfer — tensor was never needed on CPU
    result = contract(T_gpu, B_gpu)

# GOOD: Create tensors directly on the target device
for i in range(100):
    T_gpu = Tensor.random([idx, idx.flip()], device='cuda')
    result = contract(T_gpu, B_gpu)

If tensors must originate from CPU (e.g. loaded from disk), transfer them all before the compute loop to avoid interleaving transfers with GPU kernels:

# BAD: Transfer interleaved with GPU compute
for T in tensors_cpu:
    result = contract(T.cuda(), B_gpu)

# GOOD: Batch-transfer first, then compute
tensors_gpu = [T.cuda() for T in tensors_cpu]
for T_gpu in tensors_gpu:
    result = contract(T_gpu, B_gpu)

Profiling¶

Check Memory Usage¶

import torch

# Get memory per block
for key, block in tensor.data.items():
    mem_mb = block.element_size() * block.numel() / (1024 ** 2)
    print(f"Block {key}: {mem_mb:.2f} MB on {block.device}")

# Total memory
total_mb = sum(b.element_size() * b.numel() for b in tensor.data.values()) / (1024 ** 2)
print(f"Total: {total_mb:.2f} MB")

# GPU memory usage (if using CUDA)
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1024**2:.2f} MB")

Time Operations¶

import time
import torch

# For CPU timing
start = time.time()
result = contract(A, B)
elapsed = time.time() - start
print(f"Contraction took {elapsed:.3f} seconds")

# For GPU timing (need synchronization)
if torch.cuda.is_available():
    torch.cuda.synchronize()  # Ensure all GPU ops are done
    start = time.time()
    result = contract(A_gpu, B_gpu)
    torch.cuda.synchronize()
    elapsed = time.time() - start
    print(f"GPU contraction took {elapsed:.3f} seconds")

Common Pitfalls¶

Don't Create Unnecessary Copies¶

# BAD: Creates copies
for i in range(100):
    T_clone = tensor.clone()  # Expensive!
    # ... use T_clone

# GOOD: Use original or clone once
T_working = tensor.clone()
for i in range(100):
    # ... modify T_working in place

Avoid Block Iteration When Possible¶

import torch

# BAD: Manual iteration
result = 0
for block in tensor.data.values():
    result += torch.sum(block ** 2)

# GOOD: Use built-in methods
norm_squared = tensor.norm() ** 2

Performance Recommendations¶

General Recommendation: Default to CPU¶

In tensor network algorithms, tensors often grow very large and can exceed GPU memory.

CPU is the default and most reliable choice for tensor network computations
Use torch.float64 (default) for numerical precision in iterative algorithms
Focus on contraction order optimization and aggressive truncation
CPU memory is typically much larger (32-1024 GB) than GPU memory (8-24 GB)

GPU Acceleration: Use Selectively When Memory Permits¶

Only delegate to GPU when tensor sizes fit comfortably in GPU memory
Monitor GPU memory usage carefully (use torch.cuda.memory_allocated())
Consider torch.float32 to reduce memory footprint and improve GPU performance
Best for:
- Small to medium tensors (< 4-8 GB depending on GPU)
- Specific bottleneck operations (large contractions, SVD)
- Batched operations where you can control memory usage
Transfer tensors back to CPU after GPU operations to free GPU memory

For Apple Silicon (MPS)¶

Must use torch.float32 (float64 not fully supported)
Good for moderate-sized problems
Automatic conversion happens via .to('mps')

For NVIDIA GPUs (CUDA)¶

Best performance for large-scale calculations
Supports both float32 and float64
Use mixed precision when appropriate