Skip to content

Performance Tips

Nicole's symmetry-aware architecture provides automatic efficiency gains, but understanding a few optimization strategies can dramatically improve performance for large-scale tensor network calculations.

Key strategies:

  • Memory efficiency: Use appropriate symmetries and aggressive truncation
  • Computational speed: Leverage block sparsity and optimize contraction order
  • GPU acceleration: Use CUDA or MPS for large tensors
  • Device management: Minimize CPU↔GPU transfers
  • Profiling: Identify bottlenecks in your specific application

While symmetry already eliminates most unnecessary computation, careful attention to tensor structure, contraction patterns, device placement, and truncation thresholds can achieve orders of magnitude speedup for demanding algorithms like DMRG, XTRG, or PEPS calculations.

Memory Efficiency

Use Appropriate Symmetries

# More symmetries = fewer blocks = less memory
from nicole import ProductGroup, U1Group, Z2Group

# Without symmetry: full dense tensor
# With U(1): ~N blocks
# With U(1) × Z(2): ~N/2 blocks

# Choose symmetries that match your problem

Truncate Aggressively

from nicole import decomp

# Truncate small singular values
U, S, Vh = decomp(T, axes=0, mode="SVD", trunc={"thresh": 1e-12})

# Or keep fixed number
U, S, Vh = decomp(T, axes=0, mode="SVD", trunc={"nkeep": 50})

Computation Efficiency

Contraction Order Matters

from nicole import contract

# BAD: Contract large tensors first
# A: (100×10), B: (10×100), C: (100×5)
# result = contract(contract(A, B), C)  # Creates 100×100 intermediate

# GOOD: Contract to reduce size early
# result = contract(A, contract(B, C))  # Creates 10×5 intermediate

Reuse Index Objects

# GOOD: Create once, reuse
idx = Index(Direction.OUT, group, sectors=sectors)
tensors = [Tensor.random([idx, idx.flip()], itags=[f"a{i}", f"b{i}"]) 
           for i in range(10)]

# BAD: Create new index each time
tensors = [Tensor.random([Index(...), Index(...)], ...) 
           for i in range(10)]  # Wastes time and memory

Use Appropriate Data Types

import torch

# Use float32 if precision allows (required for MPS on Apple Silicon)
T_32 = Tensor.random([idx, idx.flip()], dtype=torch.float32)  # 4 bytes per element

# Use float64 when needed (default for CPU/CUDA)
T_64 = Tensor.random([idx, idx.flip()], dtype=torch.float64)  # 8 bytes per element

# Complex types available
T_complex = Tensor.random([idx, idx.flip()], dtype=torch.complex128)

# Memory difference can be significant for large tensors

Device Management

GPU Acceleration

Nicole supports GPU acceleration through PyTorch:

import torch
from nicole import Tensor

# Check device availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")

# Create tensor on GPU (if available)
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

# Create tensor directly on device
T_gpu = Tensor.random([idx, idx.flip()], itags=["i", "j"], device=device)

# Or transfer existing tensor
T_cpu = Tensor.random([idx, idx.flip()], itags=["i", "j"])
T_gpu = T_cpu.to(device)

Minimize Device Transfers

# BAD: Unnecessary CPU→GPU round-trip each iteration
for i in range(100):
    T = Tensor.random([idx, idx.flip()], device='cpu')
    T_gpu = T.cuda()  # Redundant transfer — tensor was never needed on CPU
    result = contract(T_gpu, B_gpu)

# GOOD: Create tensors directly on the target device
for i in range(100):
    T_gpu = Tensor.random([idx, idx.flip()], device='cuda')
    result = contract(T_gpu, B_gpu)

If tensors must originate from CPU (e.g. loaded from disk), transfer them all before the compute loop to avoid interleaving transfers with GPU kernels:

# BAD: Transfer interleaved with GPU compute
for T in tensors_cpu:
    result = contract(T.cuda(), B_gpu)

# GOOD: Batch-transfer first, then compute
tensors_gpu = [T.cuda() for T in tensors_cpu]
for T_gpu in tensors_gpu:
    result = contract(T_gpu, B_gpu)

Profiling

Check Memory Usage

import torch

# Get memory per block
for key, block in tensor.data.items():
    mem_mb = block.element_size() * block.numel() / (1024 ** 2)
    print(f"Block {key}: {mem_mb:.2f} MB on {block.device}")

# Total memory
total_mb = sum(b.element_size() * b.numel() for b in tensor.data.values()) / (1024 ** 2)
print(f"Total: {total_mb:.2f} MB")

# GPU memory usage (if using CUDA)
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1024**2:.2f} MB")

Time Operations

import time
import torch

# For CPU timing
start = time.time()
result = contract(A, B)
elapsed = time.time() - start
print(f"Contraction took {elapsed:.3f} seconds")

# For GPU timing (need synchronization)
if torch.cuda.is_available():
    torch.cuda.synchronize()  # Ensure all GPU ops are done
    start = time.time()
    result = contract(A_gpu, B_gpu)
    torch.cuda.synchronize()
    elapsed = time.time() - start
    print(f"GPU contraction took {elapsed:.3f} seconds")

Common Pitfalls

Don't Create Unnecessary Copies

# BAD: Creates copies
for i in range(100):
    T_clone = tensor.clone()  # Expensive!
    # ... use T_clone

# GOOD: Use original or clone once
T_working = tensor.clone()
for i in range(100):
    # ... modify T_working in place

Avoid Block Iteration When Possible

import torch

# BAD: Manual iteration
result = 0
for block in tensor.data.values():
    result += torch.sum(block ** 2)

# GOOD: Use built-in methods
norm_squared = tensor.norm() ** 2

Performance Recommendations

General Recommendation: Default to CPU

In tensor network algorithms, tensors often grow very large and can exceed GPU memory.

  • CPU is the default and most reliable choice for tensor network computations
  • Use torch.float64 (default) for numerical precision in iterative algorithms
  • Focus on contraction order optimization and aggressive truncation
  • CPU memory is typically much larger (32-1024 GB) than GPU memory (8-24 GB)

GPU Acceleration: Use Selectively When Memory Permits

  • Only delegate to GPU when tensor sizes fit comfortably in GPU memory
  • Monitor GPU memory usage carefully (use torch.cuda.memory_allocated())
  • Consider torch.float32 to reduce memory footprint and improve GPU performance
  • Best for:
    • Small to medium tensors (< 4-8 GB depending on GPU)
    • Specific bottleneck operations (large contractions, SVD)
    • Batched operations where you can control memory usage
  • Transfer tensors back to CPU after GPU operations to free GPU memory

For Apple Silicon (MPS)

  • Must use torch.float32 (float64 not fully supported)
  • Good for moderate-sized problems
  • Automatic conversion happens via .to('mps')

For NVIDIA GPUs (CUDA)

  • Best performance for large-scale calculations
  • Supports both float32 and float64
  • Use mixed precision when appropriate

See Also