GPU Acceleration¶

Nicole leverages PyTorch's device management to enable GPU acceleration for large-scale tensor network calculations. This guide covers how to use GPUs effectively with Nicole.

Overview¶

Nicole supports three device types:

CPU: Default device, supports all dtypes (float32, float64, complex64, complex128)
CUDA (NVIDIA GPUs): Full support for all dtypes, best performance for deep learning
MPS (Apple Silicon): Supports float32/complex64 natively; float64/complex128 are automatically downgraded

Checking Device Availability¶

Before using GPU acceleration, check what devices are available:

# Check NVIDIA GPU (CUDA)
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device count: {torch.cuda.device_count()}")
    print(f"CUDA device name: {torch.cuda.get_device_name(0)}")

# Check Apple Silicon GPU (MPS)
print(f"MPS available: {torch.backends.mps.is_available()}")

# Current default device
print(f"Default device: {torch.get_default_device()}")

CUDA available: False
MPS available: True
Default device: cpu

Creating Tensors on GPU¶

Specify Device at Creation¶

# Create directly on GPU
if torch.cuda.is_available():
    T_gpu = Tensor.random([idx, idx.flip()], itags=["i", "j"], device='cuda', seed=42)
elif torch.backends.mps.is_available():
    T_gpu = Tensor.random([idx, idx.flip()], itags=["i", "j"], device='mps', seed=42)
else:
    T_gpu = Tensor.random([idx, idx.flip()], itags=["i", "j"], device='cpu', seed=42)

print(f"Tensor device: {T_gpu.device}")

Tensor device: mps:0

Transfer Existing Tensor¶

# Create on CPU
T_cpu = Tensor.random([idx, idx.flip()], itags=["i", "j"], seed=10)

# Transfer to GPU
if torch.cuda.is_available():
    T_gpu_transfer = T_cpu.to('cuda')
elif torch.backends.mps.is_available():
    T_gpu_transfer = T_cpu.to('mps')
else:
    T_gpu_transfer = T_cpu  # Stay on CPU

# Transfer back to CPU
T_back = T_gpu_transfer.cpu()

print(f"Original: {T_cpu.device}")
print(f"On GPU: {T_gpu_transfer.device}")
print(f"Back to CPU: {T_back.device}")

Original: cpu
On GPU: mps:0
Back to CPU: cpu

Convenience Methods¶

T_cpu2 = Tensor.random([idx, idx.flip()], itags=["k", "l"], seed=20)

# Use convenience methods
if torch.cuda.is_available():
    T_gpu_conv = T_cpu2.cuda()  # Equivalent to .to('cuda')
    print(f"After .cuda(): {T_gpu_conv.device}")
elif torch.backends.mps.is_available():
    T_gpu_conv = T_cpu2.to('mps')  # MPS doesn't have .mps() method
    print(f"After .to('mps'): {T_gpu_conv.device}")
else:
    T_gpu_conv = T_cpu2
    print(f"No GPU available, staying on: {T_gpu_conv.device}")

# Transfer back
T_back2 = T_gpu_conv.cpu()  # Equivalent to .to('cpu')
print(f"After .cpu(): {T_back2.device}")

After .to('mps'): mps:0
After .cpu(): cpu

Device Consistency in Operations¶

All tensors in an operation must be on the same device:

from nicole import contract

# Create tensors on the same device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
A = Tensor.random([idx, idx.flip()], itags=["i", "mid"], device=device)
B = Tensor.random([idx, idx.flip()], itags=["mid", "j"], device=device)

# Contract (both on same device)
C = contract(A, B)
print(f"Result device: {C.device}")

# This would fail:
# A_cpu = Tensor.random([idx, idx.flip()], itags=["i", "mid"], device='cpu')
# B_gpu = Tensor.random([idx, idx.flip()], itags=["mid", "j"], device='cuda')
# C = contract(A_cpu, B_gpu)  # Error: tensors on different devices

Data Types and GPU Compatibility¶

CUDA (NVIDIA)¶

import torch

# CUDA supports all dtypes
T_f32 = Tensor.random([idx, idx.flip()], dtype=torch.float32, device='cuda')
T_f64 = Tensor.random([idx, idx.flip()], dtype=torch.float64, device='cuda')
T_c64 = Tensor.random([idx, idx.flip()], dtype=torch.complex64, device='cuda')
T_c128 = Tensor.random([idx, idx.flip()], dtype=torch.complex128, device='cuda')

MPS (Apple Silicon)¶

MPS devices don't support float64 or complex128 dtypes. Nicole automatically handles this by converting to compatible dtypes:

float64 → float32
complex128 → complex64

This conversion happens automatically in two scenarios:

1. Creating tensors directly on MPS:

# Request float64 on MPS - automatically uses float32
if torch.backends.mps.is_available():
    T = Tensor.random([idx, idx.flip()], dtype=torch.float64, device='mps')
    print(f"Requested: float64, Actual: {T.dtype}")  # float32

    # Same for complex types
    T_complex = Tensor.random([idx, idx.flip()], dtype=torch.complex128, device='mps')
    print(f"Requested: complex128, Actual: {T_complex.dtype}")  # complex64
else:
    print("MPS not available")

Requested: float64, Actual: torch.float32
Requested: complex128, Actual: torch.complex64

2. Moving tensors to MPS:

# Create on CPU with float64
T_cpu = Tensor.random([idx, idx.flip()], dtype=torch.float64, device='cpu')
print(f"On CPU: {T_cpu.dtype}")

# Transfer to MPS - auto-converts to float32
if torch.backends.mps.is_available():
    T_mps = T_cpu.to('mps')
    print(f"On MPS: {T_mps.dtype}")  # float32
else:
    print("MPS not available")

On CPU: torch.float64
On MPS: torch.float32

Native float32 works directly

If you already use float32 or complex64, no conversion is needed:

T = Tensor.random([idx, idx.flip()], dtype=torch.float32, device='mps')
# Uses float32 directly, no conversion

Performance Best Practices¶

1. Minimize Device Transfers¶

Device transfers are expensive. Do them once at the start:

# BAD: Transfer every iteration
for i in range(100):
    T_cpu = Tensor.random([idx, idx.flip()], device='cpu')
    T_gpu = T_cpu.cuda()
    result = contract(T_gpu, B_gpu)
    result_cpu = result.cpu()  # Unnecessary transfer

# GOOD: Work entirely on GPU
B_gpu = Tensor.random([idx, idx.flip()], device='cuda')
results = []
for i in range(100):
    T_gpu = Tensor.random([idx, idx.flip()], device='cuda')
    result = contract(T_gpu, B_gpu)
    results.append(result)

# Transfer back once at the end if needed
results_cpu = [r.cpu() for r in results]

2. Batch Transfer Operations¶

# Transfer multiple tensors together
tensors_cpu = [Tensor.random([idx, idx.flip()], itags=[f"a{i}", f"b{i}"]) 
               for i in range(10)]

# BAD: Transfer one by one
# tensors_gpu = [t.cuda() for t in tensors_cpu]  # Each transfer has overhead

# GOOD: Use list comprehension (still transfers one by one, but more pythonic)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tensors_gpu = [t.to(device) for t in tensors_cpu]

3. Choose Data Type Wisely¶

# For MPS, use float32
if torch.backends.mps.is_available():
    T = Tensor.random([idx, idx.flip()], dtype=torch.float32, device='mps')

# For CUDA, choose based on precision needs
if torch.cuda.is_available():
    # float32: faster, less memory, good for most applications
    T_fast = Tensor.random([idx, idx.flip()], dtype=torch.float32, device='cuda')

    # float64: slower, more memory, higher precision
    T_precise = Tensor.random([idx, idx.flip()], dtype=torch.float64, device='cuda')

Monitoring GPU Memory¶

if torch.cuda.is_available():
    # Check memory usage
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
    print(f"Reserved: {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
    print(f"Max allocated: {torch.cuda.max_memory_allocated() / 1024**2:.2f} MB")

    # Reset peak stats
    torch.cuda.reset_peak_memory_stats()

    # Clear cache if needed
    torch.cuda.empty_cache()

Benchmarking CPU vs GPU¶

import time

# Create test tensors
idx_large = Index(Direction.OUT, group, 
                  sectors=tuple(Sector(i, 10) for i in range(10)))

# CPU benchmark
T_cpu = Tensor.random([idx_large, idx_large.flip()], itags=["i", "j"], device='cpu')
U_cpu = Tensor.random([idx_large, idx_large.flip()], itags=["j", "k"], device='cpu')

start = time.time()
result_cpu = contract(T_cpu, U_cpu)
cpu_time = time.time() - start
print(f"CPU time: {cpu_time:.4f} seconds")

# GPU benchmark
if torch.cuda.is_available():
    T_gpu = T_cpu.cuda()
    U_gpu = U_cpu.cuda()

    # Warmup
    _ = contract(T_gpu, U_gpu)
    torch.cuda.synchronize()

    # Actual benchmark
    start = time.time()
    result_gpu = contract(T_gpu, U_gpu)
    torch.cuda.synchronize()  # Wait for GPU to finish
    gpu_time = time.time() - start
    print(f"GPU time: {gpu_time:.4f} seconds")
    print(f"Speedup: {cpu_time / gpu_time:.2f}x")

Common Pitfalls¶

1. Profiler Synchronization¶

# BAD: Timing without synchronization
if torch.cuda.is_available():
    start = time.time()
    result = contract(A_gpu, B_gpu)
    elapsed = time.time() - start  # Wrong! GPU might still be running

# GOOD: Synchronize before timing
if torch.cuda.is_available():
    start = time.time()
    result = contract(A_gpu, B_gpu)
    torch.cuda.synchronize()
    elapsed = time.time() - start  # Correct

2. Mixed Devices in Operations¶

# This will error
A_cpu = Tensor.random([idx, idx.flip()], device='cpu')
B_gpu = Tensor.random([idx, idx.flip()], device='cuda')
# result = contract(A_cpu, B_gpu)  # Error!

# Solution: Move to same device first
B_cpu = B_gpu.cpu()
result = contract(A_cpu, B_cpu)  # OK