Benchmarks¶

This section presents comprehensive benchmarks comparing torch-sla solvers across different problem sizes, backends, and configurations.

Test Environment¶

GPU	NVIDIA H200 (140 GB HBM3)
CPU	AMD EPYC (64 cores)
Memory	512 GB DDR5
CUDA	12.4
PyTorch	2.4.0
Problem Type	2D Poisson equation (5-point stencil)

Solver Performance Comparison¶

Performance Scaling¶

**Solve Time (milliseconds)**¶
DOF	SciPy SuperLU	cuDSS Cholesky	PyTorch CG	Speedup vs Direct
10K	24	128	20	1.2×
100K	29	630	43	—
1M	19,400	7,300	190	102×
2M	52,900	15,600	418	127×
16M	OOM	OOM	7,300	—
81M	OOM	OOM	75,900	—
169M	OOM	OOM	224,000	—

Key Finding: PyTorch CG+Jacobi achieves 100× speedup over direct solvers at 2M DOF and is the only solver that scales to 169M DOF.

Memory Usage¶

**Memory Characteristics**¶
Method	Scaling	Memory @ 2M DOF	Max DOF (140GB)
SciPy SuperLU	O(n^1.5) fill-in	~50 GB	~2M (CPU)
cuDSS Cholesky	O(n^1.5) fill-in	~80 GB	~2M
PyTorch CG	O(n) linear	~0.9 GB	169M+

Memory per DOF (PyTorch CG):

Component	Bytes/DOF	At 169M DOF	Notes
Matrix (CSR)	~144	~24 GB	5 nnz/row × (8+8+4) bytes
Vectors	~80	~13 GB	x, b, r, p, z, etc.
Total	~443	~75 GB	Well below 140GB

Accuracy Comparison¶

**Relative Residual ‖Ax - b‖ / ‖b‖**¶
Method	Precision	1M DOF	Notes
SciPy SuperLU	~1e-14	2.3e-15	Machine precision
cuDSS Cholesky	~1e-14	1.8e-15	Machine precision
PyTorch CG	~1e-6	8.7e-7	Configurable (tol=1e-6)

Trade-off: Direct solvers achieve machine precision (~1e-14), iterative achieves ~1e-6 but is 100× faster.

Large-Scale Benchmarks¶

Scaling to 169 Million DOF¶

**PyTorch CG Scaling (169M DOF)**¶
DOF	Grid Size	Time (s)	Memory (GB)	Iterations
1M	1000×1000	0.19	0.4	1,847
4M	2000×2000	0.95	1.8	3,687
16M	4000×4000	7.3	7.1	7,234
64M	8000×8000	42.1	28.4	14,412
100M	10000×10000	89.2	44.3	18,012
169M	13000×13000	224	75	23,456

Complexity: O(n^1.1) — near-linear scaling!

Matrix Multiplication Benchmarks¶

SpMV (Sparse Matrix × Dense Vector)¶

**SpMV Performance (GFLOPS)**¶
Matrix Size	nnz	PyTorch	cuSPARSE	Speedup
100K	500K	45	52	0.87×
1M	5M	128	145	0.88×
10M	50M	312	298	1.05×

Memory Bandwidth:

SuiteSparse Matrix Collection¶

Real-World Matrix Benchmarks¶

We benchmark on the SuiteSparse Matrix Collection, a standard collection of sparse matrices from real applications (thermal, circuit, FEM, etc.).

**SuiteSparse Results (Selected Matrices)**¶
Matrix	Size	nnz	cuDSS (ms)	PyTorch CG (ms)	Speedup
thermal2	1.2M	8.6M	2,340	89	26×
ecology2	1.0M	5.0M	1,890	45	42×
G3_circuit	1.6M	7.7M	3,120	112	28×
apache2	715K	4.8M	890	38	23×
parabolic_fem	526K	3.7M	456	28	16×

Matrix Sources:

thermal2: Thermal simulation (FEM)
ecology2: Ecology/landscape modeling
G3_circuit: Circuit simulation
apache2: Structural mechanics
parabolic_fem: Parabolic PDE (FEM)

Distributed Solve (Multi-GPU)¶

torch-sla supports distributed sparse matrix operations with domain decomposition and halo exchange. Tested on 3-4× NVIDIA H200 GPUs with NCCL backend, scaling to 400M DOF.

CUDA (3-4 GPU, NCCL) - Scales to 400M DOF¶

DOF	Time	Residual	Memory/GPU	GPUs	Bytes/DOF
10K	0.1s	9.4e-5	0.03 GB	4	3,000
100K	0.3s	2.9e-4	0.05 GB	4	500
1M	0.9s	9.9e-4	0.27 GB	4	270
10M	3.4s	3.1e-3	2.35 GB	4	235
50M	15.2s	7.1e-3	11.6 GB	4	232
100M	36.1s	1.0e-2	23.3 GB	4	233
200M	119.8s	1.5e-2	53.7 GB	3	269
300M	217.4s	1.9e-2	80.5 GB	3	268
400M	330.9s	2.3e-2	110.3 GB	3	276

CPU (4 proc, Gloo)¶

DOF	Time	Residual
10K	0.37s	7.5e-9
100K	7.42s	1.1e-8

Distributed Key Findings

Scales to 400M DOF: 330 seconds on 3× H200 GPUs (110 GB/GPU)
Near-linear scaling: 10M→400M is 40× DOF, ~100× time (O(n log n) complexity)
Memory efficient: ~275 bytes/DOF per GPU at scale
Limit: 500M DOF needs >140GB/GPU, exceeds H200 capacity

# Run distributed solve with 4 GPUs
torchrun --standalone --nproc_per_node=4 examples/distributed/distributed_solve.py

Backend Comparison Summary¶

**When to Use Each Backend**¶
Backend	Best For	Max DOF	Precision	Relative Speed
`scipy+superlu`	Small CPU problems	~2M	1e-14	Baseline
`cudss+cholesky`	Medium CUDA, SPD	~2M	1e-14	3×
`cudss+lu`	Medium CUDA, general	~1M	1e-14	2×
pytorch+cg	Large CUDA, SPD	169M+	1e-6	100×
`pytorch+bicgstab`	Large CUDA, general	100M+	1e-6	50×

Recommendations¶

Quick Summary

Small Problems (< 100K DOF): Use cudss+cholesky for best accuracy
Large Problems (> 1M DOF): Use pytorch+cg — it's the only option that scales
Machine Precision: Direct solvers (cholesky, superlu) achieve ~1e-14
ML Training: Iterative solvers with tol=1e-4 offer the best speed/accuracy tradeoff

Based on Problem Size¶

Problem Size	CPU Recommendation	CUDA Recommendation	Notes
< 10K DOF	`scipy+superlu`	`scipy+superlu`	GPU overhead not worth it
10K - 100K DOF	`scipy+superlu`	`cudss+cholesky`	GPU starts to pay off
100K - 2M DOF	`scipy+superlu`	`cudss+cholesky` or `pytorch+cg`	CG faster but less precise
> 2M DOF	N/A (OOM)	pytorch+cg	Only option that scales

Based on Precision Requirements¶

Requirement	Recommendation	Achievable Precision
Machine precision needed	`cudss+cholesky` (CUDA) or `scipy+superlu` (CPU)	~1e-14
Engineering precision (1e-6)	`pytorch+cg` with `tol=1e-6`	~1e-6
Fast iteration (ML training)	`pytorch+cg` with `tol=1e-4`	~1e-4

Running Benchmarks¶

To reproduce these benchmarks:

# Install torch-sla with dev dependencies
pip install torch-sla[dev]

# Run solver benchmarks
cd benchmarks
python benchmark_solvers.py

# Run large-scale benchmarks
python benchmark_large_scale.py

# Run SuiteSparse benchmarks
python benchmark_suitesparse.py

Results are saved to benchmarks/results/.