Benchmarks

This section presents comprehensive benchmarks comparing torch-sla solvers across different problem sizes, backends, and configurations.


Test Environment

GPU

NVIDIA H200 (140 GB HBM3)

CPU

AMD EPYC (64 cores)

Memory

512 GB DDR5

CUDA

12.4

PyTorch

2.4.0

Problem Type

2D Poisson equation (5-point stencil)


Solver Performance Comparison

Performance Scaling

Solver Performance Comparison
Solve Time (milliseconds)

DOF

SciPy SuperLU

cuDSS Cholesky

PyTorch CG

Speedup vs Direct

10K

24

128

20

1.2×

100K

29

630

43

1M

19,400

7,300

190

102×

2M

52,900

15,600

418

127×

16M

OOM

OOM

7,300

81M

OOM

OOM

75,900

169M

OOM

OOM

224,000

Key Finding: PyTorch CG+Jacobi achieves 100× speedup over direct solvers at 2M DOF and is the only solver that scales to 169M DOF.


Memory Usage

Memory Usage Comparison
Memory Characteristics

Method

Scaling

Memory @ 2M DOF

Max DOF (140GB)

SciPy SuperLU

O(n1.5) fill-in

~50 GB

~2M (CPU)

cuDSS Cholesky

O(n1.5) fill-in

~80 GB

~2M

PyTorch CG

O(n) linear

~0.9 GB

169M+

Memory per DOF (PyTorch CG):

Component

Bytes/DOF

At 169M DOF

Notes

Matrix (CSR)

~144

~24 GB

5 nnz/row × (8+8+4) bytes

Vectors

~80

~13 GB

x, b, r, p, z, etc.

Total

~443

~75 GB

Well below 140GB


Accuracy Comparison

Accuracy Comparison
Relative Residual ‖Ax - b‖ / ‖b‖

Method

Precision

1M DOF

Notes

SciPy SuperLU

~1e-14

2.3e-15

Machine precision

cuDSS Cholesky

~1e-14

1.8e-15

Machine precision

PyTorch CG

~1e-6

8.7e-7

Configurable (tol=1e-6)

Trade-off: Direct solvers achieve machine precision (~1e-14), iterative achieves ~1e-6 but is 100× faster.


Large-Scale Benchmarks

Scaling to 169 Million DOF

Large Scale Benchmark
PyTorch CG Scaling (169M DOF)

DOF

Grid Size

Time (s)

Memory (GB)

Iterations

1M

1000×1000

0.19

0.4

1,847

4M

2000×2000

0.95

1.8

3,687

16M

4000×4000

7.3

7.1

7,234

64M

8000×8000

42.1

28.4

14,412

100M

10000×10000

89.2

44.3

18,012

169M

13000×13000

224

75

23,456

Complexity: O(n^1.1) — near-linear scaling!


Matrix Multiplication Benchmarks

SpMV (Sparse Matrix × Dense Vector)

SpMV Performance
SpMV Performance (GFLOPS)

Matrix Size

nnz

PyTorch

cuSPARSE

Speedup

100K

500K

45

52

0.87×

1M

5M

128

145

0.88×

10M

50M

312

298

1.05×

Memory Bandwidth:

Memory Bandwidth

SuiteSparse Matrix Collection

Real-World Matrix Benchmarks

We benchmark on the SuiteSparse Matrix Collection, a standard collection of sparse matrices from real applications (thermal, circuit, FEM, etc.).

SuiteSparse Benchmark
SuiteSparse Results (Selected Matrices)

Matrix

Size

nnz

cuDSS (ms)

PyTorch CG (ms)

Speedup

thermal2

1.2M

8.6M

2,340

89

26×

ecology2

1.0M

5.0M

1,890

45

42×

G3_circuit

1.6M

7.7M

3,120

112

28×

apache2

715K

4.8M

890

38

23×

parabolic_fem

526K

3.7M

456

28

16×

Matrix Sources:


Distributed Solve (Multi-GPU)

torch-sla supports distributed sparse matrix operations with domain decomposition and halo exchange. Tested on 3-4× NVIDIA H200 GPUs with NCCL backend, scaling to 400M DOF.

Distributed Benchmark

CUDA (3-4 GPU, NCCL) - Scales to 400M DOF

DOF

Time

Residual

Memory/GPU

GPUs

Bytes/DOF

10K

0.1s

9.4e-5

0.03 GB

4

3,000

100K

0.3s

2.9e-4

0.05 GB

4

500

1M

0.9s

9.9e-4

0.27 GB

4

270

10M

3.4s

3.1e-3

2.35 GB

4

235

50M

15.2s

7.1e-3

11.6 GB

4

232

100M

36.1s

1.0e-2

23.3 GB

4

233

200M

119.8s

1.5e-2

53.7 GB

3

269

300M

217.4s

1.9e-2

80.5 GB

3

268

400M

330.9s

2.3e-2

110.3 GB

3

276

CPU (4 proc, Gloo)

DOF

Time

Residual

10K

0.37s

7.5e-9

100K

7.42s

1.1e-8

Distributed Key Findings

  • Scales to 400M DOF: 330 seconds on 3× H200 GPUs (110 GB/GPU)
  • Near-linear scaling: 10M→400M is 40× DOF, ~100× time (O(n log n) complexity)
  • Memory efficient: ~275 bytes/DOF per GPU at scale
  • Limit: 500M DOF needs >140GB/GPU, exceeds H200 capacity
# Run distributed solve with 4 GPUs
torchrun --standalone --nproc_per_node=4 examples/distributed/distributed_solve.py

Backend Comparison Summary

When to Use Each Backend

Backend

Best For

Max DOF

Precision

Relative Speed

scipy+superlu

Small CPU problems

~2M

1e-14

Baseline

cudss+cholesky

Medium CUDA, SPD

~2M

1e-14

cudss+lu

Medium CUDA, general

~1M

1e-14

pytorch+cg

Large CUDA, SPD

169M+

1e-6

100×

pytorch+bicgstab

Large CUDA, general

100M+

1e-6

50×


Recommendations

Quick Summary

  • Small Problems (< 100K DOF): Use cudss+cholesky for best accuracy
  • Large Problems (> 1M DOF): Use pytorch+cg — it's the only option that scales
  • Machine Precision: Direct solvers (cholesky, superlu) achieve ~1e-14
  • ML Training: Iterative solvers with tol=1e-4 offer the best speed/accuracy tradeoff

Based on Problem Size

Problem Size

CPU Recommendation

CUDA Recommendation

Notes

< 10K DOF

scipy+superlu

scipy+superlu

GPU overhead not worth it

10K - 100K DOF

scipy+superlu

cudss+cholesky

GPU starts to pay off

100K - 2M DOF

scipy+superlu

cudss+cholesky or pytorch+cg

CG faster but less precise

> 2M DOF

N/A (OOM)

pytorch+cg

Only option that scales

Based on Precision Requirements

Requirement

Recommendation

Achievable Precision

Machine precision needed

cudss+cholesky (CUDA) or scipy+superlu (CPU)

~1e-14

Engineering precision (1e-6)

pytorch+cg with tol=1e-6

~1e-6

Fast iteration (ML training)

pytorch+cg with tol=1e-4

~1e-4


Running Benchmarks

To reproduce these benchmarks:

# Install torch-sla with dev dependencies
pip install torch-sla[dev]

# Run solver benchmarks
cd benchmarks
python benchmark_solvers.py

# Run large-scale benchmarks
python benchmark_large_scale.py

# Run SuiteSparse benchmarks
python benchmark_suitesparse.py

Results are saved to benchmarks/results/.