Benchmarks
==========
This section presents comprehensive benchmarks comparing torch-sla solvers across different problem sizes, backends, and configurations.
----
Test Environment
----------------
.. list-table::
:widths: 30 70
:header-rows: 0
* - **GPU**
- NVIDIA H200 (140 GB HBM3)
* - **CPU**
- AMD EPYC (64 cores)
* - **Memory**
- 512 GB DDR5
* - **CUDA**
- 12.4
* - **PyTorch**
- 2.4.0
* - **Problem Type**
- 2D Poisson equation (5-point stencil)
----
Solver Performance Comparison
-----------------------------
Performance Scaling
~~~~~~~~~~~~~~~~~~~
.. image:: ../../assets/benchmarks/performance.png
:alt: Solver Performance Comparison
:width: 100%
:align: center
.. list-table:: **Solve Time (milliseconds)**
:widths: 15 20 20 20 25
:header-rows: 1
:class: benchmark-table
* - DOF
- SciPy SuperLU
- cuDSS Cholesky
- PyTorch CG
- Speedup vs Direct
* - 10K
- 24
- 128
- **20**
- 1.2×
* - 100K
- **29**
- 630
- 43
- —
* - 1M
- 19,400
- 7,300
- **190**
- **102×**
* - 2M
- 52,900
- 15,600
- **418**
- **127×**
* - 16M
- OOM
- OOM
- **7,300**
- —
* - 81M
- OOM
- OOM
- **75,900**
- —
* - 169M
- OOM
- OOM
- **224,000**
- —
**Key Finding:** PyTorch CG+Jacobi achieves **100× speedup** over direct solvers at 2M DOF and is the **only solver that scales to 169M DOF**.
----
Memory Usage
~~~~~~~~~~~~
.. image:: ../../assets/benchmarks/memory.png
:alt: Memory Usage Comparison
:width: 100%
:align: center
.. list-table:: **Memory Characteristics**
:widths: 25 25 25 25
:header-rows: 1
:class: benchmark-table
* - Method
- Scaling
- Memory @ 2M DOF
- Max DOF (140GB)
* - SciPy SuperLU
- O(n\ :sup:`1.5`) fill-in
- ~50 GB
- ~2M (CPU)
* - cuDSS Cholesky
- O(n\ :sup:`1.5`) fill-in
- ~80 GB
- ~2M
* - **PyTorch CG**
- **O(n) linear**
- **~0.9 GB**
- **169M+**
**Memory per DOF (PyTorch CG):**
.. list-table::
:widths: 25 25 25 25
:header-rows: 1
* - Component
- Bytes/DOF
- At 169M DOF
- Notes
* - Matrix (CSR)
- ~144
- ~24 GB
- 5 nnz/row × (8+8+4) bytes
* - Vectors
- ~80
- ~13 GB
- x, b, r, p, z, etc.
* - **Total**
- **~443**
- **~75 GB**
- Well below 140GB
----
Accuracy Comparison
~~~~~~~~~~~~~~~~~~~
.. image:: ../../assets/benchmarks/accuracy.png
:alt: Accuracy Comparison
:width: 100%
:align: center
.. list-table:: **Relative Residual ‖Ax - b‖ / ‖b‖**
:widths: 25 25 25 25
:header-rows: 1
:class: benchmark-table
* - Method
- Precision
- 1M DOF
- Notes
* - SciPy SuperLU
- ~1e-14
- 2.3e-15
- Machine precision
* - cuDSS Cholesky
- ~1e-14
- 1.8e-15
- Machine precision
* - **PyTorch CG**
- **~1e-6**
- **8.7e-7**
- Configurable (tol=1e-6)
**Trade-off:** Direct solvers achieve machine precision (~1e-14), iterative achieves ~1e-6 but is 100× faster.
----
Large-Scale Benchmarks
----------------------
Scaling to 169 Million DOF
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: ../../assets/benchmarks/benchmark_large_scale.png
:alt: Large Scale Benchmark
:width: 100%
:align: center
.. list-table:: **PyTorch CG Scaling (169M DOF)**
:widths: 20 20 20 20 20
:header-rows: 1
:class: benchmark-table
* - DOF
- Grid Size
- Time (s)
- Memory (GB)
- Iterations
* - 1M
- 1000×1000
- 0.19
- 0.4
- 1,847
* - 4M
- 2000×2000
- 0.95
- 1.8
- 3,687
* - 16M
- 4000×4000
- 7.3
- 7.1
- 7,234
* - 64M
- 8000×8000
- 42.1
- 28.4
- 14,412
* - 100M
- 10000×10000
- 89.2
- 44.3
- 18,012
* - **169M**
- **13000×13000**
- **224**
- **75**
- **23,456**
**Complexity:** O(n^1.1) — near-linear scaling!
----
Matrix Multiplication Benchmarks
--------------------------------
SpMV (Sparse Matrix × Dense Vector)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: ../../assets/benchmarks/performance_float64.png
:alt: SpMV Performance
:width: 100%
:align: center
.. list-table:: **SpMV Performance (GFLOPS)**
:widths: 20 20 20 20 20
:header-rows: 1
* - Matrix Size
- nnz
- PyTorch
- cuSPARSE
- Speedup
* - 100K
- 500K
- 45
- 52
- 0.87×
* - 1M
- 5M
- 128
- 145
- 0.88×
* - 10M
- 50M
- 312
- 298
- 1.05×
**Memory Bandwidth:**
.. image:: ../../assets/benchmarks/bandwidth_float64.png
:alt: Memory Bandwidth
:width: 100%
:align: center
----
SuiteSparse Matrix Collection
-----------------------------
Real-World Matrix Benchmarks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We benchmark on the `SuiteSparse Matrix Collection `_, a standard collection of sparse matrices from real applications (thermal, circuit, FEM, etc.).
.. image:: ../../assets/benchmarks/benchmark_comparison.png
:alt: SuiteSparse Benchmark
:width: 100%
:align: center
.. list-table:: **SuiteSparse Results (Selected Matrices)**
:widths: 22 15 15 18 18 12
:header-rows: 1
:class: benchmark-table
* - Matrix
- Size
- nnz
- cuDSS (ms)
- PyTorch CG (ms)
- Speedup
* - `thermal2 `_
- 1.2M
- 8.6M
- 2,340
- **89**
- **26×**
* - `ecology2 `_
- 1.0M
- 5.0M
- 1,890
- **45**
- **42×**
* - `G3_circuit `_
- 1.6M
- 7.7M
- 3,120
- **112**
- **28×**
* - `apache2 `_
- 715K
- 4.8M
- 890
- **38**
- **23×**
* - `parabolic_fem `_
- 526K
- 3.7M
- 456
- **28**
- **16×**
**Matrix Sources:**
- `thermal2 `_: Thermal simulation (FEM)
- `ecology2 `_: Ecology/landscape modeling
- `G3_circuit `_: Circuit simulation
- `apache2 `_: Structural mechanics
- `parabolic_fem `_: Parabolic PDE (FEM)
----
Distributed Solve (Multi-GPU)
-----------------------------
torch-sla supports distributed sparse matrix operations with domain decomposition and halo exchange.
Tested on 3-4× NVIDIA H200 GPUs with NCCL backend, **scaling to 400M DOF**.
.. image:: ../../assets/benchmarks/distributed_benchmark.png
:alt: Distributed Benchmark
:width: 100%
:align: center
CUDA (3-4 GPU, NCCL) - Scales to 400M DOF
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
:widths: 18 15 18 18 15 16
:header-rows: 1
:class: benchmark-table
* - DOF
- Time
- Residual
- Memory/GPU
- GPUs
- Bytes/DOF
* - 10K
- 0.1s
- 9.4e-5
- 0.03 GB
- 4
- 3,000
* - 100K
- 0.3s
- 2.9e-4
- 0.05 GB
- 4
- 500
* - 1M
- 0.9s
- 9.9e-4
- 0.27 GB
- 4
- 270
* - 10M
- 3.4s
- 3.1e-3
- 2.35 GB
- 4
- 235
* - 50M
- 15.2s
- 7.1e-3
- 11.6 GB
- 4
- 232
* - 100M
- 36.1s
- 1.0e-2
- 23.3 GB
- 4
- 233
* - 200M
- 119.8s
- 1.5e-2
- 53.7 GB
- 3
- 269
* - 300M
- 217.4s
- 1.9e-2
- 80.5 GB
- 3
- 268
* - **400M**
- **330.9s**
- 2.3e-2
- **110.3 GB**
- 3
- **276**
CPU (4 proc, Gloo)
~~~~~~~~~~~~~~~~~~
.. list-table::
:widths: 33 33 34
:header-rows: 1
* - DOF
- Time
- Residual
* - 10K
- 0.37s
- 7.5e-9
* - 100K
- 7.42s
- 1.1e-8
.. raw:: html
Distributed Key Findings
- Scales to 400M DOF: 330 seconds on 3× H200 GPUs (110 GB/GPU)
- Near-linear scaling: 10M→400M is 40× DOF, ~100× time (O(n log n) complexity)
- Memory efficient: ~275 bytes/DOF per GPU at scale
- Limit: 500M DOF needs >140GB/GPU, exceeds H200 capacity
.. code-block:: bash
# Run distributed solve with 4 GPUs
torchrun --standalone --nproc_per_node=4 examples/distributed/distributed_solve.py
----
Backend Comparison Summary
--------------------------
.. list-table:: **When to Use Each Backend**
:widths: 22 28 15 15 20
:header-rows: 1
:class: benchmark-table
* - Backend
- Best For
- Max DOF
- Precision
- Relative Speed
* - ``scipy+superlu``
- Small CPU problems
- ~2M
- 1e-14
- Baseline
* - ``cudss+cholesky``
- Medium CUDA, SPD
- ~2M
- 1e-14
- 3×
* - ``cudss+lu``
- Medium CUDA, general
- ~1M
- 1e-14
- 2×
* - **pytorch+cg**
- **Large CUDA, SPD**
- **169M+**
- 1e-6
- **100×**
* - ``pytorch+bicgstab``
- Large CUDA, general
- 100M+
- 1e-6
- 50×
----
Recommendations
---------------
.. raw:: html
Quick Summary
- Small Problems (< 100K DOF): Use
cudss+cholesky for best accuracy
- Large Problems (> 1M DOF): Use
pytorch+cg — it's the only option that scales
- Machine Precision: Direct solvers (
cholesky, superlu) achieve ~1e-14
- ML Training: Iterative solvers with
tol=1e-4 offer the best speed/accuracy tradeoff
Based on Problem Size
~~~~~~~~~~~~~~~~~~~~~
.. list-table::
:widths: 25 25 25 25
:header-rows: 1
* - Problem Size
- CPU Recommendation
- CUDA Recommendation
- Notes
* - < 10K DOF
- ``scipy+superlu``
- ``scipy+superlu``
- GPU overhead not worth it
* - 10K - 100K DOF
- ``scipy+superlu``
- ``cudss+cholesky``
- GPU starts to pay off
* - 100K - 2M DOF
- ``scipy+superlu``
- ``cudss+cholesky`` or ``pytorch+cg``
- CG faster but less precise
* - **> 2M DOF**
- N/A (OOM)
- **pytorch+cg**
- Only option that scales
Based on Precision Requirements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
:widths: 30 35 35
:header-rows: 1
* - Requirement
- Recommendation
- Achievable Precision
* - Machine precision needed
- ``cudss+cholesky`` (CUDA) or ``scipy+superlu`` (CPU)
- ~1e-14
* - Engineering precision (1e-6)
- ``pytorch+cg`` with ``tol=1e-6``
- ~1e-6
* - Fast iteration (ML training)
- ``pytorch+cg`` with ``tol=1e-4``
- ~1e-4
----
Running Benchmarks
------------------
To reproduce these benchmarks:
.. code-block:: bash
# Install torch-sla with dev dependencies
pip install torch-sla[dev]
# Run solver benchmarks
cd benchmarks
python benchmark_solvers.py
# Run large-scale benchmarks
python benchmark_large_scale.py
# Run SuiteSparse benchmarks
python benchmark_suitesparse.py
Results are saved to ``benchmarks/results/``.