Examples¶

This section provides practical examples for using torch-sla.

Quick Navigation

Visualization: spy() for sparsity pattern analysis
I/O Operations: Matrix Market & SafeTensors format support
Linear Solve: Direct & iterative solvers with gradients
Matrix Decompositions: SVD, Eigenvalue, LU factorization
Advanced: Nonlinear solve, distributed computing

Visualization¶

Spy Plot (Sparsity Pattern)¶

Visualize the sparsity pattern of a sparse matrix using the .spy() method.

Code:

import torch
from torch_sla import SparseTensor

# Create a 2D Poisson matrix (5-point stencil)
n = 50
val, row, col = [], [], []
for i in range(n):
    for j in range(n):
        idx = i * n + j
        val.append(4.0); row.append(idx); col.append(idx)
        if j > 0: val.append(-1.0); row.append(idx); col.append(idx-1)
        if j < n-1: val.append(-1.0); row.append(idx); col.append(idx+1)
        if i > 0: val.append(-1.0); row.append(idx); col.append(idx-n)
        if i < n-1: val.append(-1.0); row.append(idx); col.append(idx+n)

A = SparseTensor(torch.tensor(val), torch.tensor(row), torch.tensor(col), (n*n, n*n))

# Visualize sparsity pattern
A.spy(title="2D Poisson (5-point stencil)")

Output Examples:

2D Poisson (10×10) - 100 DOF, 5-point stencil with grid lines¶	2D Poisson (50×50) - 2,500 DOF, band structure visible¶
Tridiagonal (30×30) - Classic 1D Poisson pattern¶	Random Sparse (100×100) - 800 random non-zeros¶

Each non-zero element is rendered as a colored pixel with intensity proportional to its absolute value. Zero elements are white.

I/O Operations¶

Matrix Market Format¶

Save and load sparse matrices in the standard Matrix Market (.mtx) format.

Code:

from torch_sla import SparseTensor, save_matrix_market, load_matrix_market

# Create a sparse matrix
A = SparseTensor(val, row, col, (100, 100))

# Save to Matrix Market format
save_matrix_market(A, "matrix.mtx", comment="My sparse matrix")

# Load from Matrix Market format
B = load_matrix_market("matrix.mtx", device="cuda")

# Verify
assert torch.allclose(A.to_dense(), B.to_dense())

File format (.mtx):

%%MatrixMarket matrix coordinate real general
% My sparse matrix
100 100 500
1 1 4.0
1 2 -1.0
...

SafeTensors Format¶

Save and load using the efficient safetensors format.

Code:

from torch_sla import SparseTensor

A = SparseTensor(val, row, col, shape)

# Save
A.save("matrix.safetensors")

# Load
B = SparseTensor.load("matrix.safetensors", device="cuda")

# Save distributed (for multi-GPU)
A.save_distributed("matrix_dist/", num_partitions=4)

Basic Usage¶

Basic Sparse Linear Solve¶

Solve a sparse linear system \(Ax = b\) using SparseTensor.

Linear System:

Given a sparse matrix \(A \in \mathbb{R}^{n \times n}\) and right-hand side \(b \in \mathbb{R}^n\), find \(x \in \mathbb{R}^n\) such that:

\[Ax = b \quad \Leftrightarrow \quad x = A^{-1} b\]

Solver Methods:

Direct solvers (LU, Cholesky): Exact solution, \(O(n^{1.5})\) for sparse
Iterative solvers (CG, BiCGStab): Approximate solution, \(O(k \cdot \text{nnz})\) where \(k\) is iterations

Problem:

\[\begin{split}A = \begin{pmatrix} 4 & -1 & 0 \\ -1 & 4 & -1 \\ 0 & -1 & 4 \end{pmatrix}, \quad b = \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix}\end{split}\]

This is a 3×3 symmetric positive definite (SPD) tridiagonal matrix from 1D Poisson discretization.

COO Format:

Index	Row	Col	Value
0	0	0	4.0
1	0	1	-1.0
2	1	0	-1.0
3	1	1	4.0
4	1	2	-1.0
5	2	1	-1.0
6	2	2	4.0

Solution:

\[\begin{split}x = A^{-1}b = \begin{pmatrix} 0.4643 \\ 0.8571 \\ 0.9643 \end{pmatrix}\end{split}\]

Code:

import torch
from torch_sla import SparseTensor

# Create sparse matrix from dense (easier to read for small matrices)
dense = torch.tensor([[4.0, -1.0,  0.0],
                      [-1.0, 4.0, -1.0],
                      [ 0.0, -1.0, 4.0]], dtype=torch.float64)

A = SparseTensor.from_dense(dense)
b = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float64)

x = A.solve(b)

Property Detection¶

Detect matrix properties for optimal solver selection.

Symmetry: \(A = A^T\)

Positive Definiteness: All eigenvalues \(\lambda_i > 0\)

For the tridiagonal matrix: \(\lambda_1 \approx 2.59, \lambda_2 = 4.0, \lambda_3 \approx 5.41\) (all positive → SPD)

Code:

from torch_sla import SparseTensor

# Using the tridiagonal matrix from above
dense = torch.tensor([[4.0, -1.0,  0.0],
                      [-1.0, 4.0, -1.0],
                      [ 0.0, -1.0, 4.0]], dtype=torch.float64)
A = SparseTensor.from_dense(dense)

is_sym = A.is_symmetric()              # tensor(True)
is_pd = A.is_positive_definite()       # tensor(True)

Gradient Computation¶

Compute gradients through sparse solve using implicit differentiation.

Implicit Differentiation:

Given \(x = A^{-1} b\), for a loss function \(L(x)\), we want \(\frac{\partial L}{\partial A}\) and \(\frac{\partial L}{\partial b}\).

From \(Ax = b\), differentiate both sides:

\[dA \cdot x + A \cdot dx = db\]

Solving for \(dx\):

\[dx = A^{-1}(db - dA \cdot x)\]

Adjoint Method:

Define adjoint variable \(\lambda = A^{-T} \frac{\partial L}{\partial x}\), then:

\[\frac{\partial L}{\partial A_{ij}} = -\lambda_i \cdot x_j, \quad \frac{\partial L}{\partial b} = \lambda\]

Gradient formulas (summary):

\[\frac{\partial L}{\partial A} = -\lambda x^T, \quad \frac{\partial L}{\partial b} = A^{-T} \frac{\partial L}{\partial x}\]

Code:

import torch
from torch_sla import spsolve

val = torch.tensor([4.0, -1.0, -1.0, 4.0, -1.0, -1.0, 4.0],
                   dtype=torch.float64, requires_grad=True)
row = torch.tensor([0, 0, 1, 1, 1, 2, 2])
col = torch.tensor([0, 1, 0, 1, 2, 1, 2])
b = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float64, requires_grad=True)

x = spsolve(val, row, col, (3, 3), b)
loss = x.sum()
loss.backward()

# val.grad, b.grad now contain gradients

Specify Backend and Method¶

Choose solver backend and method explicitly.

Available Options:

Backend	Device	Methods
`scipy`	CPU	`superlu`, `umfpack`, `cg`, `bicgstab`, `gmres`
`eigen`	CPU	`cg`, `bicgstab`
`cusolver`	CUDA	`qr`, `cholesky`, `lu`
`cudss`	CUDA	`lu`, `cholesky`, `ldlt`

Code:

from torch_sla import SparseTensor

A = SparseTensor(val, row, col, (n, n))
b = torch.randn(n, dtype=torch.float64)

x1 = A.solve(b, backend='scipy', method='superlu')    # Direct
x2 = A.solve(b, backend='scipy', method='cg')         # Iterative (SPD)
x3 = A.solve(b, backend='scipy', method='bicgstab')   # Iterative (general)

Matrix Operations¶

Compute norms, determinants, and eigenvalues.

Frobenius Norm:

\[\|A\|_F = \sqrt{\sum_{i,j} |a_{ij}|^2} = \sqrt{52} \approx 7.21\]

Determinant:

\[\det(A) = \text{product of eigenvalues}\]

For the tridiagonal matrix: \(\det(A) = 56\)

Gradient Formula:

\[\frac{\partial \det(A)}{\partial A_{ij}} = \det(A) \cdot (A^{-1})_{ji}\]

Code:

from torch_sla import SparseTensor

# Using the tridiagonal matrix from above
dense = torch.tensor([[4.0, -1.0,  0.0],
                      [-1.0, 4.0, -1.0],
                      [ 0.0, -1.0, 4.0]], dtype=torch.float64)
A = SparseTensor.from_dense(dense)

norm = A.norm('fro')                              # ≈ 7.21
det = A.det()                                     # 56.0 (with gradient support)
eigenvalues, eigenvectors = A.eigsh(k=2, which='LM')  # Top-2 eigenvalues

Batched Solve¶

Batched SparseTensor¶

Solve multiple systems with same sparsity pattern but different values.

Problem: Solve 4 systems with scaled matrices

\[A^{(0)} = A, \quad A^{(1)} = 1.1A, \quad A^{(2)} = 1.2A, \quad A^{(3)} = 1.3A\]

Code:

import torch
from torch_sla import SparseTensor

val = torch.tensor([4.0, -1.0, -1.0, 4.0, -1.0, -1.0, 4.0], dtype=torch.float64)
row = torch.tensor([0, 0, 1, 1, 1, 2, 2])
col = torch.tensor([0, 1, 0, 1, 2, 1, 2])

batch_size = 4
val_batch = val.unsqueeze(0).expand(batch_size, -1).clone()
for i in range(batch_size):
    val_batch[i] = val * (1.0 + 0.1 * i)

A = SparseTensor(val_batch, row, col, (batch_size, 3, 3))
b = torch.randn(batch_size, 3, dtype=torch.float64)

x = A.solve(b)  # x.shape: [4, 3]

Multi-Dimensional Batch¶

Handle shapes like [B1, B2, M, N].

Example: 2 materials × 3 temperatures = 6 systems

Code:

import torch
from torch_sla import SparseTensor

B1, B2, n = 2, 3, 8
val_batch = val.unsqueeze(0).unsqueeze(0).expand(B1, B2, -1).clone()

A = SparseTensor(val_batch, row, col, (B1, B2, n, n))
b = torch.randn(B1, B2, n, dtype=torch.float64)

x = A.solve(b)  # x.shape: [2, 3, 8]

solve_batch for Repeated Solves¶

Efficient batch solve with same structure but different values.

Use Case: Time-stepping with fixed sparsity pattern

LU Decomposition: \(A = LU\), then solve \(Ly = b\), \(Ux = y\)

Code:

from torch_sla import SparseTensor

A = SparseTensor(val, row, col, shape)

val_batch = torch.stack([val * (1.0 + 0.01 * t) for t in range(100)])  # [100, nnz]
b_batch = torch.randn(100, n, dtype=torch.float64)

x_batch = A.solve_batch(val_batch, b_batch)  # [100, n]

SparseTensorList¶

Handle matrices with different sparsity patterns.

Use Case: FEM meshes with different element counts

Code:

from torch_sla import SparseTensor, SparseTensorList

A1 = SparseTensor(val1, row1, col1, (5, 5))
A2 = SparseTensor(val2, row2, col2, (10, 10))
A3 = SparseTensor(val3, row3, col3, (15, 15))

matrices = SparseTensorList([A1, A2, A3])

b_list = [torch.randn(5), torch.randn(10), torch.randn(15)]
x_list = matrices.solve(b_list)

Distributed Solve¶

Basic DSparseTensor¶

Create distributed sparse tensor with domain decomposition.

Domain Decomposition: Split 16-node grid into 2 domains

\[\text{Domain 0: } \{0,...,7\}, \quad \text{Domain 1: } \{8,...,15\}\]

Each domain has owned nodes and halo/ghost nodes from neighbors.

Code:

from torch_sla import DSparseTensor

D = DSparseTensor(val, row, col, (16, 16), num_partitions=2)

for i in range(D.num_partitions):
    p = D[i]
    # p.num_owned, p.num_halo, p.num_local

2D Poisson Example¶

Create 2D Poisson matrix with 5-point stencil.

Stencil:

\[\begin{split}\frac{1}{h^2} \begin{pmatrix} & -1 & \\ -1 & 4 & -1 \\ & -1 & \end{pmatrix}\end{split}\]

Code:

import torch
from torch_sla import DSparseTensor

def create_2d_poisson(nx, ny):
    N = nx * ny
    rows, cols, vals = [], [], []
    for i in range(ny):
        for j in range(nx):
            idx = i * nx + j
            rows.append(idx); cols.append(idx); vals.append(4.0)
            if j > 0:
                rows.append(idx); cols.append(idx - 1); vals.append(-1.0)
            if j < nx - 1:
                rows.append(idx); cols.append(idx + 1); vals.append(-1.0)
            if i > 0:
                rows.append(idx); cols.append(idx - nx); vals.append(-1.0)
            if i < ny - 1:
                rows.append(idx); cols.append(idx + nx); vals.append(-1.0)
    return (torch.tensor(vals), torch.tensor(rows), torch.tensor(cols), (N, N))

val, row, col, shape = create_2d_poisson(4, 4)
D = DSparseTensor(val, row, col, shape, num_partitions=2)

Scatter and Gather¶

Distribute global vectors to partitions and gather back.

Diagram:

Global:  [x0, x1, x2, x3, x4, x5, x6, x7]
                  ↓ scatter
Local:   [x0, x1, x2, x3, x6]  (P0 + halo)
         [x4, x5, x6, x7, x3]  (P1 + halo)
                  ↓ gather
Global:  [x0, x1, x2, x3, x4, x5, x6, x7]

Code:

from torch_sla import DSparseTensor

D = DSparseTensor(val, row, col, shape, num_partitions=2)
x_global = torch.arange(16, dtype=torch.float64)

x_local = D.scatter_local(x_global)
x_gathered = D.gather_global(x_local)

Halo Exchange¶

Exchange ghost node values between neighboring partitions.

References:

1D Decomposition Diagram:

Partition 0: Owned [0,1,2,3], Halo [4] ← from P1
Partition 1: Owned [4,5,6,7], Halo [3] ← from P0

Exchange Process:

Before: P0=[x0,x1,x2,x3,?], P1=[x4,x5,x6,x7,?]
                     ↓ halo_exchange_local()
After:  P0=[x0,x1,x2,x3,x4], P1=[x4,x5,x6,x7,x3]

Why needed: For \(y_3 = \sum_j A_{3,j} x_j\), node 3 needs \(x_4\) from P1.

Code:

from torch_sla import DSparseTensor

D = DSparseTensor(val, row, col, shape, num_partitions=4)
x_list = [torch.randn(D[i].num_local) for i in range(D.num_partitions)]

D.halo_exchange_local(x_list)

Iterative Solvers¶

PyTorch CG Solver¶

For large-scale problems (> 100K DOF), iterative methods are much faster than direct solvers.

Conjugate Gradient (CG) Algorithm:

For symmetric positive definite (SPD) matrix \(A\), CG minimizes:

\[\phi(x) = \frac{1}{2} x^T A x - b^T x\]

The minimum is achieved at \(x^* = A^{-1} b\).

CG Iteration:

Starting from \(x_0\), with residual \(r_0 = b - Ax_0\) and search direction \(p_0 = r_0\):

\[\begin{split}\alpha_k &= \frac{r_k^T r_k}{p_k^T A p_k} \\ x_{k+1} &= x_k + \alpha_k p_k \\ r_{k+1} &= r_k - \alpha_k A p_k \\ \beta_k &= \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k} \\ p_{k+1} &= r_{k+1} + \beta_k p_k\end{split}\]

Convergence:

CG converges in at most \(n\) iterations (exact arithmetic). With condition number \(\kappa = \lambda_{\max}/\lambda_{\min}\):

\[\|x_k - x^*\|_A \leq 2 \left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)^k \|x_0 - x^*\|_A\]

Preconditioning:

Preconditioned CG (PCG) solves \(M^{-1} A x = M^{-1} b\) where \(M \approx A\):

Jacobi: \(M = \text{diag}(A)\) — simple, effective
Incomplete Cholesky: \(M = \tilde{L} \tilde{L}^T\) — better for ill-conditioned

Convergence Example:

_images/cg_convergence.png — CG convergence for 2D Poisson problems of various sizes. Larger problems require more iterations due to worse conditioning.¶

Performance Comparison (1M DOF, NVIDIA H200):

Method	Time	Memory	Best For
`pytorch+cg`	0.5s ✅	~500 MB	> 100K DOF, SPD
`cudss+cholesky`	7.8s	~300 MB	< 100K DOF, high precision

Code:

from torch_sla import spsolve

# For large SPD systems, use PyTorch CG
x = spsolve(val, row, col, shape, b,
            backend='pytorch',
            method='cg',
            preconditioner='jacobi')

Preconditioners¶

Available preconditioners for iterative solvers:

Name	Description	Best For
`jacobi`	Diagonal scaling (default)	General use, fastest
`ssor`	Symmetric SOR (ω=1.5)	Slow convergence problems
`polynomial`	Neumann series (degree=2)	GPU-parallel
`ic0`	Incomplete Cholesky	Very ill-conditioned
`amg`	Algebraic Multigrid	Float32, Poisson-like

Recommendation:

Float64: Use jacobi (simplest, fastest due to fewer iterations)
Float32: Use amg (reduces iterations, compensates for precision loss)

Code:

# Jacobi (default, recommended for float64)
x = spsolve(val, row, col, shape, b,
            backend='pytorch', preconditioner='jacobi')

# AMG (recommended for float32)
x = spsolve(val.float(), row, col, shape, b.float(),
            backend='pytorch', preconditioner='amg')

Mixed Precision¶

For memory-constrained scenarios, use mixed precision: - Matrix stored in Float32 (memory efficient) - Accumulation in Float64 (high precision)

Code:

x = spsolve(val_f32, row, col, shape, b_f32,
            backend='pytorch',
            method='cg',
            mixed_precision=True)  # Returns float64 solution

CUDA Usage¶

Move to CUDA¶

Transfer to GPU for CUDA-accelerated solving.

Performance: cuDSS/cuSOLVER can be 10-100× faster for large systems.

Code:

from torch_sla import SparseTensor

A = SparseTensor(val, row, col, shape)
A_cuda = A.cuda()

x = A_cuda.solve(b.cuda())

Backend Selection on CUDA¶

Auto Selection: cuDSS (preferred) → cuSOLVER (fallback)

Code:

x = A_cuda.solve(b_cuda, backend='cudss', method='lu')
x = A_cuda.solve(b_cuda, backend='cudss', method='cholesky')  # For SPD
x = A_cuda.solve(b_cuda, backend='cusolver', method='qr')

Advanced Examples¶

Nonlinear Solve with Adjoint Gradients¶

Solve nonlinear equations \(F(u, \theta) = 0\) with automatic differentiation using the adjoint method.

Problem Formulation:

Given a nonlinear residual function \(F: \mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}^n\), find \(u^*\) such that:

\[F(u^*, \theta) = 0\]

where \(\theta\) are parameters (e.g., forcing term, material properties).

Newton-Raphson Method:

Starting from initial guess \(u_0\), iterate:

\[u_{k+1} = u_k - J_F^{-1}(u_k) F(u_k)\]

where \(J_F = \frac{\partial F}{\partial u}\) is the Jacobian matrix.

Adjoint Method for Gradients:

For a loss function \(L(u^*)\), the gradient w.r.t. parameters is:

\[\frac{\partial L}{\partial \theta} = -\lambda^T \frac{\partial F}{\partial \theta}\]

where the adjoint variable \(\lambda\) satisfies:

\[J_F^T \lambda = \frac{\partial L}{\partial u}\]

This is memory-efficient: O(1) instead of O(iterations) graph nodes.

Example: Nonlinear diffusion \(Au + u^2 = f\)

import torch
from torch_sla import SparseTensor

# Create stiffness matrix
A = SparseTensor(val, row, col, (n, n))

# Define nonlinear residual: F(u) = Au + u² - f
def residual(u, A, f):
    return A @ u + u**2 - f

# Parameters with gradients
f = torch.randn(n, requires_grad=True)
u0 = torch.zeros(n)

# Solve with Newton-Raphson
u = A.nonlinear_solve(residual, u0, f, method='newton')

# Gradients via adjoint method (memory efficient)
loss = u.sum()
loss.backward()
print(f.grad)  # ∂L/∂f

Methods:

Method	Update Rule	Convergence	Best For
`newton`	\(u_{k+1} = u_k - J^{-1} F(u_k)\)	Quadratic (fast)	General nonlinear
`picard`	\(u_{k+1} = G(u_k)\) (fixed-point)	Linear (slow)	Mildly nonlinear
`anderson`	Accelerated fixed-point with history	Superlinear	Memory-constrained

Determinant with Gradient Support¶

Compute determinants of sparse matrices with automatic differentiation.

Determinant Definition:

For a square matrix \(A \in \mathbb{R}^{n \times n}\), the determinant is a scalar value that encodes important matrix properties:

\[\det(A) = \sum_{\sigma \in S_n} \text{sgn}(\sigma) \prod_{i=1}^{n} a_{i,\sigma(i)}\]

Properties:

\(\det(AB) = \det(A) \det(B)\)
\(\det(A^T) = \det(A)\)
\(\det(A^{-1}) = 1/\det(A)\)
Matrix is singular ⟺ \(\det(A) = 0\)

Gradient Formula (Jacobi’s Formula):

For a differentiable loss \(L(\det(A))\):

\[\frac{\partial \det(A)}{\partial A_{ij}} = \det(A) \cdot (A^{-1})_{ji}\]

This is computed efficiently using the adjoint method with \(O(1)\) graph nodes.

Implementation:

CPU: LU decomposition via SciPy SuperLU
CUDA: Dense conversion + torch.linalg.det
Gradient: Adjoint method (solve \(A \mathbf{x} = \mathbf{e}_i\) for needed columns of \(A^{-1}\))

Example 1: Basic Determinant

import torch
from torch_sla import SparseTensor

# 3x3 tridiagonal matrix from dense
dense = torch.tensor([[4.0, -1.0,  0.0],
                      [-1.0, 4.0, -1.0],
                      [ 0.0, -1.0, 4.0]], dtype=torch.float64)

A = SparseTensor.from_dense(dense)
det = A.det()  # 56.0

Example 2: Gradient Computation

# Matrix with gradient tracking
dense = torch.tensor([[2.0, 1.0],
                      [1.0, 3.0]], dtype=torch.float64, requires_grad=True)

A = SparseTensor.from_dense(dense)
det = A.det()  # 5.0

# Compute gradient
det.backward()
print(dense.grad)  # [[3.0, -1.0], [-1.0, 2.0]]

Example 3: CUDA Support

# Move to CUDA
A_cuda = A.cuda()
det_cuda = A_cuda.det()  # Automatically uses CUDA backend

Example 4: Batched Determinants

# Multiple matrices with same structure
val_batch = torch.tensor([
    [2.0, 0.0, 0.0, 3.0],  # det = 6
    [1.0, 0.5, 0.5, 1.0],  # det = 0.75
], dtype=torch.float64)

A_batch = SparseTensor(val_batch, row, col, (2, 2, 2))
det_batch = A_batch.det()  # [6.0, 0.75]

Example 5: Optimization with Determinant Constraint

# Optimize matrix to achieve target determinant
val = torch.tensor([1.0, 0.5, 0.5, 1.0], requires_grad=True)
target_det = torch.tensor(2.0)
optimizer = torch.optim.Adam([val], lr=0.1)

for _ in range(50):
    optimizer.zero_grad()
    A = SparseTensor(val, row, col, (2, 2))
    loss = (A.det() - target_det) ** 2
    loss.backward()
    optimizer.step()

Example 6: Distributed Matrices

from torch_sla import DSparseTensor

# Create distributed sparse tensor
D = DSparseTensor(val, row, col, (n, n), num_partitions=4)

# Compute determinant (gathers all partitions)
det = D.det()  # Warning: requires data gather

Numerical Considerations:

Determinants can overflow/underflow for large matrices
For numerical stability, consider using log-determinant
Singular matrices (det ≈ 0) may cause LU decomposition to fail
Use torch.float64 for better numerical precision

Performance:

Matrix Size	CPU (Sparse)	CUDA (Dense)	CPU-for-CUDA	Notes
10×10	0.3 ms	1.0 ms	0.5 ms	CUDA 3x slower (dense overhead)
100×100	0.3 ms	0.3 ms	0.5 ms	Similar performance
1000×1000	0.7 ms	2.5 ms	1.2 ms	CUDA 3.6x slower (O(n³) vs O(n^1.5))

⚠️ Important Performance Note:

CUDA is slower than CPU for sparse determinants! This is because:

CPU uses sparse LU decomposition: O(nnz^1.5) time, O(nnz) memory
CUDA requires dense conversion: O(n³) time, O(n²) memory
cuSOLVER/cuDSS don’t expose sparse determinant computation

Recommendation: For CUDA tensors, use .cpu().det() instead of .det()

Eigenvalue Decomposition¶

Compute eigenvalues and eigenvectors of sparse matrices.

Eigenvalue Problem:

For a matrix \(A \in \mathbb{R}^{n \times n}\), find eigenvalues \(\lambda_i\) and eigenvectors \(v_i\) such that:

\[A v_i = \lambda_i v_i\]

Symmetric Case (eigsh):

For symmetric matrices \(A = A^T\), eigenvalues are real and eigenvectors are orthonormal:

\[A = V \Lambda V^T, \quad V^T V = I\]

where \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)\).

Algorithms:

ARPACK/LOBPCG: Iterative methods for sparse matrices, compute top-k eigenvalues
Shift-invert: For interior eigenvalues

Gradient Formula:

For a simple eigenvalue \(\lambda_i\) with eigenvector \(v_i\):

\[\frac{\partial \lambda_i}{\partial A_{jk}} = v_i[j] \cdot v_i[k]\]

Code:

from torch_sla import SparseTensor

A = SparseTensor(val, row, col, (n, n))

# Largest eigenvalues (ARPACK/LOBPCG)
eigenvalues, eigenvectors = A.eigsh(k=6, which='LM')

# Smallest eigenvalues
eigenvalues, eigenvectors = A.eigsh(k=6, which='SM')

# For non-symmetric matrices
eigenvalues, eigenvectors = A.eigs(k=6)

Example Output:

_images/eigenvalue_spectrum.png — Eigenvalue spectrum of 1D Laplacian (n=50). Red points show the 6 smallest eigenvalues computed by `eigsh()`.¶

Gradient support: Eigenvalue decomposition is differentiable!

val = val.requires_grad_(True)
A = SparseTensor(val, row, col, shape)
eigenvalues, _ = A.eigsh(k=3)
loss = eigenvalues.sum()
loss.backward()  # Gradients flow to val

SVD (Singular Value Decomposition)¶

Compute truncated SVD for sparse matrices.

SVD Definition:

For a matrix \(A \in \mathbb{R}^{m \times n}\), the SVD is:

\[A = U \Sigma V^T\]

where:

\(U \in \mathbb{R}^{m \times r}\): Left singular vectors (orthonormal columns)
\(\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r)\): Singular values (\(\sigma_1 \geq \sigma_2 \geq \ldots \geq 0\))
\(V \in \mathbb{R}^{n \times r}\): Right singular vectors (orthonormal columns)

Truncated SVD (rank-k approximation):

\[A_k = U_k \Sigma_k V_k^T = \sum_{i=1}^{k} \sigma_i u_i v_i^T\]

This is the best rank-k approximation in Frobenius norm (Eckart-Young theorem):

\[\|A - A_k\|_F = \sqrt{\sum_{i=k+1}^{r} \sigma_i^2}\]

Relation to Eigenvalues:

\(\sigma_i^2\) are eigenvalues of \(A^T A\) (or \(A A^T\))
\(v_i\) are eigenvectors of \(A^T A\)
\(u_i\) are eigenvectors of \(A A^T\)

Applications:

Dimensionality reduction: PCA via SVD
Low-rank approximation: Matrix compression
Pseudoinverse: \(A^+ = V \Sigma^{-1} U^T\)

Example Output:

_images/svd_lowrank.png — Left: Singular value spectrum showing rapid decay after true rank. Right: Approximation error decreases as rank increases.¶

Code:

from torch_sla import SparseTensor

A = SparseTensor(val, row, col, (m, n))

# Compute top-k singular values/vectors
U, S, Vt = A.svd(k=10)

# Low-rank approximation
A_approx = U @ torch.diag(S) @ Vt

# Relative approximation error
error = (A.to_dense() - A_approx).norm() / A.norm('fro')

LU Factorization for Repeated Solves¶

Cache LU factorization for efficient repeated solves with the same matrix.

LU Decomposition:

For a matrix \(A\), find lower triangular \(L\) and upper triangular \(U\) such that:

\[PA = LU\]

where \(P\) is a permutation matrix (for numerical stability).

Solving with LU:

To solve \(Ax = b\):

Factorize once: \(PA = LU\) — Cost: \(O(n^3)\) or \(O(\text{nnz}^{1.5})\) for sparse
Forward substitution: \(Ly = Pb\) — Cost: \(O(n^2)\) or \(O(\text{nnz})\) for sparse
Back substitution: \(Ux = y\) — Cost: \(O(n^2)\) or \(O(\text{nnz})\) for sparse

Complexity Savings:

For \(k\) solves with same matrix:

Without caching: \(O(k \cdot n^{1.5})\) (sparse direct)
With LU caching: \(O(n^{1.5} + k \cdot n)\) — up to \(\sqrt{n}\) faster!

Use Case: Time-stepping with fixed stiffness matrix

from torch_sla import SparseTensor

A = SparseTensor(val, row, col, shape)

# Factorize once (expensive)
lu = A.lu()

# Solve multiple RHS efficiently (cheap)
for t in range(100):
    b_t = compute_rhs(t)
    x_t = lu.solve(b_t)  # Fast solve using cached LU

Graph Neural Network Example¶

Use torch-sla for graph Laplacian operations in GNNs.

Code:

import torch
from torch_sla import SparseTensor

# Create adjacency matrix from edge list
edge_index = torch.tensor([[0, 1, 1, 2], [1, 0, 2, 1]])
edge_weight = torch.ones(4)

A = SparseTensor(edge_weight, edge_index[0], edge_index[1], (3, 3))

# Compute degree matrix
D = A.sum(dim=1)

# Normalized Laplacian: L = I - D^(-1/2) A D^(-1/2)
D_inv_sqrt = D.pow(-0.5)
A_norm = A * D_inv_sqrt.unsqueeze(1) * D_inv_sqrt.unsqueeze(0)
L = SparseTensor.eye(3) - A_norm

# Solve Laplacian system
x = L.solve(b)

Jupyter Notebook Examples¶

Interactive examples are available as Jupyter notebooks in the examples/ directory:

Notebook	Description
`basic_usage.ipynb`	Basic solve, property detection, visualization
`batched_solve.ipynb`	Batched operations and SparseTensorList
`determinant.py`	Determinant computation with gradient support (CPU & CUDA)
`gcn_example.ipynb`	Graph neural network with sparse Laplacian
`nonlinear_solve.ipynb`	Nonlinear equations with adjoint gradients
`visualization.ipynb`	Spy plots and sparsity visualization
`persistence.ipynb`	Save/load with safetensors and Matrix Market
`suitesparse_demo.ipynb`	Loading matrices from SuiteSparse Collection
`distributed/`	Distributed computing examples (matvec, solve, eigsh)