Robert Dyro

I implement a projection-based quadratic program solver (QP) in CUDA and on the CPU. The solver is based on the Alternating Direction Method of Multipliers (ADMM) algorithm and is implemented in Julia and is non-allocating (does not allocate memory dynamically, which helps it run on the GPU without modifications). The purpose of this project is (1) to show how the simplest QP program solver can be implemented (whose runtime is competitive with existing QP solvers) and (2) to learn about CUDA program optimizations when applied to a real problem. The solver is implemented in Julia because the language is higher-level and faster to write than C++, while its runtime is comparable to C++ (often within 0.5x-1.5x). However, the implementation is very low-level to work with both in CUDA and on the CPU.

The solver is a heterogeneous batch solver, meaning that it can solve a batch of QPs even if every single one of the problems has a different data, dimension, and sparsity structure. Thus, I intentionally implement the solver from start to finish within a single routine without making use of batching individual steps (e.g., matrix factorization, linear system solve, etc.). I use this deliberate generality to differentiate myself from other implementations and make the CUDA code optimization more challenging.

I test the solver on a Model Predictive Control (MPC) problem, an optimal control problem of finding the best (trading off tracking and energy expenditure) controls to a simple dynamical system to steer it towards a specified goal. I compare the CPU and CUDA performance. I find that NVIDIA RTX 3090 can solve at most about 4.5 as many problems as a single core of Ryzen 7 5800X processor. This is a rather disappointing result, either because (i) heterogenous QP solvers are not a good fit for the CUDA architecture (because of sequential nature of the linear system solve-operation on which the solver relies repeatedly) or (ii) because my CUDA program optimizations could be improved.

Quadratic Programming (QP) via Alternating Direction Method of Multipliers (ADMM)

\begin{aligned} \text{minimize}& ~~ \frac{1}{2} z^T P z + q^T z \\ \text{subject to}& ~~ C z \leq b \end{aligned}

Because of the type of a QP solver I implement here, I further require that (1) there are both upper and lower bound constraints (which is a generalization, since one can choose positive or negative infinity for the respective bound) and that (2) the matrices

P

A

are sparse (otherwise a dense solver is a more efficient software solution).

\begin{aligned} \text{minimize}& ~~ \frac{1}{2} z^T P z + q^T z \\ \text{subject to}& ~~ l \leq C z \leq u \\ \end{aligned}

The Model Predictive Control (MPC) Test Problem

The test problem I consider in this work is the Model Predictive Control (MPC) on a dynamical system with linear dynamics.

where

x

is the vector state,

u

is the vector control and

A

and

B

are arbitrary dense dynamical matrices.

Alternating Direction Method of Multipliers (ADMM) Algorithm for Quadratic Programs

There are currently two popular methods to solve quadratic optimization programs (with inequalities):

The second method is often easier to implement and recently has been enjoying success with an extremely high-quality implementation in the form of the OSQP solver. In fact, the implementation here is very heavily inspired by the OSQP implementation (and, in the case of the CPU version, it reaches the same competitive computational time as OSQP).

The method used by OSQP is a type of projection-based optimization method called Alternating Direction Method of Multipliers (ADMM). An excellent references include these lecture slides and the OSQP paper.

\begin{aligned} \text{minimize}& ~~ f(z) + g(w) \\ \text{subject to}& ~~ C z = w ~~~~~~~ \end{aligned}

\begin{aligned} z^{(k+1)} &= \text{argmin} f(z) + \frac{\rho}{2} \| C z - w^{(k)} + \rho^{-1} y^{(k)} \|_2^2 \\ w^{(k+1)} &= \text{argmin} g(w) + \frac{\rho}{2} \| C z^{(k+1)} - w + \rho^{-1} y^{(k)} \|_2^2 \\ y^{(k+1)} &= y^{(k)} + \rho \left( C z^{(k+1)} - w^{(k+1)} \right) \end{aligned}

\begin{aligned} f(z) &= \frac{1}{2} z^T P z + q^T z \\ g(w) &= \begin{cases} 0 ~~~ \text{if} ~ l \leq w \leq u \\ \infty ~~ \text{otherwise} \end{cases} \end{aligned}

Finding the argmin of

g(w)

is equivalent to projecting

w

onto the set where the inequalities are satisfied (because

w

is scaled by identity in

\frac{\rho}{2} \| C z - w + y \|_2^2

). This is extremely simple (which is why ADMM is so well-suited here).

\begin{aligned} z^{(k+1)} &= \text{argmin} \frac{1}{2} z^T P z + q^T z + \frac{\rho}{2} \| C z - w^{(k)} + \rho^{-1} y^{(k)} \|_2^2 & = \left( P + \rho C^T C\right)^{-1} \left( C^T (w - y) - q\right) \\ w^{(k+1)} &= \text{argmin}_{l \leq w \leq u} \frac{\rho}{2} \| C z^{(k+1)} - w + \rho^{-1} y^{(k)} \|_2^2 &= \text{min}(u, ~ \text{max}(l, ~ C z^{(k+1)} + \rho^{-1} y^{(k)})) \\ y^{(k+1)} & &= y^{(k)} + \rho (C z^{(k+1)} - w^{(k+1)}) \end{aligned}

Instead, OSQP suggests using the split version of the problem, by introducing extra variables, which can eliminate both of these problematic operations.

\begin{aligned} \text{minimize} & ~~ \frac{1}{2} z^T P z + q^T z \\ \text{subject to} & ~~ C z - v = 0 \\ & ~~ v = w \\ & ~~ l \leq w \leq u \end{aligned}

where I introduce the extra variable

v

. ADMM requires us to split the problem into two stages, which I do so:

\begin{aligned} z^{(k+1)}, v^{(k+1)} &= \text{argmin}_{C z = v} ~~ \frac{1}{2} z^T P z + q^T z + \frac{\rho}{2} \| v - w^{(k+1)} + \rho^{-1} y^{(k)} \|_2^2 & \\ w^{(k+1)} &= \text{argmin}_{l \leq w \leq u} ~~ \frac{\rho}{2} \| v^{(k+1)} - w + \rho^{-1} y^{(k)} \|_2^2 &= \text{min}(u, ~ \text{max}(l, ~ v^{(k+1)} + \rho^{-1} y^{(k)})) \\ y^{(k+1)} & &= y^{(k)} + \rho (v^{(k+1)} - w^{(k+1)}) \end{aligned}

where the first minimization problem is not trivial as it involves two vector variables

z

and

v

and a constraint

C z = v

. However, it can be solved by introducing a Lagrange multiplier

\lambda

\min_{z, v} L(z, v, \lambda) = \min_{z, v} \frac{1}{2} z^T P z + q^T z + \frac{\rho}{2} \| v - w^{(k+1)} + \rho^{-1} y^{(k)} \|_2^2 + \lambda^T \left( C z - v \right)

this produces a set of linear equations to solve it (by differentiating the Lagrangian with respect to

z

and then

v

, and then adding another equation for constraint satisfaction).

\begin{cases} P z + q + C^T \lambda = 0 \\ -\lambda + \rho (v - w^{(k)} - \rho^{-1} y^{(k)}) = 0 \\ C z - v = 0 \end{cases}

the trick to solving these equations efficiently is to eliminate

v

algebraically by solving the second equation for

v = w^{(k)} - \rho^{-1} y^{(k)} + \rho^{-1} \lambda

and substituting it into the third equation to get two equations to solve

\begin{cases} P z + C^T \lambda = -q \\ C z - \rho^{-1} \lambda = w^{(k)} - \rho^{-1} y^{(k)} \end{cases}

\begin{bmatrix} P & C^T \\ C & -\rho^{-1} I \end{bmatrix} \begin{bmatrix} z^{(k+1)} \\ \lambda^{(k+1)} \end{bmatrix} = \begin{bmatrix} -q \\ w^{(k)} - \rho^{-1} y^{(k)} \end{bmatrix} ~~ \text{then} ~~ v^{(k+1)} = w^{(k)} - \rho^{-1} y^{(k)} + \rho^{-1} \lambda^{(k+1)}

\begin{aligned} w^{(k+1)} &= \text{argmin}_{l \leq w \leq u} ~~ \frac{\rho}{2} \| v^{(k+1)} - w + \rho^{-1} y^{(k)} \|_2^2 &= \text{min}(u, ~ \text{max}(l, ~ v^{(k+1)} + \rho^{-1} y^{(k)})) \\ y^{(k+1)} & &= y^{(k)} + \rho (v^{(k+1)} - w^{(k+1)}) \end{aligned}

Note I have not quite reached the linear system form from OSQP, I am missing $\sigma$ regualization of $z$ . In the paper, the authors introduce yet another ADMM variable, but its effect is equivalent to adding a $\sigma$ damping to the $z$ updates in the ADMM. They likely do so, to guarantee that $P + \sigma I$ is positive definite (and thus invertible). This detail is important but would have added unnecessary complication to the explanation here.

Approximate Minimum Degree (AMD) Reordering

The full QP ADMM Algorithm

Implementation of Necessary Tools

Memory Allocation Routine

The first important set of programming tools is the memory allocation routines. While dynamic memory allocation is not possible in the CUDA kernel (at least not easily, especially when you need to use global memory for some intermediary computations), I can emulate it by creating routines that allow us to, on the fly, slice a block of memory into a desired array, and later reclaim the area of that slice when it is no longer needed.

The solver benefits from using shared CUDA memory, a very fast memory as opposed to slow global memory. An RTX 3090 has a lot of global memory, 24 GB, but a limited amount of shared (fast) memory per block, about 48 KB. Not all of the fast memory is necessarily available and I tend to work with Float32 or Int32 datatypes, both of which are 4 bytes long. Finally, I avoid reinterpreting the type of the memory on the fly because CUDA.jl compiler makes it a little difficult, so I end up with roughly 4 thousand element arrays per block for floating point and integer numbers each. In the code, I refer to those floating-point or integer arrays as fwork and iwork respectively.

Finally, because I only have limited fast memory, I want to be able to tune the program, selecting which "dynamically" allocated arrays should be fast and which slow. I use the following wrappers:

"""Equivalent to the in-built `length` function, but returns an `Int32` instead of an `Int64`."""
@inline function len32(work::AbstractVector{T})::Int32 where {T}
  return Int32(length(work))
end

mutable struct WorkSF{T,N1,N2}
  slow_whole_buffer::SubArray{T,1,CuDeviceVector{T,N1},Tuple{UnitRange{Int64}},true}
  slow_buffer::SubArray{T,1,CuDeviceVector{T,N1},Tuple{UnitRange{Int64}},true}
  fast_whole_buffer::CuDeviceVector{T,N2}
  fast_buffer::SubArray{T,1,CuDeviceVector{T,N2},Tuple{UnitRange{Int64}},true}
end

"""Make a memory block into a memory block tracking tuple."""
@inline function make_mem_sf(
  #slow_buffer::CuDeviceVector{T,N1},
  slow_buffer::SubArray{T,1,CuDeviceVector{T,N1},Tuple{UnitRange{Int64}},true},
  fast_buffer::CuDeviceVector{T,N2},
)::WorkSF{T,N1,N2} where {T,N1,N2}
  return WorkSF{T,N1,N2}(
    slow_buffer,
    view(slow_buffer, 1:length(slow_buffer)),
    fast_buffer,
    view(fast_buffer, 1:length(fast_buffer)),
  )
end

"""Allocate memory from a memory block tracking tuple."""
@inline function alloc_mem_sf!(
  worksf::WorkSF{T,N1,N2},
  size::Integer,
  fast::Integer,
)::Union{
  SubArray{T,1,CuDeviceVector{T,N1},Tuple{UnitRange{Int64}},true},
  SubArray{T,1,CuDeviceVector{T,N2},Tuple{UnitRange{Int64}},true},
} where {T,N1,N2}
  buffer = fast == 1 ? worksf.fast_buffer : worksf.slow_buffer
  @cuassert size <= len32(buffer)
  alloc_mem = view(buffer, 1:size)
  if fast == 1
    worksf.fast_buffer = view(buffer, (size+1):len32(buffer))
  else
    worksf.slow_buffer = view(buffer, (size+1):len32(buffer))
  end
  return alloc_mem
end

"""Free memory from a memory block tracking tuple."""
@inline function free_mem_sf!(
  worksf::WorkSF{T,N1,N2},
  size::Integer,
  fast::Integer,
)::Nothing where {T,N1,N2}
  whole_buffer = fast == 1 ? worksf.fast_whole_buffer : worksf.slow_whole_buffer
  buffer = fast == 1 ? worksf.fast_buffer : worksf.slow_buffer
  @cuassert size <= len32(whole_buffer) - len32(buffer)
  si, ei = len32(whole_buffer) - len32(buffer) - size + 1, len32(whole_buffer)
  if fast == 1
    worksf.fast_buffer = view(whole_buffer, si:ei)
  else
    worksf.slow_buffer = view(whole_buffer, si:ei)
  end
  return
end

For people familiar with the Julia programming language and CUDA.jl, I note that the compiler needs to know the types ahead of time, so I dynamically define the QP solver function (using Julia's metaprogramming) given the fast tuning flags at compile time. Furthermore, I need to use a different type signature for the fast and slow CUDA memory: CuDeviceVector{T,N1} and CuDeviceVector{T,N1}. N1 and N2 refer not to the dimension (dimension of 1 is implied by the Vector in the type name), but the placement of the memory on the GPU, globally or in the shared memory.

Vector Utilities

CUDA.jl has fairly poor support for broadcasting operations in Julia, so the solver needs to use a lot of loops. To make the algorithm implementation a little cleaner, I abstracted a lot of loop operations on vectors into separate functions. CUDA.jl most likely inlines these functions, so the difference is only code readability.

Sparse Matrix Utilities

The sparse matrix utilities, primarily for building the linear system matrix, required a little more implementation effort. Because memory allocation is an issue, in-place or no-work-memory-requiring routines were preferred. Some of these, like vertical and horizontal concatenation of sparse matrices I implemented myself, and some, like sparse matrix transpose, I rewrote from Julia's core (CPU-based) library.

Finally, the most implementationally challenging routine was sparse matrix reordering (to aid in sparser factorization). This requires the argument sorting routine, but no such routine is available on the GPU in CUDA.jl. I implemented a modified version of mergesort, with a temporary work array of the same size as the sorted array size (in addition to the index array).

Linear System Solver

For linear system solution, the solver is working with a symmetric, non-positive definite matrix. Since the problem formulation ensures that no eigenvalues are zero (by construction), I can use the LDLT factorization, a close relative of the Cholesky factorization which adds a diagonal D matrix to allow for negative eigenvalues. The factorization takes the form

Approximate Minimum Degree (AMD) Reordering

The main numerical optimization available to us for speedingGkp the linear system solution (not the factorization) is matrix reordering. Since the algorithm requires that the solver perform the linear system solution operation every iteration, then optimization of this step results in good computational speedups.

The objective of matrix reordering is to find a permutation matrix P, such that matrix

P A P^T

, when factored into

\tilde{L} \tilde{D} \tilde{L}^T

has significantly fewer non-zeros than the direct factorization

L D L^T = A

The optimal reordering problem is in general NP-hard, but there are a number of heuristics that can compute approximations. The quality of this approximation varies and so does their computational cost. What matters is not just how much the heuristic reduces the number of non-zeros on average, but how consistent it is and does it ever produce a reordering that results in more non-zeros than the direct matrix factorization.

I followed the excellent approximate minimum ordering tutorial by Stephen Ingram http://sfingram.net/cs517_final.pdf to develop an intuition for the problem. I considered existing proposed solutions that are widely used and of certifiable quality, like Approximate Minimum Degree (AMD) Algorithm by Amestoy, P. et al. However, I found that transcribing those existing implementations to non-allocating code is extremely challenging, even using automatic tools or ChatGPT-4.

Finally, I implemented several AMD versions and chose one based on performance evaluation on the Matrix Market Dataset. The routine I chose was a limited depth (a fixed depth of 3) enode lookup based on the quotient graph technique.

Kernel Optimization Improvements

The three main optimizations that allow the kernel to run faster are (1) using much-faster shared CUDA memory for intermediate calculations, (2) using threads to speed up unordered loops and (3) reordering the matrix before factorization to reduce the number of nonzeros in the factorization matrix. (3) is especially important because solving a linear system using a triangular factorization is a very sequential operation and thus the number of nonzeros in the factorization matrix directly determines how many sequential operations the solver has to execute. The factorization, additionally, happens for every iteration of the ADMM loop, of which there can be thousands.

In this work, I move convergence speedups to future work, so I do not tune the hyperparameters

\rho

and

\sigma

and I do not balance the matrix

A

- these are straightforward to implement, but require a lot of additional experimentation.

Lastly, I do not tune the CUDA block/thread combination, because of the implementation decision I made: the algorithm requires one block per QP problem (as it relies on exclusive access to shared memory per block). Thus, I only tune the number of threads per block, but this is much more straightforward to do than the alternative: (i) parametrizing the algorithm to use an arbitrary number of blocks as well as threads and then tuning both of these parameters at the same time.

I can also quantify the level of improvement given by employing more threads. The gains saturate fairly quickly.

CUDA Block Saturation

I ran the tests on an NVIDIA RTX 3090 GPU which has 82 streaming multiprocessors. Interestingly, as I increase the number of blocks (which corresponds to the number of QP problems), we see the time required jumps at increments of 82. Moreover, the increase is only strong every 3 multiples, at 246. Using 1 or 2 wraps (32 and 64 threads respectively) does not appear to alter this trend.

Algorithm Setup vs Iteration Time

Finally, I quantify problem setup time versus iteration time. The algorithm I implemented is fundamentally an iteration-based algorithm, generally, the more the iteration number allowed, the higher the quality of the solution. However, linear system matrix building, approximate matrix reordering, and matrix factorization all take some time. I plot the runtime versus iterations and use a linear fit to distinguish setup from iteration time.

I observe that a solution with a quality of at least 1% solution error for the toy problem requires at least 200 iterations, this gives 17.9 ms for iterations and 9.44 ms for setup. Thus, somewhat significant gains could be achieved by speed either one - both are important.

Comparison against a CPU solution and original OSQP implementation

Somewhat encouragingly, I find that a CPU implementation of the algorithm, minimally modified from the CUDA version runs within a single standard deviation of the original OSQP implementation at around 0.50 ms versus 0.45 ms for OSQP. My implementation is in Julia while theirs is in C. For non-CUDA computational gains, I mostly use SIMD (single instruction multiple data) Julia-provided macro to significantly speed up loop operations on a consumer CPU: Ryzen 7 5800X.

In comparison to CUDA, looking at the block saturation CUDA performance versus a sequential CPU solution is mostly likely found at 246 blocks/QPs as the first two block saturation jumps are a relatively minor hit to performance. Assuming a CPU solution of about 0.5 ms per QP problem, the GPU speedup is about

This is not encouraging as a consumer GPU, costing likely at least 1/3rd of the price of an RTX 3090 easily achieves 5x multi-threaded performance, beating the CUDA implementation. Nevertheless, I note that the value of this project is most likely educational. The heterogenous QP solver is a very challenging problem to effectively parallelize.

CUDA QP Solver

Heterogenous Batch Quadratic Program Solver in CUDA

Introduction

Quadratic Programming (QP) via Alternating Direction Method of Multipliers (ADMM)

The Model Predictive Control (MPC) Test Problem

Alternating Direction Method of Multipliers (ADMM) Algorithm for Quadratic Programs

Approximate Minimum Degree (AMD) Reordering

The full QP ADMM Algorithm

Implementation of Necessary Tools

Memory Allocation Routine

Vector Utilities

Sparse Matrix Utilities

Linear System Solver

Approximate Minimum Degree (AMD) Reordering

Kernel Optimization Improvements

CUDA Block Saturation

Algorithm Setup vs Iteration Time

Comparison against a CPU solution and original OSQP implementation

Extensions

References