CUDA QP SolverRobert Dyro 

Heterogenous Batch Quadratic Program Solver in CUDA
IntroductionI implement a projectionbased quadratic program solver (QP) in CUDA and on the CPU. The solver is based on the Alternating Direction Method of Multipliers (ADMM) algorithm and is implemented in Julia and is nonallocating (does not allocate memory dynamically, which helps it run on the GPU without modifications). The purpose of this project is (1) to show how the simplest QP program solver can be implemented (whose runtime is competitive with existing QP solvers) and (2) to learn about CUDA program optimizations when applied to a real problem. The solver is implemented in Julia because the language is higherlevel and faster to write than C++, while its runtime is comparable to C++ (often within 0.5x1.5x). However, the implementation is very lowlevel to work with both in CUDA and on the CPU. The solver is a heterogeneous batch solver, meaning that it can solve a batch of QPs even if every single one of the problems has a different data, dimension, and sparsity structure. Thus, I intentionally implement the solver from start to finish within a single routine without making use of batching individual steps (e.g., matrix factorization, linear system solve, etc.). I use this deliberate generality to differentiate myself from other implementations and make the CUDA code optimization more challenging. I test the solver on a Model Predictive Control (MPC) problem, an optimal control problem of finding the best (trading off tracking and energy expenditure) controls to a simple dynamical system to steer it towards a specified goal. I compare the CPU and CUDA performance. I find that NVIDIA RTX 3090 can solve at most about 4.5 as many problems as a single core of Ryzen 7 5800X processor. This is a rather disappointing result, either because (i) heterogenous QP solvers are not a good fit for the CUDA architecture (because of sequential nature of the linear system solveoperation on which the solver relies repeatedly) or (ii) because my CUDA program optimizations could be improved.
The code is opensourced here: https://github.com/rdyro/CUDA_QP_Solver Quadratic Programming (QP) via Alternating Direction Method of Multipliers (ADMM)A general quadratic program (QP) is defined as: $\begin{array}{rl}\text{minimize}& \text{}\text{}\frac{1}{2}{z}^{T}Pz+{q}^{T}z\\ \text{subject to}& \text{}\text{}Cz\le b\end{array}$ where $P$ is a symmetric positive definite matrix. Because of the type of a QP solver I implement here, I further require that (1) there are both upper and lower bound constraints (which is a generalization, since one can choose positive or negative infinity for the respective bound) and that (2) the matrices $P$ , $A$ are sparse (otherwise a dense solver is a more efficient software solution). $\begin{array}{rl}\text{minimize}& \text{}\text{}\frac{1}{2}{z}^{T}Pz+{q}^{T}z\\ \text{subject to}& \text{}\text{}l\le Cz\le u\end{array}$ The Model Predictive Control (MPC) Test ProblemThe test problem I consider in this work is the Model Predictive Control (MPC) on a dynamical system with linear dynamics. ${x}^{(k+1)}=A{x}^{(k)}+B{u}^{(k)}$ where $x$ is the vector state, $u$ is the vector control and $A$ and $B$ are arbitrary dense dynamical matrices. Alternating Direction Method of Multipliers (ADMM) Algorithm for Quadratic ProgramsThere are currently two popular methods to solve quadratic optimization programs (with inequalities):
The second method is often easier to implement and recently has been enjoying success with an extremely highquality implementation in the form of the OSQP solver. In fact, the implementation here is very heavily inspired by the OSQP implementation (and, in the case of the CPU version, it reaches the same competitive computational time as OSQP). The method used by OSQP is a type of projectionbased optimization method called Alternating Direction Method of Multipliers (ADMM). An excellent references include these lecture slides and the OSQP paper. The general ADMM algorithm (page 17 of lecture slides): For a split, optimizationandprojection problem, the ADMM algorithm is: $\begin{array}{rl}\text{minimize}& \text{}\text{}f(z)+g(w)\\ \text{subject to}& \text{}\text{}Cz=w\text{}\text{}\text{}\text{}\text{}\text{}\text{}\end{array}$ $\begin{array}{rl}{z}^{(k+1)}& =\text{argmin}f(z)+\frac{\rho}{2}\Vert Cz{w}^{(k)}+{\rho}^{1}{y}^{(k)}{\Vert}_{2}^{2}\\ {w}^{(k+1)}& =\text{argmin}g(w)+\frac{\rho}{2}\Vert C{z}^{(k+1)}w+{\rho}^{1}{y}^{(k)}{\Vert}_{2}^{2}\\ {y}^{(k+1)}& ={y}^{(k)}+\rho (C{z}^{(k+1)}{w}^{(k+1)})\end{array}$ Our case: In our case: $\begin{array}{rl}f(z)& =\frac{1}{2}{z}^{T}Pz+{q}^{T}z\\ g(w)& =\{\begin{array}{l}0\text{}\text{}\text{}\text{if}\text{}l\le w\le u\\ \mathrm{\infty}\text{}\text{}\text{otherwise}\end{array}\end{array}$ Finding the argmin of $g(w)$ is equivalent to projecting $w$ onto the set where the inequalities are satisfied (because $w$ is scaled by identity in $\frac{\rho}{2}\Vert Czw+y{\Vert}_{2}^{2}$ ). This is extremely simple (which is why ADMM is so wellsuited here). The updates then take the form $\begin{array}{rlr}{z}^{(k+1)}& =\text{argmin}\frac{1}{2}{z}^{T}Pz+{q}^{T}z+\frac{\rho}{2}\Vert Cz{w}^{(k)}+{\rho}^{1}{y}^{(k)}{\Vert}_{2}^{2}& ={(P+\rho {C}^{T}C)}^{1}({C}^{T}(wy)q)\\ {w}^{(k+1)}& ={\text{argmin}}_{l\le w\le u}\frac{\rho}{2}\Vert C{z}^{(k+1)}w+{\rho}^{1}{y}^{(k)}{\Vert}_{2}^{2}& =\text{min}(u,\text{}\text{max}(l,\text{}C{z}^{(k+1)}+{\rho}^{1}{y}^{(k)}))\\ {y}^{(k+1)}& & ={y}^{(k)}+\rho (C{z}^{(k+1)}{w}^{(k+1)})\end{array}$ However, this formulation is numerically problematic because
Instead, OSQP suggests using the split version of the problem, by introducing extra variables, which can eliminate both of these problematic operations. Split version of the problem $\begin{array}{rl}\text{minimize}& \text{}\text{}\frac{1}{2}{z}^{T}Pz+{q}^{T}z\\ \text{subject to}& \text{}\text{}Czv=0\\ & \text{}\text{}v=w\\ & \text{}\text{}l\le w\le u\end{array}$ where I introduce the extra variable $v$ . ADMM requires us to split the problem into two stages, which I do so:
where the ADMM consensus constraint is now $v=w$ (instead of $Cz=w$ ). The updates become $\begin{array}{rlr}{z}^{(k+1)},{v}^{(k+1)}& ={\text{argmin}}_{Cz=v}\text{}\text{}\frac{1}{2}{z}^{T}Pz+{q}^{T}z+\frac{\rho}{2}\Vert v{w}^{(k+1)}+{\rho}^{1}{y}^{(k)}{\Vert}_{2}^{2}& \\ {w}^{(k+1)}& ={\text{argmin}}_{l\le w\le u}\text{}\text{}\frac{\rho}{2}\Vert {v}^{(k+1)}w+{\rho}^{1}{y}^{(k)}{\Vert}_{2}^{2}& =\text{min}(u,\text{}\text{max}(l,\text{}{v}^{(k+1)}+{\rho}^{1}{y}^{(k)}))\\ {y}^{(k+1)}& & ={y}^{(k)}+\rho ({v}^{(k+1)}{w}^{(k+1)})\end{array}$ where the first minimization problem is not trivial as it involves two vector variables $z$ and $v$ and a constraint $Cz=v$ . However, it can be solved by introducing a Lagrange multiplier $\lambda $ $\underset{z,v}{min}L(z,v,\lambda )=\underset{z,v}{min}\frac{1}{2}{z}^{T}Pz+{q}^{T}z+\frac{\rho}{2}\Vert v{w}^{(k+1)}+{\rho}^{1}{y}^{(k)}{\Vert}_{2}^{2}+{\lambda}^{T}(Czv)$ this produces a set of linear equations to solve it (by differentiating the Lagrangian with respect to $z$ and then $v$ , and then adding another equation for constraint satisfaction). $\{\begin{array}{l}Pz+q+{C}^{T}\lambda =0\\ \lambda +\rho (v{w}^{(k)}{\rho}^{1}{y}^{(k)})=0\\ Czv=0\end{array}$ the trick to solving these equations efficiently is to eliminate $v$ algebraically by solving the second equation for $v={w}^{(k)}{\rho}^{1}{y}^{(k)}+{\rho}^{1}\lambda $ and substituting it into the third equation to get two equations to solve $\{\begin{array}{l}Pz+{C}^{T}\lambda =q\\ Cz{\rho}^{1}\lambda ={w}^{(k)}{\rho}^{1}{y}^{(k)}\end{array}$ which is the linear system in the matrix form $\left[\begin{array}{cc}P& {C}^{T}\\ C& {\rho}^{1}I\end{array}\right]\left[\begin{array}{c}{z}^{(k+1)}\\ {\lambda}^{(k+1)}\end{array}\right]=\left[\begin{array}{c}q\\ {w}^{(k)}{\rho}^{1}{y}^{(k)}\end{array}\right]\text{}\text{}\text{then}\text{}\text{}{v}^{(k+1)}={w}^{(k)}{\rho}^{1}{y}^{(k)}+{\rho}^{1}{\lambda}^{(k+1)}$ finally, the remaining updates are easy $\begin{array}{rlr}{w}^{(k+1)}& ={\text{argmin}}_{l\le w\le u}\text{}\text{}\frac{\rho}{2}\Vert {v}^{(k+1)}w+{\rho}^{1}{y}^{(k)}{\Vert}_{2}^{2}& =\text{min}(u,\text{}\text{max}(l,\text{}{v}^{(k+1)}+{\rho}^{1}{y}^{(k)}))\\ {y}^{(k+1)}& & ={y}^{(k)}+\rho ({v}^{(k+1)}{w}^{(k+1)})\end{array}$ Note I have not quite reached the linear system form from OSQP, I am missing $\sigma $ regualization of $z$ . In the paper, the authors introduce yet another ADMM variable, but its effect is equivalent to adding a $\sigma $ damping to the $z$ updates in the ADMM. They likely do so, to guarantee that $P+\sigma I$ is positive definite (and thus invertible). This detail is important but would have added unnecessary complication to the explanation here. Approximate Minimum Degree (AMD) ReorderingThe full QP ADMM Algorithm
Implementation of Necessary ToolsMemory Allocation RoutineThe first important set of programming tools is the memory allocation routines. While dynamic memory allocation is not possible in the CUDA kernel (at least not easily, especially when you need to use global memory for some intermediary computations), I can emulate it by creating routines that allow us to, on the fly, slice a block of memory into a desired array, and later reclaim the area of that slice when it is no longer needed. The solver benefits from using shared CUDA memory, a very fast memory as opposed to slow global memory. An RTX 3090 has a lot of global memory, 24 GB, but a limited amount of shared (fast) memory per block, about 48 KB. Not all of the fast memory is necessarily available and I tend to work with Float32 or Int32 datatypes, both of which are 4 bytes long. Finally, I avoid reinterpreting the type of the memory on the fly because CUDA.jl compiler makes it a little difficult, so I end up with roughly 4 thousand element arrays per block for floating point and integer numbers each. In the code, I refer to those floatingpoint or integer arrays as fwork and iwork respectively. Finally, because I only have limited fast memory, I want to be able to tune the program, selecting which "dynamically" allocated arrays should be fast and which slow. I use the following wrappers: """Equivalent to the inbuilt `length` function, but returns an `Int32` instead of an `Int64`.""" @inline function len32(work::AbstractVector{T})::Int32 where {T} return Int32(length(work)) end mutable struct WorkSF{T,N1,N2} slow_whole_buffer::SubArray{T,1,CuDeviceVector{T,N1},Tuple{UnitRange{Int64}},true} slow_buffer::SubArray{T,1,CuDeviceVector{T,N1},Tuple{UnitRange{Int64}},true} fast_whole_buffer::CuDeviceVector{T,N2} fast_buffer::SubArray{T,1,CuDeviceVector{T,N2},Tuple{UnitRange{Int64}},true} end """Make a memory block into a memory block tracking tuple.""" @inline function make_mem_sf( #slow_buffer::CuDeviceVector{T,N1}, slow_buffer::SubArray{T,1,CuDeviceVector{T,N1},Tuple{UnitRange{Int64}},true}, fast_buffer::CuDeviceVector{T,N2}, )::WorkSF{T,N1,N2} where {T,N1,N2} return WorkSF{T,N1,N2}( slow_buffer, view(slow_buffer, 1:length(slow_buffer)), fast_buffer, view(fast_buffer, 1:length(fast_buffer)), ) end """Allocate memory from a memory block tracking tuple.""" @inline function alloc_mem_sf!( worksf::WorkSF{T,N1,N2}, size::Integer, fast::Integer, )::Union{ SubArray{T,1,CuDeviceVector{T,N1},Tuple{UnitRange{Int64}},true}, SubArray{T,1,CuDeviceVector{T,N2},Tuple{UnitRange{Int64}},true}, } where {T,N1,N2} buffer = fast == 1 ? worksf.fast_buffer : worksf.slow_buffer @cuassert size <= len32(buffer) alloc_mem = view(buffer, 1:size) if fast == 1 worksf.fast_buffer = view(buffer, (size+1):len32(buffer)) else worksf.slow_buffer = view(buffer, (size+1):len32(buffer)) end return alloc_mem end """Free memory from a memory block tracking tuple.""" @inline function free_mem_sf!( worksf::WorkSF{T,N1,N2}, size::Integer, fast::Integer, )::Nothing where {T,N1,N2} whole_buffer = fast == 1 ? worksf.fast_whole_buffer : worksf.slow_whole_buffer buffer = fast == 1 ? worksf.fast_buffer : worksf.slow_buffer @cuassert size <= len32(whole_buffer)  len32(buffer) si, ei = len32(whole_buffer)  len32(buffer)  size + 1, len32(whole_buffer) if fast == 1 worksf.fast_buffer = view(whole_buffer, si:ei) else worksf.slow_buffer = view(whole_buffer, si:ei) end return end For people familiar with the Julia programming language and CUDA.jl, I note
that the compiler needs to know the types ahead of time, so I dynamically
define the QP solver function (using Julia's metaprogramming) given the These are in Vector UtilitiesCUDA.jl has fairly poor support for broadcasting operations in Julia, so the solver needs to use a lot of loops. To make the algorithm implementation a little cleaner, I abstracted a lot of loop operations on vectors into separate functions. CUDA.jl most likely inlines these functions, so the difference is only code readability. These are in Sparse Matrix UtilitiesThe sparse matrix utilities, primarily for building the linear system matrix, required a little more implementation effort. Because memory allocation is an issue, inplace or noworkmemoryrequiring routines were preferred. Some of these, like vertical and horizontal concatenation of sparse matrices I implemented myself, and some, like sparse matrix transpose, I rewrote from Julia's core (CPUbased) library. Finally, the most implementationally challenging routine was sparse matrix reordering (to aid in sparser factorization). This requires the argument sorting routine, but no such routine is available on the GPU in CUDA.jl. I implemented a modified version of mergesort, with a temporary work array of the same size as the sorted array size (in addition to the index array). These are in Linear System SolverFor linear system solution, the solver is working with a symmetric, nonpositive definite matrix. Since the problem formulation ensures that no eigenvalues are zero (by construction), I can use the LDLT factorization, a close relative of the Cholesky factorization which adds a diagonal D matrix to allow for negative eigenvalues. The factorization takes the form $A=LD{L}^{T}$ where $L$ is a lower triangular matrix and $D$ is a diagonal matrix. I manually transpiled the excellent C implementation of this factorization from https://github.com/osqp/qdldl/. These are in Approximate Minimum Degree (AMD) ReorderingThe main numerical optimization available to us for speedingGkp the linear system solution (not the factorization) is matrix reordering. Since the algorithm requires that the solver perform the linear system solution operation every iteration, then optimization of this step results in good computational speedups. The objective of matrix reordering is to find a permutation matrix P, such that matrix $PA{P}^{T}$ , when factored into $\stackrel{~}{L}\stackrel{~}{D}{\stackrel{~}{L}}^{T}$ has significantly fewer nonzeros than the direct factorization $LD{L}^{T}=A$ . The optimal reordering problem is in general NPhard, but there are a number of heuristics that can compute approximations. The quality of this approximation varies and so does their computational cost. What matters is not just how much the heuristic reduces the number of nonzeros on average, but how consistent it is and does it ever produce a reordering that results in more nonzeros than the direct matrix factorization. I followed the excellent approximate minimum ordering tutorial by Stephen Ingram http://sfingram.net/cs517_final.pdf to develop an intuition for the problem. I considered existing proposed solutions that are widely used and of certifiable quality, like Approximate Minimum Degree (AMD) Algorithm by Amestoy, P. et al. However, I found that transcribing those existing implementations to nonallocating code is extremely challenging, even using automatic tools or ChatGPT4. Finally, I implemented several AMD versions and chose one based on performance evaluation on the Matrix Market Dataset. The routine I chose was a limited depth (a fixed depth of 3) enode lookup based on the quotient graph technique. These are in Kernel Optimization ImprovementsSummary of optimizations
The three main optimizations that allow the kernel to run faster are (1) using muchfaster shared CUDA memory for intermediate calculations, (2) using threads to speed up unordered loops and (3) reordering the matrix before factorization to reduce the number of nonzeros in the factorization matrix. (3) is especially important because solving a linear system using a triangular factorization is a very sequential operation and thus the number of nonzeros in the factorization matrix directly determines how many sequential operations the solver has to execute. The factorization, additionally, happens for every iteration of the ADMM loop, of which there can be thousands. In this work, I move convergence speedups to future work, so I do not tune the hyperparameters $\rho $ and $\sigma $ and I do not balance the matrix $A$  these are straightforward to implement, but require a lot of additional experimentation. Lastly, I do not tune the CUDA block/thread combination, because of the implementation decision I made: the algorithm requires one block per QP problem (as it relies on exclusive access to shared memory per block). Thus, I only tune the number of threads per block, but this is much more straightforward to do than the alternative: (i) parametrizing the algorithm to use an arbitrary number of blocks as well as threads and then tuning both of these parameters at the same time.
I can also quantify the level of improvement given by employing more threads. The gains saturate fairly quickly.
CUDA Block Saturation
I ran the tests on an NVIDIA RTX 3090 GPU which has 82 streaming multiprocessors. Interestingly, as I increase the number of blocks (which corresponds to the number of QP problems), we see the time required jumps at increments of 82. Moreover, the increase is only strong every 3 multiples, at 246. Using 1 or 2 wraps (32 and 64 threads respectively) does not appear to alter this trend. Algorithm Setup vs Iteration TimeFinally, I quantify problem setup time versus iteration time. The algorithm I implemented is fundamentally an iterationbased algorithm, generally, the more the iteration number allowed, the higher the quality of the solution. However, linear system matrix building, approximate matrix reordering, and matrix factorization all take some time. I plot the runtime versus iterations and use a linear fit to distinguish setup from iteration time.
I observe that a solution with a quality of at least 1% solution error for the toy problem requires at least 200 iterations, this gives 17.9 ms for iterations and 9.44 ms for setup. Thus, somewhat significant gains could be achieved by speed either one  both are important. Comparison against a CPU solution and original OSQP implementationSomewhat encouragingly, I find that a CPU implementation of the algorithm, minimally modified from the CUDA version runs within a single standard deviation of the original OSQP implementation at around 0.50 ms versus 0.45 ms for OSQP. My implementation is in Julia while theirs is in C. For nonCUDA computational gains, I mostly use SIMD (single instruction multiple data) Juliaprovided macro to significantly speed up loop operations on a consumer CPU: Ryzen 7 5800X. In comparison to CUDA, looking at the block saturation CUDA performance versus a sequential CPU solution is mostly likely found at 246 blocks/QPs as the first two block saturation jumps are a relatively minor hit to performance. Assuming a CPU solution of about 0.5 ms per QP problem, the GPU speedup is about $\frac{246\ast 0.5\text{ms}}{27.25\text{ms}}\approx 4.5$ This is not encouraging as a consumer GPU, costing likely at least 1/3rd of the price of an RTX 3090 easily achieves 5x multithreaded performance, beating the CUDA implementation. Nevertheless, I note that the value of this project is most likely educational. The heterogenous QP solver is a very challenging problem to effectively parallelize. ExtensionsThere are a number of possible extensions to this work, these include at least
References
