View on GitHub

LSQR-CUDA

This is a LSQR-CUDA implementation written by Lawrence Ayers under the supervision of Stefan Guthe of the GRIS institute at the Technische Universität Darmstadt. The LSQR library was authored Chris Paige and Michael Saunders.

LSQR-CUDA

Overview

LSQR-CUDA is written by Lawrence Ayers under the supervision of Stefan Guthe of the GRIS institute at the Technische Universität Darmstadt. It is a CUDA port of the LSQR algorithm of Chris Paige and Michael Saunders

The goal of this work was to accelerate the computation time of the well-known LSQR algorithm using a CUDA-capable GPGPU.

LSQR was first authored by Chris Paige and Michael Saunders in their publication here.

Requirements

LSQR-CUDA has the following requirements:

Execution

To run the solver, enter the source directory and type the following command into your terminal

make run

NOTE: Before compiling, please check your GPUs compute capability. It is currently set to sm_50 in the Makefile. For best results, change this to match the capability of your GPU hardware.

Inputs

You will then be asked if you would like automatic test inputs generated for you and the desired degree sparsity of A. The range of auto-generated inputs is set in the start, end, and iteration variables of main.cu, but can be manually changed if desired. While the current setup will only build square matrices, the main.cu file can easily be modified to generate rectangular matrices.

If you have your own inputs available, you will need to save them as files with .mat (dense and sparse matrices) and .vec (vectors) extensions in the input directory. These must use a white space delimiter: “ “, and each have an appropriate number of values such that Ax=b can be satisfied.

Inputs should have the following notation:

As an example, a sparse matrix A with 1500 rows, 2000 columns, and a sparsity of 0.75% would have the following input files:

Outputs

The solution, x, will be written to output in a directory corresponding to the time of execution (Year-Month-DayTHourMinute) in the format:

for the above example, the x output file would look like this:

The 5 different implementations created for this work will then run on each set of inputs located in the input directory, with the runtime of each saved to a CSV file called YYYY-MM-DDTHHMM/YYYY-MM-DDTHHMM_LSQR-CUDA.csv ___

Table of Contents * [1. Introduction](#Introduction) * [2. Background](#Background) * [3. Methods](#Methods) * [3.1. Cpp-DENSE](#Cpp-DENSE) * [3.2. CUDA-DENSE](#CUDA-DENSE) * [3.3. CUDA-SPARSE](#CUDA-SPARSE) * [3.4. cuBLAS-DENSE](#cuBLAS-DENSE) * [3.5. cuSPARSE-SPARSE](#cuSPARSE-SPARSE) * [4. Results](#Results) * [4.1. Speedup](#Speedup) * [4.2. Accuracy](#Accuracy) * [5. Conclusion](#Conclusion)

1. Introduction

The purpose of this work is to implement the LSQR algorithm on a CUDA-capable GPU to analyze any potential runtime speedups in comparison to standard, sequential CPU implementations. When run in CUDA, many matrix operations (e.g. multiplication, euclidean norm, addition, subtraction, etc.) can be run in parallel, and can, therefore, decrease computation time.

This work has both sequential and parallel implementations of LSQR that are intended for both sparse and dense inputs. The 5 implementations are, correspondingly listed as follows:

Implementation Matrix A Vector b Hardware
Cpp-DENSE dense dense CPU
CUDA-DENSE dense dense GPU
CUDA-SPARSE sparse (CSR) dense GPU
cuBLAS-DENSE dense dense GPU
cuSPARSE-SPARSE sparse (CSR) dense GPU

The implementations with DENSE in their name are those that keep A in a dense format, and, therefore, require more computation time. Those named with SPARSE first compress A into compressed sparse row (CSR) format before running. This saves execution time and memory, but requires different methods to run LSQR.

A sparse input, sequential algorithm is not explicitly created for this work, rather, the robust scipy-lsqr python solver is used instead as the baseline for verifying accuracy and comparing runtimes of the implementations made in this work.

2. Background

The LSQR algorithm is an iterative method used to find the solution x for either of the following problems:

where A is a large, often sparse, square or rectangular matrix, and b is a vector of size A_ROWS.

LSQR was first authored by Chris Paige and Michael Saunders in their publication here, and has since been widely used for various applications.

3. Methods

The LSQR algorithm in this work is largely based off the scipy-lsqr source code as well as the C++ port provided by Luis Ibanez. In this work, the general LSQR algorithm is located in lsqr.hpp, whereby each implementation (listed above) is passed to the function as a combination of different class types (via a template).

CPU Implementations

3.1. Cpp-DENSE

The Cpp-Dense implementation is written in C++ and runs sequentially on the CPU. This implementation uses naive operations for add, subtract, multiply, Dnrm2, etc. It is the slowest of the implementations and used as a baseline to compare to dense-input GPU implementations (i.e. CUDA-DENSE and cuBLAS-DENSE). Corresponding source files are vectorCPU.cpp and vectorCPU.hpp

scipy-lsqr

Scipy’s LSQR solver is used in this work as a baseline to compare to the sparse input LSQR-CUDA implementations (i.e. CUDA-SPARSE and cuSPARSE-SPARSE). It first compresses A into CSR form before running a max of 2*A_COLUMNS iterations before returning a solution. Related information can be found on scipy’s website, and its use in this work can be found in lsqr.py

GPU Implementations

All source files pertaining to the four GPU implementations created in this work can be found in in the gpu directory.

For all CUDA kernels designed in this work, the blocksize (i.e. the number of threads in a block) is set to a constant value found in utils.cuh. This value can be changed if desired. For the GPU used in this work (NVIDIA GeForce 940M), a blocksize of 16*16 (256) threads provided best results.

The kernels used for these implementations are where most development time for LSQR-CUDA were spent.

3.2. CUDA-DENSE

The CUDA-DENSE implementation is written with the standard CUDA library, and executes many of its own kernels for various vector operations. This implementation has two dense inputs of type VectorCUDA and runs them through LSQR with accelerated multiplication, addition/subtraction, euclidean norm, and transpose operations. All operations used for this implementation are defined within the VectorCUDA class.

An output of nvprof for test-run (2500_2500_A_0.mat) of this implementation can be seen here:

nvprof output of CUDA-DENSE ``` Type Time(%) Time Calls Avg Min Max Name GPU activities: 98.00% 66.9277s 9941 6.7325ms 6.6156ms 7.4214ms multiplyTiled(double*, unsigned int*, unsigned int*, double*, unsigned int*, unsigned int*, double*) 0.51% 351.46ms 34792 10.101us 8.8320us 34.528us scale(double*, double, double*, unsigned int*, unsigned int*, bool) 0.33% 225.62ms 19880 11.349us 8.8960us 32.864us add(double*, double*, unsigned int*, unsigned int*, double*, bool) 0.33% 223.22ms 64615 3.4540us 2.7840us 3.6601ms [CUDA memset] 0.29% 197.12ms 159068 1.2390us 864ns 30.217ms [CUDA memcpy HtoD] 0.20% 137.70ms 14912 9.2340us 8.8000us 32.513us dnrm2(double*, unsigned int, unsigned int, double*, double*) 0.14% 96.548ms 34802 2.7740us 1.7600us 23.009us [CUDA memcpy DtoD] 0.13% 87.822ms 14912 5.8890us 5.0230us 26.016us maxVal(double*, unsigned int, unsigned int, double*) 0.06% 37.655ms 29827 1.2620us 1.1200us 25.344us [CUDA memcpy DtoH] 0.01% 7.9884ms 1 7.9884ms 7.9884ms 7.9884ms transposeTiled(double*, double*, unsigned int*, unsigned int*) ```

Multiplication

From the nvprof output above it is clear that the most time-intensive operation of LSQR is the matrix-vector and vector-vector multiplication operations. Since CUDA-DENSE works only with dense inputs, this operation is treated the same for both matrix-vector and vector-vector multiplication.

A naive approach to parallel multiplication is to have a each thread solve for one entry in the solution matrix, i.e. a thread accesses one row of the first input and one column of the second input from global memory to perform the dot product of these two arrays in a loop. Since the latency of global memory accesses can be quite high, a cached, “tiled”, memory solution is used instead, multiplyTiled. A multiplyNaive kernel is available for reference.

In the multiplyTiled approach to parallel matrix multiplication, inputs are first loaded into GPU-cached, (“shared”) memory, or “tiles”, that iteratively “sweep” across inputs,continuously summing up the dot product result with the running total in each iteration. Each thread works in parallel towards calculating one value in the resultant matrix. An excellent visual representation of this can be found in Penny Xu’s work, Tiled Matrix Multiplication.

In multiplyTiled, the use of cache memory halves the number of global memory accesses required for each thread in comparison to the naive approach. For a dense input matrix, A, of size 2500x2500, this implementation has a speedup of about 1.5x when switching from multiplyNaive to multiplyTiled.

Scale, Addition, and Subtraction

Due to their already low computation time within the LSQR algorithm, the scale, add, and subtract operations use naive approaches. No further development for these operations was deemed necessary.

Euclidean Norm

The euclidean norm, or Dnrm2 operation, is split into two different kernels. The first, maxVal, finds the max value within the matrix or vector, and the second, dnrm2, then divides all values by this max value whilst performing the necessary multiplication and addition operations, e.g. (a[0]/maxVal) ** 2 + (a[1]/maxVal) ** 2 + .... The norm is then found by taking the square root of the dnrm2 kernel result, before multiplying it by the max-value found in the previous kernel. This is the same method used by Ibanez in LSQR-cpp and ensures numerical stability.

Standard, parallel reduction techniques are used for both of these kernels, whereby the number of working threads in a block is halved in each iteration, and memory accesses are coalesced. Much of the development here is inspired by Mark Harris’ webinar, “Optimizing Parallel Reduction in CUDA” and the topics covered in the PMPP course at TUD. Both these kernels also utilize cache memory as to decrease memory access latency.

Matrix Transpose

Like the multiplication operation, the matrix transpose operation, transposeTiled, also utilizes a “tiled” approach, where a cached “tile” is swept across the matrix iteratively transposing it section by section. While the multiplyTiled kernel requires two separate tiles (one for each input), transposeTiled requires only one that temporarily stores a section of the matrix before loading it to global memory with swapped indices, e.g. output[3][2]=input[2][3]. This method is outlined in NVIDIA’s blog post, “An Efficient Matrix Transpose in CUDA C/++”, authored by Mark Harris.

3.3. CUDA-SPARSE

The CUDA-SPARSE implementation is written with the standard CUDA library, and has inputs of type MatrixCUDA (matrix A) and VectorCUDA (vector b). When loading A into MatrixCUDA, it is converted into compressed sparse row (CSR) form, reducing its size and, depending on its sparsity, required memory and computation effort.

All operations used here are the same as the CUDA-DENSE implementation besides the matrix-vector multiplication and matrix transpose operations. These sparse matrix operations can all be found within the MatrixCUDA source code.

nvprof output of CUDA-SPARSE ``` Type Time(%) Time Calls Avg Min Max Name GPU activities: 97.49% 56.2308s 9947 5.6530ms 5.6171ms 5.6886ms spmvCSRVector(unsigned int*, int*, int*, double*, double*, double*) 0.60% 346.90ms 34813 9.9640us 8.8630us 31.328us scale(double*, double, double*, unsigned int*, unsigned int*, bool) 0.39% 225.51ms 19892 11.336us 8.8000us 34.208us add(double*, double*, unsigned int*, unsigned int*, double*, bool) 0.39% 225.08ms 64657 3.4810us 2.6880us 3.6540ms [CUDA memset] 0.37% 211.50ms 159166 1.3280us 863ns 30.498ms [CUDA memcpy HtoD] 0.24% 137.06ms 14921 9.1850us 8.7670us 30.495us dnrm2(double*, unsigned int, unsigned int, double*, double*) 0.17% 95.903ms 34823 2.7540us 2.1440us 23.008us [CUDA memcpy DtoD] 0.15% 87.760ms 14921 5.8810us 5.0230us 28.544us maxVal(double*, unsigned int, unsigned int, double*) 0.07% 37.599ms 29846 1.2590us 1.1200us 14.752us [CUDA memcpy DtoH] 0.04% 22.277ms 3 7.4258ms 7.3831ms 7.4588ms void cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=1, bool=0, int, int, int>(cub::DeviceRadixSortPolicy<int, int, int>::Policy700 const *, cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=1, bool=0, int, int, int>*, bool=1 const *, cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=1, bool=0, int, int, int>**, bool=0*, cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=1, bool=0, int, int, int>**, int, int, cub::GridEvenShare<cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=1, bool=0, int, int, int>**>) 0.03% 18.866ms 1 18.866ms 18.866ms 18.866ms void cusparse::gather_kernel<unsigned int=128, double, double>(double const *, int const *, int, double*) 0.03% 14.797ms 2 7.3985ms 7.3967ms 7.4004ms void cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=0, bool=0, int, int, int>(cub::DeviceRadixSortPolicy<int, int, int>::Policy700 const *, cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=0, bool=0, int, int, int>*, bool=0 const *, cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=0, bool=0, int, int, int>**, bool=0*, cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=0, bool=0, int, int, int>**, int, int, cub::GridEvenShare<cub::DeviceRadixSortDownsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=0, bool=0, int, int, int>**>) 0.02% 9.9703ms 1 9.9703ms 9.9703ms 9.9703ms void cusparse::gather_kernel<unsigned int=128, int, int>(int const *, int const *, int, int*) 0.01% 5.6034ms 3 1.8678ms 1.8501ms 1.8894ms void cub::DeviceRadixSortUpsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=1, bool=0, int, int>(cub::DeviceRadixSortPolicy<int, int, int>::Policy700 const *, bool=1*, cub::DeviceRadixSortPolicy<int, int, int>::Policy700 const *, int, int, cub::GridEvenShare<cub::DeviceRadixSortPolicy<int, int, int>::Policy700 const *>) 0.01% 3.7753ms 2 1.8876ms 1.8756ms 1.8996ms void cub::DeviceRadixSortUpsweepKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, bool=0, bool=0, int, int>(cub::DeviceRadixSortPolicy<int, int, int>::Policy700 const *, bool=0*, cub::DeviceRadixSortPolicy<int, int, int>::Policy700 const *, int, int, cub::GridEvenShare<cub::DeviceRadixSortPolicy<int, int, int>::Policy700 const *>) 0.00% 2.3064ms 1 2.3064ms 2.3064ms 2.3064ms void cub::DeviceReduceByKeyKernel<cub::DispatchReduceByKey<int*, int*, cub::ConstantInputIterator<int, int>, int*, int*, cub::Equality, cub::Sum, int>::PtxReduceByKeyPolicy, int*, int*, cub::ConstantInputIterator<int, int>, int*, int*, cub::ReduceByKeyScanTileState<int, int, bool=1>, cub::Equality, cub::Sum, int>(int*, int, int, cub::ConstantInputIterator<int, int>, int*, int*, int, cub::Equality, cub::Sum, int) 0.00% 2.1231ms 1 2.1231ms 2.1231ms 2.1231ms void cusparse::sequence_kernel<unsigned int=128, int>(int*, int, cusparse::sequence_kernel<unsigned int=128, int>) 0.00% 1.8211ms 1 1.8211ms 1.8211ms 1.8211ms void cusparse::csr2csr_kernel_v1<unsigned int=128, unsigned int=896, int=0, int>(int const *, int const *, int, int const **) 0.00% 54.239us 5 10.847us 10.176us 11.456us void cub::RadixSortScanBinsKernel<cub::DeviceRadixSortPolicy<int, int, int>::Policy700, int>(int*, int) 0.00% 13.664us 1 13.664us 13.664us 13.664us void cusparse::merge_path_partition_kernel<unsigned int=128, unsigned int=896, int=0, int>(int const *, int, cusparse::merge_path_partition_kernel<unsigned int=128, unsigned int=896, int=0, int>, int, int*, int) 0.00% 10.880us 1 10.880us 10.880us 10.880us void cub::DeviceScanKernel<cub::DispatchScan<int*, int*, cub::Sum, int, int>::PtxAgentScanPolicy, int*, int*, cub::ScanTileState<int, bool=1>, cub::Sum, int, int>(int*, cub::Sum, int, int, int, cub::DispatchScan<int*, int*, cub::Sum, int, int>::PtxAgentScanPolicy, int*) 0.00% 5.3120us 1 5.3120us 5.3120us 5.3120us void cusparse::scatter_kernel<unsigned int=128, int=0, int, int>(int const *, int const *, int, int*) 0.00% 4.9280us 1 4.9280us 4.9280us 4.9280us void cub::DeviceCompactInitKernel<cub::ReduceByKeyScanTileState<int, int, bool=1>, int*>(int, int, int) 0.00% 3.1360us 1 3.1360us 3.1360us 3.1360us void cub::DeviceScanInitKernel<cub::ScanTileState<int, bool=1>>(int, int) ```

SpMV

As before, the nvprof output above again shows that the most expensive operation used for LSQR is multiplication. In this implementation, CUDA-SPARSE, it is the sparse matrix-vector multiplication operation, or SpMV, which solves for the product between a sparse matrix in compressed format (CSR) and a vector in dense format. The result is a vector of size #A-ROWS.

Much of the work done for the SpMV operation is based off Georgi Evushenko’s medium article, Sparse Matrix-Vector Multiplication with CUDA, and Pooja Hiranandani’s report, Sparse Matrix Vector Multiplication on GPUs: Implementation and analysis of five algorithms and corresponding github repository, spmv-cuda.

Three different SpMV kernels are available for this implementation

While the first, spmvNaive, uses one thread per row in matrix A to solve for a value in the solution vector, spmvCSRVector and spmvCSRVectorShared use, instead, one warp (i.e. 32 threads) per row in matrix A. This allows for a better utilization of resources, since the naive approach can create bottlenecks if it encounters a row that is significantly more dense than others.

The biggest difference between spmvCSRVector and spmvCSRVectorShared is their use of shared memory; spmvCSRVectorShared uses shared, cached memory, while spmvCSRVector does not. There is no real significant speedup found between these kernels when used in LSQR. spmvCSRVector is set to the default in this implementation, but it can easily be switched via the kern variable used in matrixCUDA.cu.

The run time of LSQR when using each kernel can be seen in the table below (inputs 2500_2500_A_0.mat and 2500_1_b.vec):

kernel used calculation time (s)
spmvNaive 198.099
spmvCSRVector 61.102
spmvCSRVectorShared 60.198

cuSPARSE Transpose

The transpose of a CSR matrix is its compressed sparse column, CSC, counterpart. A kernel for this operation was not explicitly developed for this work, as it is difficult to design and/or find a good parallel implementation for it.

Also, as can be seen from the nvprof output, the transpose operation is only called once within the entire LSQR algorithm. It was, therefore, not seen as high priority seeing as it would have little impact on overall speedup.

Therefore, the existing cusparseCsr2cscEx2 function within the cuSPARSE library is used. This implementation can be found in matrixCUDA.cu More information regarding the cuSPARSE library can be found within the CUDA toolkit documentation.

3.4. cuBLAS-DENSE

The cuBLAS-DENSE implementation is written using both the CUDA and cuBLAS. cuBLAS is a library from NVIDIA that provides “basic linear algebra” operations on a GPU. For this implementation, two dense inputs of type VectorCUBLAS are used.

Information regarding cuBLAS and how to use it is documented extensively in the CUDA toolkit documentation, and will, therefore, not be further discussed here.

To see how these cuBLAS operations were used for this implementation, please refer to the VectorCUBLAS source files

3.5. cuSPARSE-SPARSE

The cuSPARSE-SPARSE implementation is written using both CUDA, cuBLAS, and cuSPARSE libraries. cuSPARSE is a library from NVIDIA that provides “a set of basic linear algebra subroutines used for handling sparse matrices”. For this implementation, one sparse input of type MatrixCUSPARSE (matrix A) and one dense input of type VectorCUBLAS (vector b) are used.

This implementation uses all the same operations as the cuBLAS-DENSE implementation, besides the SpMV and matrix transform operations, which are both executed using the cuSPARSE library. Information regarding cuBLAS and how to use it is documented extensively in the CUDA toolkit documentation, and will, therefore, not be further discussed here.

To see how these cuSPARSE operations were used for this implementation, please refer to the MatrixCUSPARSE source files

4. Results

To test LSQR-CUDA, randomly generated, square A-matrices of sparsity 0.8 are available, ranging from size 1000x1000 to 8000x8000, as well as their corresponding dense, b-vectors (sizes 1000 to 8000). These inputs can be found in the inputs.zip files in results (split into two zip files due to their size). Their corresponding x-vector solutions for each implementation can be found in outputs.zip. The runtime of each implementation of LSQR-CUDA is saved to a CSV file, RUNTIMES.csv, such that they can be compared to the runtime of the baseline, scipy-lsqr solver.

The accuracy of these implementations is also measured via root mean squared error values, which are calculated against the results of the scipy-lsqr solver. RMSE values can be found in RMSE.csv.

The organization and analysis of the test data is done via the plotting.py python script, which is provided for reference.

4.1. Speedup

Below, the computation time of each implementations for different sized inputs are displayed. For the following results, A_ROWS refers to the number of rows in each square A matrix, e.g. A_ROWS=1500 refers to a matrix of size A_ROWS * A_COLUMNS == 1500 * 1500 == 2150000 matrix-elements.

A_ROWS scipy-lsqr (SPARSE) (s) Cpp-DENSE (s) CUDA-DENSE (s) CUDA-SPARSE (s) CUBLAS-DENSE (s) CUSPARSE-SPARSE (s)
1000 5.36911 9.03528 6.06791 2.25782 4.23762 2.57307
1500 13.91778 30.57563 16.30271 4.80371 11.05773 5.59088
2000 33.73029 72.58957 37.55610 9.31562 22.89881 10.27482
2500 75.45058 342.40522 71.61881 15.67865 42.03487 17.49148
3000 67.19402 602.61113 115.69176 25.07935 69.78010 27.93229
3500 86.04460 1031.17675 189.35206 37.92124 108.18631 42.33548
4000 115.40540 1524.24163 281.95625 55.00457 158.94691 61.39182
4500 152.97966 2117.63550 390.68038 76.19237 224.51022 85.30340
5000 221.17113 2979.51575 548.95388 102.28826 306.09594 115.30341
5500 347.02812 3941.48175 721.66800 132.16242 400.06641 150.34353
6000 416.14977 5702.59200 928.49088 172.34811 524.00603 197.15688
6500 508.41177 6652.28650 1209.02975 218.11569 663.90925 250.62997
7000 649.83381 7486.82650 1486.45062 269.35181 824.29088 310.20653
7500 844.21635 9084.99600 1818.77300 330.60550 1011.73656 380.85144
8000 1392.32401 11089.35200 2211.38200 398.40791 1226.46113 464.05978

Sparse input implementations

The results show that the sparse input GPU implementations, CUSPARSE-SPARSE and CUDA-SPARSE, required less computation time in comparison to the scipy-lsqr solver.

plot

Another useful metric when calculating the GPU implementations’ speeds in comparison to the scipy-lsqr solver is the “speedup” value, which is calculated by dividing CPU-runtime by GPU-runtime, i.e GPU-runtime / CPU-runtime:

A_ROWS CUDA-SPARSE CUSPARSE-SPARSE
1000 2.37800490 2.08665417
1500 2.89730134 2.48937383
2000 3.62083180 3.28281194
2500 4.81231229 4.31356098
3000 2.67925710 2.40560376
3500 2.26903461 2.03244638
4000 2.09810545 1.87981716
4500 2.00780795 1.79335945
5000 2.16223383 1.91816650
5500 2.62576999 2.30823447
6000 2.41458853 2.11075453
6500 2.33092711 2.02853542
7000 2.41258377 2.09484245
7500 2.55354598 2.21665527
8000 3.49471983 3.00031174

These values vary, but show a speedup of at least 2x for the CUDA-SPARSE implementation and 1.5x for the CUSPARSE-SPARSE implementation for all tests.

Surprisingly, the CUDA-SPARSE implementation outperformed the CUSPARSE-SPARSE implementation on all executions of LSQR. This is likely due to an error in the MatrixCUSPARSE code, as the pre-defined NVIDIA libraries often outperform CUDA kernels written from scratch. Both, however, outperformed the scipy-lsqr CPU solver and display the benefit of running LSQR on a GPU.

Dense input implementations

The dense input GPU implementations, CUDA-DENSE and CUBLAS-DENSE, were also faster than their CPU counterpart, Cpp-DENSE.

plot

As expected, the CUBLAS-DENSE implementation outperformed the CUDA-DENSE implementation. This is likely due to the fact that CUBLAS-DENSE uses the well-optimized cuBLAS library instead of CUDA kernels written from scratch.

4.2. Accuracy

Root mean squared error (RMSE) values for both sparse and dense input GPU implementations are calculated against the results of the scipy-lsqr solver:

A_ROWS CUDA-DENSE CUDA-SPARSE CUBLAS-DENSE CUSPARSE-SPARSE
1000 0.00344 0.00024 0.00417 0.00042
1500 0.04059 0.00145 0.02306 0.00983
2000 0.03062 0.05117 0.06697 0.04953
2500 0.00083 0.00134 0.00060 0.00086
3000 0.00041 0.00012 0.00034 0.00014
3500 0.00015 0.00040 0.00060 0.00010
4000 0.00034 0.00040 0.00013 0.00119
4500 0.00098 0.00853 0.00739 0.00575
5000 0.00018 0.00058 0.00006 0.00006
5500 0.00042 0.00033 0.00012 0.00033
6000 0.00237 0.00151 0.00173 0.00230
6500 0.00007 0.00007 0.00004 0.00020
7000 0.01310 0.01141 0.04927 0.06066
7500 0.02260 0.00758 0.02353 0.01435
8000 0.00230 0.06865 0.04395 0.02216

All of the implementations yielded very accurate results, with none having an RMSE above 0.07.

5. Conclusion

The implementations created in this work show the benefit of running LSQR on a GPU as opposed to a CPU. For sparse input implementations, a speedup of at least 1.5x is found in comparison to that of the scipy-lsqr python solver, whilst maintaining accurate results.

This speedup, however, may not be as significant when compared to optimized LSQR-C++ solvers, like those provided by Eigen, boost, or Ibanez (as opposed to a python solver). Further research and comparison could be done here.

It should also be noted, that NVIDIA provides an LSQR solver in its cuSOLVER library, cusolverSpDcsrlsqvqr, but it currently (seemingly) only works for small datasets, and runs explicitly on the CPU. If and when NVIDIA does implement this solver for the GPU, it would be useful to compare it to the speed and accuracy of the implementations presented here.

Lastly, further measures could be taken to speedup the GPU implementations presented in this work. This could include (but is not limited to), introducing CUDA streams or using different memory allocation techniques (e.g. pinned memory) provided by CUDA. This could be potentially done in a future work.