Nuno Faria, Roberto Ribeiro, University of Minho
Abstract - The LU Factorization is a very useful matrix factorization used in many matrix operations. According to its properties it may turn into a very computationally intensive algorithm, especially if no optimizations are introduced in an eventual implementation that does not takes in consideration the system's architecture and a specific language. Along this analysis to this factorization implementation we will explore three different sequential and parallel approaches to this decomposition, analyzing the algorithm from the mathematical point of view and in C language implementations. The tools used for parallelization were OpenMP, MPI and CUDA and along with these several considerations were studied regarding the structural aspects of the algorithm: transposing an algorithm due to the memory model of C, new derived data-types in MPI, the structure of the algorithm in CUDA. We also implemented a version using mathematical libraries – GotoBLAS2. This paper also concerns greatly about the numerical stability of the algorithm, thus we point out some thoughts regarding row permutations.
I.INTRODUCTION
LU factorization is a matrix decomposition used in numerous matrix computations. Probably, the most advantage full use of this operation is in solving systems of linear equations. In the most of the real use cases this technique needs to be applied to large matrices sizes resulting in heavy computational work since the theoretically complexity of this decomposition is O(N3). Though we face a clear evolution in our computational resources, especially in the processor chips where the multi-core shared memory systems suffered great improvements achieving high levels of performance, although we need to adjust our applications and algorithms to take part of this improved parallel processing capabilities. Not only shared memory systems provide increases in our performances, distributed memory systems have their fair share in computational improvements along with our increasing hardware resources availability.
So with this and with a close mathematical analysis of LU factorization we may significantly improve the performance of this decomposition, changing our algorithm not only for a parallel approach but also in the sequential implementation so it can take full profit of system architecture and parallel paradigm.
II.The LU decomposition
A.Gaussian Elimination
It is a well-known algorithm used to solve linear equations systems of type A x = b, reducing it to an upper triangular system T x = c.
To achieve this we may perform 3 operations without changing the solution:
- Multiply every term of an equation by a nonzero constant
- Swap two equations
- Add a multiple of one equation to another
With this three operations we are able to reduce the system to an upper triangular one, in a process called forward elimination. Here is an example of a forward elimination of a linear system:
Now we just apply the back substitution algorithm to get the value of the solution.
B.The LU decomposition
Let A be a m lines and n columns matrix, let us assume that there are two matrices, L and U, lower triangular and upper triangular respectively, such that A= LU, we call this a LU factorization of A.
This decomposition is achieved with an algorithm based upon Gaussian Elimination. We simply apply it to matrix A reducing it to a upper triangular matrix U and saving the coefficients of the reduction in a lower triangular matrix L. Here is an example of an iteration:
This is will be iteration 2 where we calculate l22 and l32 with pivot line as according to the algorithm l22 is trivially achieved and . Then we apply the reduction step to the line using coefficient l32 (this corresponds to the operation where Y and X are vectors and α is the scalar coefficient l32 ) :
Finally we get the state:
In the end of all iterations the matrix A’ is the upper triangular matrix U that we looked for.
C.Partial pivoting and numerical stability
In the Gaussian elimination iterative algorithm there is a pivot row i in each iteration i which is used to drive to zero all elements below the diagonal in column i (in the previous example row 0 and 1 were the pivot rows in iteration 1 and 2, respectively), well this reduction might lead to a numerical instability when achieving the final result, i.e. the condition number of the original matrix A, which is given by , with the Gaussian elimination process , can increase drastically. If this happens the use of new matrix A can result in several numerical errors.
To prevent this we implemented what is known has partial pivoting , and consists in a search for the ideal pivot in the rows ate each iteration, and use to perform the remaining algorithm
III.Implementation Decisions
A.Chosen language
The language chosen in this project was C because it’s a user friendly language which we are very familiar with. Ultimately the LU decomposition will be performed in environments to take advantage of the parallelism we can apply, such as clusters and GPUs. Also C is considered a low level language than the major of other languages which is suitable to this algorithm.
However we must pay attention to the fact that C has its memory model built as a row major model. This means that when we store a matrix, C looks at it as row structured matrix. Most of the algorithms regarding matrices (including LU decomposition) are looked at column-wise, so we must be very careful about the algorithm we choose to implement (more about this aspect is explained further in this paper). A very good alternative to C would be Fortran, which gained reputation on mathematical implementations and it’s column major.
B.Storage of L and u
To store the resulting L and U matrices we can use the original memory space used store the original matrix A, due to their form:
- U is an upper triangular matrix: the upper part of the matrix, including the diagonal, is filled with values and the lower part is filled with zeros
- L is a lower triangular matrix: all the positions bellow the main diagonal are filled with values; and the main diagonal is filled with 1so we can omit this value since we need the main diagonal to store the diagonal of U.
C.Pivoting and permutations
The permutations matrix is initially an identity matrix (with the same dimensions of the matrix A, MxN) and the method used to acquire it is straightforward: when any permutation is performed in the matrix A, that permutation is also carried out in the P (permutations) matrix. When the algorithm ends we get all the permutations of A stored in P.
The reason why we want to get the permutations matrix is to validate the resulting L and U matrices; by multiplying the (where A is the original matrix) and if the factorization is being correctly done, should be equal, or an approximate, of the matrix. This is due to the permutations of rows done along the factorization steps. More about result validation is in the Performance Analysis chapter. This concept is applied with all algorithms of LU implemented.
IV.BLAS1LU - Sequential Implementation
A.Mathematical perspective
This is the closest implementation approach to the original mathematical algorithm which means that for each iteration and inherent pivot row we will use another loop that will calculate the coefficients, to perform the elimination, one-by-one and in the same loop the elimination itself and the update are performed, using the mentioned operation which is known by SAXPY operation, in that specific row.
B.C code
In this section we explain how we implemented in C, corresponding to the level 1 BLAS implementation.
The call to the row_permutation function performs the partial pivoting component of the algorithm checking if the pivot element is the highest of the column i and from row i down, when the function finds a higher absolute value (in a given row j) than the pivot element it swaps the row j and i.
The function invoke_saxpy function is the key factor in this implementation that calls the saxpy operation. All the preparations to use this operation are done in this function.
C.Potential parallelism
We can follow several ways to parallelize this algorithm:
- We can reference the row_permutation function as a parallel one since any row is swapped with the pivot row (if the pivot element has a smaller absolute value than the some other value in that column) and we are doing this by swapping each element of the two rows.
- We find the main cycle that goes through every row computing each row and performing the Gaussian elimination with the function that invokes the saxpy operation.
- The saxpy operation, due to its cyclic nature, can also be parallelized without data dependency problems – it comes has an extremely simple cycle that only runs through the positions of two vectors and calculates a multiplication of X vector (at position i) by a scalar α and sums it with the value of Y (also at position i).
V.BLAS2LU - Sequential Implementation
A.Mathematical perspective
In this second we introduced an optimization by aggregating operations, performing a kind of loop unroll of the inner loop that performs the eliminations and coefficients calculation. For each iteration we calculate all the coefficients of that pivot column with a loop, store them in the matrix and then perform what we know as a Rank-1 update, that will do the elimination and transformations of all the rows below the pivot row. The Rank-1 update operation is a multiplication of a vertical vector-matrix and a horizontal one, the result is added to another matrix and stored over this last matrix: . In this case X is the vertical vector of all the calculated coefficients and Y is the pivot row without the pivot element, α is -1 according to the algorithm.
Rank1-Update
Although in this implementation as mentioned before, we will use C that is row-major so in order to achieve a significantly improve in the performance, especially with non-square matrices, we could try to review the implementation considering the transposed matrix. So the coefficients will be calculated and stored in a different place and the Rank-1 update remains the same because , so we need to transpose the final matrix again to achieve the correct result.
B.C code
This code sections corresponds to the level 2 BLAS operations.
Notice that we have a condition for the variable that control the pivot row and column to not be higher that number of columns of the matrix, this is to prevent errors explained further in the level 3 BLAS section, since BLAS3 uses BLAS2.
The main differences between the algorithm showed previously and this one are:
- All the coefficients (for each pivot row) are calculated all at once before any transformation in the U matrix.
- The rank-1 update changes the sub-matrix from the pivot row and column down to M and N (limit of rows and columns respectively).
C.Potential parallelism
In this case we can explore the parallelism also in performing the row permutations and calculating the coefficients needed for the Gaussian elimination, however we can investigate the case of rank-1 update which is one of the most heavy operations of the three different algorithm presented in the paper:
- The parallelization method is similar to the presented one in the previous section (BLAS1LU).
- The computation of the coefficients is now done independently and we can take advantage of the fact that we can calculate each coefficient without worrying about data dependencies between them.
- Rank-1 update takes part in an immense part of the factorization since it is a column vector-row vector multiplication and subtraction, originating a matrix. (a) and (b) are two nested cycles that can be fully parallelized given that we verify no data dependency. The parallelization of this function should greatly increase the overall performance.
D.Transposing the algorithm
For the second approach it was necessary to rewrite the algorithm due to the memory model of C. As it was said before, C is row-wise, and the algorithm of the second approach implemented is mostly used in the BLAS3 implementation to compute sub-matrix with dimension . So we are dealing with extremely inefficient matrices because these are being processed in a row-major way.
The solution went through reorganizing the algorithm of BLAS2 to run through the matrix and perform Gaussian elimination column-wise. Basically transposing the algorithm we get pivot columns instead pivot columns and instead of running through the rows we now run through the columns. However we only transposed the second implementation corresponding to the level 2 BLAS because that is the only implementation that takes advantage of a transposed and column-wise algorithm. The next image shows us a matrix that could be used by BLAS2LU: the fact that it is “tall and skinny” (TS) – with big columns - could cause low performance since C’s memory model is row-major, so we decided to transpose the matrix passed to BLAS2 and also the algorithm of BLAS2 reorganizing how the matrix is accessed (changing matrix indices mostly).
Demonstration of passing row-wise to column-wise
In the image show above the blue elements correspond to the U matrix and the white elements correspond to the L matrix.
E.Transpose code
The transposed implementation is extremely similar to the first one mentioned this section with the distinction of how it runs though the matrix. First the matrix in question is transposed, the algorithm performs column-wise Gaussian elimination in it, as for the rank-1 update operation (this function was also reconstructed for the transposed algorithm). In the end the resulting matrix At is transposed again and put back in the original matrix A.
F.Potential parallelism
The logic behind the parallelization of this algorithm is analogous to the first implementation of the second approach presented earlier in this section.
- However, since for high matrix dimensions, the process of transposing the initial matrix to an auxiliary matrix, and transposing the resulting matrix back to the original form can be quite expensive. The transpose function is also a very good candidate to be parallelized – the transpose function consists of two nested cycles.
VI.BLAS3LU - Sequential Implementation
A.Mathematical perspective
In this final approach we implement a blocked version of LU decomposition. Consider the following state of the matrix A where the algorithm has already performed parts of decomposition has the following illustration shows:
We start by applying the previous approach (Blas2 Transposed) over the decomposing it in achieving:
So we have and this let us write:
Thus and and this way we are able to achieve:
B.C code
The implementation, as it was said earlier, uses the transposed implementation of BLAS2 as seen on (1).
There are a number of auxiliary functions we call to perform correctly the necessary operations:
- First we get the L22 matrix from A according to its form; this is done in the function tril_eye – the name made its meaning with the Matlab functions tril and eye: tril returns the lower triangular of a matrix; and eye puts the main diagonal of a matrix to 1.
- The inverse_lower function is a specific function to invert the L22 matrix.
- rankB_update as the name indicates, multiplies two matrices and stores it in A.
- mult_sub subtracts two matrices; although in this example we pass as arguments the same matrix to get the values and store them, we also pass the indexes referent to each sub-matrix that are in A.
C.Invert matrix
For LU decomposition it is only needed to invert the L22 matrix, which is a triangular lower matrix. So we can use a relatively simple solving method to do this
where A corresponds to the L22 matrix, x to a column vector of the inverted matrix and b to a column-vector of the identity matrix.
So we can solve this system separately to each column of L22’ and I.
Performing the backward substitution in this case reach to each column of the inverted matrix of L22. The number of operations done in this is Flops for each column, so we reach a total of Flops for each matrix inverted. Still this will not have too much impact on the performance of the third implementation because L22 will have its dimension set to the block size defined in the implementation.
D.Potential parallelism
The third implementation is the one that involves more mathematical and auxiliary concepts in the problem and as consequence is the implementation that can be partitioned in most chunks of parallelizable code, as it uses several auxiliary functions that can be partitioned also in parallel pieces of execution:
- As it was mentioned earlier the second implementation (corresponding to level 2 BLAS operations) can be parallelized (check previous section).
- The auxiliary operation fill_A23 puts all values of A (referring to the correct indices) – it consists in two nested cycles to run through each row and column of the matrices.
- The tril_eye function can also be parallelized since all it does is run through the original matrix and perform comparisons so it can create correctly the L22 matrix.
- The inverse matrix operation, although it consists on nested cycles designed to solve a system it cannot be fully parallelized because there are data-dependencies. However the computation of each column of the inverted matrix (inversion is done by columns, explained in the previous sub-section) can be parallelized.
- As it has been a case of numerous case studies, matrix multiplication can be strongly parallelized.
- The mult_sub function can also be fully parallelized since it does not suffer from data-dependency flaws.
VII.Performance analysis
A.Measure method
The measure unit chosen for this project was FLOPS (Floating Point Operations per Second). The equation is given by . To get to this value we must first have the number of floating point operations in the algorithm. We will consider that the matrices are square, however there is one feature we must not forget: BLAS3 uses BLAS2 to compute a part of the matrix (that sub-part is also a matrix), that sub-matrix is not square, but still, since we are considering the FLOPS in the BLAS3 algorithm we will consider those operations as a part of the third implementation. So the FLOPS measure formula is the same for all three implementations.
The number of floating point operations is the same for each implementation; we only consider the operations present in the algorithm and ignore all other auxiliary operations such as matrix index calculations. Thus we must identify the number of additions, subtractions, multiplications and divisions. The number of multiply operations is given by
The number of dividing operations is given by
The number of subtractions
Ultimately the number of additions
So we get the total number of floating point operations as
We can round the total number of operations to because the other values can be overlooked for vast dimensions of the matrix A. Note that n means we have an n x n matrix since we consider only square matrices.
All the tests were executed with the K-Best measurement scheme with K=3, a maximum number of tries limited to MAX_TRY=5 and with a tolerance factor set to TOL=1.05 (5% tolerance factor). More about the K-Best measurement scheme can be found at Computer Systems: A Programmer's Perspective, Chapter 9.
B.Time measuring
The technique used to measure the time spent in each implementation was to capture the time before the start of each function corresponding to the BLAS implementations (not considering matrix allocation and initialization) and capturing the time right after the execution of the previous function completes. The function used was omp_get_wtime() which returns the current system time in seconds, later we subtract the difference between the two measurements.
C.Result Validation
To check if the matrix is properly decomposed we simply multiply the two factors L and U obtaining a new matrix A’. Since we use partial pivoting we will need matrix P to transform the original matrix A in order to compare the new matrix A’ which suffer row swap from the pivoting. This comparison is achieved with the difference of the two matrixes A-A’ i.e. PA-LU, this result in the difference of each element of both matrices which we call the error originated from the decomposition. In order to simplify the observation of this error we calculate the Euclidian norm of the matrix reducing this error matrix to an integer.
D.Results
VIII.BLAS1LU – Parallel implementation
For the parallel implementation one of the options we chose was OpenMP, since the BLAS operations that are being treated are tremendously loop-oriented. The OpenMP implementation is fairly simple to apply; it consists in defining pragmas along the code. However one must pay attention to the data dependencies mentioned in the previous section and to some finer-grained details such as reduction operations and critical code areas (determined code area that all threads access).
A.Algorithm Data Dependency
As mentioned in chapter 2 the Gaussian elimination algorithm reduces a system of type A x = b to T x = c, where T is an upper triangular matrix. To perform this reduction we need to update the matrix in each step of this elimination using three different operations, therefore in each step the values are different causing the steps to be data-dependent. In the implementation side this reflects that the iterations of the main loop are dependent, i.e. the data needed for the iteration i is dependent of the data updated in iteration i-1. This dictates that the main loop, that iterates the pivot along the decomposition, isnot parallelizable.
B.Parallel implementation
Facing such data dependency few changes in the algorithm can be implemented in order to increase performance with parallelization. Two main changes were:
- The use of the #pragma omp parallel for to distribute the iterations of the loop that calculates the coefficients one by one and updates the rest of the matrix with the SAXPY operation.
- And the parallelization of SAXPY operation with the same pragma.
IX.BLAS2LU – Parallel implementation
A.Algorithm Data Dependency
The data dependency existing in this algorithm is analogous to the previously explained for BLAS1LU – every step of the algorithm depends from the previous one, since the sub-matrix updated by the rank1update operation will be needed by the following step to compute L and U correctly.
B.Parallel implementation
The main difference in this BLAS2 algorithm from the first one (BLAS1LU) is in computing the coefficients, where they are being computed now separately, thus we can apply parallelism to this simple cycle. The thread organization in this case is: each thread computes each value for the coefficient column (so there will be M-1 threads).
The most obvious gains in performance identified were on parallelizing the heavier operations like rank1update (in this case). For rank1update the parallelized algorithm creates M-1 threads (number of rows) and each one of them will run through every row of the matrix.
The pragma is inserted in a way that the most outer loop corresponds to each execution thread, which will have an inner cycle running through the matrix (almost all of our code OpenMP code uses this strategy).
X.BLAS3LU – Parallel implementation
A.Algorithm Data Dependency
Despite the fact that this is a block algorithm the main loop data dependency remains since this stills a Gaussian elimination based algorithm. Now if we take a close look to this algorithm it has three main operations in each step: the BLAS2LU over the matrix that results in, then using the L22-1 we calculate the U23 and finally we use this to achieve the new A33. Hence this three operations are dependent and cannot be performed in parallel.
B.Parallel implementation
As it was said earlier for BLAS2, the most obvious gains are in parallelizing the heavier operations, and in BLAS3 these operations are:
- Using the BLAS2 algorithm to compute the A22 and A32 matrix in the blocked (see images in chapter VI) – already parallelized in the previous chapter
- The rankB-update operation (explained below)
- Multiply the matrices L22 (inverted) and A23 with #pragma omp parallel for
- Auxiliary functions like tril and eye with #pragma omp parallel for
The most important operation to consider in this parallelization context is rankB-update. This operation performs a series of dot-products (on the columns of L32 and rows of U23) and updates the matrix Ã33.
The rankB-update will be executed in a way that each thread will run through the rows of L32 and A33; the operations of each thread are a cycle running through the columns of U23, and an inner cycle running through each element of the row L32 and the column of U23.
Figure – rankB-update operation
The code used to perform the rankB-update operation is shown below.
XI.Use of computing units external to CPU
Along the study and analysis of this algorithm and being aware of the CUDA/GPU capabilities arouse the curiosity of using this external unit to perform some of the operations from the algorithm specification and observe this enhancement in the final performance.
In the BLAS2LU case, a simple profiling revealed a bottleneck in the rank-1 update operation, so we found suitable the use of this processing unit in this operation. Let us remember that this operation corresponds to a matrix-vector multiplication resulting in a matrix that will update an original matrix:
According to CUDA programming model o perform this operation we apply a value of the resulting matrix to a CUDA thread and set its kernel as:
That reflects the calculus needed for each resulting value. for i,j ≠0 .
We also need to specify the grid and block size according to the size of the matrix of the present step, allocate and copy the matrices in to the device.
In the BLAS3LU case we found two operations suitable of using GPU to perform them, the and that correspond to matrix-matrix multiplications. The first is an operation of this type :
This is the basic non-square matrix multiplication which each thread has as kernel:
The second operation is almost the same with an additional subtraction to the original matrix that corresponds to the update
With almost the same kernel as the previous plus the update operation:
XII.Results
XIII.BLAS2LU – Parallel implementation fully on CUDA
Because of the disappointing results showed in the previous section Use of computing units external to CPU we decided to implement another version of BLAS2 fully running on the GPU device, instead of running a hybrid algorithm. The new approach consists in running the entire BLAS2 algorithm in the GPU.
For that we needed to transform every operation on the matrix in a kernel to run in CUDA. We identified three main operations performed on the matrix for BLAS2 in each iteration which are:
- Performing the row permutations, regarding partial pivoting
- Updating the current L column (coefficients for the computation in progress)
- Rank1-update
The auxiliary functions check_max_fullgpu and swap_fullgpu are declared to only being recognized and executed in the device (GPU) with the __device__ tag in the function declaration.
A.Why a full-CUDA implementation?
Since we noticed a major worsening in the first results with naïve hybrid CUDA implementation, due to the constant communication cost we had by passing the matrix from CPU to GPU and vice-versa. So we decided to try and eliminate that cost by reducing the communications to two communications, one at the beginning of the algorithm (to send the raw matrix to the GPU) and another one at the ending of the algorithm (to get back the already computed LU matrix).
B.Kernel to update the L current column
To update the L column we only need to perform a division operation on each element of the column – so each thread will perform a single division and update the given column.
C.Rank1-update
The rank1-update operation is the same of the previous section (XII).
D.Results for the full CUDAimplementation
Here we show the results captured by running both of BLAS2 implementations on CUDA, and compare them to the OpenMP and sequential results.
When comparing the parallel results we quicly understand that for both implementations of Blas2 Cuda, the results are disapointing. For the Full Cuda implementation we observe a minor speedup due to the lower communication overhead
XIV.Blas2LU – Parallel Implementation with MPI
With MPI (Message Passing Interface) we take on a different approach on how the work is distributed along the processors. The main thing we focus on in MPI is on how we can assign different parts of the matrix to each process on the rank1-update operation. We cannot fully parallelize the algorithm due to the previous reasons (data dependencies).
A.Parallelizing the rank1-update operation with MPI
The rank1-update operation consists in multiplying a column-vector and a row-vector, which originates a matrix. The objective is to partition the sub-matrix we want to update evenly through every process. Let’s consider an example matrix to distribute through the processes: a 9x9 matrix and we want to distribute the sub-matrix to update through the processors, so we the sub-matrix is 8x8 (because we want to ignore the first column and row). Imagine we have access to four processors, the matrix will be divided evenly – each processor will have an 8x2 matrix, and the (blue) row and (green) column to multiply is broadcasted to every processor. If the sub-matrix to update doesn’t have an even dimension, like 9x9, the dimensions of the matrices distributed through the processes should be 9x2, 9x2, 9x2 and 9x3 as shown below.
Matrix distribution in MPI
After distributing the matrices and the vectors to multiply, we simply perform a rank1-update operation on that distributed matrix.
B.Distributing data with MPI
The challenge here is to partition the future updated sub-matrix throughout the processors. There are many ways to do this, whether it is through simple sends and receives (MPI_Send and MPI_Recv) or organized broadcasts. However the chosen primitive spreads all the data exactly how we need it – what MPI_Scatter does is distribute equal memory chunk sizes through all available processors. MPI_Scatterv does what we want, it spreads variable size memory chunks – for this last one, we need to specify two arrays: a memory count array (the number of memory elements to pass to each processor) and a memory displacement array (initial place in memory for each processor). The way we assign to the last processor the bigger chunk size (in case of an odd-dimensioned matrix) is: we assign it to the last processor.
This is done by calculating what we call the chunk size and remainder size (when referring to the columns):
For odd dimensions the last processor will have a dimension of otherwise the dimension of the matrix to send to each processor will be M x chunk.
For the vectors that correspond to the row and column with the pivot element, they are distributed with a simple MPI broadcast.
C.Setting matrix indices to distribute
In the functions that carry out the data distribution we need to pass the correct initial memory pointers. Since we want to ignore the first row of the matrix we sum N elements to the memory pointer of the matrix; following the same logic, since we also want to ignore the first column of the matrix we sum 1 element to the matrix pointer.
The col_type and col_type2 are explained later, as these will play an extremely important role in data distribution.
Rank1-Update initial indices
The init argument is a control variable to indicate the initial index of the matrix to consider – if init = 0 then we consider the entire matrix, if init = 3 we only perform the rank1-update from index 3 to N (these indices include the row and column with the pivot element). This will come in handy when we want to call the rank1-update operation for different sizes (like in main steps of the BLAS2 algorithm).
D.Structuring data to send
In this section, we explain how we structure the data to distribute between all the available processors. We will send and receive matrix sections column-wise, i.e., we will need to define specific derived data types in MPI to send and receive groups of matrix columns. In such vector-shaped data types we define:
- the number of elements in a column, i.e., number of rows
- the units to consider when sending and receiving the data
- The striding factor, number of elements to skip in the matrix
- the MPI type of each element of the matrix
The illustration regarding the derived data type may be a little easier to understand:
Note that the vector type is being saved in a temporary identifier. This is because we will need to do a resize operation on the type - since we may want to send and receive more than one type unit at a time, we set the type size to its right factor to avoid memory allocation problems. After doing so, we need to commit the newly created data type to the MPI system.
E.Gathering data back
The way we get all data already processed back to the original processor that scattered chunks of the matrix, is with the inverse opration of MPI_Scatterv, which is MPI_Gatherv. The arguments are applied the same way except in an inverse order according to the scatter primitive:
F. Results for the MPI implementation of BLAS2
XV. numerical libraries – GotoBLAS2
As final analyses we will use some BLAS routines to perform the bottleneck operations throughout the factorization. The used library is an optimized version developed by Kazushige Goto at the University of Texas – Austin. The use of this library cause some conflicts with the remaining program, namely when compiling the library with mulit-thread support from OpenMP, for some reason we could only use one single processor for the remaining program ( omp_get_num_procs() returned 1). So we establish the use of this libraries to single thread as the total program and compare the result with the sequential execution.
Another problem was working with sub-matrices. All decomposition implementations are iterative, and iteration after iteration the matrix which we deal with is a sub-matrix of the original matrix which means we will invoke that routine in sub-matrices for which this library does not offers support ( at least in our investigation ). To solve this we need re-alocate all the submatrices in each iteration and copy them back to the original matrix. Obviously, and specially for big matrices, this results in great amount of overhead in data copying which will be reflected in the final performance results.
A. BLAS1LU
In BLAS1LU the operation which we will perform using BLAS is the operation with the routine xAXPY ( N, ALPHA, X , INCX, Y, INCY), where x will express the data precision, INCX and INCY are the units to iterate to access the vector positions.
In practice the changes are:
B.BLAS2LU
In BLAS2LU the heavier operation is the rank1-update which will be performed with the routine xGER (M, N, ALPHA, X , INCX, Y, INCY, A , LDA) where LDA is leading dimension of the array that defines A. As mentioned before, in practice we need to re-allocate the matrix used as argument to this routine:
C.BLAS3LU
In this implementation two operations can be performed with BLAS, where the inversin of the lower triangular matrix and the multiplication can be performed in a single routine, DTRSM(SIDE, UPLO, TRANSA, DIAG, M, N ,ALPHA ,A ,LDA ,B ,LDB ) where:
- SIDE specifies the type of the triangular matrix, Left or Right
- UPLO upper or lower triangular
- DIAG is the characteristic of the diagonal, Unit or non unit
And the operation rankB-update that can be achieved with the routine DGEMM(TRANSA ,TRANSB ,M , N ,K ,ALPHA ,A ,LDA ,B ,LDB ,BETA ,C ,LDC) which performs a general matrix multiplication .
Where, as we can see, we need to re-allocate the sub-matrices L32 and A33 to invoke the routines, and copy A33 only.
D.Results
XVI.Conclusions
After analyzing every result obtained, we began to notice that there are several issues to consider when it comes to parallelizing an algorithm: from simple details like beginning to study the algorithm in a row or column major way, for example; the first CUDA implementation in this project proved not to be the best so we had to think outside the box and run the algorithm outside of the CPU. An extremely important distinction made in this project is that, as it may not be that hard to parallelize a function, to parallelize a whole algorithm requires skill, experience and knowledge on the platform that same algorithm is being developed.
Another conclusion drawn is that algorithms that require data dependent iterations don’t have much performance when run in a naïve way in languages that require data transfers, like in CUDA and MPI – this would result in massive data transfers back and forth in order to keep the matrix in its right state. The best results for the LU factorization algorithm were in OpenMP, due to its shared memory model and threaded structure – the combination of this algorithm (which is loop oriented) with this API allowed to take advantage of the massive loops that operate in huge chunks of matrices. Also, one important aspect we considered along every test measured was the numerical stability: the differences between the initial matrix A and the resulting matrix. This implied finding a maximum in a vector and swap entire rows of the matrix, so it might have been one of the major bottlenecks in this project. However there are more efficient ways of granting numerical stability, namely probabilistic methods. The use of libraries reveals itself troubleshooting since each routine works has a black box which we need to send and retrieve data at each invocation, i.e. as explained we needed to copy and re-allocate data several times which is painful in terms of performance.
As future work it is still to be developed a hybrid implementation of BLAS3 in the GPU and CPU: using CUDA for the GPU and OpenMP for the CPU, for example. We think that it would be rather interesting to use a heterogeneous platform and watch its evolution.
XVII. References
- Demmel, James W. (August 1, 1997), Applied Numerical Linear Algebra
- Weisstein EW. Matrix Inverse -- from Wolfram MathWorld. Wolfram Research, Inc. Available at: http://mathworld.wolfram.com/MatrixInverse.html.


