Lessons Learned from GPU Experiments with Aparapi

BY MARKUS SPRUNCK

Most of my experiments with new technologies are successful and in somehow encouraging. In the following you can read about some experiments with Aparapi for GPU programming - the results of these experiments have not been satisfying, but after all very insightful. In the lessons learned and recommendations paragraph of this article you may find some conclusions.

GPU programming makes sense if you write solutions for dedicated high GPU performance graphic hardware and not for standard devices. Java is maybe not the best choice to do this, because you lose the "write once run everywhere" benefit of Java.

The key problem today is the missing FP64 (native double precision) on standard graphic devices. Support of double precision is mandatory for most real world scientific computing use cases.

Introduction

General-purpose Computing on Graphics Processing (GPGPU)

The use of a graphics processing unit (GPU) for general purpose computations is called General-purpose Computing on Graphics Processing. Compared with the computation in central processing units (CPUs) the number of cores and calculation power of graphic cards can be significantly increases.

A graphics processing unit consists of hundreds of cores. These cores are restricted in the possible operations, but designed to handle multiple tasks in parallel. Compared with the small number of cores in a classical central processing unit, the many cores in graphics processing units provide a large potential for parallelization in scientific computing.

At the moment there are two competitive approaches to program for GPU-acceleration:

  • Open Computing Language by Khronos Group - "OpenCL™ is the first open, royalty-free standard for
    • cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. [...]"
    • (see http://www.khronos.org/opencl)
  • Parallel Computing Platform by NVIDIA - "CUDA™ is a parallel computing platform and programming

Java Bindings for OpenCL

The following implementations of Java bindings are just a personal selection (there are more available):

CUDA related:

  • Rootbeer - "With Rootbeer a developer can run complex Java objects on a GPU. Rootbeer (de)serializes object fields and generates CUDA code for execution in parallel on a GPU. [...] An nvidia gpu is required and CUDA must be installed"
  • JCUDA - "The JCuda Runtime API is mainly intended for the interaction with the Java bindings of the CUDA Runtime libraries, like JCublas and JCufft. A Java programmer who only wants to use these libraries and does not want to create own CUDA kernels can use the JCuda Runtime API for the interaction with these libraries."

OpenCL related:

  • Aparapi - "Aparapi allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU. It does this by converting Java byte code to OpenCL at runtime and executing on the GPU, if for any reason Aparapi can't execute on the GPU it will execute in a Java thread pool." (see http://aparapi.github.io)
  • JOCL - "This library offers Java-Bindings for OpenCL that are very similar to the original OpenCL API. The functions are provided as static methods, and semantics and signatures of these methods have been kept consistent with the original library functions, except for the language-specific limitations of Java."

All these bindings have individual pros and cons. I decided to use Aparapi for my experiments, because my notebook PC has an AMD graphic card.

Aparapi Basics

This API has some interesting concepts and is in principle independent from the graphic card vendor (because it is OpenCL related). Initially it was developed by AMD and now open source. Aparapi generates OpenCL from Java code and it offers a more high level abstraction of OpenCL (so that less boiler plate code is needed).

On the site Aparapi GitHub you may find the following minimal sample how you refactor a simple loop:

final float inA[] = .... // get a float array of data from somewhere

  final float inB[] = .... // get a float array of data from somewhere (inA.length==inB.length)

  final float result = new float[inA.length];

  for (int i=0; i<array.length; i++){

     result[i]=intA[i]+inB[i];

  }

the for-loop may run on GPU supported by Aparapi after refactoring:

final float inA[] = .... // get a float array of data from somewhere

  final float inB[] = .... // get a float array of data from somewhere (inA.length==inB.length)

  final float result = new float[inA.length];   

  Kernel kernel = new Kernel(){      

      @Override public void run(){

         int i= getGlobalId();
         result[i]=intA[i]+inB[i];

      }

  };

  Range range = Range.create(result.length); 

  kernel.execute(range);

If you like to learn more about Aparapi Basics you should start reading Aparapi UsersGuide and Aparapi FAQs.

Use Case

For the experiment described below a realistic use case has been selected. Finite Element Method results in large systems of linear equations with a symmetric positive definite coefficient matrix (typical is a bandwidth of 1:10).

These systems of linear equations can be solved by Conjugate Gradient Method (CG). For Finite Element Method it makes no sense to be to precise, because the simulation error (model simplification, wrong material properties, etc.) will exceed the numerical deviations from an exact solution. CG is sensitive to numerical computational errors due to loss of precision in the matrix multiplication. This is the reason why you need double precision.

The matrix multiplication is for large systems the most time consuming activity in Conjugate Gradient Method. This should be as fast as possible. So, the use case is Matrix Multiplication in Conjugate Gradients Method.

Band Matrix Multiplication (1st Version)

Last year developed a simple FEM implementation for mobile devices (see here). In this implementation there is a class called BandMatrix.java which is able to make a band matrix multiplication and to solve a linear equation based on Conjugate Gradient Method. This class was the first version of the experiment. It makes a general matrix multiplication and in not optimized in terms of performance. The multiplication of this version takes two matrices and returns a result a new matrix as you may see in Listing1 (find the source code on GitHub).

Listing 1: Matrix multiplication in class v1.BandMatrix.java

[...]
 private final double[][] values;

 [...] 
 public Matrix times(Matrix b) {
     final Matrix C = new Matrix(rows, 1);
     for (int i = 0; i < rows; i++) {
        for (int k = i - cols; k - i < cols; k++) {
           if (k >= 0 && k < b.rows) {
              C.values[i][0] += getValue(i, k) * b.values[k][0];
           }
        }
     }
     return C;
}

 [...] 

Band Matrix Multiplication with Optimizations (2nd Version)

Aparapi cannot manage two dimensional arrays, so some changes are needed to refactor the code for GPU execution. The values are managed in the half band matrix format (just upper part).

Listing 2: v2.BandMatrix.java works with one dimensional array:

 [...]

 /* The symmetric banded matrix ( '-' indicates zero values):
 * 
 * |  a0  a1  a2   0    0  |
 * |  a1  a3  a4  a5    0  |
 * |  a2  a4  a6  a7   a8  |   
 * |  0   a5  a7  a9   a10 |
 * |  0   0   a8  a10  a11 |
 * 
 * is managed in the half band matrix format (just upper part):
 * 
 * |  a0   a1  a2 |
 * |  a3   a4  a5 |
 * |  a6   a7  a8 |   
 * |  a9   a10  0 |
 * |  a11  0    0 |
 * 
 * is stored as array:
 * 
 * [ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, 0, a11, 0, 0 ] 
 * 
 */
 private final double[] values;
 
 [...]
 public Vector times(final Vector b) {
      final Vector C = new Vector(getMaxRows());
      for (int row = 0; row < getMaxRows(); row++) {
         final int start = Math.max(0, row - getMaxCols() + 1);
         final int end = Math.min(getMaxRows(), row + getMaxCols());

         for (int col = start; col < Math.min(end, row); col++) {
            C.addValue(row, getValue(col, row) * b.getValue(col));
         }

         for (int col = Math.max(start, row); col < end; col++) {
            C.addValue(row, getValue(row, col) * b.getValue(col));
         }
      }
      return C;
 }

 [...]

Band Matrix Multiplication with Manual Fork & Join Implementation (3rd Version)

In this version some further optimizations have been done. Now the values are managed internal in the full band matrix format. This simplified the inner loops and increased the performance.

Listing 3: BandMatrixFull.java

[...]

/* The symmetric banded matrix ( '-' indicates zero values):
 * 
 * |  a0  a1  a2   -    -  |
 * |  a1  a3  a4  a5    -  |
 * |  a2  a4  a6  a7   a8  |   
 * |  -   a5  a7  a9   a10 |
 * |  -   -   a8  a10  a11 |
 * 
 * is managed internal in the full band matrix format:
 * 
 * |   -   -    a0   a1    a2 |
 * |   -  a1    a3   a4    a5 |
 * |  a2  a4    a6   a7    a8 |   
 * |  a5  a7    a9   a10    - |
 * |  a8  a10  a11     -    - |
 * 
 * is stored as array:
 * 
 * [ -, -, a0, a1, a2,   -, a1, a3, a4, a5, a2, a4, a6, a7, a8,   a5, a7, a9, a10, -,  a8, a10,  a11, -, - ] 
 * 
 */
 protected final double[] values;

 [...]
 public void times(final Vector b, final Vector result) {

      // prepare input parameter
      final int rowStart = 0;
      final int rowEnd = rows;
      final int colMaximum = cols;
      final int rowMaximum = rows;
      final int bandwidthMid = colMaximum >> 1;

      // execute band matrix multiplication
      int index = 0;
      int rowOffset = 0;
      double sum = 0.0;
      for (int row = rowStart; row < rowEnd; row++) {
         rowOffset = row * colMaximum;
         sum = 0.0;
         for (int col = 0; col < colMaximum; col++) {
            index = row - bandwidthMid + col;
            if (index < rowMaximum && index >= 0) {
               sum += values[col + rowOffset] * b.values[index];
            }
         }
         result.values[row] = sum;
      }
 }

 [...]

The BandMatrixMultiplicatonTask can be used to run it in multithreading mode.


Listing 4: BandMatrixMultiplicatonTask.java

public class BandMatrixMultiplicatonTask extends RecursiveTask<Long> {

 [...]
 @Override
 public Long compute() {
 
    if (rowEnd - rowStart < b.getMaxRows() / Parameter.NUMBER_OF_POCESSORS) {
       calculateMatrixMultiplication();
    } else {

       final int mid = (rowEnd + rowStart) >> 1;
       final BandMatrixMultiplicatonTask firstWorker = 
                new BandMatrixMultiplicatonTask(rowStart, mid, 
                valuesMatrix, b, result);
       firstWorker.fork();
       final BandMatrixMultiplicatonTask secondWorker = 
                new BandMatrixMultiplicatonTask(mid, rowEnd, 
                valuesMatrix, b, result);
       secondWorker.compute();
       firstWorker.join();
    }
    return 0L;
 }

 private void calculateMatrixMultiplication() {
      // prepare input parameter
      final int bandwidthMid = colMaximum >> 1;
      final int rowMaximum = b.values.length;
      final double[] values = valuesMatrix.values;

      // execute band matrix multiplication
      int index;
      int rowOffset;
      double sum;
      for (int row = rowStart; row < rowEnd; row++) {
         rowOffset = row * colMaximum;
         sum = 0.0;
         for (int col = 0; col < colMaximum; col++) {
            index = row - bandwidthMid + col;
            if (index < rowMaximum && index >= 0) {
               sum += values[col + rowOffset] * b.values[index];
            }
         }
         result.values[row] = sum;
      }
   }

   [...]
}

Band Matrix Multiplication with GPU Implementation Based on Aparpai (4th Version)

This version of the matrix multiplication is a kind of weird trial to simulate double precision floating point operations with long values. The motivation to emulate double precision is the missing FP64 support on many graphic devices.

Listing 5: BandMatrixMultiplicatonAparapi.java

[...]

public class BandMatrixMultiplicatonAparapi extends Kernel {

   long[] vectorB = null;
   long[] matrixA = null;
   long[] vectorX = null;

   int[] bandwidthMid = new int[1];
   int[] colMaximum = new int[1];
   int[] rowMaximum = new int[1];

   long[] POW_10_LONG = new long[100];

   public BandMatrixMultiplicatonAparapi() {
      setExplicit(true);
      POW_10_LONG[0] = 1;
      for (int i = 1; i < 100; i++) {
         if (i < 19) {
            POW_10_LONG[i] = 10 * POW_10_LONG[i - 1];
         } else {
            POW_10_LONG[i] = POW_10_LONG[18];
         }
      }
      this.put(POW_10_LONG);
   }

   @Override
   public void run() {

      // prepare input parameter
      final int row = getGlobalId();
      final int rowOffset = row * colMaximum[0];

      // execute band matrix multiplication (for one row)
      long sum = 0L;
      int index = 0;
      for (int col = 0; col < colMaximum[0]; col++) {
         index = row - bandwidthMid[0] + col;
         if (index < rowMaximum[0] && index >= 0) {
            sum = addPacked(sum, 
                            multiplyPacked(matrixA[col + rowOffset], vectorB[index]));
         }
      }
      vectorX[row] = sum;
   }

   public void setMatrixA(long[] values, final int rowNumber, final int bandwidth) {
      if (null == matrixA) {
         matrixA = values;
      }
      vectorB = new long[rowNumber];
      vectorX = new long[rowNumber];

      bandwidthMid[0] = bandwidth >> 1;
      colMaximum[0] = bandwidth;
      rowMaximum[0] = rowNumber;
   }

   public void setVectorB(long[] values) {
      vectorB = values;
      this.put(vectorB);
   }

   public void putVectorB() {
      this.put(vectorB);
   }

   public void setVectorX(long[] values) {
      vectorX = values;
      this.put(vectorX);
   }

   public void getVectorX() {
      this.get(vectorX);
   }

   /**
    * Two digits are reserved for exponent [0..99] 
    */
   private static final int MIN_EXP = -49;

   /**
    * Used to split the double value to mantissa and exponent. 
    * The mantissa scaled to use maximal 17 digits. These are 
    * MANTISSA_DIGITS plus 1 for overflows in add operation
    */
   private static final long SPLIT_EXP = 100000000000000000L;

   /**
    * Used to split the mantissa to a higher and lower integer. 
    */
   private static final long SPLIT_INT = 100000000L;

   public long multiplyPacked(long multiplicand, long multiplier) {

      final long md_mantissa = multiplicand - multiplicand / SPLIT_EXP * SPLIT_EXP;
      final long md_hi = md_mantissa / SPLIT_INT;
      final long md_lo = md_mantissa % SPLIT_INT;

      final long mr_mantissa = multiplier - multiplier / SPLIT_EXP * SPLIT_EXP;
      final long mr_hi = mr_mantissa / SPLIT_INT;
      final long mr_lo = mr_mantissa % SPLIT_INT;

      final long product_mantissa = md_hi * mr_hi 
                                  + md_lo * mr_hi / SPLIT_INT + md_hi * mr_lo / SPLIT_INT;
      final long product_exponent = (multiplicand >> 63 | -multiplicand >>> 63) *
                                    (multiplicand / SPLIT_EXP)
            + (multiplier >> 63 | -multiplier >>> 63) 
            * (multiplier / SPLIT_EXP) + 2 * MIN_EXP + 1;

      return (product_exponent - MIN_EXP) * SPLIT_EXP * (product_mantissa >> 63 
             | -product_mantissa >>> 63)
             + product_mantissa;
   }

   public long addPacked(long augend, long addend) {

      long augend_exponent = ((augend >> 63 
                              | -augend >>> 63) * (augend / SPLIT_EXP) + MIN_EXP);
      long addend_exponent = ((addend >> 63 
                              | -addend >>> 63) * (addend / SPLIT_EXP) + MIN_EXP);

      if (augend_exponent < addend_exponent) {

         // Swap values
         augend = augend ^ addend;
         addend = addend ^ augend;
         augend = augend ^ addend;

         final long value = augend / SPLIT_EXP;
         augend_exponent = (value * (value >> 63 | -value >>> 63)) + MIN_EXP;
         final long value1 = addend / SPLIT_EXP;
         addend_exponent = (value1 * (value1 >> 63 | -value1 >>> 63)) + MIN_EXP;

         final long addend_mantissa = addend - addend / SPLIT_EXP * SPLIT_EXP;
         final long augend_mantissa = augend - augend / SPLIT_EXP * SPLIT_EXP;
         final long sum_mantissa = augend_mantissa + addend_mantissa
                  / POW_10_LONG[(int) (augend_exponent - addend_exponent)];
         return (augend_exponent - MIN_EXP) * SPLIT_EXP * (sum_mantissa >> 63 
                | - sum_mantissa >>> 63) 
                  + sum_mantissa;
      } else {
         final long addend_mantissa = addend - addend / SPLIT_EXP * SPLIT_EXP;
         final long augend_mantissa = augend - augend / SPLIT_EXP * SPLIT_EXP;
         final long sum_mantissa = augend_mantissa + addend_mantissa
                  / POW_10_LONG[(int) (augend_exponent - addend_exponent)];
         return (augend_exponent - MIN_EXP) * SPLIT_EXP * (sum_mantissa >> 63 
                | - sum_mantissa >>> 63) 
                  + sum_mantissa;
      }
   }
}

Measurements

Running the test cases (just call run_x86_64.cmd) should create a file called run_x86_64_output.txt with the following content:

#rows  #cols    v1     v2    v3a   v3b     v4a    v4b    v4c

128    101      71     37     21    36     168   2010    323
256    101     249     47     35    52     687   1408    684
384    101     624    113     88   104    1701   1780   1313
512    101    1212    227    153   119    3187   2684   1898
640    101    1863    340    233   151    5022   3135   2518
768    101    2589    476    338   196   11714   6401   5719
896    101    4178    800    527   289   11993   5552   5331
1024   101    8189   1564   1065   556   14325   5323   6018
1152   101    6268   1141    802   415   20333   6387   9050
1280   101    7756   1406    971   479   21357   6589   8975
1408   101   13574   2494   1726   898   25982   8166  10712
1536   101   11509   2205   1482   677   31015   9111  12386
1664   101   13856   2543   1769   769   36230  17912  14983
1792   101   16863   3150   2107   943   43660  12440  16955
1920   101   18811   3366   2249   991   48567  15142  17700
2048   101   22025   4161   2933  1165   72349  21758  26145
2176   101   23896   4368   2860  1209   66038  22883  23522
2304   101   26992   5053   3292  1365   70734  36415  25187
2432   101   30172   5675   3818  1622   79120  26254  28384
2560   101   33600   6554   4248  1698  158546  81976  57219
2688   101   36507   6978   4629  1802   97684  32772  36593
2816   101   79837  15269  10169  4443  185933  89416  66535
2944   101   43993   8484   5506  2142  114659  32969  53044
3072   101   48391   9562   6246  2549  206562  75176  81600
3200   101   52567   9979   6730  2610  213407  63704  80113
3328   101   57454  11042   7440  2980  199818  61465  78953
3456   101   62078  11850   8369  3370  200175  58826  77425
3584   101   97655  18452  12163  4961  171394  51921  65358
3712   101  130234  24823  16679  7827  239991  69402  92992

where:

  • v1 - is the initial not optimized (single thread) version
  • v2 - is the optimized (single thread) version with vector class instead of matrix class and half band matrix format
  • v3a - is the most optimized (single thread) version with full band matrix format
  • v3b - is the most optimized (fork & join multi-threaded) version with full band matrix format
  • v4a - is based on v3a, emulates floating point operations with PackedDouble.java
  • v4b - is based on v3a, emulates floating point operations and runs with Aparapi in GPU mode
  • v4c - is based on v3a, emulates floating point operations and runs with Aparapi in JTP mode

Figure 1: Conjugate Gradient with Random Numbers

The manual implemented Fork & Join version (v3b) outperforms all other implementations (> 500 rows). Aparapi's JTP Fallback version was so slow, that I decided to remove it from my further investigations. Feel free to implement it and experiment with group size to change the number of uses threads (the default number is much to high and obviously there is no proper thread pooling used, because the number of threads is oscillating during repeated execution). Maybe I'm too stupid to understand why Aparapi's JTP Fallback have been implemented in this way.

The ratio between the CPU version (v4a) and CPU (v4b) is about 3:1. This was the speedup by running on GPU on a non-high-performance graphic device with Aparapi (on a high-performance device the possible speedup factor should be something between 20 and 40).

The ratio between the CPU versions (v1 and v3b) is about 20:1. This was the speedup by an optimized implementation without Aparapi & GPU. This test has been done on a PC with 4 available CPU cores, for more cores the expected speedup should be higher.

Lessons Learned

Most Graphic Devices Have No FP64 Support

This is the most serious issue with GPU programming. Just the high-end devices have FP64 (native double) support. On standard devises (like my notebook PC) you will get a warning like:

Warning: Reverting to Java Thread Pool (JTP) for class [...]: FP64 required but not supported

and you GPU code will be executed in fallback mode (Java Thread Pool).

Aparapi Fallback to JTP Mode is Slow

In the case you implement the GPU code with Aparapi and the need of FP64 and you don't have a graphic device with native support, the code will be executed in JTP mode. Unless you set the Group Size in the Range to a reasonable number of threads the JTP mode is extremely slow (because too much threads are used). Compared with the fork & join implementation the JTP mode is very inefficient.

Optimize CPU Code First

Comparing the non GPU versions of the band matrix implementation show that some simple actions for more performance on CPU may result in a significant improvement. Between v1 (initial version) and v3b (optimized and manual fork & join) is a ratio of 20:1 on a standard CPU.

You Have to Decide the Target Platform

In some cases Aparapi code works also on NVIDIA and/or Intel graphic devices - but some of my tests with different hardware showed that this is by far not guaranteed. Less than 50% of the machines were able to execute the code without additional work (e.g. change of OpenCl.dll).

Not all Algorithms Result in Better Performance

Be aware that algorithms which need a lot of logic operations and/or data transfer to/from GPU tend to be slower than expected. In general GPUs are not made for complex logic and quite often the data transfer is a real bottle neck.

Your Notebook PCs May Overheat

Under Heavy GPU Load Running all the test cases caused on my notebook the following screen:

The system BIOS has detected your notebook PC was put into hibernation 
operating normally. 

Overheating may occur if the cooling vents are blocked or the operating temperature should return to normal operation once the situation is resolved.

System Temperature (90D) 

ENTER - Continue Startup

For more information, please visit: www.hp.com\go\techcenter\startup

The system BIOS has detected your notebook PC was put into hibernation operating normally. Overheating may occur if the cooling vents are blocked or the operating temperature should return to normal operation once the situation is resolved. System Temperature (90D) ENTER - Continue Startup For more information, please visit: www.hp.com\go\techcenter\startup

Obviously this hardware is not made for heavy GPU load (there is a reason why the high-end graphic cards have all these ventilators and cooling devices)

Recommendations

  • First have a look at your algorithm. Maybe there is a lot to optimize before you need to start with GPU programming
  • Java is maybe not the best choice to do GPU programming, because you lose the "write once run everywhere" benefit of Java and with the typical Java Bindings you don't have full control about your solution
  • If you have the need for more computational power, you may decide to use a high-end graphic device and think about using directly C/C++ and/or OpenCL
  • Don't try to emulate double precision in a GPU - the needed overhead will be so slow that all possible performance benefits will be compensated
  • Nevertheless spend time with GPU programming - because massive parallel programming is the model of the future. Even in the case you are not successful, you will learn some important lessons