More‎ > ‎

Lessons Learned from GPU Experiments with Aparapi

Most of my experiments with new technologies are successful and in somehow encouraging. In the following you can read about some experiments with Aparapi for GPU programming - the results of these experiments have not been satisfying, but after all very insightful. In the lessons learned and recommendations paragraph of this article you may find some conclusions.

GPU programming makes sense if you write solutions for dedicated high GPU performance graphic hardware and not for standard devices. Java is maybe not the best choice to do this, because you lose the "write once run everywhere" benefit of Java.  

The key problem today is the missing FP64 (native double precision) on standard graphic devices. Support of double precision is mandatory for most real world scientific computing use cases.

Introduction

General-purpose Computing on Graphics Processing (GPGPU)

The use of a graphics processing unit (GPU) for general purpose computations is called General-purpose Computing on Graphics Processing. Compared with the computation in central processing units (CPUs) the number of cores and calculation power of graphic cards can be significantly increases.

A graphics processing unit consists of hundreds of cores. These cores are restricted in the possible operations, but designed to handle multiple tasks in parallel. Compared with the small number of cores in a classical central processing unit, the many cores in graphics processing units provide a large potential for parallelization in scientific computing. 

At the moment there are two competitive approaches to program for GPU-acceleration:
  • Open Computing Language by Khronos Group - "OpenCL™ is the first open, royalty-free standard for
    cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. [...]"  

    (see http://www.khronos.org/opencl)

  • Parallel Computing Platform by NVIDIA - "CUDA™ is a parallel computing platform and programming
    model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the
    power of the graphics processing unit (GPU). [...]"  

    (see http://www.nvidia.com/object/cuda_home_new.html)

Java Bindings for OpenCL

The following implementations of Java bindings are just a personal selection (there are more available):

CUDA related:
  • Rootbeer - "With Rootbeer a developer can run complex Java objects on a GPU. Rootbeer (de)serializes object fields and generates CUDA code for execution in parallel on a GPU. [...] An nvidia gpu is required and CUDA must be installed"

  • JCUDA - "The JCuda Runtime API is mainly intended for the interaction with the Java bindings of the CUDA Runtime libraries, like JCublas and JCufft. A Java programmer who only wants to use these libraries and does not want to create own CUDA kernels can use the JCuda Runtime API for the interaction with these libraries." 
OpenCL related:
  • Aparapi - "Aparapi allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU. It does this by converting Java byte code to OpenCL at runtime and executing on the GPU, if for any reason Aparapi can't execute on the GPU it will execute in a Java thread pool." (see http://aparapi.github.io)

  • JOCL - "This library offers Java-Bindings for OpenCL that are very similar to the original OpenCL API. The functions are provided as static methods, and semantics and signatures of these methods have been kept consistent with the original library functions, except for the language-specific limitations of Java." 
All these bindings have individual pros and cons. I decided to use Aparapi for my experiments, because my notebook PC has an AMD graphic card.

Aparapi Basics

This API has some interesting concepts and is in principle independent from the graphic card vendor (because it is OpenCL related). Initially it was developed by AMD and now open source. Aparapi generates OpenCL from Java code and it offers a more high level abstraction of OpenCL (so that less boiler plate code is needed).

On the site Aparapi GitHub you may find the following minimal sample how you refactor a simple loop:

 
final float inA[] = .... // get a float array of data from somewhere

  final float inB[] = .... // get a float array of data from somewhere (inA.length==inB.length)

  final float result = new float[inA.length];

  for (int i=0; i<array.length; i++){

     result[i]=intA[i]+inB[i];

  }

the for-loop may run on GPU supported by Aparapi after refactoring:

 
 final float inA[] = .... // get a float array of data from somewhere

  final float inB[] = .... // get a float array of data from somewhere (inA.length==inB.length)

  final float result = new float[inA.length];   

  Kernel kernel = new Kernel(){      

      @Override public void run(){

         int i= getGlobalId();
         result[i]=intA[i]+inB[i];

      }

  };

  Range range = Range.create(result.length); 

  kernel.execute(range);

If you like to learn more about Aparapi Basics you should start reading Aparapi UsersGuide and Aparapi FAQs.

Use Case

For the experiment described below a realistic use case has been selected. Finite Element Method results in large systems of linear equations with a symmetric positive definite coefficient matrix (typical is a bandwidth of 1:10).

These systems of linear equations can be solved by Conjugate Gradient Method (CG). For Finite Element Method it makes no sense to be to precise, because the simulation error (model simplification, wrong material properties, etc.) will exceed the numerical deviations from an exact solution. CG is sensitive to numerical computational errors due to loss of precision in the matrix multiplication. This is the reason why you need double precision.

The matrix multiplication is for large systems the most time consuming activity in Conjugate Gradient Method. This should be as fast as possible. So, the use case is Matrix Multiplication in Conjugate Gradients Method.

Band Matrix Multiplication (1st Version) 

Last year developed a simple FEM implementation for mobile devices (see here). In this implementation there is a class called BandMatrix.java which is able to make a band matrix multiplication and to solve a linear equation based on Conjugate Gradient Method. This class was the first version of the experiment.  It makes a general matrix multiplication and in not optimized in terms of performance. The multiplication of this version takes two matrices and returns a result a new matrix as you may see in Listing1 (find the source code on GitHub).

Listing 1: Matrix multiplication in class v1.BandMatrix.java

 [...]
 private final double[][] values;

 [...] 
 public Matrix times(Matrix b) {
     final Matrix C = new Matrix(rows, 1);
     for (int i = 0; i < rows; i++) {
        for (int k = i - cols; k - i < cols; k++) {
           if (k >= 0 && k < b.rows) {
              C.values[i][0] += getValue(i, k) * b.values[k][0];
           }
        }
     }
     return C;
}

 [...] 

Band Matrix Multiplication with Optimizations (2nd Version) 

Aparapi cannot manage two dimensional arrays, so some changes are needed to refactor the code for GPU execution. The values are managed in the half band matrix format (just upper part).

Listing 2: v2.BandMatrix.java works with one dimensional array:

 [...]

 /* The symmetric banded matrix ( '-' indicates zero values):
 * 
 * |  a0  a1  a2   0    0  |
 * |  a1  a3  a4  a5    0  |
 * |  a2  a4  a6  a7   a8  |   
 * |  0   a5  a7  a9   a10 |
 * |  0   0   a8  a10  a11 |
 * 
 * is managed in the half band matrix format (just upper part):
 * 
 * |  a0   a1  a2 |
 * |  a3   a4  a5 |
 * |  a6   a7  a8 |   
 * |  a9   a10  0 |
 * |  a11  0    0 |
 * 
 * is stored as array:
 * 
 * [ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, 0, a11, 0, 0 ] 
 * 
 */
 private final double[] values;
 
 [...]
 public Vector times(final Vector b) {
      final Vector C = new Vector(getMaxRows());
      for (int row = 0; row < getMaxRows(); row++) {
         final int start = Math.max(0, row - getMaxCols() + 1);
         final int end = Math.min(getMaxRows(), row + getMaxCols());

         for (int col = start; col < Math.min(end, row); col++) {
            C.addValue(row, getValue(col, row) * b.getValue(col));
         }

         for (int col = Math.max(start, row); col < end; col++) {
            C.addValue(row, getValue(row, col) * b.getValue(col));
         }
      }
      return C;
 }

 [...]

Band Matrix Multiplication with Manual Fork & Join Implementation (3rd Version) 

In this version some further optimizations have been done. Now the values are managed internal in the full band matrix format. This simplified the inner loops and increased the performance.

Listing 3: BandMatrixFull.java

 [...]

/* The symmetric banded matrix ( '-' indicates zero values):
 * 
 * |  a0  a1  a2   -    -  |
 * |  a1  a3  a4  a5    -  |
 * |  a2  a4  a6  a7   a8  |   
 * |  -   a5  a7  a9   a10 |
 * |  -   -   a8  a10  a11 |
 * 
 * is managed internal in the full band matrix format:
 * 
 * |   -   -    a0   a1    a2 |
 * |   -  a1    a3   a4    a5 |
 * |  a2  a4    a6   a7    a8 |   
 * |  a5  a7    a9   a10    - |
 * |  a8  a10  a11     -    - |
 * 
 * is stored as array:
 * 
 * [ -, -, a0, a1, a2,   -, a1, a3, a4, a5, a2, a4, a6, a7, a8,   a5, a7, a9, a10, -,  a8, a10,  a11, -, - ] 
 * 
 */
 protected final double[] values;

 [...]
 public void times(final Vector b, final Vector result) {

      // prepare input parameter
      final int rowStart = 0;
      final int rowEnd = rows;
      final int colMaximum = cols;
      final int rowMaximum = rows;
      final int bandwidthMid = colMaximum >> 1;

      // execute band matrix multiplication
      int index = 0;
      int rowOffset = 0;
      double sum = 0.0;
      for (int row = rowStart; row < rowEnd; row++) {
         rowOffset = row * colMaximum;
         sum = 0.0;
         for (int col = 0; col < colMaximum; col++) {
            index = row - bandwidthMid + col;
            if (index < rowMaximum && index >= 0) {
               sum += values[col + rowOffset] * b.values[index];
            }
         }
         result.values[row] = sum;
      }
 }

 [...]

The BandMatrixMultiplicatonTask can be used to run it in multithreading mode.

Listing 4: BandMatrixMultiplicatonTask.java

public class BandMatrixMultiplicatonTask extends RecursiveTask<Long> {

     [...]
 @Override
 public Long compute() {
 
    if (rowEnd - rowStart < b.getMaxRows() / Parameter.NUMBER_OF_POCESSORS) {
       calculateMatrixMultiplication();
    } else {

       final int mid = (rowEnd + rowStart) >> 1;
       final BandMatrixMultiplicatonTask firstWorker = new BandMatrixMultiplicatonTask(rowStart, mid, 
                valuesMatrix, b, result);
       firstWorker.fork();
       final BandMatrixMultiplicatonTask secondWorker = new BandMatrixMultiplicatonTask(mid, rowEnd, 
                valuesMatrix, b, result);
       secondWorker.compute();
       firstWorker.join();
    }
    return 0L;
 }

 private void calculateMatrixMultiplication() {
      // prepare input parameter
      final int bandwidthMid = colMaximum >> 1;
      final int rowMaximum = b.values.length;
      final double[] values = valuesMatrix.values;

      // execute band matrix multiplication
      int index;
      int rowOffset;
      double sum;
      for (int row = rowStart; row < rowEnd; row++) {
         rowOffset = row * colMaximum;
         sum = 0.0;
         for (int col = 0; col < colMaximum; col++) {
            index = row - bandwidthMid + col;
            if (index < rowMaximum && index >= 0) {
               sum += values[col + rowOffset] * b.values[index];
            }
         }
         result.values[row] = sum;
      }
}

   [...]
}

Band Matrix Multiplication with GPU Implementation Based on Aparpai (4th Version)

This version of the matrix multiplication is a kind of weird trial to simulate double precision floating point operations with long values. The motivation to emulate double precision is the missing FP64 support on many graphic devices.

Listing 5: BandMatrixMultiplicatonAparapi.java

 
 [...]

public class BandMatrixMultiplicatonAparapi extends Kernel {

   long[] vectorB = null;
   long[] matrixA = null;
   long[] vectorX = null;

   int[] bandwidthMid = new int[1];
   int[] colMaximum = new int[1];
   int[] rowMaximum = new int[1];

   long[] POW_10_LONG = new long[100];

   public BandMatrixMultiplicatonAparapi() {
      setExplicit(true);
      POW_10_LONG[0] = 1;
      for (int i = 1; i < 100; i++) {
         if (i < 19) {
            POW_10_LONG[i] = 10 * POW_10_LONG[i - 1];
         } else {
            POW_10_LONG[i] = POW_10_LONG[18];
         }
      }
      this.put(POW_10_LONG);
   }

   @Override
   public void run() {

      // prepare input parameter
      final int row = getGlobalId();
      final int rowOffset = row * colMaximum[0];

      // execute band matrix multiplication (for one row)
      long sum = 0L;
      int index = 0;
      for (int col = 0; col < colMaximum[0]; col++) {
         index = row - bandwidthMid[0] + col;
         if (index < rowMaximum[0] && index >= 0) {
            sum = addPacked(sum, multiplyPacked(matrixA[col + rowOffset], vectorB[index]));
         }
      }
      vectorX[row] = sum;
   }

   public void setMatrixA(long[] values, final int rowNumber, final int bandwidth) {
      if (null == matrixA) {
         matrixA = values;
      }
      vectorB = new long[rowNumber];
      vectorX = new long[rowNumber];

      bandwidthMid[0] = bandwidth >> 1;
      colMaximum[0] = bandwidth;
      rowMaximum[0] = rowNumber;
   }

   public void setVectorB(long[] values) {
      vectorB = values;
      this.put(vectorB);
   }

   public void putVectorB() {
      this.put(vectorB);
   }

   public void setVectorX(long[] values) {
      vectorX = values;
      this.put(vectorX);
   }

   public void getVectorX() {
      this.get(vectorX);
   }

   /**
    * Two digits are reserved for exponent [0..99] 
    */
   private static final int MIN_EXP = -49;

   /**
    * Used to split the double value to mantissa and exponent. 
    * The mantissa scaled to use maximal 17 digits. These are 
    * MANTISSA_DIGITS plus 1 for overflows in add operation
    */
   private static final long SPLIT_EXP = 100000000000000000L;

   /**
    * Used to split the mantissa to a higher and lower integer. 
    */
   private static final long SPLIT_INT = 100000000L;

   public long multiplyPacked(long multiplicand, long multiplier) {

      final long md_mantissa = multiplicand - multiplicand / SPLIT_EXP * SPLIT_EXP;
      final long md_hi = md_mantissa / SPLIT_INT;
      final long md_lo = md_mantissa % SPLIT_INT;

      final long mr_mantissa = multiplier - multiplier / SPLIT_EXP * SPLIT_EXP;
      final long mr_hi = mr_mantissa / SPLIT_INT;
      final long mr_lo = mr_mantissa % SPLIT_INT;

      final long product_mantissa = md_hi * mr_hi + md_lo * mr_hi / SPLIT_INT + md_hi * mr_lo / SPLIT_INT;
      final long product_exponent = (multiplicand >> 63 | -multiplicand >>> 63) * (multiplicand / SPLIT_EXP)
            + (multiplier >> 63 | -multiplier >>> 63) * (multiplier / SPLIT_EXP) + 2 * MIN_EXP + 1;

      return (product_exponent - MIN_EXP) * SPLIT_EXP * (product_mantissa >> 63 | -product_mantissa >>> 63)
            + product_mantissa;
   }

   public long addPacked(long augend, long addend) {

      long augend_exponent = ((augend >> 63 | -augend >>> 63) * (augend / SPLIT_EXP) + MIN_EXP);
      long addend_exponent = ((addend >> 63 | -addend >>> 63) * (addend / SPLIT_EXP) + MIN_EXP);

      if (augend_exponent < addend_exponent) {

         // Swap values
         augend = augend ^ addend;
         addend = addend ^ augend;
         augend = augend ^ addend;

         final long value = augend / SPLIT_EXP;
         augend_exponent = (value * (value >> 63 | -value >>> 63)) + MIN_EXP;
         final long value1 = addend / SPLIT_EXP;
         addend_exponent = (value1 * (value1 >> 63 | -value1 >>> 63)) + MIN_EXP;

         final long addend_mantissa = addend - addend / SPLIT_EXP * SPLIT_EXP;
         final long augend_mantissa = augend - augend / SPLIT_EXP * SPLIT_EXP;
         final long sum_mantissa = augend_mantissa + addend_mantissa
                  / POW_10_LONG[(int) (augend_exponent - addend_exponent)];
         return (augend_exponent - MIN_EXP) * SPLIT_EXP * (sum_mantissa >> 63 | -sum_mantissa >>> 63) 
                  + sum_mantissa;
      } else {
         final long addend_mantissa = addend - addend / SPLIT_EXP * SPLIT_EXP;
         final long augend_mantissa = augend - augend / SPLIT_EXP * SPLIT_EXP;
         final long sum_mantissa = augend_mantissa + addend_mantissa
                  / POW_10_LONG[(int) (augend_exponent - addend_exponent)];
         return (augend_exponent - MIN_EXP) * SPLIT_EXP * (sum_mantissa >> 63 | -sum_mantissa >>> 63) 
                 + sum_mantissa;
      }
   }
}

Measurements 

Running the test cases (just call run_x86_64.cmd) should create a file called run_x86_64_output.txt with the following content:

#rows	#cols	v1	v2	v3a	v3b	v4a	v4b	v4c

128	101	71	37	21	36	168	2010	323
256	101	249	47	35	52	687	1408	684
384	101	624	113	88	104	1701	1780	1313
512	101	1212	227	153	119	3187	2684	1898
640	101	1863	340	233	151	5022	3135	2518
768	101	2589	476	338	196	11714	6401	5719
896	101	4178	800	527	289	11993	5552	5331
1024	101	8189	1564	1065	556	14325	5323	6018
1152	101	6268	1141	802	415	20333	6387	9050
1280	101	7756	1406	971	479	21357	6589	8975
1408	101	13574	2494	1726	898	25982	8166	10712
1536	101	11509	2205	1482	677	31015	9111	12386
1664	101	13856	2543	1769	769	36230	17912	14983
1792	101	16863	3150	2107	943	43660	12440	16955
1920	101	18811	3366	2249	991	48567	15142	17700
2048	101	22025	4161	2933	1165	72349	21758	26145
2176	101	23896	4368	2860	1209	66038	22883	23522
2304	101	26992	5053	3292	1365	70734	36415	25187
2432	101	30172	5675	3818	1622	79120	26254	28384
2560	101	33600	6554	4248	1698	158546	81976	57219
2688	101	36507	6978	4629	1802	97684	32772	36593
2816	101	79837	15269	10169	4443	185933	89416	66535
2944	101	43993	8484	5506	2142	114659	32969	53044
3072	101	48391	9562	6246	2549	206562	75176	81600
3200	101	52567	9979	6730	2610	213407	63704	80113
3328	101	57454	11042	7440	2980	199818	61465	78953
3456	101	62078	11850	8369	3370	200175	58826	77425
3584	101	97655	18452	12163	4961	171394	51921	65358
3712	101	130234	24823	16679	7827	239991	69402	92992

where: 
  • v1 - is the initial not optimized (single thread) version 
  • v2 - is the optimized (single thread) version with vector class instead of matrix class and half band matrix format 
  • v3a - is the most optimized (single thread) version with full band matrix format 
  • v3b - is the most optimized (fork & join multi-threaded) version with full band matrix format 
  • v4a - is based on v3a, emulates floating point operations with PackedDouble.java 
  • v4b - is based on v3a, emulates floating point operations and runs with Aparapi in GPU mode 
  • v4c - is based on v3a, emulates floating point operations and runs with Aparapi in JTP mode 


Figure 1: Conjugate Gradient with Random Numbers


The manual implemented Fork & Join version (v3b) outperforms all other implementations (> 500 rows). Aparapi's JTP Fallback version was so slow, that I decided to remove it from my further investigations. Feel free to implement it and experiment with group size to change the number of uses threads (the default number is much to high and obviously there is no proper thread pooling used, because the number of threads is oscillating during repeated execution). Maybe I'm too stupid to understand why Aparapi's JTP Fallback have been implemented in this way.

The ratio between the CPU version (v4a) and CPU (v4b) is about 3:1. This was the speedup by running on GPU on a non-high-performance graphic device with Aparapi (on a high-performance device the possible speedup factor should be something between 20 and 40).

The ratio between the CPU versions (v1 and v3b) is about 20:1. This was the speedup by an optimized implementation without Aparapi & GPU. This test has been done on a PC with 4 available CPU cores, for more cores the expected speedup should be higher.

Lessons Learned

Most Graphic Devices Have No FP64 Support

This is the most serious issue with GPU programming. Just the high-end devices have FP64 (native double) support. On standard devises (like my notebook PC) you will get a warning like:  


Warning: Reverting to Java Thread Pool (JTP) for class [...]: FP64 required but not supported

and you GPU code will be executed in fallback mode (Java Thread Pool). 

Aparapi Fallback to JTP Mode is Slow

In the case you implement the GPU code with Aparapi and the need of FP64 and you don't have a graphic device with native support, the code will be executed in JTP mode. Unless you set the Group Size in the Range to a reasonable number of threads the JTP mode is extremely slow (because too much threads are used). Compared with the fork & join implementation the JTP mode is very inefficient.

Optimize CPU Code First

Comparing the non GPU versions of the band matrix implementation show that some simple actions for more performance on CPU may result in a significant improvement. Between v1 (initial version) and v3b (optimized and manual fork & join) is a ratio of 20:1 on a standard CPU. 

You Have to Decide the Target Platform

In some cases Aparapi code works also on NVIDIA and/or Intel graphic devices - but some of my tests with different hardware showed that this is by far not guaranteed. Less than 50% of the machines were able to execute the code without additional work (e.g. change of OpenCl.dll).

Not all Algorithms Result in Better Performance

Be aware that algorithms which need a lot of logic operations and/or data transfer to/from GPU tend to be slower than expected. In general GPUs are not made for complex logic and quite often the data transfer is a real bottle neck.

Your Notebook PCs May Overheat 

Under Heavy GPU Load Running all the test cases caused on my notebook the following screen:

The system BIOS has detected your notebook PC was put into hibernation 
operating normally. 

Overheating may occur if the cooling vents are blocked or the operating temperature should return to normal operation once the situation is resolved.

System Temperature (90D) 

ENTER - Continue Startup

For more information, please visit: www.hp.com\go\techcenter\startup

Obviously this hardware is not made for heavy GPU load (there is a reason why the high-end graphic cards have all these ventilators and cooling devices)

Recommendations

  • First have a look at your algorithm. Maybe there is a lot to optimize before you need to start with GPU programming
  • Java is maybe not the best choice to do GPU programming, because you lose the "write once run everywhere" benefit of Java and with the typical Java Bindings you don't have full control about your solution
  • If you have the need for more computational power, you may decide to use a high-end graphic device and think about using directly C/C++ and/or OpenCL
  • Don't try to emulate double precision in a GPU - the needed overhead will be so slow that all possible performance benefits will be compensated
  • Nevertheless spend time with GPU programming - because massive parallel programming is the model of the future. Even in the case you are not successful, you will learn some important lessons

Find Code on GitHub

Sponsored Link