Lessons Learned from GPU Experiments with Aparapi

BY MARKUS SPRUNCK

Most of my experiments with new technologies are successful and in somehow encouraging. In the following you can read about some experiments with Aparapi for GPU programming - the results of these experiments have not been satisfying, but after all very insightful. In the lessons learned and recommendations paragraph of this article you may find some conclusions.

GPU programming makes sense if you write solutions for dedicated high GPU performance graphic hardware and not for standard devices. Java is maybe not the best choice to do this, because you lose the "write once run everywhere" benefit of Java.

The key problem today is the missing FP64 (native double precision) on standard graphic devices. Support of double precision is mandatory for most real world scientific computing use cases.

Introduction

General-purpose Computing on Graphics Processing (GPGPU)

The use of a graphics processing unit (GPU) for general purpose computations is called General-purpose Computing on Graphics Processing. Compared with the computation in central processing units (CPUs) the number of cores and calculation power of graphic cards can be significantly increases.

A graphics processing unit consists of hundreds of cores. These cores are restricted in the possible operations, but designed to handle multiple tasks in parallel. Compared with the small number of cores in a classical central processing unit, the many cores in graphics processing units provide a large potential for parallelization in scientific computing.

At the moment there are two competitive approaches to program for GPU-acceleration:

  • Open Computing Language by Khronos Group - "OpenCL™ is the first open, royalty-free standard for

    • cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. [...]"

    • (see http://www.khronos.org/opencl)

  • Parallel Computing Platform by NVIDIA - "CUDA™ is a parallel computing platform and programming

Java Bindings for OpenCL

The following implementations of Java bindings are just a personal selection (there are more available):

CUDA related:

  • Rootbeer - "With Rootbeer a developer can run complex Java objects on a GPU. Rootbeer (de)serializes object fields and generates CUDA code for execution in parallel on a GPU. [...] An nvidia gpu is required and CUDA must be installed"

  • JCUDA - "The JCuda Runtime API is mainly intended for the interaction with the Java bindings of the CUDA Runtime libraries, like JCublas and JCufft. A Java programmer who only wants to use these libraries and does not want to create own CUDA kernels can use the JCuda Runtime API for the interaction with these libraries."

OpenCL related:

  • Aparapi - "Aparapi allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU. It does this by converting Java byte code to OpenCL at runtime and executing on the GPU, if for any reason Aparapi can't execute on the GPU it will execute in a Java thread pool." (see http://aparapi.github.io)

  • JOCL - "This library offers Java-Bindings for OpenCL that are very similar to the original OpenCL API. The functions are provided as static methods, and semantics and signatures of these methods have been kept consistent with the original library functions, except for the language-specific limitations of Java."

All these bindings have individual pros and cons. I decided to use Aparapi for my experiments, because my notebook PC has an AMD graphic card.

Aparapi Basics

This API has some interesting concepts and is in principle independent from the graphic card vendor (because it is OpenCL related). Initially it was developed by AMD and now open source. Aparapi generates OpenCL from Java code and it offers a more high level abstraction of OpenCL (so that less boiler plate code is needed).

On the site Aparapi GitHub you may find the following minimal sample how you refactor a simple loop:

final float inA[] = .... // get a float array of data from somewhere


final float inB[] = .... // get a float array of data from somewhere (inA.length==inB.length)


final float result = new float[inA.length];


for (int i=0; i<array.length; i++){


result[i]=intA[i]+inB[i];


}

the for-loop may run on GPU supported by Aparapi after refactoring:

final float inA[] = .... // get a float array of data from somewhere


final float inB[] = .... // get a float array of data from somewhere (inA.length==inB.length)


final float result = new float[inA.length];


Kernel kernel = new Kernel(){


@Override public void run(){


int i= getGlobalId();

result[i]=intA[i]+inB[i];


}


};


Range range = Range.create(result.length);


kernel.execute(range);

If you like to learn more about Aparapi Basics you should start reading Aparapi UsersGuide and Aparapi FAQs.

Use Case

For the experiment described below a realistic use case has been selected. Finite Element Method results in large systems of linear equations with a symmetric positive definite coefficient matrix (typical is a bandwidth of 1:10).

These systems of linear equations can be solved by Conjugate Gradient Method (CG). For Finite Element Method it makes no sense to be to precise, because the simulation error (model simplification, wrong material properties, etc.) will exceed the numerical deviations from an exact solution. CG is sensitive to numerical computational errors due to loss of precision in the matrix multiplication. This is the reason why you need double precision.

The matrix multiplication is for large systems the most time consuming activity in Conjugate Gradient Method. This should be as fast as possible. So, the use case is Matrix Multiplication in Conjugate Gradients Method.

Band Matrix Multiplication (1st Version)

Last year developed a simple FEM implementation for mobile devices (see here). In this implementation there is a class called BandMatrix.java which is able to make a band matrix multiplication and to solve a linear equation based on Conjugate Gradient Method. This class was the first version of the experiment. It makes a general matrix multiplication and in not optimized in terms of performance. The multiplication of this version takes two matrices and returns a result a new matrix as you may see in Listing1 (find the source code on GitHub).

Listing 1: Matrix multiplication in class v1.BandMatrix.java

[...]

private final double[][] values;


[...]

public Matrix times(Matrix b) {

final Matrix C = new Matrix(rows, 1);

for (int i = 0; i < rows; i++) {

for (int k = i - cols; k - i < cols; k++) {

if (k >= 0 && k < b.rows) {

C.values[i][0] += getValue(i, k) * b.values[k][0];

}

}

}

return C;

}


[...]

Band Matrix Multiplication with Optimizations (2nd Version)

Aparapi cannot manage two dimensional arrays, so some changes are needed to refactor the code for GPU execution. The values are managed in the half band matrix format (just upper part).

Listing 2: v2.BandMatrix.java works with one dimensional array:

[...]


/* The symmetric banded matrix ( '-' indicates zero values):

*

* | a0 a1 a2 0 0 |

* | a1 a3 a4 a5 0 |

* | a2 a4 a6 a7 a8 |

* | 0 a5 a7 a9 a10 |

* | 0 0 a8 a10 a11 |

*

* is managed in the half band matrix format (just upper part):

*

* | a0 a1 a2 |

* | a3 a4 a5 |

* | a6 a7 a8 |

* | a9 a10 0 |

* | a11 0 0 |

*

* is stored as array:

*

* [ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, 0, a11, 0, 0 ]

*

*/

private final double[] values;

[...]

public Vector times(final Vector b) {

final Vector C = new Vector(getMaxRows());

for (int row = 0; row < getMaxRows(); row++) {

final int start = Math.max(0, row - getMaxCols() + 1);

final int end = Math.min(getMaxRows(), row + getMaxCols());


for (int col = start; col < Math.min(end, row); col++) {

C.addValue(row, getValue(col, row) * b.getValue(col));

}


for (int col = Math.max(start, row); col < end; col++) {

C.addValue(row, getValue(row, col) * b.getValue(col));

}

}

return C;

}


[...]

Band Matrix Multiplication with Manual Fork & Join Implementation (3rd Version)

In this version some further optimizations have been done. Now the values are managed internal in the full band matrix format. This simplified the inner loops and increased the performance.

Listing 3: BandMatrixFull.java

[...]


/* The symmetric banded matrix ( '-' indicates zero values):

*

* | a0 a1 a2 - - |

* | a1 a3 a4 a5 - |

* | a2 a4 a6 a7 a8 |

* | - a5 a7 a9 a10 |

* | - - a8 a10 a11 |

*

* is managed internal in the full band matrix format:

*

* | - - a0 a1 a2 |

* | - a1 a3 a4 a5 |

* | a2 a4 a6 a7 a8 |

* | a5 a7 a9 a10 - |

* | a8 a10 a11 - - |

*

* is stored as array:

*

* [ -, -, a0, a1, a2, -, a1, a3, a4, a5, a2, a4, a6, a7, a8, a5, a7, a9, a10, -, a8, a10, a11, -, - ]

*

*/

protected final double[] values;


[...]

public void times(final Vector b, final Vector result) {


// prepare input parameter

final int rowStart = 0;

final int rowEnd = rows;

final int colMaximum = cols;

final int rowMaximum = rows;

final int bandwidthMid = colMaximum >> 1;


// execute band matrix multiplication

int index = 0;

int rowOffset = 0;

double sum = 0.0;

for (int row = rowStart; row < rowEnd; row++) {

rowOffset = row * colMaximum;

sum = 0.0;

for (int col = 0; col < colMaximum; col++) {

index = row - bandwidthMid + col;

if (index < rowMaximum && index >= 0) {

sum += values[col + rowOffset] * b.values[index];

}

}

result.values[row] = sum;

}

}


[...]

The BandMatrixMultiplicatonTask can be used to run it in multithreading mode.


Listing 4: BandMatrixMultiplicatonTask.java

public class BandMatrixMultiplicatonTask extends RecursiveTask<Long> {


[...]

@Override

public Long compute() {

if (rowEnd - rowStart < b.getMaxRows() / Parameter.NUMBER_OF_POCESSORS) {

calculateMatrixMultiplication();

} else {


final int mid = (rowEnd + rowStart) >> 1;

final BandMatrixMultiplicatonTask firstWorker =

new BandMatrixMultiplicatonTask(rowStart, mid,

valuesMatrix, b, result);

firstWorker.fork();

final BandMatrixMultiplicatonTask secondWorker =

new BandMatrixMultiplicatonTask(mid, rowEnd,

valuesMatrix, b, result);

secondWorker.compute();

firstWorker.join();

}

return 0L;

}


private void calculateMatrixMultiplication() {

// prepare input parameter

final int bandwidthMid = colMaximum >> 1;

final int rowMaximum = b.values.length;

final double[] values = valuesMatrix.values;


// execute band matrix multiplication

int index;

int rowOffset;

double sum;

for (int row = rowStart; row < rowEnd; row++) {

rowOffset = row * colMaximum;

sum = 0.0;

for (int col = 0; col < colMaximum; col++) {

index = row - bandwidthMid + col;

if (index < rowMaximum && index >= 0) {

sum += values[col + rowOffset] * b.values[index];

}

}

result.values[row] = sum;

}

}


[...]

}

Band Matrix Multiplication with GPU Implementation Based on Aparpai (4th Version)

This version of the matrix multiplication is a kind of weird trial to simulate double precision floating point operations with long values. The motivation to emulate double precision is the missing FP64 support on many graphic devices.

Listing 5: BandMatrixMultiplicatonAparapi.java

[...]


public class BandMatrixMultiplicatonAparapi extends Kernel {


long[] vectorB = null;

long[] matrixA = null;

long[] vectorX = null;


int[] bandwidthMid = new int[1];

int[] colMaximum = new int[1];

int[] rowMaximum = new int[1];


long[] POW_10_LONG = new long[100];


public BandMatrixMultiplicatonAparapi() {

setExplicit(true);

POW_10_LONG[0] = 1;

for (int i = 1; i < 100; i++) {

if (i < 19) {

POW_10_LONG[i] = 10 * POW_10_LONG[i - 1];

} else {

POW_10_LONG[i] = POW_10_LONG[18];

}

}

this.put(POW_10_LONG);

}


@Override

public void run() {


// prepare input parameter

final int row = getGlobalId();

final int rowOffset = row * colMaximum[0];


// execute band matrix multiplication (for one row)

long sum = 0L;

int index = 0;

for (int col = 0; col < colMaximum[0]; col++) {

index = row - bandwidthMid[0] + col;

if (index < rowMaximum[0] && index >= 0) {

sum = addPacked(sum,

multiplyPacked(matrixA[col + rowOffset], vectorB[index]));

}

}

vectorX[row] = sum;

}


public void setMatrixA(long[] values, final int rowNumber, final int bandwidth) {

if (null == matrixA) {

matrixA = values;

}

vectorB = new long[rowNumber];

vectorX = new long[rowNumber];


bandwidthMid[0] = bandwidth >> 1;

colMaximum[0] = bandwidth;

rowMaximum[0] = rowNumber;

}


public void setVectorB(long[] values) {

vectorB = values;

this.put(vectorB);

}


public void putVectorB() {

this.put(vectorB);

}


public void setVectorX(long[] values) {

vectorX = values;

this.put(vectorX);

}


public void getVectorX() {

this.get(vectorX);

}


/**

* Two digits are reserved for exponent [0..99]

*/

private static final int MIN_EXP = -49;


/**

* Used to split the double value to mantissa and exponent.

* The mantissa scaled to use maximal 17 digits. These are

* MANTISSA_DIGITS plus 1 for overflows in add operation

*/

private static final long SPLIT_EXP = 100000000000000000L;


/**

* Used to split the mantissa to a higher and lower integer.

*/

private static final long SPLIT_INT = 100000000L;


public long multiplyPacked(long multiplicand, long multiplier) {


final long md_mantissa = multiplicand - multiplicand / SPLIT_EXP * SPLIT_EXP;

final long md_hi = md_mantissa / SPLIT_INT;

final long md_lo = md_mantissa % SPLIT_INT;


final long mr_mantissa = multiplier - multiplier / SPLIT_EXP * SPLIT_EXP;

final long mr_hi = mr_mantissa / SPLIT_INT;

final long mr_lo = mr_mantissa % SPLIT_INT;


final long product_mantissa = md_hi * mr_hi

+ md_lo * mr_hi / SPLIT_INT + md_hi * mr_lo / SPLIT_INT;

final long product_exponent = (multiplicand >> 63 | -multiplicand >>> 63) *

(multiplicand / SPLIT_EXP)

+ (multiplier >> 63 | -multiplier >>> 63)

* (multiplier / SPLIT_EXP) + 2 * MIN_EXP + 1;


return (product_exponent - MIN_EXP) * SPLIT_EXP * (product_mantissa >> 63

| -product_mantissa >>> 63)

+ product_mantissa;

}


public long addPacked(long augend, long addend) {


long augend_exponent = ((augend >> 63

| -augend >>> 63) * (augend / SPLIT_EXP) + MIN_EXP);

long addend_exponent = ((addend >> 63

| -addend >>> 63) * (addend / SPLIT_EXP) + MIN_EXP);


if (augend_exponent < addend_exponent) {


// Swap values

augend = augend ^ addend;

addend = addend ^ augend;

augend = augend ^ addend;


final long value = augend / SPLIT_EXP;

augend_exponent = (value * (value >> 63 | -value >>> 63)) + MIN_EXP;

final long value1 = addend / SPLIT_EXP;

addend_exponent = (value1 * (value1 >> 63 | -value1 >>> 63)) + MIN_EXP;


final long addend_mantissa = addend - addend / SPLIT_EXP * SPLIT_EXP;

final long augend_mantissa = augend - augend / SPLIT_EXP * SPLIT_EXP;

final long sum_mantissa = augend_mantissa + addend_mantissa

/ POW_10_LONG[(int) (augend_exponent - addend_exponent)];

return (augend_exponent - MIN_EXP) * SPLIT_EXP * (sum_mantissa >> 63

| - sum_mantissa >>> 63)

+ sum_mantissa;

} else {

final long addend_mantissa = addend - addend / SPLIT_EXP * SPLIT_EXP;

final long augend_mantissa = augend - augend / SPLIT_EXP * SPLIT_EXP;

final long sum_mantissa = augend_mantissa + addend_mantissa

/ POW_10_LONG[(int) (augend_exponent - addend_exponent)];

return (augend_exponent - MIN_EXP) * SPLIT_EXP * (sum_mantissa >> 63

| - sum_mantissa >>> 63)

+ sum_mantissa;

}

}

}

Measurements

Running the test cases (just call run_x86_64.cmd) should create a file called run_x86_64_output.txt with the following content:

#rows #cols v1 v2 v3a v3b v4a v4b v4c


128 101 71 37 21 36 168 2010 323

256 101 249 47 35 52 687 1408 684

384 101 624 113 88 104 1701 1780 1313

512 101 1212 227 153 119 3187 2684 1898

640 101 1863 340 233 151 5022 3135 2518

768 101 2589 476 338 196 11714 6401 5719

896 101 4178 800 527 289 11993 5552 5331

1024 101 8189 1564 1065 556 14325 5323 6018

1152 101 6268 1141 802 415 20333 6387 9050

1280 101 7756 1406 971 479 21357 6589 8975

1408 101 13574 2494 1726 898 25982 8166 10712

1536 101 11509 2205 1482 677 31015 9111 12386

1664 101 13856 2543 1769 769 36230 17912 14983

1792 101 16863 3150 2107 943 43660 12440 16955

1920 101 18811 3366 2249 991 48567 15142 17700

2048 101 22025 4161 2933 1165 72349 21758 26145

2176 101 23896 4368 2860 1209 66038 22883 23522

2304 101 26992 5053 3292 1365 70734 36415 25187

2432 101 30172 5675 3818 1622 79120 26254 28384

2560 101 33600 6554 4248 1698 158546 81976 57219

2688 101 36507 6978 4629 1802 97684 32772 36593

2816 101 79837 15269 10169 4443 185933 89416 66535

2944 101 43993 8484 5506 2142 114659 32969 53044

3072 101 48391 9562 6246 2549 206562 75176 81600

3200 101 52567 9979 6730 2610 213407 63704 80113

3328 101 57454 11042 7440 2980 199818 61465 78953

3456 101 62078 11850 8369 3370 200175 58826 77425

3584 101 97655 18452 12163 4961 171394 51921 65358

3712 101 130234 24823 16679 7827 239991 69402 92992

where:

  • v1 - is the initial not optimized (single thread) version

  • v2 - is the optimized (single thread) version with vector class instead of matrix class and half band matrix format

  • v3a - is the most optimized (single thread) version with full band matrix format

  • v3b - is the most optimized (fork & join multi-threaded) version with full band matrix format

  • v4a - is based on v3a, emulates floating point operations with PackedDouble.java

  • v4b - is based on v3a, emulates floating point operations and runs with Aparapi in GPU mode

  • v4c - is based on v3a, emulates floating point operations and runs with Aparapi in JTP mode

Figure 1: Conjugate Gradient with Random Numbers

The manual implemented Fork & Join version (v3b) outperforms all other implementations (> 500 rows). Aparapi's JTP Fallback version was so slow, that I decided to remove it from my further investigations. Feel free to implement it and experiment with group size to change the number of uses threads (the default number is much to high and obviously there is no proper thread pooling used, because the number of threads is oscillating during repeated execution). Maybe I'm too stupid to understand why Aparapi's JTP Fallback have been implemented in this way.

The ratio between the CPU version (v4a) and CPU (v4b) is about 3:1. This was the speedup by running on GPU on a non-high-performance graphic device with Aparapi (on a high-performance device the possible speedup factor should be something between 20 and 40).

The ratio between the CPU versions (v1 and v3b) is about 20:1. This was the speedup by an optimized implementation without Aparapi & GPU. This test has been done on a PC with 4 available CPU cores, for more cores the expected speedup should be higher.

Lessons Learned

Most Graphic Devices Have No FP64 Support

This is the most serious issue with GPU programming. Just the high-end devices have FP64 (native double) support. On standard devises (like my notebook PC) you will get a warning like:

Warning: Reverting to Java Thread Pool (JTP) for class [...]: FP64 required but not supported

and you GPU code will be executed in fallback mode (Java Thread Pool).

Aparapi Fallback to JTP Mode is Slow

In the case you implement the GPU code with Aparapi and the need of FP64 and you don't have a graphic device with native support, the code will be executed in JTP mode. Unless you set the Group Size in the Range to a reasonable number of threads the JTP mode is extremely slow (because too much threads are used). Compared with the fork & join implementation the JTP mode is very inefficient.

Optimize CPU Code First

Comparing the non GPU versions of the band matrix implementation show that some simple actions for more performance on CPU may result in a significant improvement. Between v1 (initial version) and v3b (optimized and manual fork & join) is a ratio of 20:1 on a standard CPU.

You Have to Decide the Target Platform

In some cases Aparapi code works also on NVIDIA and/or Intel graphic devices - but some of my tests with different hardware showed that this is by far not guaranteed. Less than 50% of the machines were able to execute the code without additional work (e.g. change of OpenCl.dll).

Not all Algorithms Result in Better Performance

Be aware that algorithms which need a lot of logic operations and/or data transfer to/from GPU tend to be slower than expected. In general GPUs are not made for complex logic and quite often the data transfer is a real bottle neck.

Your Notebook PCs May Overheat

Under Heavy GPU Load Running all the test cases caused on my notebook the following screen:

The system BIOS has detected your notebook PC was put into hibernation

operating normally.


Overheating may occur if the cooling vents are blocked or the operating temperature should return to normal operation once the situation is resolved.


System Temperature (90D)


ENTER - Continue Startup


For more information, please visit: www.hp.com\go\techcenter\startup

The system BIOS has detected your notebook PC was put into hibernation operating normally. Overheating may occur if the cooling vents are blocked or the operating temperature should return to normal operation once the situation is resolved. System Temperature (90D) ENTER - Continue Startup For more information, please visit: www.hp.com\go\techcenter\startup

Obviously this hardware is not made for heavy GPU load (there is a reason why the high-end graphic cards have all these ventilators and cooling devices)

Recommendations

  • First have a look at your algorithm. Maybe there is a lot to optimize before you need to start with GPU programming

  • Java is maybe not the best choice to do GPU programming, because you lose the "write once run everywhere" benefit of Java and with the typical Java Bindings you don't have full control about your solution

  • If you have the need for more computational power, you may decide to use a high-end graphic device and think about using directly C/C++ and/or OpenCL

  • Don't try to emulate double precision in a GPU - the needed overhead will be so slow that all possible performance benefits will be compensated

  • Nevertheless spend time with GPU programming - because massive parallel programming is the model of the future. Even in the case you are not successful, you will learn some important lessons