Most of my experiments with new technologies are successful and in somehow encouraging. In the following you can read about some experiments with Aparapi for GPU programming  the results of these experiments have not been satisfying, but after all very insightful. In the lessons learned and recommendations paragraph of this article you may find some conclusions.
GPU programming makes sense if you write solutions for dedicated high GPU performance graphic hardware and not for standard devices. Java is maybe not the best choice to do this, because you lose the "write once run everywhere" benefit of Java.
The key problem today is the missing FP64 (native double precision) on standard graphic devices. Support of double precision is mandatory for most real world scientific computing use cases.
Introduction
Generalpurpose Computing on Graphics Processing (GPGPU)
The use of a graphics processing unit (GPU) for general purpose computations is called Generalpurpose Computing on Graphics Processing. Compared with the computation in central processing units (CPUs) the number of cores and calculation power of graphic cards can be significantly increases.
A graphics processing unit consists of hundreds of cores. These cores are restricted in the possible operations, but designed to handle multiple tasks in parallel. Compared with the small number of cores in a classical central processing unit, the many cores in graphics processing units provide a large potential for parallelization in scientific computing.
At the moment there are two competitive approaches to program for GPUacceleration:
 Open Computing Language by Khronos Group  "OpenCL™ is the first open, royaltyfree standard for
crossplatform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. [...]"
(see http://www.khronos.org/opencl)
 Parallel Computing Platform by NVIDIA  "CUDA™ is a parallel computing platform and programming
model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the
power of the graphics processing unit (GPU). [...]"
(see http://www.nvidia.com/object/cuda_home_new.html)
Java Bindings for OpenCL
The following implementations of Java bindings are just a personal selection (there are more available):
CUDA related:
 Rootbeer  "With Rootbeer a developer can run complex Java objects on a GPU. Rootbeer (de)serializes object fields and generates CUDA code for execution in parallel on a GPU. [...] An nvidia gpu is required and CUDA must be installed"
 JCUDA  "The JCuda Runtime API is mainly intended for the interaction with the Java bindings of the CUDA Runtime libraries, like JCublas and JCufft. A Java programmer who only wants to use these libraries and does not want to create own CUDA kernels can use the JCuda Runtime API for the interaction with these libraries."
OpenCL related:
 Aparapi  "Aparapi allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU. It does this by converting Java byte code to OpenCL at runtime and executing on the GPU, if for any reason Aparapi can't execute on the GPU it will execute in a Java thread pool." (see http://aparapi.github.io)
 JOCL  "This library offers JavaBindings for OpenCL that are very similar to the original OpenCL API. The functions are provided as static methods, and semantics and signatures of these methods have been kept consistent with the original library functions, except for the languagespecific limitations of Java."
All these bindings have individual pros and cons. I decided to use Aparapi for my experiments, because my notebook PC has an AMD graphic card.
Aparapi Basics
This API has some interesting concepts and is in principle independent from the graphic card vendor (because it is OpenCL related). Initially it was developed by AMD and now open source. Aparapi generates OpenCL from Java code and it offers a more high level abstraction of OpenCL (so that less boiler plate code is needed).
On the site Aparapi GitHub you may find the following minimal sample how you refactor a simple loop:
 final float inA[] = .... // get a float array of data from somewhere
final float inB[] = .... // get a float array of data from somewhere (inA.length==inB.length)
final float result = new float[inA.length];
for (int i=0; i<array.length; i++){
result[i]=intA[i]+inB[i];
}

the forloop may run on GPU supported by Aparapi after refactoring:
 final float inA[] = .... // get a float array of data from somewhere
final float inB[] = .... // get a float array of data from somewhere (inA.length==inB.length)
final float result = new float[inA.length];
Kernel kernel = new Kernel(){
@Override public void run(){
int i= getGlobalId();
result[i]=intA[i]+inB[i];
}
};
Range range = Range.create(result.length);
kernel.execute(range);

Use Case
For the experiment described below a realistic use case has been selected. Finite Element Method results in large systems of linear equations with a symmetric positive definite coefficient matrix (typical is a bandwidth of 1:10).
These systems of linear equations can be solved by Conjugate Gradient Method (CG). For Finite Element Method it makes no sense to be to precise, because the simulation error (model simplification, wrong material properties, etc.) will exceed the numerical deviations from an exact solution. CG is sensitive to numerical computational errors due to loss of precision in the matrix multiplication. This is the reason why you need double precision.
The matrix multiplication is for large systems the most time consuming activity in Conjugate Gradient Method. This should be as fast as possible. So, the use case is Matrix Multiplication in Conjugate Gradients Method.
Band Matrix Multiplication (1st Version)
Last year developed a simple FEM implementation for mobile devices (see here). In this implementation there is a class called BandMatrix.java which is able to make a band matrix multiplication and to solve a linear equation based on Conjugate Gradient Method. This class was the first version of the experiment. It makes a general matrix multiplication and in not optimized in terms of performance. The multiplication of this version takes two matrices and returns a result a new matrix as you may see in Listing1 (find the source code on GitHub).
Listing 1: Matrix multiplication in class v1.BandMatrix.java
 [...]
private final double[][] values;
[...]
public Matrix times(Matrix b) {
final Matrix C = new Matrix(rows, 1);
for (int i = 0; i < rows; i++) {
for (int k = i  cols; k  i < cols; k++) {
if (k >= 0 && k < b.rows) {
C.values[i][0] += getValue(i, k) * b.values[k][0];
}
}
}
return C;
}
[...]

Band Matrix Multiplication with Optimizations (2nd Version)
Aparapi cannot manage two dimensional arrays, so some changes are needed to refactor the code for GPU execution. The values are managed in the half band matrix format (just upper part).
Listing 2: v2.BandMatrix.java works with one dimensional array:
 [...]
/* The symmetric banded matrix ( '' indicates zero values):
*
*  a0 a1 a2 0 0 
*  a1 a3 a4 a5 0 
*  a2 a4 a6 a7 a8 
*  0 a5 a7 a9 a10 
*  0 0 a8 a10 a11 
*
* is managed in the half band matrix format (just upper part):
*
*  a0 a1 a2 
*  a3 a4 a5 
*  a6 a7 a8 
*  a9 a10 0 
*  a11 0 0 
*
* is stored as array:
*
* [ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, 0, a11, 0, 0 ]
*
*/
private final double[] values;
[...]
public Vector times(final Vector b) {
final Vector C = new Vector(getMaxRows());
for (int row = 0; row < getMaxRows(); row++) {
final int start = Math.max(0, row  getMaxCols() + 1);
final int end = Math.min(getMaxRows(), row + getMaxCols());
for (int col = start; col < Math.min(end, row); col++) {
C.addValue(row, getValue(col, row) * b.getValue(col));
}
for (int col = Math.max(start, row); col < end; col++) {
C.addValue(row, getValue(row, col) * b.getValue(col));
}
}
return C;
}
[...]

Band Matrix Multiplication with Manual Fork & Join Implementation (3rd Version)
In this version some further optimizations have been done. Now the values are managed internal in the full band matrix format. This simplified the inner loops and increased the performance.
Listing 3: BandMatrixFull.java
 [...]
/* The symmetric banded matrix ( '' indicates zero values):
*
*  a0 a1 a2   
*  a1 a3 a4 a5  
*  a2 a4 a6 a7 a8 
*   a5 a7 a9 a10 
*    a8 a10 a11 
*
* is managed internal in the full band matrix format:
*
*    a0 a1 a2 
*   a1 a3 a4 a5 
*  a2 a4 a6 a7 a8 
*  a5 a7 a9 a10  
*  a8 a10 a11   
*
* is stored as array:
*
* [ , , a0, a1, a2, , a1, a3, a4, a5, a2, a4, a6, a7, a8, a5, a7, a9, a10, , a8, a10, a11, ,  ]
*
*/
protected final double[] values;
[...]
public void times(final Vector b, final Vector result) {
// prepare input parameter
final int rowStart = 0;
final int rowEnd = rows;
final int colMaximum = cols;
final int rowMaximum = rows;
final int bandwidthMid = colMaximum >> 1;
// execute band matrix multiplication
int index = 0;
int rowOffset = 0;
double sum = 0.0;
for (int row = rowStart; row < rowEnd; row++) {
rowOffset = row * colMaximum;
sum = 0.0;
for (int col = 0; col < colMaximum; col++) {
index = row  bandwidthMid + col;
if (index < rowMaximum && index >= 0) {
sum += values[col + rowOffset] * b.values[index];
}
}
result.values[row] = sum;
}
}
[...]

The BandMatrixMultiplicatonTask can be used to run it in multithreading mode.
Listing 4: BandMatrixMultiplicatonTask.java
 public class BandMatrixMultiplicatonTask extends RecursiveTask<Long> {
[...]
@Override
public Long compute() {
if (rowEnd  rowStart < b.getMaxRows() / Parameter.NUMBER_OF_POCESSORS) {
calculateMatrixMultiplication();
} else {
final int mid = (rowEnd + rowStart) >> 1;
final BandMatrixMultiplicatonTask firstWorker = new BandMatrixMultiplicatonTask(rowStart, mid,
valuesMatrix, b, result);
firstWorker.fork();
final BandMatrixMultiplicatonTask secondWorker = new BandMatrixMultiplicatonTask(mid, rowEnd,
valuesMatrix, b, result);
secondWorker.compute();
firstWorker.join();
}
return 0L;
}
private void calculateMatrixMultiplication() {
// prepare input parameter
final int bandwidthMid = colMaximum >> 1;
final int rowMaximum = b.values.length;
final double[] values = valuesMatrix.values;
// execute band matrix multiplication
int index;
int rowOffset;
double sum;
for (int row = rowStart; row < rowEnd; row++) {
rowOffset = row * colMaximum;
sum = 0.0;
for (int col = 0; col < colMaximum; col++) {
index = row  bandwidthMid + col;
if (index < rowMaximum && index >= 0) {
sum += values[col + rowOffset] * b.values[index];
}
}
result.values[row] = sum;
}
}
[...]
}

Band Matrix Multiplication with GPU Implementation Based on Aparpai (4^{th} Version)
This version of the matrix multiplication is a kind of weird trial to simulate double precision floating point operations with long values. The motivation to emulate double precision is the missing FP64 support on many graphic devices.
Listing 5: BandMatrixMultiplicatonAparapi.java
 [...]
public class BandMatrixMultiplicatonAparapi extends Kernel {
long[] vectorB = null;
long[] matrixA = null;
long[] vectorX = null;
int[] bandwidthMid = new int[1];
int[] colMaximum = new int[1];
int[] rowMaximum = new int[1];
long[] POW_10_LONG = new long[100];
public BandMatrixMultiplicatonAparapi() {
setExplicit(true);
POW_10_LONG[0] = 1;
for (int i = 1; i < 100; i++) {
if (i < 19) {
POW_10_LONG[i] = 10 * POW_10_LONG[i  1];
} else {
POW_10_LONG[i] = POW_10_LONG[18];
}
}
this.put(POW_10_LONG);
}
@Override
public void run() {
// prepare input parameter
final int row = getGlobalId();
final int rowOffset = row * colMaximum[0];
// execute band matrix multiplication (for one row)
long sum = 0L;
int index = 0;
for (int col = 0; col < colMaximum[0]; col++) {
index = row  bandwidthMid[0] + col;
if (index < rowMaximum[0] && index >= 0) {
sum = addPacked(sum, multiplyPacked(matrixA[col + rowOffset], vectorB[index]));
}
}
vectorX[row] = sum;
}
public void setMatrixA(long[] values, final int rowNumber, final int bandwidth) {
if (null == matrixA) {
matrixA = values;
}
vectorB = new long[rowNumber];
vectorX = new long[rowNumber];
bandwidthMid[0] = bandwidth >> 1;
colMaximum[0] = bandwidth;
rowMaximum[0] = rowNumber;
}
public void setVectorB(long[] values) {
vectorB = values;
this.put(vectorB);
}
public void putVectorB() {
this.put(vectorB);
}
public void setVectorX(long[] values) {
vectorX = values;
this.put(vectorX);
}
public void getVectorX() {
this.get(vectorX);
}
/**
* Two digits are reserved for exponent [0..99]
*/
private static final int MIN_EXP = 49;
/**
* Used to split the double value to mantissa and exponent.
* The mantissa scaled to use maximal 17 digits. These are
* MANTISSA_DIGITS plus 1 for overflows in add operation
*/
private static final long SPLIT_EXP = 100000000000000000L;
/**
* Used to split the mantissa to a higher and lower integer.
*/
private static final long SPLIT_INT = 100000000L;
public long multiplyPacked(long multiplicand, long multiplier) {
final long md_mantissa = multiplicand  multiplicand / SPLIT_EXP * SPLIT_EXP;
final long md_hi = md_mantissa / SPLIT_INT;
final long md_lo = md_mantissa % SPLIT_INT;
final long mr_mantissa = multiplier  multiplier / SPLIT_EXP * SPLIT_EXP;
final long mr_hi = mr_mantissa / SPLIT_INT;
final long mr_lo = mr_mantissa % SPLIT_INT;
final long product_mantissa = md_hi * mr_hi + md_lo * mr_hi / SPLIT_INT + md_hi * mr_lo / SPLIT_INT;
final long product_exponent = (multiplicand >> 63  multiplicand >>> 63) * (multiplicand / SPLIT_EXP)
+ (multiplier >> 63  multiplier >>> 63) * (multiplier / SPLIT_EXP) + 2 * MIN_EXP + 1;
return (product_exponent  MIN_EXP) * SPLIT_EXP * (product_mantissa >> 63  product_mantissa >>> 63)
+ product_mantissa;
}
public long addPacked(long augend, long addend) {
long augend_exponent = ((augend >> 63  augend >>> 63) * (augend / SPLIT_EXP) + MIN_EXP);
long addend_exponent = ((addend >> 63  addend >>> 63) * (addend / SPLIT_EXP) + MIN_EXP);
if (augend_exponent < addend_exponent) {
// Swap values
augend = augend ^ addend;
addend = addend ^ augend;
augend = augend ^ addend;
final long value = augend / SPLIT_EXP;
augend_exponent = (value * (value >> 63  value >>> 63)) + MIN_EXP;
final long value1 = addend / SPLIT_EXP;
addend_exponent = (value1 * (value1 >> 63  value1 >>> 63)) + MIN_EXP;
final long addend_mantissa = addend  addend / SPLIT_EXP * SPLIT_EXP;
final long augend_mantissa = augend  augend / SPLIT_EXP * SPLIT_EXP;
final long sum_mantissa = augend_mantissa + addend_mantissa
/ POW_10_LONG[(int) (augend_exponent  addend_exponent)];
return (augend_exponent  MIN_EXP) * SPLIT_EXP * (sum_mantissa >> 63  sum_mantissa >>> 63)
+ sum_mantissa;
} else {
final long addend_mantissa = addend  addend / SPLIT_EXP * SPLIT_EXP;
final long augend_mantissa = augend  augend / SPLIT_EXP * SPLIT_EXP;
final long sum_mantissa = augend_mantissa + addend_mantissa
/ POW_10_LONG[(int) (augend_exponent  addend_exponent)];
return (augend_exponent  MIN_EXP) * SPLIT_EXP * (sum_mantissa >> 63  sum_mantissa >>> 63)
+ sum_mantissa;
}
}
}

Measurements
Running the test cases (just call run_x86_64.cmd) should create a file called run_x86_64_output.txt with the following content:
 #rows #cols v1 v2 v3a v3b v4a v4b v4c
128 101 71 37 21 36 168 2010 323
256 101 249 47 35 52 687 1408 684
384 101 624 113 88 104 1701 1780 1313
512 101 1212 227 153 119 3187 2684 1898
640 101 1863 340 233 151 5022 3135 2518
768 101 2589 476 338 196 11714 6401 5719
896 101 4178 800 527 289 11993 5552 5331
1024 101 8189 1564 1065 556 14325 5323 6018
1152 101 6268 1141 802 415 20333 6387 9050
1280 101 7756 1406 971 479 21357 6589 8975
1408 101 13574 2494 1726 898 25982 8166 10712
1536 101 11509 2205 1482 677 31015 9111 12386
1664 101 13856 2543 1769 769 36230 17912 14983
1792 101 16863 3150 2107 943 43660 12440 16955
1920 101 18811 3366 2249 991 48567 15142 17700
2048 101 22025 4161 2933 1165 72349 21758 26145
2176 101 23896 4368 2860 1209 66038 22883 23522
2304 101 26992 5053 3292 1365 70734 36415 25187
2432 101 30172 5675 3818 1622 79120 26254 28384
2560 101 33600 6554 4248 1698 158546 81976 57219
2688 101 36507 6978 4629 1802 97684 32772 36593
2816 101 79837 15269 10169 4443 185933 89416 66535
2944 101 43993 8484 5506 2142 114659 32969 53044
3072 101 48391 9562 6246 2549 206562 75176 81600
3200 101 52567 9979 6730 2610 213407 63704 80113
3328 101 57454 11042 7440 2980 199818 61465 78953
3456 101 62078 11850 8369 3370 200175 58826 77425
3584 101 97655 18452 12163 4961 171394 51921 65358
3712 101 130234 24823 16679 7827 239991 69402 92992

where:
 v1  is the initial not optimized (single thread) version
 v2  is the optimized (single thread) version with vector class instead of matrix class and half band matrix format
 v3a  is the most optimized (single thread) version with full band matrix format
 v3b  is the most optimized (fork & join multithreaded) version with full band matrix format
 v4a  is based on v3a, emulates floating point operations with PackedDouble.java
 v4b  is based on v3a, emulates floating point operations and runs with Aparapi in GPU mode
 v4c  is based on v3a, emulates floating point operations and runs with Aparapi in JTP mode
Figure 1: Conjugate Gradient with Random Numbers
The manual implemented Fork & Join version (v3b) outperforms all other implementations (> 500 rows). Aparapi's JTP Fallback version was so slow, that I decided to remove it from my further investigations. Feel free to implement it and experiment with group size to change the number of uses threads (the default number is much to high and obviously there is no proper thread pooling used, because the number of threads is oscillating during repeated execution). Maybe I'm too stupid to understand why Aparapi's JTP Fallback have been implemented in this way.
The ratio between the CPU version (v4a) and CPU (v4b) is about 3:1. This was the speedup by running on GPU on a nonhighperformance graphic device with Aparapi (on a highperformance device the possible speedup factor should be something between 20 and 40).
The ratio between the CPU versions (v1 and v3b) is about 20:1. This was the speedup by an optimized implementation without Aparapi & GPU. This test has been done on a PC with 4 available CPU cores, for more cores the expected speedup should be higher.
Lessons Learned
Most Graphic Devices Have No FP64 Support
This is the most serious issue with GPU programming. Just the highend devices have FP64 (native double) support. On standard devises (like my notebook PC) you will get a warning like:

Warning: Reverting to Java Thread Pool (JTP) for class [...]: FP64 required but not supported

and you GPU code will be executed in fallback mode (Java Thread Pool).
Aparapi Fallback to JTP Mode is Slow
In the case you implement the GPU code with Aparapi and the need of FP64 and you don't have a graphic device with native support, the code will be executed in JTP mode. Unless you set the Group Size in the Range to a reasonable number of threads the JTP mode is extremely slow (because too much threads are used). Compared with the fork & join implementation the JTP mode is very inefficient.
Optimize CPU Code First
Comparing the non GPU versions of the band matrix implementation show that some simple actions for more performance on CPU may result in a significant improvement. Between v1 (initial version) and v3b (optimized and manual fork & join) is a ratio of 20:1 on a standard CPU.
You Have to Decide the Target Platform
In some cases Aparapi code works also on NVIDIA and/or Intel graphic devices  but some of my tests with different hardware showed that this is by far not guaranteed. Less than 50% of the machines were able to execute the code without additional work (e.g. change of OpenCl.dll).
Not all Algorithms Result in Better Performance
Be aware that algorithms which need a lot of logic operations and/or data transfer to/from GPU tend to be slower than expected. In general GPUs are not made for complex logic and quite often the data transfer is a real bottle neck.
Your Notebook PCs May Overheat
Under Heavy GPU Load
Running all the test cases caused on my notebook the following screen:
 The system BIOS has detected your notebook PC was put into hibernation
operating normally.
Overheating may occur if the cooling vents are blocked or the operating temperature should return to normal operation once the situation is resolved.
System Temperature (90D)
ENTER  Continue Startup
For more information, please visit: www.hp.com\go\techcenter\startup

Obviously this hardware is not made for heavy GPU load (there is a reason why the highend graphic cards have all these ventilators and cooling devices)
Recommendations
 First have a look at your algorithm. Maybe there is a lot to optimize before you need to start with GPU programming
 Java is maybe not the best choice to do GPU programming, because you lose the "write once run everywhere" benefit of Java and with the typical Java Bindings you don't have full control about your solution
 If you have the need for more computational power, you may decide to use a highend graphic device and think about using directly C/C++ and/or OpenCL
 Don't try to emulate double precision in a GPU  the needed overhead will be so slow that all possible performance benefits will be compensated
 Nevertheless spend time with GPU programming  because massive parallel programming is the model of the future. Even in the case you are not successful, you will learn some important lessons
Find Code on GitHub
