BY MARKUS SPRUNCK
One of the most interesting improvements of Java 7 is the better support of concurrency. With the JSR166 Concurrency Utilities we get some very helpful improvements of concurrency.
From my point of view the forkjoin library has a very high potential for practical use in software engineering. Fork and join provides a very easy programming model for algorithms which can be implemented as recursive task. There are a lot of algorithms that can be implemented with these divide and conquer algorithms.
During next years we will see an increasing number of cores in standard desktops, mobiles, notebooks and servers. The reason for this is easy  it is cheaper to add additional cores than to build a faster single processor. So, we will have to write more software which supports concurrency to take benefit of better hardware. My personal rule is: „You need a good reason to implement concurrency and if you have to do it be really careful.“
In the last years I saw more buggy implementations than working. This is the reason why I like the fork and join library. A clear programming model which implements the boiler plate code prevents you from errors. If you intend to use fork and join take some time to understand the behavior.
Java Example
The example in file #1 and #2 is very similar to the example code in Java 7 documentation. In general, Fibonacci numbers with a recursive algorithm is not a good idea because there is a better linear solution (compare http://nayuki.eigenstate.org/page/fastfibonaccialgorithms), but it is easier to implement and understand than others.
FibonacciTask.java [error handling, parameter validation and asserts removed]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

package com.sprunck.sample;
import java.util.concurrent.RecursiveTask;
public class FibonacciTask extends RecursiveTask<Long> {
private static final long serialVersionUID = 1L;
private final long inputValue;
public FibonacciTask(long inputValue) {
this.inputValue = inputValue;
}
@Override
public Long compute() {
if (inputValue == 0L) {
return 0L;
} else if (inputValue <= 2L) {
return 1L;
} else {
final FibonacciTask firstWorker = new FibonacciTask(inputValue  1L);
firstWorker.fork();
final FibonacciTask secondWorker = new FibonacciTask(inputValue  2L);
return secondWorker.compute() + firstWorker.join();
}
}
}

FibonacciTaskTest.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

package com.sprunck.sample;
import java.util.concurrent.ForkJoinPool;
import junit.framework.Assert;
import org.junit.Test;
public class FibonacciTaskTest {
// it makes no sense to create more threads than available cores (no speed improvement here)
private static final int AVAILABLE_PROCESSORS = Runtime.getRuntime().availableProcessors();
// create thread pool
private final ForkJoinPool pool = new ForkJoinPool(AVAILABLE_PROCESSORS);
@Test
public void testFibonacciArray() {
// more test data:
// http://www.maths.surrey.ac.uk/hostedsites/R.Knott/Fibonacci/fibtable.html
long results[] = { 0L, 1L, 1L, 2L, 3L, 5L, 8L, 13L, 21L, 34L, 55L,
89L, 144L, 233L, 377L, 610L, 987L, 1597L,
2584L, 4181L, 6765L };
for (int inputValue = 0; inputValue < results.length; inputValue++) {
final FibonacciTask task = new FibonacciTask(inputValue);
System.out.print("Fibonacci(" + inputValue + ") = ");
final long result = pool.invoke(task);
System.out.println(result);
Assert.assertEquals(results[inputValue], result);
}
}
}

Expected output of FibonacciTaskTest.java should look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

Fibonacci(0) = 0
Fibonacci(1) = 1
Fibonacci(2) = 1
Fibonacci(3) = 2
Fibonacci(4) = 3
Fibonacci(5) = 5
Fibonacci(6) = 8
Fibonacci(7) = 13
Fibonacci(8) = 21
Fibonacci(9) = 34
Fibonacci(10) = 55
Fibonacci(11) = 89
Fibonacci(12) = 144
Fibonacci(13) = 233
Fibonacci(14) = 377
Fibonacci(15) = 610
Fibonacci(16) = 987
Fibonacci(17) = 1597
Fibonacci(18) = 2584
Fibonacci(19) = 4181
Fibonacci(20) = 6765

So far it is a simple and clear solution. No boiler plate code for concurrency, e.g. thread synchronization.
Java Example with Traces
But I'd like to encourage you to have a deeper look in what happens in the solution. In files #3 and #4 you find an enhanced version of the same program. The only difference between the first and second version is some code to trace what happens during execution and a small slowTask() to simulate more realistic behavior.
FibonacciTaskTraces.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61

package com.sprunck.sample;
import java.util.concurrent.RecursiveTask;
public class FibonacciTaskTraces extends RecursiveTask<Long> {
private static final long serialVersionUID = 1L;
// just needed to format debug output
public static final String OUTPUT_PREFIX = "  ";
private final String prefix;
private final long inputValue;
public FibonacciTaskTraces(long inputValue, final String prefix) {
this.inputValue = inputValue;
this.prefix = prefix;
}
@Override
public Long compute() {
if (inputValue == 0L) {
slowTask();
return 0L;
} else if (inputValue <= 2L) {
slowTask();
return 1L;
} else {
final long firstValue = inputValue  1L;
System.out.println(prefix + "  Fibonacci(" + firstValue + ") < "
+ Thread.currentThread().getName()
+ " (fork) ");
final FibonacciTaskTraces firstWorker =
new FibonacciTaskTraces(firstValue, prefix + OUTPUT_PREFIX);
firstWorker.fork();
final long secondValue = inputValue  2L;
System.out.println(prefix + "  Fibonacci(" + secondValue + ") < "
+ Thread.currentThread().getName());
final FibonacciTaskTraces secondWorker =
new FibonacciTaskTraces(secondValue, prefix + OUTPUT_PREFIX);
long result = secondWorker.compute() + firstWorker.join();
System.out.println(prefix + "  Fibonacci(" + inputValue + ") = "
+ result + " < "
+ Thread.currentThread().getName() + " (join)");
slowTask();
return result;
}
}
/** just simulate a longer running task (with out disturbing the other threads) */
private void slowTask() {
for (int k = 0, i = 0; i < 1000 * 1000 * 100; i++) {
i = i + k;
}
}
}

FibonacciTaskTracesTask.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

package com.sprunck.sample;
import java.util.concurrent.ForkJoinPool;
import junit.framework.Assert;
import org.junit.Test;
public class FibonacciTaskTracesTest {
// it makes no sense to create more threads than available cores (no speed improvement here)
private static final int AVAILABLE_PROCESSORS = Runtime.getRuntime().availableProcessors();
// create thread pool
private final ForkJoinPool pool = new ForkJoinPool(AVAILABLE_PROCESSORS);
@Test
public void testFibonacciArrayTraces() {
// more test data:
// http://www.maths.surrey.ac.uk/hostedsites/R.Knott/Fibonacci/fibtable.html
long results[] = { 0L, 1L, 1L, 2L, 3L, 5L, 8L, 13L };
for (int inputValue = 0; inputValue < results.length; inputValue++) {
final FibonacciTaskTraces task = new FibonacciTaskTraces(inputValue, "  ");
System.out.println("invoke Fibonacci(" + inputValue + ") < "
+ Thread.currentThread().getName());
final long result = pool.invoke(task);
System.out.println("result = " + result + "\n");
Assert.assertEquals(results[inputValue], result);
}
}
}

Expected output of FibonacciTaskTracesTest.java should look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108

invoke Fibonacci(0) < main
result = 0
invoke Fibonacci(1) < main
result = 1
invoke Fibonacci(2) < main
result = 1
invoke Fibonacci(3) < main
  Fibonacci(2) < ForkJoinPool1worker1 (fork)
  Fibonacci(1) < ForkJoinPool1worker1
  Fibonacci(3) = 2 < ForkJoinPool1worker1 (join)
result = 2
invoke Fibonacci(4) < main
  Fibonacci(3) < ForkJoinPool1worker1 (fork)
  Fibonacci(2) < ForkJoinPool1worker1
   Fibonacci(2) < ForkJoinPool1worker2 (fork)
   Fibonacci(1) < ForkJoinPool1worker2
   Fibonacci(3) = 2 < ForkJoinPool1worker2 (join)
  Fibonacci(4) = 3 < ForkJoinPool1worker1 (join)
result = 3
invoke Fibonacci(5) < main
  Fibonacci(4) < ForkJoinPool1worker1 (fork)
  Fibonacci(3) < ForkJoinPool1worker1
   Fibonacci(2) < ForkJoinPool1worker1 (fork)
   Fibonacci(1) < ForkJoinPool1worker1
   Fibonacci(3) < ForkJoinPool1worker2 (fork)
   Fibonacci(2) < ForkJoinPool1worker2
    Fibonacci(2) < ForkJoinPool1worker2 (fork)
    Fibonacci(1) < ForkJoinPool1worker2
   Fibonacci(3) = 2 < ForkJoinPool1worker1 (join)
    Fibonacci(3) = 2 < ForkJoinPool1worker2 (join)
   Fibonacci(4) = 3 < ForkJoinPool1worker2 (join)
  Fibonacci(5) = 5 < ForkJoinPool1worker1 (join)
result = 5
invoke Fibonacci(6) < main
  Fibonacci(5) < ForkJoinPool1worker1 (fork)
  Fibonacci(4) < ForkJoinPool1worker1
   Fibonacci(3) < ForkJoinPool1worker1 (fork)
   Fibonacci(2) < ForkJoinPool1worker1
   Fibonacci(4) < ForkJoinPool1worker2 (fork)
   Fibonacci(3) < ForkJoinPool1worker2
    Fibonacci(2) < ForkJoinPool1worker2 (fork)
    Fibonacci(1) < ForkJoinPool1worker2
    Fibonacci(2) < ForkJoinPool1worker1 (fork)
    Fibonacci(1) < ForkJoinPool1worker1
    Fibonacci(3) = 2 < ForkJoinPool1worker2 (join)
    Fibonacci(3) = 2 < ForkJoinPool1worker1 (join)
    Fibonacci(3) < ForkJoinPool1worker2 (fork)
    Fibonacci(2) < ForkJoinPool1worker2
   Fibonacci(4) = 3 < ForkJoinPool1worker1 (join)
     Fibonacci(2) < ForkJoinPool1worker2 (fork)
     Fibonacci(1) < ForkJoinPool1worker2
     Fibonacci(3) = 2 < ForkJoinPool1worker2 (join)
    Fibonacci(4) = 3 < ForkJoinPool1worker2 (join)
   Fibonacci(5) = 5 < ForkJoinPool1worker2 (join)
  Fibonacci(6) = 8 < ForkJoinPool1worker1 (join)
result = 8
invoke Fibonacci(7) < main
  Fibonacci(6) < ForkJoinPool1worker1 (fork)
  Fibonacci(5) < ForkJoinPool1worker1
   Fibonacci(4) < ForkJoinPool1worker1 (fork)
   Fibonacci(3) < ForkJoinPool1worker1
    Fibonacci(2) < ForkJoinPool1worker1 (fork)
    Fibonacci(1) < ForkJoinPool1worker1
   Fibonacci(5) < ForkJoinPool1worker2 (fork)
   Fibonacci(4) < ForkJoinPool1worker2
    Fibonacci(3) < ForkJoinPool1worker2 (fork)
    Fibonacci(2) < ForkJoinPool1worker2
     Fibonacci(2) < ForkJoinPool1worker2 (fork)
     Fibonacci(1) < ForkJoinPool1worker2
    Fibonacci(3) = 2 < ForkJoinPool1worker1 (join)
    Fibonacci(3) < ForkJoinPool1worker1 (fork)
    Fibonacci(2) < ForkJoinPool1worker1
     Fibonacci(3) = 2 < ForkJoinPool1worker2 (join)
     Fibonacci(2) < ForkJoinPool1worker1 (fork)
     Fibonacci(1) < ForkJoinPool1worker1
    Fibonacci(4) = 3 < ForkJoinPool1worker2 (join)
    Fibonacci(4) < ForkJoinPool1worker2 (fork)
    Fibonacci(3) < ForkJoinPool1worker2
     Fibonacci(2) < ForkJoinPool1worker2 (fork)
     Fibonacci(1) < ForkJoinPool1worker2
     Fibonacci(3) = 2 < ForkJoinPool1worker1 (join)
    Fibonacci(4) = 3 < ForkJoinPool1worker1 (join)
     Fibonacci(3) = 2 < ForkJoinPool1worker2 (join)
   Fibonacci(5) = 5 < ForkJoinPool1worker1 (join)
     Fibonacci(3) < ForkJoinPool1worker2 (fork)
     Fibonacci(2) < ForkJoinPool1worker2
      Fibonacci(2) < ForkJoinPool1worker1 (fork)
      Fibonacci(1) < ForkJoinPool1worker1
      Fibonacci(3) = 2 < ForkJoinPool1worker1 (join)
     Fibonacci(4) = 3 < ForkJoinPool1worker2 (join)
    Fibonacci(5) = 5 < ForkJoinPool1worker2 (join)
   Fibonacci(6) = 8 < ForkJoinPool1worker2 (join)
  Fibonacci(7) = 13 < ForkJoinPool1worker1 (join)
result = 13

The output gives you now deeper look into the processing of the program. Following ways of Fibonacci numbers calculation appear:
 the first three Fibonacci numbers are processed in the main thread
 the next Fibonacci number has been processed in just one new thread (ForkJoinPool1worker1)
 starting with the fifth Fibonacci number two threads (ForkJoinPool1worker1 and ForkJoinPool1worker2) have been used.
The algorithm is inefficient, because there are a lot of redundant operations (recalculation of the same Fibonacci number) in the processing. In a real life application you should be careful with this kind of inefficient algorithms. Some traces help to understand what happens.
Recommendations
 The use of fork and join is easy and straightforward, but use some time to trace and understand your implementation
 Sometimes it is helpful to implement two versions of the same algorithm (one for analysis and a second for production)
 Spend some time in designing and understanding better concurrency algorithms is a good investment
