More‎ > ‎

Memory and Runtime Analysis for the removeAll() Method of Java TreeSet and ArrayList

BY MARKUS SPRUNCK


The following article describes some important differences between the removeAll() method in the classes TreeSet and ArrayList. Both implementations of removeAll() show quiet different behavior in respect to run time and memory consumption. It is important to know the behavior of these two and other collection classes of Java to select the right for a given task.

Requirements

Find all double entries in one data set which are also in a second data set and remove. This may be needed in the case the primary keys of two different data sources have to be compared - to find missing entries in one data set. In this case an easy solution is to use the removeAll() method, but what is the best data structure in this situation?

Differences Between TreeSet and ArrayList

The ArrayList has two nested for-loops, compares all elements until it finds the equal and removes it. Run time is expected to be something like O(n2) - so, a double size of n means quadruple n the run time.

The TreeSet has a different implementation - one for-loop for all elements, then a search in the tree and removal of the double element. This means that a TreeSet has a expected run time of about O(n*log(n)). From the run time point of view the TreeSet should be the clear winner.

Simplified source code of removeAll() method for ArrayList and TreeSet from JDK 7:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// ArrayList - strongly simplyfied code
removeAll(Collection c)
{          
    // 1) for each element
    for (r = 0; r < size; r++) {            
        // 2) for each element         
        for (int i = 0; i < size; i++) {
                // 3) compare          
                if (o.equals(elementData[i])) {        
                    // 4) remove this entry
            }
        }
    }                 
}
 
 
// TreeSet - strongly simplyfied code
removeAll(Collection c) {
    // 1) for each element
    for (Iterator i = c.iterator(); i.hasNext(); ) {   
        // 2) find in tree
        while (p != null) {    
            // 3) compare                      
            int cmp = k.compareTo(p.key);              
            if (cmp < 0)
                p = p.left;     // 4a) remove this entry                           
            else if (cmp > 0)
                p = p.right;    // 4b) remove this entry
            else
                p;              // 4c) remove this entry
        }
    }
}

The TreeSet class has some overhead for managing nodes and should need more memory than the simple ArrayList class. The following measurements give a clearer picture what happens.

Runtime Comparison

Figure 1 depicts the measured results for a ArrayList (orange) and a TreeSet (blue) implementation - for larger n the TreeSet is the clear winner. 

  
Figure 1: Time with removeAll()

For small data sets (n < 200) the ArrayList would be the better choice, because creating the TreeSet has noticeable overhead for filling the tree from an array (see Figure 2). Filling the ArrayList is a simple and just a very fast copy operation.
 
  
Figure 2: Time without removeAll()

So, from the performance perspective the TreeSet is the clear winner at least for large data.

Memory Comparison

The TreeSet will need more memory. Dependent on the size of the content (i.e. items to be stored) - a lot of memory overhead is possible. In the sample code (see below) the TreeSet need for large n significantly more memory than the ArrayList (see Figure 3).
  
Figure 3: Memory with the removeAll() call

Source Code of Micro Benchmark

To reproduce the results run with the VM Option "-Xint". This option means no to use the Just-in-time compilation of the VM.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
package com.sw_engineering_candies.sample;
 
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.TreeSet;
 
public class RemoveAllWithArrayList {
 
    public static void main(String[] args) {
 
        System.out.println("SIZE\tTreeSetBytes\tTreeSetTime\tArrayListBytes\tArrayListTime");
 
        for (int size = 10; size < 150000; size *= 1.1) {
            System.out.print(size);
 
            final String[] firstArray = new String[size];
            int firstIndex = 0;
            for (int i = 0; firstIndex < size; i++) {
                if (Math.random() < 0.9f) {
                    firstArray[firstIndex] = "ID" + i;
                    firstIndex++;
                }
            }
 
            final String[] secondArray = new String[size];
            int scondIndex = 0;
            for (int k = 0; scondIndex < size; k++) {
                if (Math.random() < 0.9f) {
                    secondArray[scondIndex] = "ID" + k;
                    scondIndex++;
                }
            }
 
            missingEntriesTreeSet(firstArray, secondArray);
            missingEntriesArrayList(firstArray, secondArray);
 
            System.out.println();
        }
 
    }
 
    private static void missingEntriesTreeSet(String[] array1, String[] array2) {
 
        Runtime runtime = Runtime.getRuntime();
        runtime.gc();
        final long memoryStart = runtime.totalMemory() - runtime.freeMemory();
        final long start = System.nanoTime();
 
        TreeSet< String > firstTemp = new TreeSet< String >(Arrays.asList(array1));
        TreeSet< String > secondTemp = new TreeSet< String >(Arrays.asList(array2));
        firstTemp.removeAll(secondTemp);
 
        final long end = System.nanoTime();
        final long memoryEnd = (runtime.totalMemory() - runtime.freeMemory());
        System.out.print("\t" + (memoryEnd - memoryStart));
        System.out.print("\t" + (end - start));
    }
 
    private static void missingEntriesArrayList(String[] array1, String[] array2) {
 
        Runtime runtime = Runtime.getRuntime();
        runtime.gc();
        final long memoryStart = runtime.totalMemory() - runtime.freeMemory();
        final long start = System.nanoTime();
 
        List< String > firstTemp = new ArrayList< String >(Arrays.asList(array1));
        List< String > secondTemp = new ArrayList< String >(Arrays.asList(array2));
        firstTemp.removeAll(secondTemp);
 
        final long end = System.nanoTime();
        final long memoryEnd = (runtime.totalMemory() - runtime.freeMemory());
        System.out.print("\t" + (memoryEnd - memoryStart));
        System.out.print("\t" + (end - start));
    }
}

The program should print the following to the standard output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
SIZE    TreeSetBytes    TreeSetTime     ArrayListBytes  ArrayListTime
10      91776           608318          91776           82133
11      912             81784           91848           18787
12      976             90235           91856           18927
13      1040            99734           91856           17321
14      1104            109581          91864           19625
15      1168            120406          91864           17320
16      1232            126623          91872           18089
17      1296            134235          91872           16343
18      1360            141009          91880           22209
19      1424            145758          91880           19137
20      1488            164476          91888           19626
22      1616            188222          91896           19626
24      1744            207708          91904           17809
26      1872            224121          91912           22070
28      2000            246400          91920           20882
30      2128            269308          91928           21301
33      2320            299060          91936           21441
36      2512            321270          91952           15155
39      2704            348927          91960           15435
42      2896            384197          91976           13759
46      3152            430082          91992           12012
50      3408            471848          92008           14387
55      3728            522343          92024           11734
60      4048            587156          92048           11803
66      4432            652876          92072           16692
72      4816            729353          92096           20114
79      5264            825454          92120           17042
... 

Recommendations

  • Take enough time to think about the optimal data structure for your task. Sometimes the overhead for a transformation will be compensated by better run time and/or memory consumption. 

  • Verify the expected behavior with measurements. A small micro benchmark cost just some minutes and the result can help to avoid serious performance issues in production.

Sponsored Link