Inside Set in Java

You know what a "Set" is in java. Its an interface which extends collection interface. It contains no duplicate elements i.e., no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. All is fine and cool until here.

But the question is how set maintains unique elements? Here is how. Internally set maintains a Map when a Set (HashSet/Treeset/etc) is instantiated. See below code of HashSet constructor.

    
/**
 * Constructs a new, empty set; the backing HashMap instance has
 * default initial capacity (16) and load factor (0.75).
 */
public HashSet() {
	map = new HashMap();
}

So HashSet creates an instance of HashMap when HashSet is instantiated, TreeSet creates an instance of TreeMap when TreeSet is instantiated and so on. The keys in the map are the elements you add to Set. Then what are values? Values are dummy objects. When you add an element to the Set, the element will be added as key and "new Object()" (dummy Object instance) will be added as value which is a dummay value. What will happen if you add a duplicate object to Set? If the set already contains the element, the call to add method leaves the set unchanged and returns false. If the set does not contain the element, the call to add method adds the element and returns true.

All set implementations (HashSet, TreeSet, EnumSet) implements Serializable interface and so contains methods writeObject and readObject to save or reconstitute the state of Set (HashSet, TreeSet, EnumSet) instance to/from a stream. When you serialize this instance, writeObject and readObject methods will be called.

Again all set implementations are not synchronized. If multiple threads access a set concurrently, and at least one of the threads modifies the set, it must be synchronized externally. This is typically accomplished by synchronizing on some object that naturally encapsulates the set. If no such object exists, the set should be "wrapped" using the Collections.synchronizedSet method like below.

 
Ex:
Set s = Collections.synchronizedSet(new HashSet(...));

Another point. Except EnumSet, the iterators returned by the implemented class's (HashSet/TreeSet) iterator method are fail-fast i.e., if the set is modified at any time after the iterator is created, in any way except through the iterator's own remove method, the Iterator throws a ConcurrentModificationException. But again it is not guaranteed to throw "ConcurrentModificationException" by iterator all the time but will be thrown on best-effort basis. So dont rely on this exception for the correctness of the program.


Now lets see each of different implementations of Set provided by Java and what additional features are provided by them other than ones defined above.

Note that the operations like adding elements to the set, traversing the set, removing elements from set, etc will use the implementations of the respective backing map.

 

HashSet:
HashSet is backed by a HashMap instance and hence it allows the null element. It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time.

When do you use HashSet?
When you are looking for performance, use HashSet. Since this class uses the hash function when retrieving the elements, it allows fast retrieval. This class offers constant time performance for the basic operations add, remove, contains and size, assuming the hash function disperses the elements properly among the buckets. Iterating over this set requires time proportional to the sum of the HashSet instance's size (the number of elements) plus the "capacity" of the backing HashMap instance the number of buckets. Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.

 

LinkedHashSet:
LinkedHashSet is backed by LinkedHashMap. So all the elements in LinkedHashSet are actually the keys in the LinkedHashMap. It maintains a doubly-linked list running through all of its entries. This linked list defines the iteration ordering, which is the order in which elements were inserted into the set (insertion-order). This class provides all of the optional Set operations, and permits null elements.

When do you use LinkedHashSet?
When you are looking to produce a copy of a set that has the same order as the original, regardless of the original set's implementation without incurring the increased cost (associated with TreeSet).

    
void foo(Set s) {
	Set copy = new LinkedHashSet(s);
	...
}

This technique is particularly useful if a module takes a set on input, copies it, and later returns results whose order is determined by that of the copy. Like HashSet, it provides constant-time performance for the basic operations add, contains and remove), assuming the hash function disperses elements properly among the buckets. Performance is likely to be just slightly below that of HashSet, due to the added expense of maintaining the linked list, with one exception: Iteration over a LinkedHashSet requires time proportional to the size of the set, regardless of its capacity. Iteration over a HashSet is likely to be more expensive, requiring time proportional to its capacity.


A linked hash set has two parameters that affect its performance: initial capacity and load factor. They are defined precisely as for HashSet. Note, however, that the penalty for choosing an excessively high value for initial capacity is less severe for this class than for HashSet, as iteration times for this class are unaffected by capacity.

 

TreeSet:
TreeSet is backed by TreeMap instance and TreeSet wont permit null elements. As opposing to HashSet, TreeSet provides a total ordering on its elements and especially when you need a sorted order. The elements are ordered either by using their plain Comparable natural ordering (if you dont pass any parameter for the TreeSet constructor) or by a Comparator typically provided at sorted set creation time as a parameter to the constructor. All elements inserted into this set must either implement the Comparable interface or atleast be accepted by the specified comparator. So all such elements must be mutually comparable i.e., e1.compareTo(e2) (or comparator.compare(e1, e2)) must not throw a ClassCastException.

When do you use TreeSet?
Its very well explained above that TreeSet will be used when the sorted order of inserted elements is required.

What about the performance? The performance obviously will be low compared to HashSet. A TreeSet may be accessed and traversed in either ascending or descending order. The descendingSet method returns a view of the set with the senses of all relational and directional methods inverted. The performance of ascending operations and views is likely to be faster than that of descending ones.

 

EnumSet:
EnumSet as the name itself says is a specialized set for use with enum types. All of the elements in an enum set must come from a single enum type that is specified, explicitly or implicitly, when the set is created. Enum sets are represented internally as bit vectors. This representation is extremely compact and efficient. The space and time performance of this class should be good enough to allow its use as a high-quality, typesafe alternative to traditional int-based "bit flags." Even bulk operations (such as containsAll and retainAll) should run very quickly if their argument is also an enum set. All basic operations execute in constant time. They are likely (though not guaranteed) to be much faster than their HashSet counterparts. Even bulk operations execute in constant time if their argument is also an enum set.

The iterator returned by the iterator method traverses the elements in their natural order (the order in which the enum constants are declared). The returned iterator is weakly consistent: it will never throw ConcurrentModificationException and it may or may not show the effects of any modifications to the set that occur while the iteration is in progress.


Null elements are not permitted in EnumSet. Attempts to insert a null element will throw NullPointerException. Attempts to test for the presence of a null element or to remove one will, however, function properly.

EnumSet internally uses RegularEnumSet class, a private implementation class for EnumSet, for "regular sized" enum types i.e., those with 64 or fewer enum constants while it used JumboEnumSet class if the enum constants are more than 64 which is another private implementation class for EnumSet.



blog comments powered by Disqus