program story

HashMap Java 8 구현

inputbox 2020. 9. 23. 07:34
반응형

HashMap Java 8 구현


다음 링크 문서에 따라 : Java HashMap 구현

의 구현 HashMap(또는 오히려 향상 HashMap) 과 혼동됩니다 . 내 쿼리는 다음과 같습니다.

첫째로

static final int TREEIFY_THRESHOLD = 8;
static final int UNTREEIFY_THRESHOLD = 6;
static final int MIN_TREEIFY_CAPACITY = 64;

이 상수는 왜 그리고 어떻게 사용됩니까? 이에 대한 명확한 예가 필요합니다. 이를 통해 어떻게 성능 향상을 달성하고 있습니까?

둘째

HashMapJDK에서 의 소스 코드를 보면 다음과 같은 정적 내부 클래스를 찾을 수 있습니다.

static final class TreeNode<K, V> extends java.util.LinkedHashMap.Entry<K, V> {
    HashMap.TreeNode<K, V> parent;
    HashMap.TreeNode<K, V> left;
    HashMap.TreeNode<K, V> right;
    HashMap.TreeNode<K, V> prev;
    boolean red;

    TreeNode(int arg0, K arg1, V arg2, HashMap.Node<K, V> arg3) {
        super(arg0, arg1, arg2, arg3);
    }

    final HashMap.TreeNode<K, V> root() {
        HashMap.TreeNode arg0 = this;

        while (true) {
            HashMap.TreeNode arg1 = arg0.parent;
            if (arg0.parent == null) {
                return arg0;
            }

            arg0 = arg1;
        }
    }
    //...
}

어떻게 사용합니까? 알고리즘에 대한 설명이 필요합니다 .


HashMap특정 수의 버킷을 포함합니다. hashCode어떤 버킷에 넣을지 결정하는 데 사용 됩니다. 단순함을 위해 모듈러스로 상상하십시오.

해시 코드가 123456이고 버킷이 4 개인 123456 % 4 = 0경우 항목은 첫 번째 버킷 인 버킷 1에 들어갑니다.

HashMap

우리의 해시 코드 함수가 좋다면, 모든 버킷이 어느 정도 균등하게 사용될 수 있도록 균등 한 분포를 제공해야합니다. 이 경우 버킷은 연결 목록을 사용하여 값을 저장합니다.

연결된 버킷

그러나 좋은 해시 함수를 구현하기 위해 사람에게 의존 할 수는 없습니다. 사람들은 종종 불량한 해시 함수를 작성하여 균등하지 않은 분포를 초래합니다. 우리가 입력 한 내용에 대해 불행해질 수도 있습니다.

잘못된 해시 맵

이 분포가 적을수록 O (1) 작업에서 멀어지고 O (n) 작업으로 더 가까워집니다.

Hashmap의 구현은 버킷이 너무 커지면 연결 목록이 아닌 일부 버킷을 트리로 구성하여이를 완화하려고합니다. 이것은 무엇 TREEIFY_THRESHOLD = 8을위한 것입니다. 버킷에 8 개 이상의 항목이 포함 된 경우 트리가되어야합니다.

나무 통

이 나무는 Red-Black 나무입니다. 먼저 해시 코드로 정렬됩니다. 해시 코드가 동일 하면 객체가 해당 인터페이스를 구현하는지 여부와 ID 해시 코드 compareTo메서드를 사용합니다 Comparable.

맵에서 항목이 제거되면 버킷의 항목 수가 줄어들어이 트리 구조가 더 이상 필요하지 않을 수 있습니다. 그게 UNTREEIFY_THRESHOLD = 6목적입니다. 버킷의 요소 수가 6 개 미만으로 떨어지면 다시 연결 목록을 사용하는 것이 좋습니다.

마지막으로 MIN_TREEIFY_CAPACITY = 64.

When a hash map grows in size, it automatically resizes itself to have more buckets. If we have a small hash map, the likelihood of us getting very full buckets is quite high, because we don't that have many different buckets to put stuff into. It's much better to have a bigger hash map, with more buckets that are less full. This constant basically says not to start making buckets into trees if our hash map is very small - it should resize to be larger first instead.


To answer your question about the performance gain, these optimisations were added to improve the worst case. I'm only speculating but you would probably only see a noticeable performance improvement because of these optimisations if your hashCode function was not very good.


To put it simpler (as much as I could simpler) + some more details.

These properties depend on a lot of internal things that would be very cool to understand - before moving to them directly.

TREEIFY_THRESHOLD -> when a single bucket reaches this (and the total number exceeds MIN_TREEIFY_CAPACITY), it is transformed into a perfectly balanced red/black tree node. Why? Because of search speed. Think about it in a different way:

it would take at most 32 steps to search for an Entry within a bucket/bin with Integer.MAX_VALUE entries.

Some intro for the next topic. Why is the number of bins/buckets always a power of two? At least two reasons: faster than modulo operation and modulo on negative numbers will be negative. And you can't put an Entry into a "negative" bucket:

 int arrayIndex = hashCode % buckets; // will be negative

 buckets[arrayIndex] = Entry; // obviously will fail

Instead there is a nice trick used instead of modulo:

 (n - 1) & hash // n is the number of bins, hash - is the hash function of the key

That is semantically the same as modulo operation. It will keep the lower bits. This has an interesting consequence when you do:

Map<String, String> map = new HashMap<>();

In the case above, the decision of where an entry goes is taken based on the last 4 bits only of you hashcode.

This is where multiplying the buckets comes into play. Under certain conditions (would take a lot of time to explain in exact details), buckets are doubled in size. Why? When buckets are doubled in size, there is one more bit coming into play.

So you have 16 buckets - last 4 bits of the hashcode decide where an entry goes. You double the buckets: 32 buckets - 5 last bits decide where entry will go.

As such this process is called re-hashing. This might get slow. That is (for people who care) as HashMap is "joked" as: fast, fast, fast, slooow. There are other implementations - search pauseless hashmap...

Now UNTREEIFY_THRESHOLD comes into play after re-hashing. At that point, some entries might move from this bins to others (they add one more bit to the (n-1)&hash computation - and as such might move to other buckets) and it might reach this UNTREEIFY_THRESHOLD. At this point it does not pay off to keep the bin as red-black tree node, but as a LinkedList instead, like

 entry.next.next....

MIN_TREEIFY_CAPACITY is the minimum number of buckets before a certain bucket is transformed into a Tree.


TreeNode is an alternative way to store the entries that belong to a single bin of the HashMap. In older implementations the entries of a bin were stored in a linked list. In Java 8, if the number of entries in a bin passed a threshold (TREEIFY_THRESHOLD), they are stored in a tree structure instead of the original linked list. This is an optimization.

From the implementation:

/*
 * Implementation notes.
 *
 * This map usually acts as a binned (bucketed) hash table, but
 * when bins get too large, they are transformed into bins of
 * TreeNodes, each structured similarly to those in
 * java.util.TreeMap. Most methods try to use normal bins, but
 * relay to TreeNode methods when applicable (simply by checking
 * instanceof a node).  Bins of TreeNodes may be traversed and
 * used like any others, but additionally support faster lookup
 * when overpopulated. However, since the vast majority of bins in
 * normal use are not overpopulated, checking for existence of
 * tree bins may be delayed in the course of table methods.

You would need to visualize it: say there is a Class Key with only hashCode() function overridden to always return same value

public class Key implements Comparable<Key>{

  private String name;

  public Key (String name){
    this.name = name;
  }

  @Override
  public int hashCode(){
    return 1;
  }

  public String keyName(){
    return this.name;
  }

  public int compareTo(Key key){
    //returns a +ve or -ve integer 
  }

}

and then somewhere else, I am inserting 9 entries into a HashMap with all keys being instances of this class. e.g.

Map<Key, String> map = new HashMap<>();

    Key key1 = new Key("key1");
    map.put(key1, "one");

    Key key2 = new Key("key2");
    map.put(key2, "two");
    Key key3 = new Key("key3");
    map.put(key3, "three");
    Key key4 = new Key("key4");
    map.put(key4, "four");
    Key key5 = new Key("key5");
    map.put(key5, "five");
    Key key6 = new Key("key6");
    map.put(key6, "six");
    Key key7 = new Key("key7");
    map.put(key7, "seven");
    Key key8 = new Key("key8");
    map.put(key8, "eight");

//Since hascode is same, all entries will land into same bucket, lets call it bucket 1. upto here all entries in bucket 1 will be arranged in LinkedList structure e.g. key1 -> key2-> key3 -> ...so on. but when I insert one more entry 

    Key key9 = new Key("key9");
    map.put(key9, "nine");

  threshold value of 8 will be reached and it will rearrange bucket1 entires into Tree (red-black) structure, replacing old linked list. e.g.

                  key1
                 /    \
               key2   key3
              /   \   /  \

Tree traversal is faster {O(log n)} than LinkedList {O(n)} and as n grows, the difference becomes more significant.


The change in HashMap implementation was was added with JEP-180. The purpose was to:

Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries. Implement the same improvement in the LinkedHashMap class

However pure performance is not the only gain. It will also prevent HashDoS attack, in case a hash map is used to store user input, because the red-black tree that is used to store data in the bucket has worst case insertion complexity in O(log n). The tree is used after a certain criteria is met - see Eugene's answer.


To understand the internal implementation of hashmap, you need to understand the hashing. Hashing in its simplest form, is a way to assigning a unique code for any variable/object after applying any formula/algorithm on its properties.

진정한 해시 함수는이 규칙을 따라야합니다.

“해시 함수는 동일하거나 동일한 객체에 함수가 적용될 때마다 매번 동일한 해시 코드를 반환해야합니다. 즉, 두 개의 동일한 객체가 일관되게 동일한 해시 코드를 생성해야합니다.”

참고 URL : https://stackoverflow.com/questions/43911369/hashmap-java-8-implementation

반응형