File: gc_hashcode.txt

package info (click to toggle)
sablevm 1.11.3-1
links: PTS
area: main
in suites: sarge
size: 6,980 kB
ctags: 7,922
sloc: ansic: 116,013; sh: 8,679; makefile: 489
file content (129 lines) | stat: -rw-r--r-- 5,842 bytes
parent folder | download | duplicates (2)
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 * This source file is part of SableVM.                            *
 *                                                                 *
 * See the file "LICENSE" for the copyright information and for    *
 * the terms and conditions for copying, distribution and          *
 * modification of this source file.                               *
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Q: How do the lockword bits denoting SVM_HASH_NONE, SVM_HASH_NOT_MOVED,
   SVM_HASH_MOVED work in conjunction with the garbage collector,
   specifically with env->vm->heap.hashcode_base?
   
A: First, you must note that the method Object.hashcode() can be called on
   any object (or array).  As the hashcode is an "int", a naive implementation
   of java.lang.Object could look as:

package java.lang;

public class Object
{
   private static nextHashcode = 0;

   private final int hashcode = nextHashcode++;
...

   public int hashcode()
   {
     return hashcode;
   }
}

Of course, this does not work; synchronization is missing (concurrent object
creation) and so on, but it illustrates the idea.

In this naive implementation, every object/array instance has a 32 bit field
for storing its hashcode.  If an application has many small objects on which
the hashcode() method is *never* called throughout its execution, then the
space overhead can be relatively big.

In VMs that use non-moving collectors, a simple strategy can be used to eliminate
the need to use an additional 32 bits per object/array instance: one can simply
use the object's address as hashcode.  This can have deficiencies in hashcode
value distribution, specially if the heap is restricted to a small subset of
memory, but it is usually good enough, and it can save much memory.

This does not work for moving collectors, as the address of objects/arrays change
throughout their lifetime (hashcode cannot change, of course).  A clever trick
(got it from IBM researchers) allows one to get away with a two-bit hashcode that
can be put in the header of an object, thus saving a word for every object
instance.  Note that I did not use this trick for arrays, as the overhead is
usually less important.  All arrays need at least: lockword, vtable pointer,
and length.  Adding hashcode, even to arrays of 0 (!!!) elements is at most 33%
overhead (instead of 50%), and is very likely to be less (there aren't too many
situations where arrays of 0 elements are useful).

The two bits indicate one of three states:

SVM_HASH_NONE:      hashcode() was never called on this object.  The object has
                     no hashcode.
SVM_HASH_NOT_MOVED: hashcode(0 has been called, and GC has not happened since the
                     first call.  The object has a hashcode that is not explicitly
                     stored; it has to be computed.  See below for details.
SVM_HASH_MOVED:     hashcode() has been called, and GC has happened since, so the
                     hashcode is explicitly saved at the end of the object.

It works like this:
1- All arrays are assigned a hashcode at allocation time.  The hashcode is saved in
    the array header.
2- Objects are allocated, and 2 bits of the header are reserved for hashcode.  These
    bits are initialized to SVM_HASH_NONE.  An object has no real hashcode until
    hashcode() is called on it.  Experiments show that many objects do not ever
    need a hashcode, so this is good.
3- When Object.hashcode() is called on an instance, the following algorithm is applied:

    if instance is array
      return stored hashcode

    // instance is an object

    else if instance.hashcode_state == SVM_HASH_NONE
      instance.hashcode_state = SVM_HASH_NOT_MOVED
      return compute_hashcode(instance)

    else if instance.hashcode_state == SVM_HASH_NOT_MOVED
      return compute_hashcode(instance)

    else // instance.hashcode_state == SVM_HASH_MOVED
      return stored hashcode (at end of object)

    as for compute_hashcode, it looks like this:

    int compute_hashcode(instance)
    {
       int offset = instance.address - start_of_current_heap;
       return offset + hashcode_base_value;
    }

     hashcode_base_value is a constant *between* GCs.  It is a virtual
     base address of a hypothetical contiguous heap.  So, on VM startup,
     this value is initialized to 0.  After first GC, it is incremented
     by the size of the discarded "from" space (copying collector).  Same
     after every GC.  So, it looks like:

     |---HEAP-1---|---HEAP-2---|---HEAP-3---|---
     ^            ^            ^            ^
     |            |            |            |
     |            |            |            \
     |            |            \             - Value after 3rd GC == size of HEAP-1 + HEAP-2 + HEAP-3
     |            \             - Value after 2nd GC == size of HEAP-1 + HEAP-2
     \             - Value after 1st GC == size of HEAP-1
      - Initial value (0)

     This approach has several advantages:  It makes hashcodes less volatile across
     executions of the same application (even if malloc() returns different
     addresses for the heap).  It also ensures a better distribution of hashcode values,
     as they are given increasing values until the 32-bit values space is exhausted
     at which point modulo 32-bits applies.

4- When GC happens, a little additional care has to be given to objects when
    their hashcode state is SVM_HASH_NOT_MOVED or SVM_HASH_MOVED.

    if object.hashcode_state == SVM_HASH_NOT_MOVED
      we must compute hashcode one last time and store it at the
      end of the object in the TO space.  We also set the hashcode
      state of the TO copy to SVM_HASH_MOVED.

    if object.hashcode_state == SVM_HASH_MOVED
      we must copy a longer object.