merge improvements from jac to apoa1.
several changes, better output.
new task-based parallel bonds, several fixes to potentials.
tweak tolerance.
updated distribution information files.
added double-precision test case.
several fixes to double-precision vectorization.
use multiple of warp size.
Fixed CUDA to run with correct number of threads
fix typo.