several fixes to double-precision vectorization.
use multiple of warp size.
Fixed CUDA to run with correct number of threads
fix typo.
Working copy of CUDA code, metis no longer required
fix minor bug.
fix a number of things, especially energy reporting.
trying to undo the damage from previous botched commit.
fix double avx stuff, not tested yet.
add type-agnostic fabs.