Fixed CUDA to run with correct number of threads
fix typo.
Working copy of CUDA code, metis no longer required
fix minor bug.
fix a number of things, especially energy reporting.
trying to undo the damage from previous botched commit.
fix double avx stuff, not tested yet.
add type-agnostic fabs.
add switched LJ interactions.
judicious use of doubles when computing the residuals.