DOMDEC-GPU is a hybrid method, that uses the GPU for non-bonded calculations (Coulomb and VDW), and CPU cores for bonded terms and integration. As noted in domdec.doc, each MPI task requires its own GPU; the code was not written to share the GPU with other processes.
Within node parallelism uses OpenMP threads; if that is working properly, the machine should have a load of about 8.0 for 8 cores, with one process showing a load of 800% via top. If that is not the case, try setting the env var OMP_NUM_THREADS to 8 before running CHARMM.
On a single machine, the performance with the GPU ought to be 2-3 times faster than CPU-only with 8 MPI tasks.