You should see an almost equal load on all nodes. Certain tasks are however not parallelized (see parallel.doc) and then you will see that only one node is being used. We typically run simulations (and sometimes also long minimizations) in parallel, leaving setup and analysis to be performed on a single CPU.
There is a penalty (communications overhead) for using more than one CPU, and you have to determine for yourself what is the optimum number to use with your system. Remember that in benchmarking parallel jobs it is the elapsed time, not the CPUtime that is important.