Previous Thread
Next Thread
Print Thread
Page 2 of 3 1 2 3
Re: serial and mpi versions of c33b2 installed on intel xeon
lennart #17446 03/18/08 12:00 PM
Joined: Mar 2004
Posts: 64
Forum Member
OP Offline
Forum Member
Joined: Mar 2004
Posts: 64
Thanks for your tips.

I remember I tested cbenchtest/mbcodyn.inp before and the mpirun faild at that time.
I tried again the serial and the mpirun of the same testcase,
but failed with the mpirun even using the command line you suggested.
The error messages are like this.

p0_18599: (0.000000) Specified multiple processes sharing memory without configuring for shared memory.
p0_18599: (0.000000) Check the users manual for more information.
p0_18599: p4_error: read_procgroup: 0

I've also tested with my own script but the mpirun was not successful either.

Some steps (or a step) seem to be missing from the error messages above.
Please let me know what's missing or wrong.

Thanks,
Seongeun

Re: serial and mpi versions of c33b2 installed on intel xeon
seongeun #17447 03/18/08 12:04 PM
Joined: Sep 2003
Posts: 4,794
Likes: 2
Forum Member
Online Content
Forum Member
Joined: Sep 2003
Posts: 4,794
Likes: 2
I gave a couple of very specific suggestions in my answer to your previous post. Read them, and try them. Do not just repeat a non-working way of doing things on yet another input file. The performance benchmark obviously is only interesting once you can actually run a parallel job. It is very important that you make sure that you can run some parallel MPICH tests, without involving CHARMM.

I did not suggest an mpirun command line. Note that my suggestion assumed that you have four cores in a shared memory machine; this may of course be wrong, but you have not told us what your platform is.

It is always better to post the actual commands used, and the results they give, than your own interpretations.

The MPI error messages indicate that you do not have MPICH installed for shared memory. See your MPI documentation; you may need something like " --with-comm=shared --with-device=ch_p4" with the configure step in the MPICH installation.


Lennart Nilsson
Karolinska Institutet
Stockholm, Sweden
Re: serial and mpi versions of c33b2 installed on intel xeon
lennart #17448 03/18/08 03:26 PM
Joined: Mar 2004
Posts: 64
Forum Member
OP Offline
Forum Member
Joined: Mar 2004
Posts: 64
Thanks a lot and I feel like you're yelling at me through a nice pair of speakers.
I'm sorry that I made a mistake to choose the word 'mpirun'.
And I'm using just a single node having 4 cpus (2 dual cores) on intel xeon cluster.

Actually, I did exactly as you suggested with the command line
./charmm -p4wd . -p4pg hostfile < charmm.inp > charmm.out
after copying charmm installed with mpich to the current directory.

I tried the same testcase again with the following steps.

The mpich-1.2.7p1 was installed with the additional options of
--with-comm=shared and --with-device=ch_p4.
Then c33b2 was installed with fresh source files and with the new mpich.
The cbenchtest/mbcodyn.inp was copied to charmm.inp and tested with the command line below.
./charmm -p4wd . -p4pg hostfile < charmm.inp > charmm.out

Without executing lamboot and with the hostfile format like below, the charmm failed.
(The hostname is the result of executing hostname on the shell prompt, of course.)
hostname 4
The number 4 is to use 4 cpus, I guess, right?
The error messages look like below.
p0_371: (0.000000) Specified multiple processes sharing memory without configuring for shared memory.
p0_371: (0.000000) Check the users manual for more information.
p0_371: p4_error: read_procgroup: 0

With the hostfile format below, the charmm runs but without speed-up proportional to
the number of cpus specified.
hostname cpu=4
Although the banner and the end part of output looks like mpirun output,
the dynamics results are exactly the same with the one by the serial charmm.

Additionally, with or without lamboot, the same story applies.
Also, I'd like to say I tried all the previous suggestions by Rick and you, as far as I can remember.

Please let me know some more tips to try.

Thanks,
Seongeun

Last edited by seongeun; 03/18/08 03:28 PM.
Re: serial and mpi versions of c33b2 installed on intel xeon
seongeun #17449 03/18/08 03:49 PM
Joined: Sep 2003
Posts: 4,794
Likes: 2
Forum Member
Online Content
Forum Member
Joined: Sep 2003
Posts: 4,794
Likes: 2
"localhost 3" is what I have in my hostfile for a 4-core job; try exactly that. The 3 specifies the number processes in addition to the one you start explicitly.

For now you should forget about CHARMM and verify that you can actually run the MPICH testcases in parallel, with or without mpirun - the importatn thing is to verify your MPICH installation, and to find out how to execute a parallel job on your machine.


Lennart Nilsson
Karolinska Institutet
Stockholm, Sweden
Re: serial and mpi versions of c33b2 installed on intel xeon
lennart #17450 03/18/08 04:46 PM
Joined: Sep 2003
Posts: 8,499
rmv Online Content
Forum Member
Online Content
Forum Member
Joined: Sep 2003
Posts: 8,499
Note that MPICH and LAMMPI are different implementations of MPI; you should not be using 'lamboot' with an MPICH installation.

Find the test prgrams provided with your installed MPI version and verify that you can compile and run the simple Fortran test case.

Re: serial and mpi versions of c33b2 installed on intel xeon
lennart #17451 03/19/08 04:41 AM
Joined: Mar 2004
Posts: 64
Forum Member
OP Offline
Forum Member
Joined: Mar 2004
Posts: 64
Thanks a lot for your tips again.
And sure I did try the 'hostname 3' in the hostfile, but it didn't work.

To run the mpich tests in ~/mpich-1.2.7p1/examples/test, I did the followings.
1) ./configure -mpichpath=~/mpich-1.2.7p1/bin
2) make TESTARGS=-small testing > testing.out
3) ./runtest in each directory of coll, context, env, io, pt2pt, profile, and topol.

There were differences between the present test and the standard output for io, pt2pt, and topol.

Although I'm trying to figure this out and not so sure to post all these here is all right,
but please let me know how to deal with the problem.

Thanks,
Seongeun


The io/contest.diff file looks like below.

Differences in atomicity.out
2c2
< **io No such file or directory No Errors
---
> No Errors
Differences in excl.out
2c2
< **io File exists No Errors
---
> No Errors
Differences in shared_fp.out
2c2
< **io No such file or directory**io No such file or directory**io
No such file or directory No Errors
---
> No Errors
Differences in error.out
2,3c2
< Unexpected error message Invalid argument
< Found 1 errors
---
> No Errors

In topol, the topol.diff file looks like this.

Differences in cart1f.out
2,7d1
< Timeout in waiting for processes to exit, 1 left. This may be due to a defective
< rsh program (Some versions of Kerberos rsh have been observed to have this
< problem).
< This is not a problem with P4 or MPICH but a problem with the operating
< environment. For many applications, this problem will only slow down
< process termination.

Finally in pt2pt/pt2pt.diff,

Differences in longmsgs.out
2,8c2
< p1_32142: (0.328125) xx_shmalloc: returning NULL; requested 4194352 bytes
< p1_32142: (0.328125) p4_shmalloc returning NULL; request = 4194352 bytes
< You can increase the amount of memory by setting the environment variable
< P4_GLOBMEMSIZE (in bytes); the current size is 4194304
< p1_32142: p4_error: alloc_p4_msg failed: 0
< p0_32137: p4_error: interrupt SIGx: 13
< p0_32137: (2.359375) net_send: could not write to fd=4, errno = 32
---
> No Errors
Differences in sendmany.out
2,12c2,16
< rm_l_4_4640: (0.050781) net_send: could not write to fd=5, errno = 32
< rm_l_3_4623: (0.074219) net_send: could not write to fd=5, errno = 32
< rm_4627: (0.050781) net_send: could not write to fd=4, errno = 32
< rm_4610: (0.074219) net_send: could not write to fd=4, errno = 32
< rm_4593: (0.101562) net_send: could not write to fd=4, errno = 32
< rm_l_2_4606: (0.101562) net_send: could not write to fd=5, errno = 32
< rm_4573: (0.125000) net_send: could not write to fd=4, errno = 32
< rm_l_1_4589: (0.125000) net_send: could not write to fd=5, errno = 32
< rm_4644: (0.027344) net_send: could not write to fd=4, errno = 32
< rm_l_5_4657: (0.027344) net_send: could not write to fd=5, errno = 32
< p0_4568: (14.160156) net_send: could not write to fd=4, errno = 32
---
> length = 1 ints
> length = 2 ints
> length = 4 ints
> length = 8 ints
> length = 16 ints
> length = 32 ints
> length = 64 ints
> length = 128 ints
> length = 256 ints
> length = 512 ints
> length = 1024 ints
> length = 2048 ints
> length = 4096 ints
> length = 8192 ints
> length = 16384 ints
Differences in structf.out
2,7c2
< 0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS does not fit in Fortran integer
< [0] Aborting program !
< [0] Aborting program!
< p0_9285: p4_error: : 972
< Interrupt
< Interrupt
---
> Received hello 5.1234

Re: serial and mpi versions of c33b2 installed on intel xeon
seongeun #17452 03/19/08 12:23 PM
Joined: Feb 2004
Posts: 147
Forum Member
Offline
Forum Member
Joined: Feb 2004
Posts: 147
Most likely you have 2 or more installations of MPI libraries on your system. Errors like this come up regularly on the MPI-related mailing lists I'm reading. Please try to find out which MPI libraries are installed and where they are installed then decide whether you want to keep the one(s) installed with the system or the one(s) that you have compiled.

It's possible to have several MPI libraries installed at once, or even several versions of the same library, but you should be prepared to deal with this situation - mainly this involves setting up the right environment variables (PATH, LD_LIBRARY_PATH, etc.)

Re: serial and mpi versions of c33b2 installed on intel xeon
bogdan #17453 03/20/08 11:06 AM
Joined: Mar 2004
Posts: 64
Forum Member
OP Offline
Forum Member
Joined: Mar 2004
Posts: 64
Thanks for your reply.

I checked the PATH and LD_LIBRARY_PATH and there was only one path to mpich.
The mpich version is mpich-1.2.7p1.

$ echo $PATH
/usr/local/mpich-ic9/bin:/usr/local/amber9/exe:/usr/local/pbs/bin:/opt/intel/fce/9.1.037/bin:/opt/in
tel/idbe/9.1.043/bin:/opt/intel/cce/9.1.043/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/se
ongeun/bin

$ echo $LD_LIBRARY_PATH
/opt/intel/mkl/8.1/lib/em64t:/opt/intel/fce/9.1.037/lib:/opt/intel/cce/9.1.043/lib

Then, I compiled c33b2 from the beginning with fresh source files.
But the run with the command line below failed.
./charmm -p4wd . -p4pg hostfile < charmm.inp > charmm.out

The hostfile format is like this.
hostname 3

Please let me know some more tips.
Thanks,

Seongeun

Re: serial and mpi versions of c33b2 installed on intel xeon
seongeun #17454 03/20/08 11:49 AM
Joined: Sep 2003
Posts: 4,794
Likes: 2
Forum Member
Online Content
Forum Member
Joined: Sep 2003
Posts: 4,794
Likes: 2
1/ So you can run the MPICH testcases with a similar commandline? You have to get the MPICH testcases to work before you move to CHARMM.
2/ Not easy to help when you don't say in what way your test fails. Is it a CHARMM problem, or are we still dealing with MPI problems?
3/ The line in the hostfile is literally "localhost 3" - the word "localhost" is not just something that I made up. It may of course work with a real hostname as well, but that is less general (you need one different files for different hosts then).


Lennart Nilsson
Karolinska Institutet
Stockholm, Sweden
Re: serial and mpi versions of c33b2 installed on intel xeon
seongeun #17455 03/21/08 02:09 AM
Joined: Sep 2003
Posts: 8,499
rmv Online Content
Forum Member
Online Content
Forum Member
Joined: Sep 2003
Posts: 8,499
The file format for a ch_p4 device process group file to run 4 processes is more like

hostname 0 /full/path/to/charmm
hostname 1 /full/path/to/charmm
hostname 1 /full/path/to/charmm
hostname 1 /full/path/to/charmm

where 'hostname' needs to be a valid internet host name (including localhost); you also need to have 'rsh' setup properly, or else set up 'ssh' as described in the MPICH documentation.

Page 2 of 3 1 2 3

Moderated by  lennart, rmv 

Link Copied to Clipboard
Powered by UBB.threads™ PHP Forum Software 7.7.4
(Release build 20200307)
Responsive Width:

PHP: 5.6.33-0+deb8u1 Page Time: 0.013s Queries: 35 (0.008s) Memory: 0.9975 MB (Peak: 1.1324 MB) Data Comp: Off Server Time: 2020-10-01 16:13:35 UTC
Valid HTML 5 and Valid CSS