Which MPI library are you using?
On 26 Apr 2007, at 09:41, Niall Moran wrote:
Hi,
I have built and installed ALPS 1.3b3 and am trying to get the MPI version working. I am using a simple input file (shown below) and using the dirloop_sse_mpi tool.
When running this on a single processor with dirloop_sse it takes around 62 seconds for each task and does each sequentially on a single cpu.
I then tried running this in parallel on 8 processors. System has PBS for batch processing. Submitted job interactively and requested a walltime of 20 minutes and 8 cpus (4 x 2 processors per node).
qsub -l walltime=00:20:00 -l nodes=4:ppn=2 -I
Then started simulation with
mpiexec -n 8 dirloop_sse_mpi job.in.xml
All the tasks seem to finish correctly however the processes remain until the walltime is up. Output is listed below.
Have tried building ALPS with both gcc 3.3.3 and also pathscale 2.5 compiler along with mpich-1.2.7. The same thing happens with each.
Does anyone have any ideas what is going on?
Many Thanks,
Niall.
Parameter version of input file.
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
Output of simulation in parallel.
Quantum Monte simulations using the generalized directed loop algorithm v. 1.1 available from http://alps.comp-phys.org/ copyright (c) 2001-2005 by Fabien Alet alet@comp-phys.org, Synge Todo wistaria@comp-phys.org, and Matthias Troyer troyer@comp-phys.org see F. Alet, S. Wessel and M. Troyer, Phys. Rev. E 71, 036706 (2005) for details.
using the ALPS parallelizing scheduler copyright (c) 1994-2006 by Matthias Troyer troyer@comp-phys.org. see Lecture Notes in Computer Science, Vol. 1505, p. 191 (1998).
based on the ALPS libraries version 1.3b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2006 by the ALPS collaboration. Consult the web page for license details. For details see the publication: F. Alet et al., J. Phys. Soc. Jpn. Suppl., Vol. 74, 30 (2005).
parsing task files ... Created a new simulation: 2 on 1 processes Created a new simulation: 3 on 1 processes Created a new simulation: 4 on 1 processes Created a new simulation: 5 on 1 processes Created a new simulation: 6 on 1 processes Created a new simulation: 7 on 1 processes Created a new simulation: 8 on 1 processes Creating a new simulation: 1 on 1 processes Created run 1 locally All processes have been assigned Checking if Simulation 1 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Checking if Simulation 2 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Checking if Simulation 3 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 4 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 5 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 6 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 7 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 8 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 2 is finished: Finished Halted Simulation 2 Checking if Simulation 1 is finished: Finished Halted Simulation 1 Assigning 2 processes No work for master process No work for 1. All processes have been assigned Checking if Simulation 3 is finished: Finished Halted Simulation 3 Checking if Simulation 4 is finished: Finished Halted Simulation 4 Checking if Simulation 5 is finished: Finished Halted Simulation 5 Checking if Simulation 6 is finished: Finished Halted Simulation 6 Assigning 5 processes No work for 5. All processes have been assigned Checking if Simulation 7 is finished: Finished Halted Simulation 7 Checking if Simulation 8 is finished: Finished Halted Simulation 8 Assigning 7 processes All processes have been assigned Checkpointing Simulation 1 Checkpointing Simulation 2 Checkpointing Simulation 3 Checkpointing Simulation 4 Checkpointing Simulation 5 Checkpointing Simulation 6 Checkpointing Simulation 7 Checkpointing Simulation 8 Finished with all tasks.
WARNING : Hamiltonian has a sign problem... p7_10096: p4_error: Timeout in establishing connection to remote process: 0 rm_l_7_10098: (365.230469) net_send: could not write to fd=5, errno = 32 p7_10096: (365.234375) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p5_12240: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_12242: (365.238281) net_send: could not write to fd=5, errno = 32 p0_29156: p4_error: net_recv read: probable EOF on socket: 1 WARNING : Hamiltonian has a sign problem... p3_12614: p4_error: Timeout in establishing connection to remote process: 0 rm_l_3_12616: (365.238281) net_send: could not write to fd=5, errno = 32 p5_12240: (373.246094) net_send: could not write to fd=5, errno = 32 p0_29156: (374.425781) net_send: could not write to fd=4, errno = 32 p3_12614: (373.250000) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p1_29158: p4_error: net_recv read: probable EOF on socket: 1 p1_29158: (381.265625) net_send: could not write to fd=5, errno = 32