Hi,
I have built and installed ALPS 1.3b3 and am trying to get the MPI version working. I am using a simple input file (shown below) and using the dirloop_sse_mpi tool.
When running this on a single processor with dirloop_sse it takes around 62 seconds for each task and does each sequentially on a single cpu.
I then tried running this in parallel on 8 processors. System has PBS for batch processing. Submitted job interactively and requested a walltime of 20 minutes and 8 cpus (4 x 2 processors per node).
qsub -l walltime=00:20:00 -l nodes=4:ppn=2 -I
Then started simulation with
mpiexec -n 8 dirloop_sse_mpi job.in.xml
All the tasks seem to finish correctly however the processes remain until the walltime is up. Output is listed below.
Have tried building ALPS with both gcc 3.3.3 and also pathscale 2.5 compiler along with mpich-1.2.7. The same thing happens with each.
Does anyone have any ideas what is going on?
Many Thanks,
Niall.
Parameter version of input file. --------------------------------------------------------------------------- MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
Output of simulation in parallel. --------------------------------------------------------------------------
Quantum Monte simulations using the generalized directed loop algorithm v. 1.1 available from http://alps.comp-phys.org/ copyright (c) 2001-2005 by Fabien Alet alet@comp-phys.org, Synge Todo wistaria@comp-phys.org, and Matthias Troyer troyer@comp-phys.org see F. Alet, S. Wessel and M. Troyer, Phys. Rev. E 71, 036706 (2005) for details.
using the ALPS parallelizing scheduler copyright (c) 1994-2006 by Matthias Troyer troyer@comp-phys.org. see Lecture Notes in Computer Science, Vol. 1505, p. 191 (1998).
based on the ALPS libraries version 1.3b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2006 by the ALPS collaboration. Consult the web page for license details. For details see the publication: F. Alet et al., J. Phys. Soc. Jpn. Suppl., Vol. 74, 30 (2005).
parsing task files ... Created a new simulation: 2 on 1 processes Created a new simulation: 3 on 1 processes Created a new simulation: 4 on 1 processes Created a new simulation: 5 on 1 processes Created a new simulation: 6 on 1 processes Created a new simulation: 7 on 1 processes Created a new simulation: 8 on 1 processes Creating a new simulation: 1 on 1 processes Created run 1 locally All processes have been assigned Checking if Simulation 1 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Checking if Simulation 2 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Checking if Simulation 3 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 4 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 5 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 6 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 7 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 8 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 2 is finished: Finished Halted Simulation 2 Checking if Simulation 1 is finished: Finished Halted Simulation 1 Assigning 2 processes No work for master process No work for 1. All processes have been assigned Checking if Simulation 3 is finished: Finished Halted Simulation 3 Checking if Simulation 4 is finished: Finished Halted Simulation 4 Checking if Simulation 5 is finished: Finished Halted Simulation 5 Checking if Simulation 6 is finished: Finished Halted Simulation 6 Assigning 5 processes No work for 5. All processes have been assigned Checking if Simulation 7 is finished: Finished Halted Simulation 7 Checking if Simulation 8 is finished: Finished Halted Simulation 8 Assigning 7 processes All processes have been assigned Checkpointing Simulation 1 Checkpointing Simulation 2 Checkpointing Simulation 3 Checkpointing Simulation 4 Checkpointing Simulation 5 Checkpointing Simulation 6 Checkpointing Simulation 7 Checkpointing Simulation 8 Finished with all tasks.
WARNING : Hamiltonian has a sign problem... p7_10096: p4_error: Timeout in establishing connection to remote process: 0 rm_l_7_10098: (365.230469) net_send: could not write to fd=5, errno = 32 p7_10096: (365.234375) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p5_12240: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_12242: (365.238281) net_send: could not write to fd=5, errno = 32 p0_29156: p4_error: net_recv read: probable EOF on socket: 1 WARNING : Hamiltonian has a sign problem... p3_12614: p4_error: Timeout in establishing connection to remote process: 0 rm_l_3_12616: (365.238281) net_send: could not write to fd=5, errno = 32 p5_12240: (373.246094) net_send: could not write to fd=5, errno = 32 p0_29156: (374.425781) net_send: could not write to fd=4, errno = 32 p3_12614: (373.250000) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p1_29158: p4_error: net_recv read: probable EOF on socket: 1 p1_29158: (381.265625) net_send: could not write to fd=5, errno = 32
Which MPI library are you using?
On 26 Apr 2007, at 09:41, Niall Moran wrote:
Hi,
I have built and installed ALPS 1.3b3 and am trying to get the MPI version working. I am using a simple input file (shown below) and using the dirloop_sse_mpi tool.
When running this on a single processor with dirloop_sse it takes around 62 seconds for each task and does each sequentially on a single cpu.
I then tried running this in parallel on 8 processors. System has PBS for batch processing. Submitted job interactively and requested a walltime of 20 minutes and 8 cpus (4 x 2 processors per node).
qsub -l walltime=00:20:00 -l nodes=4:ppn=2 -I
Then started simulation with
mpiexec -n 8 dirloop_sse_mpi job.in.xml
All the tasks seem to finish correctly however the processes remain until the walltime is up. Output is listed below.
Have tried building ALPS with both gcc 3.3.3 and also pathscale 2.5 compiler along with mpich-1.2.7. The same thing happens with each.
Does anyone have any ideas what is going on?
Many Thanks,
Niall.
Parameter version of input file.
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
Output of simulation in parallel.
Quantum Monte simulations using the generalized directed loop algorithm v. 1.1 available from http://alps.comp-phys.org/ copyright (c) 2001-2005 by Fabien Alet alet@comp-phys.org, Synge Todo wistaria@comp-phys.org, and Matthias Troyer troyer@comp-phys.org see F. Alet, S. Wessel and M. Troyer, Phys. Rev. E 71, 036706 (2005) for details.
using the ALPS parallelizing scheduler copyright (c) 1994-2006 by Matthias Troyer troyer@comp-phys.org. see Lecture Notes in Computer Science, Vol. 1505, p. 191 (1998).
based on the ALPS libraries version 1.3b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2006 by the ALPS collaboration. Consult the web page for license details. For details see the publication: F. Alet et al., J. Phys. Soc. Jpn. Suppl., Vol. 74, 30 (2005).
parsing task files ... Created a new simulation: 2 on 1 processes Created a new simulation: 3 on 1 processes Created a new simulation: 4 on 1 processes Created a new simulation: 5 on 1 processes Created a new simulation: 6 on 1 processes Created a new simulation: 7 on 1 processes Created a new simulation: 8 on 1 processes Creating a new simulation: 1 on 1 processes Created run 1 locally All processes have been assigned Checking if Simulation 1 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Checking if Simulation 2 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Checking if Simulation 3 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 4 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 5 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 6 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 7 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 8 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 2 is finished: Finished Halted Simulation 2 Checking if Simulation 1 is finished: Finished Halted Simulation 1 Assigning 2 processes No work for master process No work for 1. All processes have been assigned Checking if Simulation 3 is finished: Finished Halted Simulation 3 Checking if Simulation 4 is finished: Finished Halted Simulation 4 Checking if Simulation 5 is finished: Finished Halted Simulation 5 Checking if Simulation 6 is finished: Finished Halted Simulation 6 Assigning 5 processes No work for 5. All processes have been assigned Checking if Simulation 7 is finished: Finished Halted Simulation 7 Checking if Simulation 8 is finished: Finished Halted Simulation 8 Assigning 7 processes All processes have been assigned Checkpointing Simulation 1 Checkpointing Simulation 2 Checkpointing Simulation 3 Checkpointing Simulation 4 Checkpointing Simulation 5 Checkpointing Simulation 6 Checkpointing Simulation 7 Checkpointing Simulation 8 Finished with all tasks.
WARNING : Hamiltonian has a sign problem... p7_10096: p4_error: Timeout in establishing connection to remote process: 0 rm_l_7_10098: (365.230469) net_send: could not write to fd=5, errno = 32 p7_10096: (365.234375) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p5_12240: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_12242: (365.238281) net_send: could not write to fd=5, errno = 32 p0_29156: p4_error: net_recv read: probable EOF on socket: 1 WARNING : Hamiltonian has a sign problem... p3_12614: p4_error: Timeout in establishing connection to remote process: 0 rm_l_3_12616: (365.238281) net_send: could not write to fd=5, errno = 32 p5_12240: (373.246094) net_send: could not write to fd=5, errno = 32 p0_29156: (374.425781) net_send: could not write to fd=4, errno = 32 p3_12614: (373.250000) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p1_29158: p4_error: net_recv read: probable EOF on socket: 1 p1_29158: (381.265625) net_send: could not write to fd=5, errno = 32
Its MPICH 1.2.7-1 from http://www-unix.mcs.anl.gov/mpi/mpich1/download.html.
Matthias Troyer wrote:
Which MPI library are you using?
On 26 Apr 2007, at 09:41, Niall Moran wrote:
Hi,
I have built and installed ALPS 1.3b3 and am trying to get the MPI version working. I am using a simple input file (shown below) and using the dirloop_sse_mpi tool.
When running this on a single processor with dirloop_sse it takes around 62 seconds for each task and does each sequentially on a single cpu.
I then tried running this in parallel on 8 processors. System has PBS for batch processing. Submitted job interactively and requested a walltime of 20 minutes and 8 cpus (4 x 2 processors per node).
qsub -l walltime=00:20:00 -l nodes=4:ppn=2 -I
Then started simulation with
mpiexec -n 8 dirloop_sse_mpi job.in.xml
All the tasks seem to finish correctly however the processes remain until the walltime is up. Output is listed below.
Have tried building ALPS with both gcc 3.3.3 and also pathscale 2.5 compiler along with mpich-1.2.7. The same thing happens with each.
Does anyone have any ideas what is going on?
Many Thanks,
Niall.
Parameter version of input file.
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
Output of simulation in parallel.
Quantum Monte simulations using the generalized directed loop algorithm v. 1.1 available from http://alps.comp-phys.org/ copyright (c) 2001-2005 by Fabien Alet alet@comp-phys.org, Synge Todo wistaria@comp-phys.org, and Matthias Troyer troyer@comp-phys.org see F. Alet, S. Wessel and M. Troyer, Phys. Rev. E 71, 036706 (2005) for details.
using the ALPS parallelizing scheduler copyright (c) 1994-2006 by Matthias Troyer troyer@comp-phys.org. see Lecture Notes in Computer Science, Vol. 1505, p. 191 (1998).
based on the ALPS libraries version 1.3b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2006 by the ALPS collaboration. Consult the web page for license details. For details see the publication: F. Alet et al., J. Phys. Soc. Jpn. Suppl., Vol. 74, 30 (2005).
parsing task files ... Created a new simulation: 2 on 1 processes Created a new simulation: 3 on 1 processes Created a new simulation: 4 on 1 processes Created a new simulation: 5 on 1 processes Created a new simulation: 6 on 1 processes Created a new simulation: 7 on 1 processes Created a new simulation: 8 on 1 processes Creating a new simulation: 1 on 1 processes Created run 1 locally All processes have been assigned Checking if Simulation 1 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Checking if Simulation 2 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Checking if Simulation 3 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 4 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 5 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 6 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 7 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 8 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 2 is finished: Finished Halted Simulation 2 Checking if Simulation 1 is finished: Finished Halted Simulation 1 Assigning 2 processes No work for master process No work for 1. All processes have been assigned Checking if Simulation 3 is finished: Finished Halted Simulation 3 Checking if Simulation 4 is finished: Finished Halted Simulation 4 Checking if Simulation 5 is finished: Finished Halted Simulation 5 Checking if Simulation 6 is finished: Finished Halted Simulation 6 Assigning 5 processes No work for 5. All processes have been assigned Checking if Simulation 7 is finished: Finished Halted Simulation 7 Checking if Simulation 8 is finished: Finished Halted Simulation 8 Assigning 7 processes All processes have been assigned Checkpointing Simulation 1 Checkpointing Simulation 2 Checkpointing Simulation 3 Checkpointing Simulation 4 Checkpointing Simulation 5 Checkpointing Simulation 6 Checkpointing Simulation 7 Checkpointing Simulation 8 Finished with all tasks.
WARNING : Hamiltonian has a sign problem... p7_10096: p4_error: Timeout in establishing connection to remote process: 0 rm_l_7_10098: (365.230469) net_send: could not write to fd=5, errno = 32 p7_10096: (365.234375) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p5_12240: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_12242: (365.238281) net_send: could not write to fd=5, errno = 32 p0_29156: p4_error: net_recv read: probable EOF on socket: 1 WARNING : Hamiltonian has a sign problem... p3_12614: p4_error: Timeout in establishing connection to remote process: 0 rm_l_3_12616: (365.238281) net_send: could not write to fd=5, errno = 32 p5_12240: (373.246094) net_send: could not write to fd=5, errno = 32 p0_29156: (374.425781) net_send: could not write to fd=4, errno = 32 p3_12614: (373.250000) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p1_29158: p4_error: net_recv read: probable EOF on socket: 1 p1_29158: (381.265625) net_send: could not write to fd=5, errno = 32
I just remembered that other users reported similar problems with a combination of gcc-3.3.3 and mpich. Using LAM MPI and gcc-4 made the problem go away then.
Matthias
On 26 Apr 2007, at 09:41, Niall Moran wrote:
Hi,
I have built and installed ALPS 1.3b3 and am trying to get the MPI version working. I am using a simple input file (shown below) and using the dirloop_sse_mpi tool.
When running this on a single processor with dirloop_sse it takes around 62 seconds for each task and does each sequentially on a single cpu.
I then tried running this in parallel on 8 processors. System has PBS for batch processing. Submitted job interactively and requested a walltime of 20 minutes and 8 cpus (4 x 2 processors per node).
qsub -l walltime=00:20:00 -l nodes=4:ppn=2 -I
Then started simulation with
mpiexec -n 8 dirloop_sse_mpi job.in.xml
All the tasks seem to finish correctly however the processes remain until the walltime is up. Output is listed below.
Have tried building ALPS with both gcc 3.3.3 and also pathscale 2.5 compiler along with mpich-1.2.7. The same thing happens with each.
Does anyone have any ideas what is going on?
Many Thanks,
Niall.
Parameter version of input file.
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
Output of simulation in parallel.
Quantum Monte simulations using the generalized directed loop algorithm v. 1.1 available from http://alps.comp-phys.org/ copyright (c) 2001-2005 by Fabien Alet alet@comp-phys.org, Synge Todo wistaria@comp-phys.org, and Matthias Troyer troyer@comp-phys.org see F. Alet, S. Wessel and M. Troyer, Phys. Rev. E 71, 036706 (2005) for details.
using the ALPS parallelizing scheduler copyright (c) 1994-2006 by Matthias Troyer troyer@comp-phys.org. see Lecture Notes in Computer Science, Vol. 1505, p. 191 (1998).
based on the ALPS libraries version 1.3b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2006 by the ALPS collaboration. Consult the web page for license details. For details see the publication: F. Alet et al., J. Phys. Soc. Jpn. Suppl., Vol. 74, 30 (2005).
parsing task files ... Created a new simulation: 2 on 1 processes Created a new simulation: 3 on 1 processes Created a new simulation: 4 on 1 processes Created a new simulation: 5 on 1 processes Created a new simulation: 6 on 1 processes Created a new simulation: 7 on 1 processes Created a new simulation: 8 on 1 processes Creating a new simulation: 1 on 1 processes Created run 1 locally All processes have been assigned Checking if Simulation 1 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Checking if Simulation 2 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Checking if Simulation 3 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 4 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 5 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 6 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 7 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 8 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 2 is finished: Finished Halted Simulation 2 Checking if Simulation 1 is finished: Finished Halted Simulation 1 Assigning 2 processes No work for master process No work for 1. All processes have been assigned Checking if Simulation 3 is finished: Finished Halted Simulation 3 Checking if Simulation 4 is finished: Finished Halted Simulation 4 Checking if Simulation 5 is finished: Finished Halted Simulation 5 Checking if Simulation 6 is finished: Finished Halted Simulation 6 Assigning 5 processes No work for 5. All processes have been assigned Checking if Simulation 7 is finished: Finished Halted Simulation 7 Checking if Simulation 8 is finished: Finished Halted Simulation 8 Assigning 7 processes All processes have been assigned Checkpointing Simulation 1 Checkpointing Simulation 2 Checkpointing Simulation 3 Checkpointing Simulation 4 Checkpointing Simulation 5 Checkpointing Simulation 6 Checkpointing Simulation 7 Checkpointing Simulation 8 Finished with all tasks.
WARNING : Hamiltonian has a sign problem... p7_10096: p4_error: Timeout in establishing connection to remote process: 0 rm_l_7_10098: (365.230469) net_send: could not write to fd=5, errno = 32 p7_10096: (365.234375) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p5_12240: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_12242: (365.238281) net_send: could not write to fd=5, errno = 32 p0_29156: p4_error: net_recv read: probable EOF on socket: 1 WARNING : Hamiltonian has a sign problem... p3_12614: p4_error: Timeout in establishing connection to remote process: 0 rm_l_3_12616: (365.238281) net_send: could not write to fd=5, errno = 32 p5_12240: (373.246094) net_send: could not write to fd=5, errno = 32 p0_29156: (374.425781) net_send: could not write to fd=4, errno = 32 p3_12614: (373.250000) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p1_29158: p4_error: net_recv read: probable EOF on socket: 1 p1_29158: (381.265625) net_send: could not write to fd=5, errno = 32
Thanks for your reply. We don't have LAM MPI because it does not work well with PBS however we do have MPICH2 and also gcc 4.1.2, pathscale 2.5 and 3.0 and also portland 6.2.2 so I will try it with different combinations of these.
Niall.
Matthias Troyer wrote:
I just remembered that other users reported similar problems with a combination of gcc-3.3.3 and mpich. Using LAM MPI and gcc-4 made the problem go away then.
Matthias
On 26 Apr 2007, at 09:41, Niall Moran wrote:
Hi,
I have built and installed ALPS 1.3b3 and am trying to get the MPI version working. I am using a simple input file (shown below) and using the dirloop_sse_mpi tool.
When running this on a single processor with dirloop_sse it takes around 62 seconds for each task and does each sequentially on a single cpu.
I then tried running this in parallel on 8 processors. System has PBS for batch processing. Submitted job interactively and requested a walltime of 20 minutes and 8 cpus (4 x 2 processors per node).
qsub -l walltime=00:20:00 -l nodes=4:ppn=2 -I
Then started simulation with
mpiexec -n 8 dirloop_sse_mpi job.in.xml
All the tasks seem to finish correctly however the processes remain until the walltime is up. Output is listed below.
Have tried building ALPS with both gcc 3.3.3 and also pathscale 2.5 compiler along with mpich-1.2.7. The same thing happens with each.
Does anyone have any ideas what is going on?
Many Thanks,
Niall.
Parameter version of input file.
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
Output of simulation in parallel.
Quantum Monte simulations using the generalized directed loop algorithm v. 1.1 available from http://alps.comp-phys.org/ copyright (c) 2001-2005 by Fabien Alet alet@comp-phys.org, Synge Todo wistaria@comp-phys.org, and Matthias Troyer troyer@comp-phys.org see F. Alet, S. Wessel and M. Troyer, Phys. Rev. E 71, 036706 (2005) for details.
using the ALPS parallelizing scheduler copyright (c) 1994-2006 by Matthias Troyer troyer@comp-phys.org. see Lecture Notes in Computer Science, Vol. 1505, p. 191 (1998).
based on the ALPS libraries version 1.3b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2006 by the ALPS collaboration. Consult the web page for license details. For details see the publication: F. Alet et al., J. Phys. Soc. Jpn. Suppl., Vol. 74, 30 (2005).
parsing task files ... Created a new simulation: 2 on 1 processes Created a new simulation: 3 on 1 processes Created a new simulation: 4 on 1 processes Created a new simulation: 5 on 1 processes Created a new simulation: 6 on 1 processes Created a new simulation: 7 on 1 processes Created a new simulation: 8 on 1 processes Creating a new simulation: 1 on 1 processes Created run 1 locally All processes have been assigned Checking if Simulation 1 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Checking if Simulation 2 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Checking if Simulation 3 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 4 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 5 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 6 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 7 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 8 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 2 is finished: Finished Halted Simulation 2 Checking if Simulation 1 is finished: Finished Halted Simulation 1 Assigning 2 processes No work for master process No work for 1. All processes have been assigned Checking if Simulation 3 is finished: Finished Halted Simulation 3 Checking if Simulation 4 is finished: Finished Halted Simulation 4 Checking if Simulation 5 is finished: Finished Halted Simulation 5 Checking if Simulation 6 is finished: Finished Halted Simulation 6 Assigning 5 processes No work for 5. All processes have been assigned Checking if Simulation 7 is finished: Finished Halted Simulation 7 Checking if Simulation 8 is finished: Finished Halted Simulation 8 Assigning 7 processes All processes have been assigned Checkpointing Simulation 1 Checkpointing Simulation 2 Checkpointing Simulation 3 Checkpointing Simulation 4 Checkpointing Simulation 5 Checkpointing Simulation 6 Checkpointing Simulation 7 Checkpointing Simulation 8 Finished with all tasks.
WARNING : Hamiltonian has a sign problem... p7_10096: p4_error: Timeout in establishing connection to remote process: 0 rm_l_7_10098: (365.230469) net_send: could not write to fd=5, errno = 32 p7_10096: (365.234375) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p5_12240: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_12242: (365.238281) net_send: could not write to fd=5, errno = 32 p0_29156: p4_error: net_recv read: probable EOF on socket: 1 WARNING : Hamiltonian has a sign problem... p3_12614: p4_error: Timeout in establishing connection to remote process: 0 rm_l_3_12616: (365.238281) net_send: could not write to fd=5, errno = 32 p5_12240: (373.246094) net_send: could not write to fd=5, errno = 32 p0_29156: (374.425781) net_send: could not write to fd=4, errno = 32 p3_12614: (373.250000) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p1_29158: p4_error: net_recv read: probable EOF on socket: 1 p1_29158: (381.265625) net_send: could not write to fd=5, errno = 32
We've had no problems with LAM and PBS. Please let me know if one of the combination works.
Matthias
On 26 Apr 2007, at 09:57, Niall Moran wrote:
Thanks for your reply. We don't have LAM MPI because it does not work well with PBS however we do have MPICH2 and also gcc 4.1.2, pathscale 2.5 and 3.0 and also portland 6.2.2 so I will try it with different combinations of these.
Niall.
Matthias Troyer wrote:
I just remembered that other users reported similar problems with a combination of gcc-3.3.3 and mpich. Using LAM MPI and gcc-4 made the problem go away then.
Matthias
On 26 Apr 2007, at 09:41, Niall Moran wrote:
Hi,
I have built and installed ALPS 1.3b3 and am trying to get the MPI version working. I am using a simple input file (shown below) and using the dirloop_sse_mpi tool.
When running this on a single processor with dirloop_sse it takes around 62 seconds for each task and does each sequentially on a single cpu.
I then tried running this in parallel on 8 processors. System has PBS for batch processing. Submitted job interactively and requested a walltime of 20 minutes and 8 cpus (4 x 2 processors per node).
qsub -l walltime=00:20:00 -l nodes=4:ppn=2 -I
Then started simulation with
mpiexec -n 8 dirloop_sse_mpi job.in.xml
All the tasks seem to finish correctly however the processes remain until the walltime is up. Output is listed below.
Have tried building ALPS with both gcc 3.3.3 and also pathscale 2.5 compiler along with mpich-1.2.7. The same thing happens with each.
Does anyone have any ideas what is going on?
Many Thanks,
Niall.
Parameter version of input file.
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
Output of simulation in parallel.
Quantum Monte simulations using the generalized directed loop algorithm v. 1.1 available from http://alps.comp-phys.org/ copyright (c) 2001-2005 by Fabien Alet alet@comp-phys.org, Synge Todo wistaria@comp-phys.org, and Matthias Troyer <troyer@comp- phys.org> see F. Alet, S. Wessel and M. Troyer, Phys. Rev. E 71, 036706 (2005) for details.
using the ALPS parallelizing scheduler copyright (c) 1994-2006 by Matthias Troyer troyer@comp-phys.org. see Lecture Notes in Computer Science, Vol. 1505, p. 191 (1998).
based on the ALPS libraries version 1.3b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2006 by the ALPS collaboration. Consult the web page for license details. For details see the publication: F. Alet et al., J. Phys. Soc. Jpn. Suppl., Vol. 74, 30 (2005).
parsing task files ... Created a new simulation: 2 on 1 processes Created a new simulation: 3 on 1 processes Created a new simulation: 4 on 1 processes Created a new simulation: 5 on 1 processes Created a new simulation: 6 on 1 processes Created a new simulation: 7 on 1 processes Created a new simulation: 8 on 1 processes Creating a new simulation: 1 on 1 processes Created run 1 locally All processes have been assigned Checking if Simulation 1 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Checking if Simulation 2 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Checking if Simulation 3 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 4 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 5 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 6 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 7 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 8 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 2 is finished: Finished Halted Simulation 2 Checking if Simulation 1 is finished: Finished Halted Simulation 1 Assigning 2 processes No work for master process No work for 1. All processes have been assigned Checking if Simulation 3 is finished: Finished Halted Simulation 3 Checking if Simulation 4 is finished: Finished Halted Simulation 4 Checking if Simulation 5 is finished: Finished Halted Simulation 5 Checking if Simulation 6 is finished: Finished Halted Simulation 6 Assigning 5 processes No work for 5. All processes have been assigned Checking if Simulation 7 is finished: Finished Halted Simulation 7 Checking if Simulation 8 is finished: Finished Halted Simulation 8 Assigning 7 processes All processes have been assigned Checkpointing Simulation 1 Checkpointing Simulation 2 Checkpointing Simulation 3 Checkpointing Simulation 4 Checkpointing Simulation 5 Checkpointing Simulation 6 Checkpointing Simulation 7 Checkpointing Simulation 8 Finished with all tasks.
WARNING : Hamiltonian has a sign problem... p7_10096: p4_error: Timeout in establishing connection to remote process: 0 rm_l_7_10098: (365.230469) net_send: could not write to fd=5, errno = 32 p7_10096: (365.234375) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p5_12240: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_12242: (365.238281) net_send: could not write to fd=5, errno = 32 p0_29156: p4_error: net_recv read: probable EOF on socket: 1 WARNING : Hamiltonian has a sign problem... p3_12614: p4_error: Timeout in establishing connection to remote process: 0 rm_l_3_12616: (365.238281) net_send: could not write to fd=5, errno = 32 p5_12240: (373.246094) net_send: could not write to fd=5, errno = 32 p0_29156: (374.425781) net_send: could not write to fd=4, errno = 32 p3_12614: (373.250000) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p1_29158: p4_error: net_recv read: probable EOF on socket: 1 p1_29158: (381.265625) net_send: could not write to fd=5, errno = 32
I have built a version with mpich2 version 1.0.2p1 and pathscale compiler version 2.5. The processes are no longer hanging after the tasks are completing. However the tasks seem to be running a bit too fast and I am not sure whether they are running properly or not. The parameter form of the input file I am using is
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATTICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
I am using the dirloop_sse tool and dirloop_sse_mpi in parallel. When running serially some of the tasks run in 1 or 2 seconds and some of them take 62 seconds. It usually takes around 4-5 minutes for the full 8 tasks.
However when run in parallel the whole 8 tasks are completed in under 25 seconds. Is there a way of verifying that the tasks are running correctly and are there any standard/benchmark input files available to test with?
Many Thanks,
Niall.
Matthias Troyer wrote:
We've had no problems with LAM and PBS. Please let me know if one of the combination works.
Matthias
On 26 Apr 2007, at 09:57, Niall Moran wrote:
Thanks for your reply. We don't have LAM MPI because it does not work well with PBS however we do have MPICH2 and also gcc 4.1.2, pathscale 2.5 and 3.0 and also portland 6.2.2 so I will try it with different combinations of these.
Niall.
Matthias Troyer wrote:
I just remembered that other users reported similar problems with a combination of gcc-3.3.3 and mpich. Using LAM MPI and gcc-4 made the problem go away then.
Matthias
On 26 Apr 2007, at 09:41, Niall Moran wrote:
Hi,
I have built and installed ALPS 1.3b3 and am trying to get the MPI version working. I am using a simple input file (shown below) and using the dirloop_sse_mpi tool.
When running this on a single processor with dirloop_sse it takes around 62 seconds for each task and does each sequentially on a single cpu.
I then tried running this in parallel on 8 processors. System has PBS for batch processing. Submitted job interactively and requested a walltime of 20 minutes and 8 cpus (4 x 2 processors per node).
qsub -l walltime=00:20:00 -l nodes=4:ppn=2 -I
Then started simulation with
mpiexec -n 8 dirloop_sse_mpi job.in.xml
All the tasks seem to finish correctly however the processes remain until the walltime is up. Output is listed below.
Have tried building ALPS with both gcc 3.3.3 and also pathscale 2.5 compiler along with mpich-1.2.7. The same thing happens with each.
Does anyone have any ideas what is going on?
Many Thanks,
Niall.
Parameter version of input file.
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
Output of simulation in parallel.
Quantum Monte simulations using the generalized directed loop algorithm v. 1.1 available from http://alps.comp-phys.org/ copyright (c) 2001-2005 by Fabien Alet alet@comp-phys.org, Synge Todo wistaria@comp-phys.org, and Matthias Troyer troyer@comp-phys.org see F. Alet, S. Wessel and M. Troyer, Phys. Rev. E 71, 036706 (2005) for details.
using the ALPS parallelizing scheduler copyright (c) 1994-2006 by Matthias Troyer troyer@comp-phys.org. see Lecture Notes in Computer Science, Vol. 1505, p. 191 (1998).
based on the ALPS libraries version 1.3b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2006 by the ALPS collaboration. Consult the web page for license details. For details see the publication: F. Alet et al., J. Phys. Soc. Jpn. Suppl., Vol. 74, 30 (2005).
parsing task files ... Created a new simulation: 2 on 1 processes Created a new simulation: 3 on 1 processes Created a new simulation: 4 on 1 processes Created a new simulation: 5 on 1 processes Created a new simulation: 6 on 1 processes Created a new simulation: 7 on 1 processes Created a new simulation: 8 on 1 processes Creating a new simulation: 1 on 1 processes Created run 1 locally All processes have been assigned Checking if Simulation 1 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Checking if Simulation 2 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Checking if Simulation 3 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 4 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 5 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 6 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 7 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 8 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 2 is finished: Finished Halted Simulation 2 Checking if Simulation 1 is finished: Finished Halted Simulation 1 Assigning 2 processes No work for master process No work for 1. All processes have been assigned Checking if Simulation 3 is finished: Finished Halted Simulation 3 Checking if Simulation 4 is finished: Finished Halted Simulation 4 Checking if Simulation 5 is finished: Finished Halted Simulation 5 Checking if Simulation 6 is finished: Finished Halted Simulation 6 Assigning 5 processes No work for 5. All processes have been assigned Checking if Simulation 7 is finished: Finished Halted Simulation 7 Checking if Simulation 8 is finished: Finished Halted Simulation 8 Assigning 7 processes All processes have been assigned Checkpointing Simulation 1 Checkpointing Simulation 2 Checkpointing Simulation 3 Checkpointing Simulation 4 Checkpointing Simulation 5 Checkpointing Simulation 6 Checkpointing Simulation 7 Checkpointing Simulation 8 Finished with all tasks.
WARNING : Hamiltonian has a sign problem... p7_10096: p4_error: Timeout in establishing connection to remote process: 0 rm_l_7_10098: (365.230469) net_send: could not write to fd=5, errno = 32 p7_10096: (365.234375) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p5_12240: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_12242: (365.238281) net_send: could not write to fd=5, errno = 32 p0_29156: p4_error: net_recv read: probable EOF on socket: 1 WARNING : Hamiltonian has a sign problem... p3_12614: p4_error: Timeout in establishing connection to remote process: 0 rm_l_3_12616: (365.238281) net_send: could not write to fd=5, errno = 32 p5_12240: (373.246094) net_send: could not write to fd=5, errno = 32 p0_29156: (374.425781) net_send: could not write to fd=4, errno = 32 p3_12614: (373.250000) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p1_29158: p4_error: net_recv read: probable EOF on socket: 1 p1_29158: (381.265625) net_send: could not write to fd=5, errno = 32
The difference between 1 or 2 and 61 or 62 seconds is that be default the ALPS scheduler checks every minute whether the job is done. For such short jobs you get a better timing estimate by reducing that time using the --Tmin xxx command line option which changes the minimum time between checks from 60 to xxx seconds. Try --Tmin 5. The easiest way to test if it is correct is to compare the results.
Best regards
Matthias
On 27 Apr 2007, at 08:04, Niall Moran wrote:
I have built a version with mpich2 version 1.0.2p1 and pathscale compiler version 2.5. The processes are no longer hanging after the tasks are completing. However the tasks seem to be running a bit too fast and I am not sure whether they are running properly or not. The parameter form of the input file I am using is
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATTICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
I am using the dirloop_sse tool and dirloop_sse_mpi in parallel. When running serially some of the tasks run in 1 or 2 seconds and some of them take 62 seconds. It usually takes around 4-5 minutes for the full 8 tasks.
However when run in parallel the whole 8 tasks are completed in under 25 seconds. Is there a way of verifying that the tasks are running correctly and are there any standard/benchmark input files available to test with?
Many Thanks,
Niall.
Matthias Troyer wrote:
We've had no problems with LAM and PBS. Please let me know if one of the combination works.
Matthias
On 26 Apr 2007, at 09:57, Niall Moran wrote:
Thanks for your reply. We don't have LAM MPI because it does not work well with PBS however we do have MPICH2 and also gcc 4.1.2, pathscale 2.5 and 3.0 and also portland 6.2.2 so I will try it with different combinations of these.
Niall.
Matthias Troyer wrote:
I just remembered that other users reported similar problems with a combination of gcc-3.3.3 and mpich. Using LAM MPI and gcc-4 made the problem go away then.
Matthias
On 26 Apr 2007, at 09:41, Niall Moran wrote:
Hi,
I have built and installed ALPS 1.3b3 and am trying to get the MPI version working. I am using a simple input file (shown below) and using the dirloop_sse_mpi tool.
When running this on a single processor with dirloop_sse it takes around 62 seconds for each task and does each sequentially on a single cpu.
I then tried running this in parallel on 8 processors. System has PBS for batch processing. Submitted job interactively and requested a walltime of 20 minutes and 8 cpus (4 x 2 processors per node).
qsub -l walltime=00:20:00 -l nodes=4:ppn=2 -I
Then started simulation with
mpiexec -n 8 dirloop_sse_mpi job.in.xml
All the tasks seem to finish correctly however the processes remain until the walltime is up. Output is listed below.
Have tried building ALPS with both gcc 3.3.3 and also pathscale 2.5 compiler along with mpich-1.2.7. The same thing happens with each.
Does anyone have any ideas what is going on?
Many Thanks,
Niall.
Parameter version of input file.
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
Output of simulation in parallel.
Quantum Monte simulations using the generalized directed loop algorithm v. 1.1 available from http://alps.comp-phys.org/ copyright (c) 2001-2005 by Fabien Alet alet@comp-phys.org, Synge Todo wistaria@comp-phys.org, and Matthias Troyer <troyer@comp- phys.org> see F. Alet, S. Wessel and M. Troyer, Phys. Rev. E 71, 036706 (2005) for details.
using the ALPS parallelizing scheduler copyright (c) 1994-2006 by Matthias Troyer <troyer@comp- phys.org>. see Lecture Notes in Computer Science, Vol. 1505, p. 191 (1998).
based on the ALPS libraries version 1.3b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2006 by the ALPS collaboration. Consult the web page for license details. For details see the publication: F. Alet et al., J. Phys. Soc. Jpn. Suppl., Vol. 74, 30 (2005).
parsing task files ... Created a new simulation: 2 on 1 processes Created a new simulation: 3 on 1 processes Created a new simulation: 4 on 1 processes Created a new simulation: 5 on 1 processes Created a new simulation: 6 on 1 processes Created a new simulation: 7 on 1 processes Created a new simulation: 8 on 1 processes Creating a new simulation: 1 on 1 processes Created run 1 locally All processes have been assigned Checking if Simulation 1 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Checking if Simulation 2 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Checking if Simulation 3 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 4 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 5 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 6 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 7 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 8 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 2 is finished: Finished Halted Simulation 2 Checking if Simulation 1 is finished: Finished Halted Simulation 1 Assigning 2 processes No work for master process No work for 1. All processes have been assigned Checking if Simulation 3 is finished: Finished Halted Simulation 3 Checking if Simulation 4 is finished: Finished Halted Simulation 4 Checking if Simulation 5 is finished: Finished Halted Simulation 5 Checking if Simulation 6 is finished: Finished Halted Simulation 6 Assigning 5 processes No work for 5. All processes have been assigned Checking if Simulation 7 is finished: Finished Halted Simulation 7 Checking if Simulation 8 is finished: Finished Halted Simulation 8 Assigning 7 processes All processes have been assigned Checkpointing Simulation 1 Checkpointing Simulation 2 Checkpointing Simulation 3 Checkpointing Simulation 4 Checkpointing Simulation 5 Checkpointing Simulation 6 Checkpointing Simulation 7 Checkpointing Simulation 8 Finished with all tasks.
WARNING : Hamiltonian has a sign problem... p7_10096: p4_error: Timeout in establishing connection to remote process: 0 rm_l_7_10098: (365.230469) net_send: could not write to fd=5, errno = 32 p7_10096: (365.234375) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p5_12240: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_12242: (365.238281) net_send: could not write to fd=5, errno = 32 p0_29156: p4_error: net_recv read: probable EOF on socket: 1 WARNING : Hamiltonian has a sign problem... p3_12614: p4_error: Timeout in establishing connection to remote process: 0 rm_l_3_12616: (365.238281) net_send: could not write to fd=5, errno = 32 p5_12240: (373.246094) net_send: could not write to fd=5, errno = 32 p0_29156: (374.425781) net_send: could not write to fd=4, errno = 32 p3_12614: (373.250000) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p1_29158: p4_error: net_recv read: probable EOF on socket: 1 p1_29158: (381.265625) net_send: could not write to fd=5, errno = 32
Ok I am pretty sure it is all working OK now. Thanks for the help.
Niall.
Matthias Troyer wrote:
The difference between 1 or 2 and 61 or 62 seconds is that be default the ALPS scheduler checks every minute whether the job is done. For such short jobs you get a better timing estimate by reducing that time using the --Tmin xxx command line option which changes the minimum time between checks from 60 to xxx seconds. Try --Tmin 5. The easiest way to test if it is correct is to compare the results.
Best regards
Matthias
On 27 Apr 2007, at 08:04, Niall Moran wrote:
I have built a version with mpich2 version 1.0.2p1 and pathscale compiler version 2.5. The processes are no longer hanging after the tasks are completing. However the tasks seem to be running a bit too fast and I am not sure whether they are running properly or not. The parameter form of the input file I am using is
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATTICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
I am using the dirloop_sse tool and dirloop_sse_mpi in parallel. When running serially some of the tasks run in 1 or 2 seconds and some of them take 62 seconds. It usually takes around 4-5 minutes for the full 8 tasks.
However when run in parallel the whole 8 tasks are completed in under 25 seconds. Is there a way of verifying that the tasks are running correctly and are there any standard/benchmark input files available to test with?
Many Thanks,
Niall.
Matthias Troyer wrote:
We've had no problems with LAM and PBS. Please let me know if one of the combination works.
Matthias
On 26 Apr 2007, at 09:57, Niall Moran wrote:
Thanks for your reply. We don't have LAM MPI because it does not work well with PBS however we do have MPICH2 and also gcc 4.1.2, pathscale 2.5 and 3.0 and also portland 6.2.2 so I will try it with different combinations of these.
Niall.
Matthias Troyer wrote:
I just remembered that other users reported similar problems with a combination of gcc-3.3.3 and mpich. Using LAM MPI and gcc-4 made the problem go away then.
Matthias
On 26 Apr 2007, at 09:41, Niall Moran wrote:
Hi,
I have built and installed ALPS 1.3b3 and am trying to get the MPI version working. I am using a simple input file (shown below) and using the dirloop_sse_mpi tool.
When running this on a single processor with dirloop_sse it takes around 62 seconds for each task and does each sequentially on a single cpu.
I then tried running this in parallel on 8 processors. System has PBS for batch processing. Submitted job interactively and requested a walltime of 20 minutes and 8 cpus (4 x 2 processors per node).
qsub -l walltime=00:20:00 -l nodes=4:ppn=2 -I
Then started simulation with
mpiexec -n 8 dirloop_sse_mpi job.in.xml
All the tasks seem to finish correctly however the processes remain until the walltime is up. Output is listed below.
Have tried building ALPS with both gcc 3.3.3 and also pathscale 2.5 compiler along with mpich-1.2.7. The same thing happens with each.
Does anyone have any ideas what is going on?
Many Thanks,
Niall.
Parameter version of input file.
MODEL_LIBRARY="/alps/lib/xml/models.xml" MODEL="spin"; LATICE_LIBRARY="/alps/lib/xml/lattices.xml" LATTICE="chain lattice" SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=L*SWEEPS; { L=10; T=0.1; } { L=15; T=0.15; } { L=20; T=0.2; } { L=25; T=0.05; } { L=30; T=0.05; } { L=35; T=0.3; } { L=40; T=0.05; } { L=45; T=0.05; }
Output of simulation in parallel.
Quantum Monte simulations using the generalized directed loop algorithm v. 1.1 available from http://alps.comp-phys.org/ copyright (c) 2001-2005 by Fabien Alet alet@comp-phys.org, Synge Todo wistaria@comp-phys.org, and Matthias Troyer troyer@comp-phys.org see F. Alet, S. Wessel and M. Troyer, Phys. Rev. E 71, 036706 (2005) for details.
using the ALPS parallelizing scheduler copyright (c) 1994-2006 by Matthias Troyer troyer@comp-phys.org. see Lecture Notes in Computer Science, Vol. 1505, p. 191 (1998).
based on the ALPS libraries version 1.3b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2006 by the ALPS collaboration. Consult the web page for license details. For details see the publication: F. Alet et al., J. Phys. Soc. Jpn. Suppl., Vol. 74, 30 (2005).
parsing task files ... Created a new simulation: 2 on 1 processes Created a new simulation: 3 on 1 processes Created a new simulation: 4 on 1 processes Created a new simulation: 5 on 1 processes Created a new simulation: 6 on 1 processes Created a new simulation: 7 on 1 processes Created a new simulation: 8 on 1 processes Creating a new simulation: 1 on 1 processes Created run 1 locally All processes have been assigned Checking if Simulation 1 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Checking if Simulation 2 is finished: not yet, next check in 60 seconds ( 0% done). Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Created run 1 locally Checking if Simulation 3 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 4 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 5 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 6 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 7 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 8 is finished: not yet, next check in 60 seconds ( 0% done). Checking if Simulation 2 is finished: Finished Halted Simulation 2 Checking if Simulation 1 is finished: Finished Halted Simulation 1 Assigning 2 processes No work for master process No work for 1. All processes have been assigned Checking if Simulation 3 is finished: Finished Halted Simulation 3 Checking if Simulation 4 is finished: Finished Halted Simulation 4 Checking if Simulation 5 is finished: Finished Halted Simulation 5 Checking if Simulation 6 is finished: Finished Halted Simulation 6 Assigning 5 processes No work for 5. All processes have been assigned Checking if Simulation 7 is finished: Finished Halted Simulation 7 Checking if Simulation 8 is finished: Finished Halted Simulation 8 Assigning 7 processes All processes have been assigned Checkpointing Simulation 1 Checkpointing Simulation 2 Checkpointing Simulation 3 Checkpointing Simulation 4 Checkpointing Simulation 5 Checkpointing Simulation 6 Checkpointing Simulation 7 Checkpointing Simulation 8 Finished with all tasks.
WARNING : Hamiltonian has a sign problem... p7_10096: p4_error: Timeout in establishing connection to remote process: 0 rm_l_7_10098: (365.230469) net_send: could not write to fd=5, errno = 32 p7_10096: (365.234375) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p5_12240: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_12242: (365.238281) net_send: could not write to fd=5, errno = 32 p0_29156: p4_error: net_recv read: probable EOF on socket: 1 WARNING : Hamiltonian has a sign problem... p3_12614: p4_error: Timeout in establishing connection to remote process: 0 rm_l_3_12616: (365.238281) net_send: could not write to fd=5, errno = 32 p5_12240: (373.246094) net_send: could not write to fd=5, errno = 32 p0_29156: (374.425781) net_send: could not write to fd=4, errno = 32 p3_12614: (373.250000) net_send: could not write to fd=5, errno = 32 WARNING : Hamiltonian has a sign problem... p1_29158: p4_error: net_recv read: probable EOF on socket: 1 p1_29158: (381.265625) net_send: could not write to fd=5, errno = 32
comp-phys-alps-users@lists.phys.ethz.ch