Hello,
I would be glad if ALPS users could help me with a problem I've been experiencing. I have ALPS 2.2.b3 installed in an iMac (i5 mid 2011) and an i7 desktop running Ubuntu 15.04 64 bits, with openmpi (1.6.5) also present. I have no problem running ALPS executables in either computer, nor running basic programs using openmpi (for instance `mpirun -host <host> hostname`) from either machine to the other. I can also successfully run ALPS executables in either machine using openmpi locally.
However, when I try to run an ALPS executable (e.g. loop) via openmpi using both machines simultaneously, the jobs are initiated, keep running indefinitely, but no output is produced, not even to the screen. Using mpirun with the -v flag gives no additional information. All I get is the following:
dirac-2:qmc apvieira$ mpirun -v -np 7 loop --mpi heisloop20.in.xml --Tmin 5 ALPS/looper version 3.2b12-20100128 (2010/01/28) multi-cluster quantum Monte Carlo algorithms for spin systems available from http://wistaria.comp-phys.org/alps-looper/ copyright (c) 1997-2010 by Synge Todo wistaria@comp-phys.org
using ALPS/parapack scheduler a Monte Carlo scheduler for multiple-level parallelization copyright (c) 1997-2013 by Synge Todo wistaria@comp-phys.org
based on the ALPS libraries version 2.2.b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2013 by the ALPS collaboration. Consult the web page for license details. For details see the publication: B. Bauer et al., J. Stat. Mech. (2011) P05001.
Did anyone have this kind of problem before?
Thanks, Andre Vieira
Have you tried running other C++ programs using MPI with your heterogeneous setup?
On Sep 30, 2015, at 18:29, Andre Vieira apvieira@if.usp.br wrote:
Hello,
I would be glad if ALPS users could help me with a problem I've been experiencing. I have ALPS 2.2.b3 installed in an iMac (i5 mid 2011) and an i7 desktop running Ubuntu 15.04 64 bits, with openmpi (1.6.5) also present. I have no problem running ALPS executables in either computer, nor running basic programs using openmpi (for instance `mpirun -host <host> hostname`) from either machine to the other. I can also successfully run ALPS executables in either machine using openmpi locally.
However, when I try to run an ALPS executable (e.g. loop) via openmpi using both machines simultaneously, the jobs are initiated, keep running indefinitely, but no output is produced, not even to the screen. Using mpirun with the -v flag gives no additional information. All I get is the following:
dirac-2:qmc apvieira$ mpirun -v -np 7 loop --mpi heisloop20.in.xml --Tmin 5 ALPS/looper version 3.2b12-20100128 (2010/01/28) multi-cluster quantum Monte Carlo algorithms for spin systems available from http://wistaria.comp-phys.org/alps-looper/ copyright (c) 1997-2010 by Synge Todo wistaria@comp-phys.org
using ALPS/parapack scheduler a Monte Carlo scheduler for multiple-level parallelization copyright (c) 1997-2013 by Synge Todo wistaria@comp-phys.org
based on the ALPS libraries version 2.2.b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2013 by the ALPS collaboration. Consult the web page for license details. For details see the publication: B. Bauer et al., J. Stat. Mech. (2011) P05001.
Did anyone have this kind of problem before?
Thanks, Andre Vieira
Do you use the identical version of openmpi on both machines, configured in the same way?
On Sep 30, 2015, at 18:29, Andre Vieira apvieira@if.usp.br wrote:
Hello,
I would be glad if ALPS users could help me with a problem I've been experiencing. I have ALPS 2.2.b3 installed in an iMac (i5 mid 2011) and an i7 desktop running Ubuntu 15.04 64 bits, with openmpi (1.6.5) also present. I have no problem running ALPS executables in either computer, nor running basic programs using openmpi (for instance `mpirun -host <host> hostname`) from either machine to the other. I can also successfully run ALPS executables in either machine using openmpi locally.
However, when I try to run an ALPS executable (e.g. loop) via openmpi using both machines simultaneously, the jobs are initiated, keep running indefinitely, but no output is produced, not even to the screen. Using mpirun with the -v flag gives no additional information. All I get is the following:
dirac-2:qmc apvieira$ mpirun -v -np 7 loop --mpi heisloop20.in.xml --Tmin 5 ALPS/looper version 3.2b12-20100128 (2010/01/28) multi-cluster quantum Monte Carlo algorithms for spin systems available from http://wistaria.comp-phys.org/alps-looper/ copyright (c) 1997-2010 by Synge Todo wistaria@comp-phys.org
using ALPS/parapack scheduler a Monte Carlo scheduler for multiple-level parallelization copyright (c) 1997-2013 by Synge Todo wistaria@comp-phys.org
based on the ALPS libraries version 2.2.b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2013 by the ALPS collaboration. Consult the web page for license details. For details see the publication: B. Bauer et al., J. Stat. Mech. (2011) P05001.
Did anyone have this kind of problem before?
Thanks, Andre Vieira
Thanks for the quick reply.
Openmpi version is 1.6.5 on both machines. For the iMac I use the default version and configurations provided with ALPS, while for Ubuntu I use the version and configurations in the Ubuntu repositories. What specific configuration should I look for?
Running an example C++ program also gets into trouble. For instance, from the iMac I get normal execution with 'mpirun -H localhost a.out' as well as with 'mpirun -H linuxhost a.out', but when I try 'mpirun -H localhost,linuxhost a.out' I get the following (dirac-2 is the hostname for the iMac)
[dirac-2.local:22207] *** An error occurred in MPI_Recv [dirac-2.local:22207] *** on communicator MPI_COMM_WORLD [dirac-2.local:22207] *** MPI_ERR_TRUNCATE: message truncated [dirac-2.local:22207] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- mpirun has exited due to process rank 1 with PID 22207 on node dirac-2 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here).
Running from the linux host generates the same behavior, with the same error message, so I guess there is something in the iMac (dirac-2) preventing communication between hosts.
2015-09-30 13:44 GMT-03:00 Matthias Troyer troyer@phys.ethz.ch:
Do you use the identical version of openmpi on both machines, configured in the same way?
On Sep 30, 2015, at 18:29, Andre Vieira apvieira@if.usp.br wrote:
Hello,
I would be glad if ALPS users could help me with a problem I've been experiencing. I have ALPS 2.2.b3 installed in an iMac (i5 mid 2011) and an i7 desktop running Ubuntu 15.04 64 bits, with openmpi (1.6.5) also present. I have no problem running ALPS executables in either computer, nor running basic programs using openmpi (for instance `mpirun -host <host> hostname`) from either machine to the other. I can also successfully run ALPS executables in either machine using openmpi locally.
However, when I try to run an ALPS executable (e.g. loop) via openmpi using both machines simultaneously, the jobs are initiated, keep running indefinitely, but no output is produced, not even to the screen. Using mpirun with the -v flag gives no additional information. All I get is the following:
dirac-2:qmc apvieira$ mpirun -v -np 7 loop --mpi heisloop20.in.xml --Tmin 5 ALPS/looper version 3.2b12-20100128 (2010/01/28) multi-cluster quantum Monte Carlo algorithms for spin systems available from http://wistaria.comp-phys.org/alps-looper/ copyright (c) 1997-2010 by Synge Todo wistaria@comp-phys.org
using ALPS/parapack scheduler a Monte Carlo scheduler for multiple-level parallelization copyright (c) 1997-2013 by Synge Todo wistaria@comp-phys.org
based on the ALPS libraries version 2.2.b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2013 by the ALPS collaboration. Consult the web page for license details. For details see the publication: B. Bauer et al., J. Stat. Mech. (2011) P05001.
Did anyone have this kind of problem before?
Thanks, Andre Vieira
Hi,
You first need to test whether you can get MPI to run heterogeneously before trying ALPS
On Sep 30, 2015, at 19:22, Andre Vieira apvieira@if.usp.br wrote:
Thanks for the quick reply.
Openmpi version is 1.6.5 on both machines. For the iMac I use the default version and configurations provided with ALPS, while for Ubuntu I use the version and configurations in the Ubuntu repositories. What specific configuration should I look for?
Running an example C++ program also gets into trouble. For instance, from the iMac I get normal execution with 'mpirun -H localhost a.out' as well as with 'mpirun -H linuxhost a.out', but when I try 'mpirun -H localhost,linuxhost a.out' I get the following (dirac-2 is the hostname for the iMac)
[dirac-2.local:22207] *** An error occurred in MPI_Recv [dirac-2.local:22207] *** on communicator MPI_COMM_WORLD [dirac-2.local:22207] *** MPI_ERR_TRUNCATE: message truncated [dirac-2.local:22207] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
mpirun has exited due to process rank 1 with PID 22207 on node dirac-2 exiting improperly. There are two reasons this could occur:
- this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.
- this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here).
Running from the linux host generates the same behavior, with the same error message, so I guess there is something in the iMac (dirac-2) preventing communication between hosts.
2015-09-30 13:44 GMT-03:00 Matthias Troyer troyer@phys.ethz.ch:
Do you use the identical version of openmpi on both machines, configured in the same way?
On Sep 30, 2015, at 18:29, Andre Vieira apvieira@if.usp.br wrote:
Hello,
I would be glad if ALPS users could help me with a problem I've been experiencing. I have ALPS 2.2.b3 installed in an iMac (i5 mid 2011) and an i7 desktop running Ubuntu 15.04 64 bits, with openmpi (1.6.5) also present. I have no problem running ALPS executables in either computer, nor running basic programs using openmpi (for instance `mpirun -host <host> hostname`) from either machine to the other. I can also successfully run ALPS executables in either machine using openmpi locally.
However, when I try to run an ALPS executable (e.g. loop) via openmpi using both machines simultaneously, the jobs are initiated, keep running indefinitely, but no output is produced, not even to the screen. Using mpirun with the -v flag gives no additional information. All I get is the following:
dirac-2:qmc apvieira$ mpirun -v -np 7 loop --mpi heisloop20.in.xml --Tmin 5 ALPS/looper version 3.2b12-20100128 (2010/01/28) multi-cluster quantum Monte Carlo algorithms for spin systems available from http://wistaria.comp-phys.org/alps-looper/ copyright (c) 1997-2010 by Synge Todo wistaria@comp-phys.org
using ALPS/parapack scheduler a Monte Carlo scheduler for multiple-level parallelization copyright (c) 1997-2013 by Synge Todo wistaria@comp-phys.org
based on the ALPS libraries version 2.2.b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2013 by the ALPS collaboration. Consult the web page for license details. For details see the publication: B. Bauer et al., J. Stat. Mech. (2011) P05001.
Did anyone have this kind of problem before?
Thanks, Andre Vieira
As I described in the previous message, running simple mpi jobs heterogeneously is not working. I'll try to figure out what's truncating messages between hosts.
Thanks, Andre.
2015-09-30 17:16 GMT-03:00 Matthias Troyer troyer@phys.ethz.ch:
Hi,
You first need to test whether you can get MPI to run heterogeneously before trying ALPS
On Sep 30, 2015, at 19:22, Andre Vieira apvieira@if.usp.br wrote:
Thanks for the quick reply.
Openmpi version is 1.6.5 on both machines. For the iMac I use the default version and configurations provided with ALPS, while for Ubuntu I use the version and configurations in the Ubuntu repositories. What specific configuration should I look for?
Running an example C++ program also gets into trouble. For instance, from the iMac I get normal execution with 'mpirun -H localhost a.out' as well as with 'mpirun -H linuxhost a.out', but when I try 'mpirun -H localhost,linuxhost a.out' I get the following (dirac-2 is the hostname for the iMac)
[dirac-2.local:22207] *** An error occurred in MPI_Recv [dirac-2.local:22207] *** on communicator MPI_COMM_WORLD [dirac-2.local:22207] *** MPI_ERR_TRUNCATE: message truncated [dirac-2.local:22207] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
mpirun has exited due to process rank 1 with PID 22207 on node dirac-2 exiting improperly. There are two reasons this could occur:
- this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.
- this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here).
Running from the linux host generates the same behavior, with the same error message, so I guess there is something in the iMac (dirac-2) preventing communication between hosts.
2015-09-30 13:44 GMT-03:00 Matthias Troyer troyer@phys.ethz.ch:
Do you use the identical version of openmpi on both machines, configured in the same way?
On Sep 30, 2015, at 18:29, Andre Vieira apvieira@if.usp.br wrote:
Hello,
I would be glad if ALPS users could help me with a problem I've been experiencing. I have ALPS 2.2.b3 installed in an iMac (i5 mid 2011) and an i7 desktop running Ubuntu 15.04 64 bits, with openmpi (1.6.5) also present. I have no problem running ALPS executables in either computer, nor running basic programs using openmpi (for instance `mpirun -host <host> hostname`) from either machine to the other. I can also successfully run ALPS executables in either machine using openmpi locally.
However, when I try to run an ALPS executable (e.g. loop) via openmpi using both machines simultaneously, the jobs are initiated, keep running indefinitely, but no output is produced, not even to the screen. Using mpirun with the -v flag gives no additional information. All I get is the following:
dirac-2:qmc apvieira$ mpirun -v -np 7 loop --mpi heisloop20.in.xml --Tmin 5 ALPS/looper version 3.2b12-20100128 (2010/01/28) multi-cluster quantum Monte Carlo algorithms for spin systems available from http://wistaria.comp-phys.org/alps-looper/ copyright (c) 1997-2010 by Synge Todo wistaria@comp-phys.org
using ALPS/parapack scheduler a Monte Carlo scheduler for multiple-level parallelization copyright (c) 1997-2013 by Synge Todo wistaria@comp-phys.org
based on the ALPS libraries version 2.2.b3 available from http://alps.comp-phys.org/ copyright (c) 1994-2013 by the ALPS collaboration. Consult the web page for license details. For details see the publication: B. Bauer et al., J. Stat. Mech. (2011) P05001.
Did anyone have this kind of problem before?
Thanks, Andre Vieira
comp-phys-alps-users@lists.phys.ethz.ch