Dear ALPS users,
I have a large list of tasks run from a given job file *parm.in.xml*, which got terminated after some tasks were completed; my *parm.out.xml* looks like this after the termination:
===========<SNIP-1>================= /......// //......// // <TASK status="finished">// // <INPUT file="parm.task62.out.xml"/>// // </TASK>// // <TASK status="finished">// // <INPUT file="parm.task63.out.xml"/>// // </TASK>// // <TASK status="new">// // <INPUT file="parm.task64.out.xml"/>// // </TASK>// // <TASK status="new">// //......// //....../ ============<SNIP-1>================
To restart the job from task64, I execute the command (without changing anything in the output files)
*worm**--Tmin 10 parm.out.xml*
Could you please confirm that this is correct? I ask because nothing seems to proceed after a while after executing the above:
============<SNIP-2>=============== /.........// //.........// //Loading information about run 1 from file parm.task62.out.run1// //Loading information about run 1 from file parm.task63.out.run1// //Task 1 finished.// //........// //........// / /Task 63 finished.// //Created run 1 locally// //Starting task 64.// //Checking if it is finished: not yet, next check in 10 seconds ( 0% done).// //Checking if it is finished: not yet, next check in 203 seconds ( 1% done).// //Checking if it is finished: not yet, next check in 174 seconds ( 23% done)./
=============<SNIP-2>==============
From thereon the simulation just hangs for many hours (I've tried this
procedure repeatedly), which it should not because my SWEEPS and THERMALIZATION are not excessively large.
Any ideas on why the re-started simulations don't seem to proceed are appreciated.
With regards, Vipin
Dear ALPS users,
More info on the previous problem (_*summary*_: an executed job-file with many tasks hangs at a particular task upon re-starting the job-file, as follows):
1) If I pick out the parameter file of the hung task from the job-file and run it separately, the simulation of this particular task goes through. So there's no fundamental problem with the task(s) being run. 2) If I drastically reduce the SWEEPS and THERMALIZATION of just the hung task in the job-file and re-start the job-file, the hung task and all succeeding tasks go through. Which is surprising because all the tasks of a job file are more or less the same (small temperature differences, random seeds, same SWEEPS and THERMALIZATION etc.) 3) Parameters of an example hung task:
================<SNIP>======================
/LATTICE="inhomogeneous simple cubic lattice periodic";// //L=14;// // //MODEL="hardcore boson";// //V=0;// //t=1.0;// //NONLOCAL=0;// // //T=1.32;// //SWEEPS=50000;// //THERMALIZATION=100000;// // //{DISORDERSEED=69830; mu=8*2*(random()-0.5);}/
==================<SNIP>=====================
I appreciate any ideas on why when I re-start a job-file, the first task to be run always hangs (blocking everything else) but the same task goes through when run separately.
Thanks, Vipin
On 03/31/2014 12:07 PM, vvarma@ictp.it wrote:
Dear ALPS users,
I have a large list of tasks run from a given job file *parm.in.xml*, which got terminated after some tasks were completed; my *parm.out.xml* looks like this after the termination:
===========<SNIP-1>================= /......// //......// // <TASK status="finished">// // <INPUT file="parm.task62.out.xml"/>// // </TASK>// // <TASK status="finished">// // <INPUT file="parm.task63.out.xml"/>// // </TASK>// // <TASK status="new">// // <INPUT file="parm.task64.out.xml"/>// // </TASK>// // <TASK status="new">// //......// //....../ ============<SNIP-1>================
To restart the job from task64, I execute the command (without changing anything in the output files)
*worm**--Tmin 10 parm.out.xml*
Could you please confirm that this is correct? I ask because nothing seems to proceed after a while after executing the above:
============<SNIP-2>=============== /.........// //.........// //Loading information about run 1 from file parm.task62.out.run1// //Loading information about run 1 from file parm.task63.out.run1// //Task 1 finished.// //........// //........// / /Task 63 finished.// //Created run 1 locally// //Starting task 64.// //Checking if it is finished: not yet, next check in 10 seconds ( 0% done).// //Checking if it is finished: not yet, next check in 203 seconds ( 1% done).// //Checking if it is finished: not yet, next check in 174 seconds ( 23% done)./
=============<SNIP-2>==============
From thereon the simulation just hangs for many hours (I've tried this procedure repeatedly), which it should not because my SWEEPS and THERMALIZATION are not excessively large.
Any ideas on why the re-started simulations don't seem to proceed are appreciated.
With regards, Vipin
Please use the new directed worm algorithm code instead. Tama Ma can help you get started. We will remove the old worm code soon
Matthias
On 02 Apr 2014, at 14:58, vvarma@ictp.it wrote:
Dear ALPS users,
More info on the previous problem (summary: an executed job-file with many tasks hangs at a particular task upon re-starting the job-file, as follows):
- If I pick out the parameter file of the hung task from the job-file and run it separately, the simulation of this particular task goes through. So there's no fundamental problem with the task(s) being run.
- If I drastically reduce the SWEEPS and THERMALIZATION of just the hung task in the job-file and re-start the job-file, the hung task and all succeeding tasks go through. Which is surprising because all the tasks of a job file are more or less the same (small temperature differences, random seeds, same SWEEPS and THERMALIZATION etc.)
- Parameters of an example hung task:
================<SNIP>======================
LATTICE="inhomogeneous simple cubic lattice periodic"; L=14;
MODEL="hardcore boson"; V=0; t=1.0; NONLOCAL=0;
T=1.32; SWEEPS=50000; THERMALIZATION=100000;
{DISORDERSEED=69830; mu=8*2*(random()-0.5);}
==================<SNIP>=====================
I appreciate any ideas on why when I re-start a job-file, the first task to be run always hangs (blocking everything else) but the same task goes through when run separately.
Thanks, Vipin
On 03/31/2014 12:07 PM, vvarma@ictp.it wrote:
Dear ALPS users,
I have a large list of tasks run from a given job file parm.in.xml, which got terminated after some tasks were completed; my parm.out.xml looks like this after the termination:
===========<SNIP-1>================= ...... ......
<TASK status="finished"> <INPUT file="parm.task62.out.xml"/> </TASK> <TASK status="finished"> <INPUT file="parm.task63.out.xml"/> </TASK> <TASK status="new"> <INPUT file="parm.task64.out.xml"/> </TASK> <TASK status="new"> ...... ...... ============<SNIP-1>================
To restart the job from task64, I execute the command (without changing anything in the output files)
worm --Tmin 10 parm.out.xml
Could you please confirm that this is correct? I ask because nothing seems to proceed after a while after executing the above:
============<SNIP-2>=============== ......... ......... Loading information about run 1 from file parm.task62.out.run1 Loading information about run 1 from file parm.task63.out.run1 Task 1 finished. ........ ........
Task 63 finished. Created run 1 locally Starting task 64. Checking if it is finished: not yet, next check in 10 seconds ( 0% done). Checking if it is finished: not yet, next check in 203 seconds ( 1% done). Checking if it is finished: not yet, next check in 174 seconds ( 23% done).
=============<SNIP-2>==============
From thereon the simulation just hangs for many hours (I've tried this procedure repeatedly), which it should not because my SWEEPS and THERMALIZATION are not excessively large.
Any ideas on why the re-started simulations don't seem to proceed are appreciated.
With regards, Vipin
Hi Matthias,
Thanks for the suggestion. I tried the new directed worm dwa but presently it does not support disordered systems (per my simulations on disordered systems, and confirmed by Tama Ma).
So until support for disorder is included in dwa, any suggestion as to why I'm unable to restart simulations using the standard worm is greatly appreciated.
Thanks, Vipin
Please use the new directed worm algorithm code instead. Tama Ma can help you get started. We will remove the old worm code soon
Matthias
On 02 Apr 2014, at 14:58, vvarma@ictp.it wrote:
Dear ALPS users,
More info on the previous problem (summary: an executed job-file with many tasks hangs at a particular task upon re-starting the job-file, as follows):
- If I pick out the parameter file of the hung task from the job-file
and run it separately, the simulation of this particular task goes through. So there's no fundamental problem with the task(s) being run. 2) If I drastically reduce the SWEEPS and THERMALIZATION of just the hung task in the job-file and re-start the job-file, the hung task and all succeeding tasks go through. Which is surprising because all the tasks of a job file are more or less the same (small temperature differences, random seeds, same SWEEPS and THERMALIZATION etc.) 3) Parameters of an example hung task:
================<SNIP>======================
LATTICE="inhomogeneous simple cubic lattice periodic"; L=14;
MODEL="hardcore boson"; V=0; t=1.0; NONLOCAL=0;
T=1.32; SWEEPS=50000; THERMALIZATION=100000;
{DISORDERSEED=69830; mu=8*2*(random()-0.5);}
==================<SNIP>=====================
I appreciate any ideas on why when I re-start a job-file, the first task to be run always hangs (blocking everything else) but the same task goes through when run separately.
Thanks, Vipin
On 03/31/2014 12:07 PM, vvarma@ictp.it wrote:
Dear ALPS users,
I have a large list of tasks run from a given job file parm.in.xml, which got terminated after some tasks were completed; my parm.out.xml looks like this after the termination:
===========<SNIP-1>================= ...... ......
<TASK status="finished"> <INPUT file="parm.task62.out.xml"/> </TASK> <TASK status="finished"> <INPUT file="parm.task63.out.xml"/> </TASK> <TASK status="new"> <INPUT file="parm.task64.out.xml"/> </TASK> <TASK status="new"> ...... ...... ============<SNIP-1>================
To restart the job from task64, I execute the command (without changing anything in the output files)
worm --Tmin 10 parm.out.xml
Could you please confirm that this is correct? I ask because nothing seems to proceed after a while after executing the above:
============<SNIP-2>=============== ......... ......... Loading information about run 1 from file parm.task62.out.run1 Loading information about run 1 from file parm.task63.out.run1 Task 1 finished. ........ ........
Task 63 finished. Created run 1 locally Starting task 64. Checking if it is finished: not yet, next check in 10 seconds ( 0% done). Checking if it is finished: not yet, next check in 203 seconds ( 1% done). Checking if it is finished: not yet, next check in 174 seconds ( 23% done).
=============<SNIP-2>==============
From thereon the simulation just hangs for many hours (I've tried this procedure repeatedly), which it should not because my SWEEPS and THERMALIZATION are not excessively large.
Any ideas on why the re-started simulations don't seem to proceed are appreciated.
With regards, Vipin
Please ask Tama to add disorder support if you need it. It will be easy to do. We no longer support the old work code.
Matthias
On Apr 15, 2014, at 10:22, "Varma Vipin Kerala" vvarma@ictp.it wrote:
Hi Matthias,
Thanks for the suggestion. I tried the new directed worm dwa but presently it does not support disordered systems (per my simulations on disordered systems, and confirmed by Tama Ma).
So until support for disorder is included in dwa, any suggestion as to why I'm unable to restart simulations using the standard worm is greatly appreciated.
Thanks, Vipin
Please use the new directed worm algorithm code instead. Tama Ma can help you get started. We will remove the old worm code soon
Matthias
On 02 Apr 2014, at 14:58, vvarma@ictp.it wrote:
Dear ALPS users,
More info on the previous problem (summary: an executed job-file with many tasks hangs at a particular task upon re-starting the job-file, as follows):
- If I pick out the parameter file of the hung task from the job-file
and run it separately, the simulation of this particular task goes through. So there's no fundamental problem with the task(s) being run. 2) If I drastically reduce the SWEEPS and THERMALIZATION of just the hung task in the job-file and re-start the job-file, the hung task and all succeeding tasks go through. Which is surprising because all the tasks of a job file are more or less the same (small temperature differences, random seeds, same SWEEPS and THERMALIZATION etc.) 3) Parameters of an example hung task:
================<SNIP>======================
LATTICE="inhomogeneous simple cubic lattice periodic"; L=14;
MODEL="hardcore boson"; V=0; t=1.0; NONLOCAL=0;
T=1.32; SWEEPS=50000; THERMALIZATION=100000;
{DISORDERSEED=69830; mu=8*2*(random()-0.5);}
==================<SNIP>=====================
I appreciate any ideas on why when I re-start a job-file, the first task to be run always hangs (blocking everything else) but the same task goes through when run separately.
Thanks, Vipin
On 03/31/2014 12:07 PM, vvarma@ictp.it wrote: Dear ALPS users,
I have a large list of tasks run from a given job file parm.in.xml, which got terminated after some tasks were completed; my parm.out.xml looks like this after the termination:
===========<SNIP-1>================= ...... ......
<TASK status="finished"> <INPUT file="parm.task62.out.xml"/> </TASK> <TASK status="finished"> <INPUT file="parm.task63.out.xml"/> </TASK> <TASK status="new"> <INPUT file="parm.task64.out.xml"/> </TASK> <TASK status="new"> ...... ...... ============<SNIP-1>================
To restart the job from task64, I execute the command (without changing anything in the output files)
worm --Tmin 10 parm.out.xml
Could you please confirm that this is correct? I ask because nothing seems to proceed after a while after executing the above:
============<SNIP-2>=============== ......... ......... Loading information about run 1 from file parm.task62.out.run1 Loading information about run 1 from file parm.task63.out.run1 Task 1 finished. ........ ........
Task 63 finished. Created run 1 locally Starting task 64. Checking if it is finished: not yet, next check in 10 seconds ( 0% done). Checking if it is finished: not yet, next check in 203 seconds ( 1% done). Checking if it is finished: not yet, next check in 174 seconds ( 23% done).
=============<SNIP-2>==============
From thereon the simulation just hangs for many hours (I've tried this procedure repeatedly), which it should not because my SWEEPS and THERMALIZATION are not excessively large.
Any ideas on why the re-started simulations don't seem to proceed are appreciated.
With regards, Vipin
comp-phys-alps-users@lists.phys.ethz.ch