Dear Mateusz,
I will look at the issue as soon as I find some time to investigate it. Right now I have been too busy with with other important things to do. But I have not forgotten your problem.
Matthias
On 17 Jan 2012, at 08:31, Mateusz Łącki wrote:
Dear Matthias I would like add some new details ("read" messages come from the very initial patch for diagnosing the iostream fix. I was too lazy to unpack the source again). On one core with no mpi apparently the problem is less severe. At least I was able to obtain the following output:
Done with checkpoint. Checking if it is finished: not yet, next check in 900 seconds ( 52% done). Checking if it is finished: not yet, next check in 900 seconds ( 54% done). Making regular checkpoint. Checkpointing Simulation 1 read: 111 read: 98 read: 111 read: 98 Done with checkpoint. Checking if it is finished: not yet, next check in 900 seconds ( 55% done). Checking if it is finished: not yet, next check in 900 seconds ( 57% done). Making regular checkpoint. Checkpointing Simulation 1 read: 111 read: 98 read: 111 read: 98 Done with checkpoint. Checking if it is finished: not yet, next check in 900 seconds ( 59% done). Checking if it is finished: not yet, next check in 900 seconds ( 61% done). Making regular checkpoint. Checkpointing Simulation 1 read: 111 read: 98 read: 111 read: 98 Done with checkpoint. Avoided problem Avoided problem Avoided problem Avoided problem Avoided problem
But the simulation stopped outputting anything 24h ago (i took over 1904 minutes altogether, so it took around 9h to calculate up to 60%). The problem looks substantially different than the mpi case.
The file was:
LATTICE="inhomogeneous open chain lattice"; L=30;
MODEL="trapped boson Hubbard"; NONLOCAL=0; U = 1.0; mu = 0.5; Nmax = 5;
T=0.04; t=0.05; K=0.00
MEASURE[Correlations] = 'True'; MEASURE_LOCAL[Occupation] = "n"; MEASURE_LOCAL[SlonTrabalski] = "n2"; MEASURE_CORRELATION[Czeslaw] = "n:n" THERMALIZATION=100000; SWEEPS=2000000; dasdaRESTRICT_MEASUREMENTS[N]=30
{t=0.55; mu=-0.315; L=120; RESTRICT_MEASUREMENTS[N]=120}
W dniu 14 stycznia 2012 19:22 użytkownik Mateusz Łącki mateusz.lacki@gmail.com napisał:
Dear Matthias, Did You have any success troubleshooting the issue? I have run into it again (different compilation, different machine). The only common thing seems to be a large number of worms involved (in the newest case it was 50). Of course the error is identical, *xml files too, but parameters are a little different.
Regards, Mateusz
On Dec 31, 2011, at 10:35 PM, Mateusz Łącki wrote:
Dear Matthias, I attach new output files. Sorry for the delay - I have misapplied the patch and needed to redo the whole procedure. Nevertheless the parameters are printed.
<dalps2.zip>
Regards, Happy New Year Mateusz
On Dec 31, 2011, at 12:31 PM, Matthias Troyer wrote:
No, the output files are not needed. I just need to know which of the simulations caused the problem and the patch will help with that.
Matthias
On Dec 31, 2011, at 12:20 PM, Mateusz Łącki wrote:
Ok, I have set up the calculation again with the new patch. I will also try to do in one core only (but this will take time). Will any output files such that h5 files be of any help?
Regards, Mateusz
One thing that might help me find which of the hundreds of simulations that you started caused the issue is if you apply the following patch and try again:
--- applications/qmc/worms/Wcheck.C (revision 5899) +++ applications/qmc/worms/Wcheck.C (working copy) @@ -94,6 +94,7 @@
void WRun::print_spins() {
- std::cout << parms;
std::cout << "Spin configuration:\n"; std::cout << "Wormheads at " << worm_head[0].site() << " " << worm_head[0].time() << " and " << worm_head[1].site() << " " << worm_head[1].time() << std::endl;
On 30 Dec 2011, at 23:26, Matthias Troyer wrote:
> It's hard debugging this if you launch so many jobs by MPI. Have you tried to see whether the problem also occurs if you don't use MPI? And, which version of ALPS do you use? > > Matthias > > On 30 Dec 2011, at 23:04, Mateusz Łącki wrote: > >> Dear Matthias, >> I attach input file (parm5c), modified models.xml and lattices.xml, stdout and stderr in separate files (out, out2) >> >> Regards, >> Mateusz Łącki >> >> <dalps.zip> >> >> On Dec 30, 2011, at 5:36 PM, Matthias Troyer wrote: >> >>> Hi, >>> >>> I can only look into this if you send you input file. >>> >>> Matthias >>> >>> On 30 Dec 2011, at 12:05, Mateusz Łącki wrote: >>> >>>> Dear All, >>>> I have set up some computation which failed: >>>> -------------------------------------------------------------------------- >>>> MPI_ABORT was invoked on rank 27 in communicator MPI_COMM_WORLD >>>> with errorcode -2. >>>> >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>> You may or may not see output from other processes, depending on >>>> exactly when Open MPI kills them. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> mpiexec has exited due to process rank 27 with PID 15245 on >>>> node clone18 exiting without calling "finalize". This may >>>> have caused other processes in the application to be >>>> terminated by signals sent by mpiexec (as reported here). >>>> -------------------------------------------------------------------------- >>>> >>>> just before that (in the output file, not sure about the time): >>>> >>>> Avoided problem >>>> q = -0 state1 = 0 state2 = 0 bond_type = 0 id 2047013814 2047013814 >>>> Spin configuration: >>>> Wormheads at 78 0.999997 and 78 0.259301 >>>> Site: 0 >>>> Kink : [ 0.119336 : 0 ] >>>> Kink : [ 0.124065 : 1 ] >>>> Kink : [ 0.174815 : 0 ] >>>> Kink : [ 0.17605 : 1 ] >>>> Kink : [ 0.335094 : 2 ] >>>> Kink : [ 0.368865 : 1 ] >>>> (...) >>>> Site: 299 >>>> Kink : [ 0.00590279 : 2 ] >>>> Kink : [ 0.0326616 : 1 ] >>>> Kink : [ 0.0697665 : 0 ] >>>> Kink : [ 0.0977223 : 1 ] >>>> Kink : [ 0.254292 : 2 ] >>>> Kink : [ 0.256147 : 1 ] >>>> Kink : [ 0.328286 : 2 ] >>>> Kink : [ 0.329838 : 1 ] >>>> Kink : [ 0.405038 : 0 ] >>>> Kink : [ 0.438803 : 1 ] >>>> Kink : [ 0.487034 : 2 ] >>>> Kink : [ 0.503331 : 1 ] >>>> Kink : [ 0.812159 : 2 ] >>>> Kink : [ 0.827811 : 1 ] >>>> >>>> >>>> Is this related? I am not sure whether this output indicates an error. >>>> >>>> Regards, >>>> Mateusz >>>> >>>> On Dec 29, 2011, at 9:59 AM, Matthias Troyer wrote: >>>> >>>>> Yes, indeed >>>>> >>>>> On Dec 29, 2011, at 9:53 AM, Mateusz Łącki wrote: >>>>> >>>>>> Dear Matthias, >>>>>> >>>>>> Thank you for your answer. If I understand correctly the problem is solved and results take into account this special case now? >>>>>> >>>>>> Regards, >>>>>> Mateusz >>>>>> >>>>>>> This is a debug message which I added a while ago when solving a problem that we had because of finite resolution of floating point numbers. There is a chance of 1e-16 per site and unit imaginary time interval that two bosons hop away from two neighboring sites at exactly the same time. This case needs special consideration, and the notice was added to indicate that this case had happened. >>>>>>> >>>>>>> Matthias >>>>>>> >>>>>>> On Dec 29, 2011, at 9:39 AM, Mateusz Łącki wrote: >>>>>>> >>>>>>>> Dear All, >>>>>>>> While running some QMC (worm) computation by MPI over several nodes I noticed "Avoided problem" message appearing fro time to time. It does not seem particularly dangerous, but is it possible to find out what was the problem in the first place? >>>>>>>> >>>>>>>> Regards, >>>>>>>> Mateusz Łącki >>>>>>> >>>>>> >>>>> >>>> >>> >> >