Re: [ALPS-users] IXDRDump

10 May 2010


      Hi Synge,
I've found what the problem was (  https://lists.phys.ethz.ch/pipermail/comp-phys-alps-users/2010/000784.html 
  ) . It was my "fault"... Anyway, I think it could be useful to have  
some debug-time functionality to prevent this kind of problems (for  
example something that checks the type-consistence of the dumped <<   
types with the ones that are loaded >> .
...
Hi, Giuseppe,
Are you sure that your first job (which wrote corrupted dump files)  
has really finished before terminating by the job scheduler?
How does the scheduler log of the first job look like?  Did you see  
the message "Reached time limit." at the end of the log?
Synge
On 2010/05/10, at 0:34, Giuseppe Carleo wrote:
...
Hi,
I have realized that there's a "problem" with the filesystem of the  
cluster. The binary checkpoints files are corrupted because only  
part of them is written on the hard-disk before the job ends.
In fact, while on other machines the dimension of the checkpoints  
files is increased whenever a call to dump<<something is performed,  
on this machine the files are not increased in size and only rarely  
something (but NOT everything) is written on the files.
Probably, for efficiency reasons, the writing on the files is  
postponed later and the data are kept in memory until the next flush.
Now, I was wondering if I could circumvent this problem calling a  
flush()-like function of the  alps::ODump , but I think this  
function is not implemented. Is it correct?
Thanks again,
Giuseppe
...
Hi Giuseppe,
It seems that something is written incorrectly on that machine.  
What happens if you compile and link your code (statically) on  
another Linux machine and then run it on the bug cluster?
Matthias
On May 5, 2010, at 1:38 PM, Giuseppe Carleo wrote:
...
It seems that the checkpoint files generated on the cluster have  
some problems :
If I use checkpoint files generated on another machine, than the  
simulation on the cluster is correctly resumed.
At the end of this resumed simulation, if I try to resume it  
again I get a message error like before.
If instead I use the checkpoint files of the cluster to resume a  
simulation on another machine, then I get the same error message  
as before, plus an extra error message "vector::_M_fill_insert" .
Probably there's some issue with the binary format... don't know.
Giuseppe
...
Can you read the checkpoint files on other machines, or can you  
read checkpoint files on other machines on that one?
On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:
...
Hi Matthias,
thank you for your quick answer.
The machine I'm talking about is this one http://www.top500.org/system/9881 
 , so it is basically a linux cluster with Intel processors.
A typical error message is :
parsing task files ...
Loading information about run 1 from file /scratch/cont003/ 
carleo/beta40/N180.task1.out.run1
failed to read array of type double from an IXDRDump
Cannot open simulation file /scratch/cont003/carleo/beta40/ 
N180.task1.out.xml.
This issue happens for all the checkpoints, and the checkpoints  
files exist (i.e., in the previous case both /scratch/cont003/ 
carleo/beta40/N180.task1.out.xml
and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist)  
and they are not truncated (i.e. at least the .out.xml files  
correctly end with </MCRUN></SIMULATION> ).
Moreover, the checkpoints file are indeed accessible and have  
the right permissions (-rw-r-----)... uhm, strange.
Giuseppe
> Hi Giuseppe
>
> What type of machine are you using ALPS on? I cannot  
> immediately tell you what the problem might be. Do all files  
> actually exist locally or might the checkpoints not be  
> accessible? Or maybe the file was truncated by the process  
> being killed? Does this happen to all checkpoints or just some?
>
> Matthias
>
> On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote:
>
>> Hello everybody,
>>
>>
>> I am currently using the ALPS (v. 1.35) scheduler in my QMC  
>> code, and everything works pretty well.
>>
>> Nonetheless, I've experienced an error when trying to restart  
>> my simulations on a HPC machine :
>>
>> "failed to read array of type double from an IXDRDump"
>>
>> which doesn't allow me to restart any simulation...
>>
>> On the other hand, on other machines the simulations are  
>> correctly restarted.
>> I therefore assume that the way I use to restart simulations  
>> is correct, i.e. I invoke something like mpirun ./myprogram.o  
>> simulation_name.out.xml ,
>> and that the dumping of the internal variables is done  
>> correctly in my code.
>>
>> I think the error message should be related to some machine- 
>> specific issue.
>>
>> Do you have suggestions for this problem?
>>
>>
>> Thank you in advance,
>>
>>
>> Giuseppe
>>
>

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [ALPS-users] IXDRDump