IXDRDump

List overview All Threads

newer

older

enegy

Summer school "Simulating strongly...

Giuseppe Carleo

5 May 2010 5 May '10

10:38 a.m.

Hello everybody,

I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well.

Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine :

"failed to read array of type double from an IXDRDump"

which doesn't allow me to restart any simulation...

On the other hand, on other machines the simulations are correctly restarted. I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , and that the dumping of the internal variables is done correctly in my code.

I think the error message should be related to some machine-specific issue.

Do you have suggestions for this problem?

Thank you in advance,

Giuseppe

Show replies by date

Matthias Troyer

5 May 5 May

10:42 a.m.

Hi Giuseppe

What type of machine are you using ALPS on? I cannot immediately tell you what the problem might be. Do all files actually exist locally or might the checkpoints not be accessible? Or maybe the file was truncated by the process being killed? Does this happen to all checkpoints or just some?

Matthias

On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote:

...

Hello everybody,

I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well.

Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine :

"failed to read array of type double from an IXDRDump"

which doesn't allow me to restart any simulation...

On the other hand, on other machines the simulations are correctly restarted. I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , and that the dumping of the internal variables is done correctly in my code.

I think the error message should be related to some machine-specific issue.

Do you have suggestions for this problem?

Thank you in advance,

Giuseppe

Giuseppe Carleo

11:09 a.m.

Hi Matthias,

thank you for your quick answer.

The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors.

A typical error message is :

parsing task files ... Loading information about run 1 from file /scratch/cont003/carleo/ beta40/N180.task1.out.run1 failed to read array of type double from an IXDRDump Cannot open simulation file /scratch/cont003/carleo/beta40/ N180.task1.out.xml.

This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/carleo/beta40/ N180.task1.out.xml and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ).

Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange.

Giuseppe

...

Hi Giuseppe

What type of machine are you using ALPS on? I cannot immediately tell you what the problem might be. Do all files actually exist locally or might the checkpoints not be accessible? Or maybe the file was truncated by the process being killed? Does this happen to all checkpoints or just some?

Matthias

On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote:

...
Hello everybody,

I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well.

Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine :

"failed to read array of type double from an IXDRDump"

which doesn't allow me to restart any simulation...

On the other hand, on other machines the simulations are correctly restarted. I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , and that the dumping of the internal variables is done correctly in my code.

I think the error message should be related to some machine- specific issue.

Do you have suggestions for this problem?

Thank you in advance,

Giuseppe

Matthias Troyer

11:51 a.m.

Can you read the checkpoint files on other machines, or can you read checkpoint files on other machines on that one?

On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:

...

Hi Matthias,

thank you for your quick answer.

The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors.

A typical error message is :

parsing task files ... Loading information about run 1 from file /scratch/cont003/carleo/beta40/N180.task1.out.run1 failed to read array of type double from an IXDRDump Cannot open simulation file /scratch/cont003/carleo/beta40/N180.task1.out.xml.

This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/carleo/beta40/N180.task1.out.xml and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ).

Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange.

Giuseppe

...
Hi Giuseppe

What type of machine are you using ALPS on? I cannot immediately tell you what the problem might be. Do all files actually exist locally or might the checkpoints not be accessible? Or maybe the file was truncated by the process being killed? Does this happen to all checkpoints or just some?

Matthias

On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote:

...
Hello everybody,

I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well.

Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine :

"failed to read array of type double from an IXDRDump"

which doesn't allow me to restart any simulation...

On the other hand, on other machines the simulations are correctly restarted. I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , and that the dumping of the internal variables is done correctly in my code.

I think the error message should be related to some machine-specific issue.

Do you have suggestions for this problem?

Thank you in advance,

Giuseppe

Giuseppe Carleo

1:38 p.m.

It seems that the checkpoint files generated on the cluster have some problems :

If I use checkpoint files generated on another machine, than the simulation on the cluster is correctly resumed. At the end of this resumed simulation, if I try to resume it again I get a message error like before.

If instead I use the checkpoint files of the cluster to resume a simulation on another machine, then I get the same error message as before, plus an extra error message "vector::_M_fill_insert" .

Probably there's some issue with the binary format... don't know.

Giuseppe

...

Can you read the checkpoint files on other machines, or can you read checkpoint files on other machines on that one?

On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:

...
Hi Matthias,

thank you for your quick answer.

The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors.

A typical error message is :

parsing task files ... Loading information about run 1 from file /scratch/cont003/carleo/ beta40/N180.task1.out.run1 failed to read array of type double from an IXDRDump Cannot open simulation file /scratch/cont003/carleo/beta40/ N180.task1.out.xml.

This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/ carleo/beta40/N180.task1.out.xml and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ).

Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange.

Giuseppe

...
Hi Giuseppe

What type of machine are you using ALPS on? I cannot immediately tell you what the problem might be. Do all files actually exist locally or might the checkpoints not be accessible? Or maybe the file was truncated by the process being killed? Does this happen to all checkpoints or just some?

Matthias

On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote:

...
Hello everybody,

I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well.

Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine :

"failed to read array of type double from an IXDRDump"

which doesn't allow me to restart any simulation...

On the other hand, on other machines the simulations are correctly restarted. I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , and that the dumping of the internal variables is done correctly in my code.

I think the error message should be related to some machine- specific issue.

Do you have suggestions for this problem?

Thank you in advance,

Giuseppe

Matthias Troyer

3:10 p.m.

Hi Giuseppe,

It seems that something is written incorrectly on that machine. What happens if you compile and link your code (statically) on another Linux machine and then run it on the bug cluster?

Matthias

On May 5, 2010, at 1:38 PM, Giuseppe Carleo wrote:

...

It seems that the checkpoint files generated on the cluster have some problems :

If I use checkpoint files generated on another machine, than the simulation on the cluster is correctly resumed. At the end of this resumed simulation, if I try to resume it again I get a message error like before.

If instead I use the checkpoint files of the cluster to resume a simulation on another machine, then I get the same error message as before, plus an extra error message "vector::_M_fill_insert" .

Probably there's some issue with the binary format... don't know.

Giuseppe

...
Can you read the checkpoint files on other machines, or can you read checkpoint files on other machines on that one?

On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:

...
Hi Matthias,

thank you for your quick answer.

The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors.

A typical error message is :

parsing task files ... Loading information about run 1 from file /scratch/cont003/carleo/beta40/N180.task1.out.run1 failed to read array of type double from an IXDRDump Cannot open simulation file /scratch/cont003/carleo/beta40/N180.task1.out.xml.

This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/carleo/beta40/N180.task1.out.xml and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ).

Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange.

Giuseppe

...
Hi Giuseppe

What type of machine are you using ALPS on? I cannot immediately tell you what the problem might be. Do all files actually exist locally or might the checkpoints not be accessible? Or maybe the file was truncated by the process being killed? Does this happen to all checkpoints or just some?

Matthias

On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote:

...
Hello everybody,

I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well.

Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine :

"failed to read array of type double from an IXDRDump"

which doesn't allow me to restart any simulation...

On the other hand, on other machines the simulations are correctly restarted. I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , and that the dumping of the internal variables is done correctly in my code.

I think the error message should be related to some machine-specific issue.

Do you have suggestions for this problem?

Thank you in advance,

Giuseppe

Giuseppe Carleo

9 May 9 May

5:34 p.m.

Hi,

I have realized that there's a "problem" with the filesystem of the cluster. The binary checkpoints files are corrupted because only part of them is written on the hard-disk before the job ends. In fact, while on other machines the dimension of the checkpoints files is increased whenever a call to dump<<something is performed, on this machine the files are not increased in size and only rarely something (but NOT everything) is written on the files.

Probably, for efficiency reasons, the writing on the files is postponed later and the data are kept in memory until the next flush.

Now, I was wondering if I could circumvent this problem calling a flush()-like function of the alps::ODump , but I think this function is not implemented. Is it correct?

Thanks again,

Giuseppe

...

Hi Giuseppe,

It seems that something is written incorrectly on that machine. What happens if you compile and link your code (statically) on another Linux machine and then run it on the bug cluster?

Matthias

On May 5, 2010, at 1:38 PM, Giuseppe Carleo wrote:

...
It seems that the checkpoint files generated on the cluster have some problems :

If I use checkpoint files generated on another machine, than the simulation on the cluster is correctly resumed. At the end of this resumed simulation, if I try to resume it again I get a message error like before.

If instead I use the checkpoint files of the cluster to resume a simulation on another machine, then I get the same error message as before, plus an extra error message "vector::_M_fill_insert" .

Probably there's some issue with the binary format... don't know.

Giuseppe

...
Can you read the checkpoint files on other machines, or can you read checkpoint files on other machines on that one?

On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:

...
Hi Matthias,

thank you for your quick answer.

The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors.

A typical error message is :

parsing task files ... Loading information about run 1 from file /scratch/cont003/carleo/ beta40/N180.task1.out.run1 failed to read array of type double from an IXDRDump Cannot open simulation file /scratch/cont003/carleo/beta40/ N180.task1.out.xml.

This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/ carleo/beta40/N180.task1.out.xml and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ).

Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange.

Giuseppe

...
Hi Giuseppe

What type of machine are you using ALPS on? I cannot immediately tell you what the problem might be. Do all files actually exist locally or might the checkpoints not be accessible? Or maybe the file was truncated by the process being killed? Does this happen to all checkpoints or just some?

Matthias

On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote:

...
Hello everybody,

I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well.

Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine :

"failed to read array of type double from an IXDRDump"

which doesn't allow me to restart any simulation...

On the other hand, on other machines the simulations are correctly restarted. I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , and that the dumping of the internal variables is done correctly in my code.

I think the error message should be related to some machine- specific issue.

Do you have suggestions for this problem?

Thank you in advance,

Giuseppe

Matthias Troyer

6:21 p.m.

Hi Giuseppe,

Can you please edit the file src/alps/osiris/xdrdump.C

There around line 344 you should find:

// destructor closes the stream and file OXDRFileDump::~OXDRFileDump() { xdr_destroy(&xdr_); if(file_) std::fclose(file_); }

can you please insert a call to flush() before the xdr_destroy? It should look like this:

// destructor closes the stream and file OXDRFileDump::~OXDRFileDump() { flush(); xdr_destroy(&xdr_); if(file_) std::fclose(file_); }

Could you please try this and tell me whether it helps?

Matthias

On 9 May 2010, at 17:34, Giuseppe Carleo wrote:

...

Hi,

I have realized that there's a "problem" with the filesystem of the cluster. The binary checkpoints files are corrupted because only part of them is written on the hard-disk before the job ends. In fact, while on other machines the dimension of the checkpoints files is increased whenever a call to dump<<something is performed, on this machine the files are not increased in size and only rarely something (but NOT everything) is written on the files.

Probably, for efficiency reasons, the writing on the files is postponed later and the data are kept in memory until the next flush.

Now, I was wondering if I could circumvent this problem calling a flush()-like function of the alps::ODump , but I think this function is not implemented. Is it correct?

Thanks again,

Giuseppe

...
Hi Giuseppe,

It seems that something is written incorrectly on that machine. What happens if you compile and link your code (statically) on another Linux machine and then run it on the bug cluster?

Matthias

On May 5, 2010, at 1:38 PM, Giuseppe Carleo wrote:

...
It seems that the checkpoint files generated on the cluster have some problems :

If I use checkpoint files generated on another machine, than the simulation on the cluster is correctly resumed. At the end of this resumed simulation, if I try to resume it again I get a message error like before.

If instead I use the checkpoint files of the cluster to resume a simulation on another machine, then I get the same error message as before, plus an extra error message "vector::_M_fill_insert" .

Probably there's some issue with the binary format... don't know.

Giuseppe

...
Can you read the checkpoint files on other machines, or can you read checkpoint files on other machines on that one?

On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:

...
Hi Matthias,

thank you for your quick answer.

The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors.

A typical error message is :

parsing task files ... Loading information about run 1 from file /scratch/cont003/carleo/beta40/N180.task1.out.run1 failed to read array of type double from an IXDRDump Cannot open simulation file /scratch/cont003/carleo/beta40/N180.task1.out.xml.

This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/carleo/beta40/N180.task1.out.xml and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ).

Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange.

Giuseppe

...
Hi Giuseppe

What type of machine are you using ALPS on? I cannot immediately tell you what the problem might be. Do all files actually exist locally or might the checkpoints not be accessible? Or maybe the file was truncated by the process being killed? Does this happen to all checkpoints or just some?

Matthias

On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote:

> Hello everybody, > > > I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well. > > Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine : > > "failed to read array of type double from an IXDRDump" > > which doesn't allow me to restart any simulation... > > On the other hand, on other machines the simulations are correctly restarted. > I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , > and that the dumping of the internal variables is done correctly in my code. > > I think the error message should be related to some machine-specific issue. > > Do you have suggestions for this problem? > > > Thank you in advance, > > > Giuseppe >

Giuseppe Carleo

8:02 p.m.

I have done the changes you suggested in order to have an explicit call to flush before exiting, but unfortunately It doesn't work.

I have also tried, after calling dump<<stuff, to make a system call to the function sync... still without any success.

The binary files are incomplete, and when the program tries to load them it is able to load only few of the dumped variables...

I think I will contact the guys from the cluster and ask them about this problem.

Thank you anyway!

Giuseppe

...

Hi Giuseppe,

Can you please edit the file src/alps/osiris/xdrdump.C

There around line 344 you should find:

// destructor closes the stream and file OXDRFileDump::~OXDRFileDump() { xdr_destroy(&xdr_); if(file_) std::fclose(file_); }

can you please insert a call to flush() before the xdr_destroy? It should look like this:

// destructor closes the stream and file OXDRFileDump::~OXDRFileDump() { flush(); xdr_destroy(&xdr_); if(file_) std::fclose(file_); }

Could you please try this and tell me whether it helps?

Matthias

On 9 May 2010, at 17:34, Giuseppe Carleo wrote:

...
Hi,

I have realized that there's a "problem" with the filesystem of the cluster. The binary checkpoints files are corrupted because only part of them is written on the hard-disk before the job ends. In fact, while on other machines the dimension of the checkpoints files is increased whenever a call to dump<<something is performed, on this machine the files are not increased in size and only rarely something (but NOT everything) is written on the files.

Probably, for efficiency reasons, the writing on the files is postponed later and the data are kept in memory until the next flush.

Now, I was wondering if I could circumvent this problem calling a flush()-like function of the alps::ODump , but I think this function is not implemented. Is it correct?

Thanks again,

Giuseppe

...
Hi Giuseppe,

It seems that something is written incorrectly on that machine. What happens if you compile and link your code (statically) on another Linux machine and then run it on the bug cluster?

Matthias

On May 5, 2010, at 1:38 PM, Giuseppe Carleo wrote:

...
It seems that the checkpoint files generated on the cluster have some problems :

If I use checkpoint files generated on another machine, than the simulation on the cluster is correctly resumed. At the end of this resumed simulation, if I try to resume it again I get a message error like before.

If instead I use the checkpoint files of the cluster to resume a simulation on another machine, then I get the same error message as before, plus an extra error message "vector::_M_fill_insert" .

Probably there's some issue with the binary format... don't know.

Giuseppe

...
Can you read the checkpoint files on other machines, or can you read checkpoint files on other machines on that one?

On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:

...
Hi Matthias,

thank you for your quick answer.

The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors.

A typical error message is :

parsing task files ... Loading information about run 1 from file /scratch/cont003/ carleo/beta40/N180.task1.out.run1 failed to read array of type double from an IXDRDump Cannot open simulation file /scratch/cont003/carleo/beta40/ N180.task1.out.xml.

This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/ carleo/beta40/N180.task1.out.xml and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ).

Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange.

Giuseppe

> Hi Giuseppe > > What type of machine are you using ALPS on? I cannot > immediately tell you what the problem might be. Do all files > actually exist locally or might the checkpoints not be > accessible? Or maybe the file was truncated by the process > being killed? Does this happen to all checkpoints or just some? > > Matthias > > On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote: > >> Hello everybody, >> >> >> I am currently using the ALPS (v. 1.35) scheduler in my QMC >> code, and everything works pretty well. >> >> Nonetheless, I've experienced an error when trying to restart >> my simulations on a HPC machine : >> >> "failed to read array of type double from an IXDRDump" >> >> which doesn't allow me to restart any simulation... >> >> On the other hand, on other machines the simulations are >> correctly restarted. >> I therefore assume that the way I use to restart simulations >> is correct, i.e. I invoke something like mpirun ./myprogram.o >> simulation_name.out.xml , >> and that the dumping of the internal variables is done >> correctly in my code. >> >> I think the error message should be related to some machine- >> specific issue. >> >> Do you have suggestions for this problem? >> >> >> Thank you in advance, >> >> >> Giuseppe >> >

Matthias Troyer

8:41 p.m.

Indeed, this seems to be not related to ALPS but to your system. There is one possible way around: use the XDR library that comes with ALPS. Please edit the file src/alps/config.h and comment out the line defining the macro ALPS_HAVE_RPC_XDR_H.

Matthias

On 9 May 2010, at 20:02, Giuseppe Carleo wrote:

...

I have done the changes you suggested in order to have an explicit call to flush before exiting, but unfortunately It doesn't work.

I have also tried, after calling dump<<stuff, to make a system call to the function sync... still without any success.

The binary files are incomplete, and when the program tries to load them it is able to load only few of the dumped variables...

I think I will contact the guys from the cluster and ask them about this problem.

Thank you anyway!

Giuseppe

...
Hi Giuseppe,

Can you please edit the file src/alps/osiris/xdrdump.C

There around line 344 you should find:

// destructor closes the stream and file OXDRFileDump::~OXDRFileDump() { xdr_destroy(&xdr_); if(file_) std::fclose(file_); }

can you please insert a call to flush() before the xdr_destroy? It should look like this:

// destructor closes the stream and file OXDRFileDump::~OXDRFileDump() { flush(); xdr_destroy(&xdr_); if(file_) std::fclose(file_); }

Could you please try this and tell me whether it helps?

Matthias

On 9 May 2010, at 17:34, Giuseppe Carleo wrote:

...
Hi,

I have realized that there's a "problem" with the filesystem of the cluster. The binary checkpoints files are corrupted because only part of them is written on the hard-disk before the job ends. In fact, while on other machines the dimension of the checkpoints files is increased whenever a call to dump<<something is performed, on this machine the files are not increased in size and only rarely something (but NOT everything) is written on the files.

Probably, for efficiency reasons, the writing on the files is postponed later and the data are kept in memory until the next flush.

Now, I was wondering if I could circumvent this problem calling a flush()-like function of the alps::ODump , but I think this function is not implemented. Is it correct?

Thanks again,

Giuseppe

...
Hi Giuseppe,

It seems that something is written incorrectly on that machine. What happens if you compile and link your code (statically) on another Linux machine and then run it on the bug cluster?

Matthias

On May 5, 2010, at 1:38 PM, Giuseppe Carleo wrote:

...
It seems that the checkpoint files generated on the cluster have some problems :

If I use checkpoint files generated on another machine, than the simulation on the cluster is correctly resumed. At the end of this resumed simulation, if I try to resume it again I get a message error like before.

If instead I use the checkpoint files of the cluster to resume a simulation on another machine, then I get the same error message as before, plus an extra error message "vector::_M_fill_insert" .

Probably there's some issue with the binary format... don't know.

Giuseppe

...
Can you read the checkpoint files on other machines, or can you read checkpoint files on other machines on that one?

On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:

> Hi Matthias, > > thank you for your quick answer. > > The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors. > > > A typical error message is : > > parsing task files ... > Loading information about run 1 from file /scratch/cont003/carleo/beta40/N180.task1.out.run1 > failed to read array of type double from an IXDRDump > Cannot open simulation file /scratch/cont003/carleo/beta40/N180.task1.out.xml. > > This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/carleo/beta40/N180.task1.out.xml > and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ). > > Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange. > > Giuseppe > > > >> Hi Giuseppe >> >> What type of machine are you using ALPS on? I cannot immediately tell you what the problem might be. Do all files actually exist locally or might the checkpoints not be accessible? Or maybe the file was truncated by the process being killed? Does this happen to all checkpoints or just some? >> >> Matthias >> >> On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote: >> >>> Hello everybody, >>> >>> >>> I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well. >>> >>> Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine : >>> >>> "failed to read array of type double from an IXDRDump" >>> >>> which doesn't allow me to restart any simulation... >>> >>> On the other hand, on other machines the simulations are correctly restarted. >>> I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , >>> and that the dumping of the internal variables is done correctly in my code. >>> >>> I think the error message should be related to some machine-specific issue. >>> >>> Do you have suggestions for this problem? >>> >>> >>> Thank you in advance, >>> >>> >>> Giuseppe >>> >> >

Giuseppe Carleo

10:52 p.m.

Hi everybody,

thank you for your answers. After Fabien Alet told that he didn't experience problems on that cluster, I looked better at my code... and I have found a very SUBTLE bug :

I was dumping the size of a vector<Sometype> V_ , like this dump<<V_.size(); and then loading it like int size; dump>>size;

This was causing all the problems, as I was wrongly assuming that V_.size() was dumped as an integer, while it wasn't. The correct way is of course to cast V_.size() to an integer... and then dump it.

Quite strangely, this problem manifested itself only on this machine.

Thank you again, and sorry for all these apparently useless emails..!

Giuseppe

...

Indeed, this seems to be not related to ALPS but to your system. There is one possible way around: use the XDR library that comes with ALPS. Please edit the file src/alps/config.h and comment out the line defining the macro ALPS_HAVE_RPC_XDR_H.

Matthias

On 9 May 2010, at 20:02, Giuseppe Carleo wrote:

...
I have done the changes you suggested in order to have an explicit call to flush before exiting, but unfortunately It doesn't work.

I have also tried, after calling dump<<stuff, to make a system call to the function sync... still without any success.

The binary files are incomplete, and when the program tries to load them it is able to load only few of the dumped variables...

I think I will contact the guys from the cluster and ask them about this problem.

Thank you anyway!

Giuseppe

...
Hi Giuseppe,

Can you please edit the file src/alps/osiris/xdrdump.C

There around line 344 you should find:

// destructor closes the stream and file OXDRFileDump::~OXDRFileDump() { xdr_destroy(&xdr_); if(file_) std::fclose(file_); }

can you please insert a call to flush() before the xdr_destroy? It should look like this:

// destructor closes the stream and file OXDRFileDump::~OXDRFileDump() { flush(); xdr_destroy(&xdr_); if(file_) std::fclose(file_); }

Could you please try this and tell me whether it helps?

Matthias

On 9 May 2010, at 17:34, Giuseppe Carleo wrote:

...
Hi,

I have realized that there's a "problem" with the filesystem of the cluster. The binary checkpoints files are corrupted because only part of them is written on the hard-disk before the job ends. In fact, while on other machines the dimension of the checkpoints files is increased whenever a call to dump<<something is performed, on this machine the files are not increased in size and only rarely something (but NOT everything) is written on the files.

Probably, for efficiency reasons, the writing on the files is postponed later and the data are kept in memory until the next flush.

Now, I was wondering if I could circumvent this problem calling a flush()-like function of the alps::ODump , but I think this function is not implemented. Is it correct?

Thanks again,

Giuseppe

...
Hi Giuseppe,

It seems that something is written incorrectly on that machine. What happens if you compile and link your code (statically) on another Linux machine and then run it on the bug cluster?

Matthias

On May 5, 2010, at 1:38 PM, Giuseppe Carleo wrote:

...
It seems that the checkpoint files generated on the cluster have some problems :

If I use checkpoint files generated on another machine, than the simulation on the cluster is correctly resumed. At the end of this resumed simulation, if I try to resume it again I get a message error like before.

If instead I use the checkpoint files of the cluster to resume a simulation on another machine, then I get the same error message as before, plus an extra error message "vector::_M_fill_insert" .

Probably there's some issue with the binary format... don't know.

Giuseppe

> Can you read the checkpoint files on other machines, or can > you read checkpoint files on other machines on that one? > > > On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote: > >> Hi Matthias, >> >> thank you for your quick answer. >> >> The machine I'm talking about is this one http://www.top500.org/system/9881 >> , so it is basically a linux cluster with Intel processors. >> >> >> A typical error message is : >> >> parsing task files ... >> Loading information about run 1 from file /scratch/cont003/ >> carleo/beta40/N180.task1.out.run1 >> failed to read array of type double from an IXDRDump >> Cannot open simulation file /scratch/cont003/carleo/beta40/ >> N180.task1.out.xml. >> >> This issue happens for all the checkpoints, and the >> checkpoints files exist (i.e., in the previous case both / >> scratch/cont003/carleo/beta40/N180.task1.out.xml >> and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) >> and they are not truncated (i.e. at least the .out.xml files >> correctly end with </MCRUN></SIMULATION> ). >> >> Moreover, the checkpoints file are indeed accessible and have >> the right permissions (-rw-r-----)... uhm, strange. >> >> Giuseppe >> >> >> >>> Hi Giuseppe >>> >>> What type of machine are you using ALPS on? I cannot >>> immediately tell you what the problem might be. Do all files >>> actually exist locally or might the checkpoints not be >>> accessible? Or maybe the file was truncated by the process >>> being killed? Does this happen to all checkpoints or just >>> some? >>> >>> Matthias >>> >>> On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote: >>> >>>> Hello everybody, >>>> >>>> >>>> I am currently using the ALPS (v. 1.35) scheduler in my QMC >>>> code, and everything works pretty well. >>>> >>>> Nonetheless, I've experienced an error when trying to >>>> restart my simulations on a HPC machine : >>>> >>>> "failed to read array of type double from an IXDRDump" >>>> >>>> which doesn't allow me to restart any simulation... >>>> >>>> On the other hand, on other machines the simulations are >>>> correctly restarted. >>>> I therefore assume that the way I use to restart >>>> simulations is correct, i.e. I invoke something like >>>> mpirun ./myprogram.o simulation_name.out.xml , >>>> and that the dumping of the internal variables is done >>>> correctly in my code. >>>> >>>> I think the error message should be related to some machine- >>>> specific issue. >>>> >>>> Do you have suggestions for this problem? >>>> >>>> >>>> Thank you in advance, >>>> >>>> >>>> Giuseppe >>>> >>> >> >

Synge Todo

10 May 10 May

4:42 a.m.

Hi, Giuseppe,

Are you sure that your first job (which wrote corrupted dump files) has really finished before terminating by the job scheduler? How does the scheduler log of the first job look like? Did you see the message "Reached time limit." at the end of the log?

Synge

On 2010/05/10, at 0:34, Giuseppe Carleo wrote:

...

Hi,

I have realized that there's a "problem" with the filesystem of the cluster. The binary checkpoints files are corrupted because only part of them is written on the hard-disk before the job ends. In fact, while on other machines the dimension of the checkpoints files is increased whenever a call to dump<<something is performed, on this machine the files are not increased in size and only rarely something (but NOT everything) is written on the files.

Probably, for efficiency reasons, the writing on the files is postponed later and the data are kept in memory until the next flush.

Now, I was wondering if I could circumvent this problem calling a flush()-like function of the alps::ODump , but I think this function is not implemented. Is it correct?

Thanks again,

Giuseppe

...
Hi Giuseppe,

It seems that something is written incorrectly on that machine. What happens if you compile and link your code (statically) on another Linux machine and then run it on the bug cluster?

Matthias

On May 5, 2010, at 1:38 PM, Giuseppe Carleo wrote:

...
It seems that the checkpoint files generated on the cluster have some problems :

If I use checkpoint files generated on another machine, than the simulation on the cluster is correctly resumed. At the end of this resumed simulation, if I try to resume it again I get a message error like before.

If instead I use the checkpoint files of the cluster to resume a simulation on another machine, then I get the same error message as before, plus an extra error message "vector::_M_fill_insert" .

Probably there's some issue with the binary format... don't know.

Giuseppe

...
Can you read the checkpoint files on other machines, or can you read checkpoint files on other machines on that one?

On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:

...
Hi Matthias,

thank you for your quick answer.

The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors.

A typical error message is :

parsing task files ... Loading information about run 1 from file /scratch/cont003/carleo/beta40/N180.task1.out.run1 failed to read array of type double from an IXDRDump Cannot open simulation file /scratch/cont003/carleo/beta40/N180.task1.out.xml.

This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/carleo/beta40/N180.task1.out.xml and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ).

Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange.

Giuseppe

...
Hi Giuseppe

What type of machine are you using ALPS on? I cannot immediately tell you what the problem might be. Do all files actually exist locally or might the checkpoints not be accessible? Or maybe the file was truncated by the process being killed? Does this happen to all checkpoints or just some?

Matthias

On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote:

> Hello everybody, > > > I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well. > > Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine : > > "failed to read array of type double from an IXDRDump" > > which doesn't allow me to restart any simulation... > > On the other hand, on other machines the simulations are correctly restarted. > I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , > and that the dumping of the internal variables is done correctly in my code. > > I think the error message should be related to some machine-specific issue. > > Do you have suggestions for this problem? > > > Thank you in advance, > > > Giuseppe >

Giuseppe Carleo

10:15 a.m.

Hi Synge,

I've found what the problem was ( https://lists.phys.ethz.ch/pipermail/comp-phys-alps-users/2010/000784.html ) . It was my "fault"... Anyway, I think it could be useful to have some debug-time functionality to prevent this kind of problems (for example something that checks the type-consistence of the dumped << types with the ones that are loaded >> .

...

Hi, Giuseppe,

Are you sure that your first job (which wrote corrupted dump files) has really finished before terminating by the job scheduler? How does the scheduler log of the first job look like? Did you see the message "Reached time limit." at the end of the log?

Synge

On 2010/05/10, at 0:34, Giuseppe Carleo wrote:

...
Hi,

I have realized that there's a "problem" with the filesystem of the cluster. The binary checkpoints files are corrupted because only part of them is written on the hard-disk before the job ends. In fact, while on other machines the dimension of the checkpoints files is increased whenever a call to dump<<something is performed, on this machine the files are not increased in size and only rarely something (but NOT everything) is written on the files.

Probably, for efficiency reasons, the writing on the files is postponed later and the data are kept in memory until the next flush.

Now, I was wondering if I could circumvent this problem calling a flush()-like function of the alps::ODump , but I think this function is not implemented. Is it correct?

Thanks again,

Giuseppe

...
Hi Giuseppe,

It seems that something is written incorrectly on that machine. What happens if you compile and link your code (statically) on another Linux machine and then run it on the bug cluster?

Matthias

On May 5, 2010, at 1:38 PM, Giuseppe Carleo wrote:

...
It seems that the checkpoint files generated on the cluster have some problems :

If I use checkpoint files generated on another machine, than the simulation on the cluster is correctly resumed. At the end of this resumed simulation, if I try to resume it again I get a message error like before.

If instead I use the checkpoint files of the cluster to resume a simulation on another machine, then I get the same error message as before, plus an extra error message "vector::_M_fill_insert" .

Probably there's some issue with the binary format... don't know.

Giuseppe

...
Can you read the checkpoint files on other machines, or can you read checkpoint files on other machines on that one?

On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:

...
Hi Matthias,

thank you for your quick answer.

The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors.

A typical error message is :

parsing task files ... Loading information about run 1 from file /scratch/cont003/ carleo/beta40/N180.task1.out.run1 failed to read array of type double from an IXDRDump Cannot open simulation file /scratch/cont003/carleo/beta40/ N180.task1.out.xml.

This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/ carleo/beta40/N180.task1.out.xml and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ).

Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange.

Giuseppe

> Hi Giuseppe > > What type of machine are you using ALPS on? I cannot > immediately tell you what the problem might be. Do all files > actually exist locally or might the checkpoints not be > accessible? Or maybe the file was truncated by the process > being killed? Does this happen to all checkpoints or just some? > > Matthias > > On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote: > >> Hello everybody, >> >> >> I am currently using the ALPS (v. 1.35) scheduler in my QMC >> code, and everything works pretty well. >> >> Nonetheless, I've experienced an error when trying to restart >> my simulations on a HPC machine : >> >> "failed to read array of type double from an IXDRDump" >> >> which doesn't allow me to restart any simulation... >> >> On the other hand, on other machines the simulations are >> correctly restarted. >> I therefore assume that the way I use to restart simulations >> is correct, i.e. I invoke something like mpirun ./myprogram.o >> simulation_name.out.xml , >> and that the dumping of the internal variables is done >> correctly in my code. >> >> I think the error message should be related to some machine- >> specific issue. >> >> Do you have suggestions for this problem? >> >> >> Thank you in advance, >> >> >> Giuseppe >> >

Matthias Troyer

4:52 p.m.

Dear Giuseppe,

Unfortunately one cannot check for that except if one prepends type information before every variable, which would about double the size and the I/O time.

Matthias

On 10 May 2010, at 10:15, Giuseppe Carleo wrote:

...

Hi Synge,

I've found what the problem was ( https://lists.phys.ethz.ch/pipermail/comp-phys-alps-users/2010/000784.html ) . It was my "fault"... Anyway, I think it could be useful to have some debug-time functionality to prevent this kind of problems (for example something that checks the type-consistence of the dumped << types with the ones that are loaded >> .

...
Hi, Giuseppe,

Are you sure that your first job (which wrote corrupted dump files) has really finished before terminating by the job scheduler? How does the scheduler log of the first job look like? Did you see the message "Reached time limit." at the end of the log?

Synge

On 2010/05/10, at 0:34, Giuseppe Carleo wrote:

...
Hi,

I have realized that there's a "problem" with the filesystem of the cluster. The binary checkpoints files are corrupted because only part of them is written on the hard-disk before the job ends. In fact, while on other machines the dimension of the checkpoints files is increased whenever a call to dump<<something is performed, on this machine the files are not increased in size and only rarely something (but NOT everything) is written on the files.

Probably, for efficiency reasons, the writing on the files is postponed later and the data are kept in memory until the next flush.

Now, I was wondering if I could circumvent this problem calling a flush()-like function of the alps::ODump , but I think this function is not implemented. Is it correct?

Thanks again,

Giuseppe

...
Hi Giuseppe,

It seems that something is written incorrectly on that machine. What happens if you compile and link your code (statically) on another Linux machine and then run it on the bug cluster?

Matthias

On May 5, 2010, at 1:38 PM, Giuseppe Carleo wrote:

...
It seems that the checkpoint files generated on the cluster have some problems :

If I use checkpoint files generated on another machine, than the simulation on the cluster is correctly resumed. At the end of this resumed simulation, if I try to resume it again I get a message error like before.

If instead I use the checkpoint files of the cluster to resume a simulation on another machine, then I get the same error message as before, plus an extra error message "vector::_M_fill_insert" .

Probably there's some issue with the binary format... don't know.

Giuseppe

...
Can you read the checkpoint files on other machines, or can you read checkpoint files on other machines on that one?

On May 5, 2010, at 11:09 AM, Giuseppe Carleo wrote:

> Hi Matthias, > > thank you for your quick answer. > > The machine I'm talking about is this one http://www.top500.org/system/9881 , so it is basically a linux cluster with Intel processors. > > > A typical error message is : > > parsing task files ... > Loading information about run 1 from file /scratch/cont003/carleo/beta40/N180.task1.out.run1 > failed to read array of type double from an IXDRDump > Cannot open simulation file /scratch/cont003/carleo/beta40/N180.task1.out.xml. > > This issue happens for all the checkpoints, and the checkpoints files exist (i.e., in the previous case both /scratch/cont003/carleo/beta40/N180.task1.out.xml > and /scratch/cont003/carleo/beta40/N180.task1.out.run1 exist) and they are not truncated (i.e. at least the .out.xml files correctly end with </MCRUN></SIMULATION> ). > > Moreover, the checkpoints file are indeed accessible and have the right permissions (-rw-r-----)... uhm, strange. > > Giuseppe > > > >> Hi Giuseppe >> >> What type of machine are you using ALPS on? I cannot immediately tell you what the problem might be. Do all files actually exist locally or might the checkpoints not be accessible? Or maybe the file was truncated by the process being killed? Does this happen to all checkpoints or just some? >> >> Matthias >> >> On May 5, 2010, at 10:38 AM, Giuseppe Carleo wrote: >> >>> Hello everybody, >>> >>> >>> I am currently using the ALPS (v. 1.35) scheduler in my QMC code, and everything works pretty well. >>> >>> Nonetheless, I've experienced an error when trying to restart my simulations on a HPC machine : >>> >>> "failed to read array of type double from an IXDRDump" >>> >>> which doesn't allow me to restart any simulation... >>> >>> On the other hand, on other machines the simulations are correctly restarted. >>> I therefore assume that the way I use to restart simulations is correct, i.e. I invoke something like mpirun ./myprogram.o simulation_name.out.xml , >>> and that the dumping of the internal variables is done correctly in my code. >>> >>> I think the error message should be related to some machine-specific issue. >>> >>> Do you have suggestions for this problem? >>> >>> >>> Thank you in advance, >>> >>> >>> Giuseppe >>> >> >

5573

Age (days ago)

5578

Last active (days ago)

comp-phys-alps-users@lists.phys.ethz.ch

13 comments

3 participants

tags (0)

participants (3)

Giuseppe Carleo
Matthias Troyer
Synge Todo