Hi all, Is it not possible to define a simple graph with complex hopping parameter in the Hamiltonian when using fulldiagonization?
For example, a 4-site chain is defined using a simple graph as
<LATTICES> <GRAPH name="Chain" vertices=" 4"> <VERTEX id=" 1" type="0"></VERTEX> <VERTEX id=" 2" type="0"></VERTEX> <VERTEX id=" 3" type="0"></VERTEX> <VERTEX id=" 4" type="0"></VERTEX> <EDGE type="0" source=" 1" target=" 2"/> <EDGE type="0" source=" 2" target=" 3"/> <EDGE type="0" source=" 3" target=" 4"/> </GRAPH> </LATTICES>
which I could also use the lattices.xml in the lattice libraries.I discovered that when the lattice in the libraries is used, it works well for complex hopping but using a simple graphs I got the following errors 'can not convert complex number into real one'.Any reason for this error?The reason I want to use a simple graph is because I want to use a different lattice structure for my calculations.
Thanks
Akin
I have recently been running a lot of dmrg calculations on a linux cluster. I realize that the dmrg program is not parallel programmed, but I run a lot of separate programs on separate processors. However, the clusters are set up so that there are 2 processors to a node. I'm using ALPS 1.3.3 with boost 1.34.1 and lp solve 4.0. I had to specify where the mpich2 files were when I compiled as well as specifying the compiler as gnu (maybe that's the problem?). I was told by the support staff of the linux cluster to try that.
The most distressing error has been "St9bad_alloc" which crashes the program most of the time (sometimes it is able to keep going). I recently received that error followed by five lines of: *** ERROR in dsyev: info != 0 (failed to converge)
when I was running a 2D heisenberg model (4x4 lattice).
I also have received the error: *** glibc detected *** double free or corruption (!prev): 0x00000000007b34a0 ***
but it doesn't seem to crash the program ever and the results seem to be fine so I'm not as worried about that one. I don't receive these errors all the time, but they worry me all the same. Any ideas on what might be wrong? Is it because I used gnu as the compiler?
Thanks, Justin
This indicates a problem in LAPACK:
*** ERROR in dsyev: info != 0 (failed to converge)
I have had issues in the past with certain proprietary LAPACK libraries not converging but the slower Netlib version does.
What library are you using? Is your machine 64-bit? Are you compiling ALPS for 64-bit integers? Is your BLAS/LAPACK for 64-bit integers?
When you're running on the cluster with multiple independent jobs, are you running in the same directory on a shared file system? The DMRG scratch files aren't named uniquely for each job so if they are all being written to one directory on the same filesystem, each job will overwrite the others files. This can cause all sorts of terrible things to occur.
You might want to setup your job submission script to create temporary scratch directory for each job to run in, and have the script copy back your final output files to whatever directory you submitted the job from, for example.
And no, GCC is not the problem here.
Jeff
On Fri, Feb 6, 2009 at 10:35 AM, Justin David Peel justin.peel@utah.edu wrote:
I have recently been running a lot of dmrg calculations on a linux cluster. I realize that the dmrg program is not parallel programmed, but I run a lot of separate programs on separate processors. However, the clusters are set up so that there are 2 processors to a node. I'm using ALPS 1.3.3 with boost 1.34.1 and lp solve 4.0. I had to specify where the mpich2 files were when I compiled as well as specifying the compiler as gnu (maybe that's the problem?). I was told by the support staff of the linux cluster to try that.
The most distressing error has been "St9bad_alloc" which crashes the program most of the time (sometimes it is able to keep going). I recently received that error followed by five lines of: *** ERROR in dsyev: info != 0 (failed to converge)
when I was running a 2D heisenberg model (4x4 lattice).
I also have received the error: *** glibc detected *** double free or corruption (!prev): 0x00000000007b34a0 ***
but it doesn't seem to crash the program ever and the results seem to be fine so I'm not as worried about that one. I don't receive these errors all the time, but they worry me all the same. Any ideas on what might be wrong? Is it because I used gnu as the compiler?
Thanks, Justin
I don't know the answers to all of those things (I'm not very experienced with LINUX), but I do know some. The jobs are not all running in the same shared folder; I am using a scratch folder (the LINUX cluster people direct us to do so because it is faster disk access). The machine is 64-bit. I don't know about the LAPACK library (I'm not sure how to check). I didn't have to specify a location for that one or the BLAS library. I just followed the ALPS instructions for installation so I don't know if I really installed it for 64-bit or not. I've tried looking around the for LAPACK library but haven't found it. Maybe I'll have to ask the people who run the cluster about this.
Thanks for the reply, Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Jeff Hammond Sent: Fri 2/6/2009 9:45 AM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
This indicates a problem in LAPACK:
*** ERROR in dsyev: info != 0 (failed to converge)
I have had issues in the past with certain proprietary LAPACK libraries not converging but the slower Netlib version does.
What library are you using? Is your machine 64-bit? Are you compiling ALPS for 64-bit integers? Is your BLAS/LAPACK for 64-bit integers?
When you're running on the cluster with multiple independent jobs, are you running in the same directory on a shared file system? The DMRG scratch files aren't named uniquely for each job so if they are all being written to one directory on the same filesystem, each job will overwrite the others files. This can cause all sorts of terrible things to occur.
You might want to setup your job submission script to create temporary scratch directory for each job to run in, and have the script copy back your final output files to whatever directory you submitted the job from, for example.
And no, GCC is not the problem here.
Jeff
On Fri, Feb 6, 2009 at 10:35 AM, Justin David Peel justin.peel@utah.edu wrote:
I have recently been running a lot of dmrg calculations on a linux cluster. I realize that the dmrg program is not parallel programmed, but I run a lot of separate programs on separate processors. However, the clusters are set up so that there are 2 processors to a node. I'm using ALPS 1.3.3 with boost 1.34.1 and lp solve 4.0. I had to specify where the mpich2 files were when I compiled as well as specifying the compiler as gnu (maybe that's the problem?). I was told by the support staff of the linux cluster to try that.
The most distressing error has been "St9bad_alloc" which crashes the program most of the time (sometimes it is able to keep going). I recently received that error followed by five lines of: *** ERROR in dsyev: info != 0 (failed to converge)
when I was running a 2D heisenberg model (4x4 lattice).
I also have received the error: *** glibc detected *** double free or corruption (!prev): 0x00000000007b34a0 ***
but it doesn't seem to crash the program ever and the results seem to be fine so I'm not as worried about that one. I don't receive these errors all the time, but they worry me all the same. Any ideas on what might be wrong? Is it because I used gnu as the compiler?
Thanks, Justin
type "man dsyev" in the terminal. if Lapack is properly installed, you should be able to see some explanation about this LAPACK subroutine. Can you find the explicit value of the "info" variable after the occurance of the error. As you will see in the manual page of "dsyev" that the value of "info" gives clues about the nature of the error.
On Fri, Feb 6, 2009 at 12:11 PM, Justin David Peel justin.peel@utah.eduwrote:
I don't know the answers to all of those things (I'm not very experienced with LINUX), but I do know some. The jobs are not all running in the same shared folder; I am using a scratch folder (the LINUX cluster people direct us to do so because it is faster disk access). The machine is 64-bit. I don't know about the LAPACK library (I'm not sure how to check). I didn't have to specify a location for that one or the BLAS library. I just followed the ALPS instructions for installation so I don't know if I really installed it for 64-bit or not. I've tried looking around the for LAPACK library but haven't found it. Maybe I'll have to ask the people who run the cluster about this.
Thanks for the reply, Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Jeff Hammond Sent: Fri 2/6/2009 9:45 AM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
This indicates a problem in LAPACK:
*** ERROR in dsyev: info != 0 (failed to converge)
I have had issues in the past with certain proprietary LAPACK libraries not converging but the slower Netlib version does.
What library are you using? Is your machine 64-bit? Are you compiling ALPS for 64-bit integers? Is your BLAS/LAPACK for 64-bit integers?
When you're running on the cluster with multiple independent jobs, are you running in the same directory on a shared file system? The DMRG scratch files aren't named uniquely for each job so if they are all being written to one directory on the same filesystem, each job will overwrite the others files. This can cause all sorts of terrible things to occur.
You might want to setup your job submission script to create temporary scratch directory for each job to run in, and have the script copy back your final output files to whatever directory you submitted the job from, for example.
And no, GCC is not the problem here.
Jeff
On Fri, Feb 6, 2009 at 10:35 AM, Justin David Peel justin.peel@utah.edu wrote:
I have recently been running a lot of dmrg calculations on a linux
cluster. I realize that the dmrg program is not parallel programmed, but I run a lot of separate programs on separate processors. However, the clusters are set up so that there are 2 processors to a node. I'm using ALPS 1.3.3 with boost 1.34.1 and lp solve 4.0. I had to specify where the mpich2 files were when I compiled as well as specifying the compiler as gnu (maybe that's the problem?). I was told by the support staff of the linux cluster to try that.
The most distressing error has been "St9bad_alloc" which crashes the
program most of the time (sometimes it is able to keep going). I recently received that error followed by five lines of:
*** ERROR in dsyev: info != 0 (failed to converge)
when I was running a 2D heisenberg model (4x4 lattice).
I also have received the error: *** glibc detected *** double free or corruption (!prev):
0x00000000007b34a0 ***
but it doesn't seem to crash the program ever and the results seem to be
fine so I'm not as worried about that one. I don't receive these errors all the time, but they worry me all the same. Any ideas on what might be wrong? Is it because I used gnu as the compiler?
Thanks, Justin
-- Jeff Hammond The University of Chicago http://home.uchicago.edu/~jhammond/http://home.uchicago.edu/%7Ejhammond/
Dear Justin,
To properly diagnose your problem we need some help from you. Could you please send us the output file? Also, could you also try to run using gdb? You need to follow these steps:
$ gdb ./dmrg $ run parms_file ... crash! $ bt
and you send us the output.
Another thing: can you tell us a little bit about the problem you are studying? It may help.
Thank you, Saludos, <ADRIAN>
sinan bulut wrote:
type "man dsyev" in the terminal. if Lapack is properly installed, you should be able to see some explanation about this LAPACK subroutine. Can you find the explicit value of the "info" variable after the occurance of the error. As you will see in the manual page of "dsyev" that the value of "info" gives clues about the nature of the error.
On Fri, Feb 6, 2009 at 12:11 PM, Justin David Peel <justin.peel@utah.edu mailto:justin.peel@utah.edu> wrote:
I don't know the answers to all of those things (I'm not very experienced with LINUX), but I do know some. The jobs are not all running in the same shared folder; I am using a scratch folder (the LINUX cluster people direct us to do so because it is faster disk access). The machine is 64-bit. I don't know about the LAPACK library (I'm not sure how to check). I didn't have to specify a location for that one or the BLAS library. I just followed the ALPS instructions for installation so I don't know if I really installed it for 64-bit or not. I've tried looking around the for LAPACK library but haven't found it. Maybe I'll have to ask the people who run the cluster about this. Thanks for the reply, Justin -----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch <mailto:comp-phys-alps-users-bounces@phys.ethz.ch> on behalf of Jeff Hammond Sent: Fri 2/6/2009 9:45 AM To: comp-phys-alps-users@phys.ethz.ch <mailto:comp-phys-alps-users@phys.ethz.ch> Subject: Re: [ALPS-users] Some errors while running dmrg This indicates a problem in LAPACK: *** ERROR in dsyev: info != 0 (failed to converge) I have had issues in the past with certain proprietary LAPACK libraries not converging but the slower Netlib version does. What library are you using? Is your machine 64-bit? Are you compiling ALPS for 64-bit integers? Is your BLAS/LAPACK for 64-bit integers? When you're running on the cluster with multiple independent jobs, are you running in the same directory on a shared file system? The DMRG scratch files aren't named uniquely for each job so if they are all being written to one directory on the same filesystem, each job will overwrite the others files. This can cause all sorts of terrible things to occur. You might want to setup your job submission script to create temporary scratch directory for each job to run in, and have the script copy back your final output files to whatever directory you submitted the job from, for example. And no, GCC is not the problem here. Jeff On Fri, Feb 6, 2009 at 10:35 AM, Justin David Peel <justin.peel@utah.edu <mailto:justin.peel@utah.edu>> wrote: > I have recently been running a lot of dmrg calculations on a linux cluster. I realize that the dmrg program is not parallel programmed, but I run a lot of separate programs on separate processors. However, the clusters are set up so that there are 2 processors to a node. I'm using ALPS 1.3.3 with boost 1.34.1 and lp solve 4.0. I had to specify where the mpich2 files were when I compiled as well as specifying the compiler as gnu (maybe that's the problem?). I was told by the support staff of the linux cluster to try that. > > The most distressing error has been "St9bad_alloc" which crashes the program most of the time (sometimes it is able to keep going). I recently received that error followed by five lines of: > *** ERROR in dsyev: info != 0 (failed to converge) > > when I was running a 2D heisenberg model (4x4 lattice). > > I also have received the error: > *** glibc detected *** double free or corruption (!prev): 0x00000000007b34a0 *** > > but it doesn't seem to crash the program ever and the results seem to be fine so I'm not as worried about that one. I don't receive these errors all the time, but they worry me all the same. Any ideas on what might be wrong? Is it because I used gnu as the compiler? > > Thanks, > Justin > > -- Jeff Hammond The University of Chicago http://home.uchicago.edu/~jhammond/ <http://home.uchicago.edu/%7Ejhammond/>
I think that I may have figured it out. I think it has to do my entering one of the mpi commands incorrectly. This may not be all of the problems, but I'll let you know if I have any more of them. I was having trouble duplicating the errors in gdb, but that was probably because of my ignorance with the mpi commands. It could also be that these errors are very infrequent. I'll just have to see. Thanks so much for being willing to help and being patient with me.
Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Adrian E. Feiguin Sent: Fri 2/6/2009 2:29 PM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
Dear Justin,
To properly diagnose your problem we need some help from you. Could you please send us the output file? Also, could you also try to run using gdb? You need to follow these steps:
$ gdb ./dmrg $ run parms_file ... crash! $ bt
and you send us the output.
Another thing: can you tell us a little bit about the problem you are studying? It may help.
Thank you, Saludos, <ADRIAN>
sinan bulut wrote:
type "man dsyev" in the terminal. if Lapack is properly installed, you should be able to see some explanation about this LAPACK subroutine. Can you find the explicit value of the "info" variable after the occurance of the error. As you will see in the manual page of "dsyev" that the value of "info" gives clues about the nature of the error.
On Fri, Feb 6, 2009 at 12:11 PM, Justin David Peel <justin.peel@utah.edu mailto:justin.peel@utah.edu> wrote:
I don't know the answers to all of those things (I'm not very experienced with LINUX), but I do know some. The jobs are not all running in the same shared folder; I am using a scratch folder (the LINUX cluster people direct us to do so because it is faster disk access). The machine is 64-bit. I don't know about the LAPACK library (I'm not sure how to check). I didn't have to specify a location for that one or the BLAS library. I just followed the ALPS instructions for installation so I don't know if I really installed it for 64-bit or not. I've tried looking around the for LAPACK library but haven't found it. Maybe I'll have to ask the people who run the cluster about this. Thanks for the reply, Justin -----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch <mailto:comp-phys-alps-users-bounces@phys.ethz.ch> on behalf of Jeff Hammond Sent: Fri 2/6/2009 9:45 AM To: comp-phys-alps-users@phys.ethz.ch <mailto:comp-phys-alps-users@phys.ethz.ch> Subject: Re: [ALPS-users] Some errors while running dmrg This indicates a problem in LAPACK: *** ERROR in dsyev: info != 0 (failed to converge) I have had issues in the past with certain proprietary LAPACK libraries not converging but the slower Netlib version does. What library are you using? Is your machine 64-bit? Are you compiling ALPS for 64-bit integers? Is your BLAS/LAPACK for 64-bit integers? When you're running on the cluster with multiple independent jobs, are you running in the same directory on a shared file system? The DMRG scratch files aren't named uniquely for each job so if they are all being written to one directory on the same filesystem, each job will overwrite the others files. This can cause all sorts of terrible things to occur. You might want to setup your job submission script to create temporary scratch directory for each job to run in, and have the script copy back your final output files to whatever directory you submitted the job from, for example. And no, GCC is not the problem here. Jeff On Fri, Feb 6, 2009 at 10:35 AM, Justin David Peel <justin.peel@utah.edu <mailto:justin.peel@utah.edu>> wrote: > I have recently been running a lot of dmrg calculations on a linux cluster. I realize that the dmrg program is not parallel programmed, but I run a lot of separate programs on separate processors. However, the clusters are set up so that there are 2 processors to a node. I'm using ALPS 1.3.3 with boost 1.34.1 and lp solve 4.0. I had to specify where the mpich2 files were when I compiled as well as specifying the compiler as gnu (maybe that's the problem?). I was told by the support staff of the linux cluster to try that. > > The most distressing error has been "St9bad_alloc" which crashes the program most of the time (sometimes it is able to keep going). I recently received that error followed by five lines of: > *** ERROR in dsyev: info != 0 (failed to converge) > > when I was running a 2D heisenberg model (4x4 lattice). > > I also have received the error: > *** glibc detected *** double free or corruption (!prev): 0x00000000007b34a0 *** > > but it doesn't seem to crash the program ever and the results seem to be fine so I'm not as worried about that one. I don't receive these errors all the time, but they worry me all the same. Any ideas on what might be wrong? Is it because I used gnu as the compiler? > > Thanks, > Justin > > -- Jeff Hammond The University of Chicago http://home.uchicago.edu/~jhammond/ <http://home.uchicago.edu/%7Ejhammond/>
Hi Justin,
I don't know why you are using MPI at all. As you noticed already, the dmrg code is serial. You only need to use your queuing system to submit serial jobs on different nodes. Just make sure you are running different jobs on different directories. If you run in interactive mode or under gdb, and everything works fine, the problem is elsewhere.
Saludos, <ADRIAN>
Justin David Peel wrote:
I think that I may have figured it out. I think it has to do my entering one of the mpi commands incorrectly. This may not be all of the problems, but I'll let you know if I have any more of them. I was having trouble duplicating the errors in gdb, but that was probably because of my ignorance with the mpi commands. It could also be that these errors are very infrequent. I'll just have to see. Thanks so much for being willing to help and being patient with me.
Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Adrian E. Feiguin Sent: Fri 2/6/2009 2:29 PM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
Dear Justin,
To properly diagnose your problem we need some help from you. Could you please send us the output file? Also, could you also try to run using gdb? You need to follow these steps:
$ gdb ./dmrg $ run parms_file ... crash! $ bt
and you send us the output.
Another thing: can you tell us a little bit about the problem you are studying? It may help.
Thank you, Saludos,
<ADRIAN>
sinan bulut wrote:
type "man dsyev" in the terminal. if Lapack is properly installed, you should be able to see some explanation about this LAPACK subroutine. Can you find the explicit value of the "info" variable after the occurance of the error. As you will see in the manual page of "dsyev" that the value of "info" gives clues about the nature of the error.
On Fri, Feb 6, 2009 at 12:11 PM, Justin David Peel <justin.peel@utah.edu mailto:justin.peel@utah.edu> wrote:
I don't know the answers to all of those things (I'm not very experienced with LINUX), but I do know some. The jobs are not all running in the same shared folder; I am using a scratch folder (the LINUX cluster people direct us to do so because it is faster disk access). The machine is 64-bit. I don't know about the LAPACK library (I'm not sure how to check). I didn't have to specify a location for that one or the BLAS library. I just followed the ALPS instructions for installation so I don't know if I really installed it for 64-bit or not. I've tried looking around the for LAPACK library but haven't found it. Maybe I'll have to ask the people who run the cluster about this. Thanks for the reply, Justin -----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch <mailto:comp-phys-alps-users-bounces@phys.ethz.ch> on behalf of Jeff Hammond Sent: Fri 2/6/2009 9:45 AM To: comp-phys-alps-users@phys.ethz.ch <mailto:comp-phys-alps-users@phys.ethz.ch> Subject: Re: [ALPS-users] Some errors while running dmrg This indicates a problem in LAPACK: *** ERROR in dsyev: info != 0 (failed to converge) I have had issues in the past with certain proprietary LAPACK libraries not converging but the slower Netlib version does. What library are you using? Is your machine 64-bit? Are you compiling ALPS for 64-bit integers? Is your BLAS/LAPACK for 64-bit integers? When you're running on the cluster with multiple independent jobs, are you running in the same directory on a shared file system? The DMRG scratch files aren't named uniquely for each job so if they are all being written to one directory on the same filesystem, each job will overwrite the others files. This can cause all sorts of terrible things to occur. You might want to setup your job submission script to create temporary scratch directory for each job to run in, and have the script copy back your final output files to whatever directory you submitted the job from, for example. And no, GCC is not the problem here. Jeff On Fri, Feb 6, 2009 at 10:35 AM, Justin David Peel <justin.peel@utah.edu <mailto:justin.peel@utah.edu>> wrote: > I have recently been running a lot of dmrg calculations on a linux cluster. I realize that the dmrg program is not parallel programmed, but I run a lot of separate programs on separate processors. However, the clusters are set up so that there are 2 processors to a node. I'm using ALPS 1.3.3 with boost 1.34.1 and lp solve 4.0. I had to specify where the mpich2 files were when I compiled as well as specifying the compiler as gnu (maybe that's the problem?). I was told by the support staff of the linux cluster to try that. > > The most distressing error has been "St9bad_alloc" which crashes the program most of the time (sometimes it is able to keep going). I recently received that error followed by five lines of: > *** ERROR in dsyev: info != 0 (failed to converge) > > when I was running a 2D heisenberg model (4x4 lattice). > > I also have received the error: > *** glibc detected *** double free or corruption (!prev): 0x00000000007b34a0 *** > > but it doesn't seem to crash the program ever and the results seem to be fine so I'm not as worried about that one. I don't receive these errors all the time, but they worry me all the same. Any ideas on what might be wrong? Is it because I used gnu as the compiler? > > Thanks, > Justin > > -- Jeff Hammond The University of Chicago http://home.uchicago.edu/~jhammond/ <http://home.uchicago.edu/%7Ejhammond/>
The reason I have been using mpi is because I have only seen examples for these clusters using mpi. I know that dmrg is serial. I'll see if I can figure out how to submit the jobs without mpi. I am running different jobs in different directories.
Thanks, Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Adrian E. Feiguin Sent: Sun 2/8/2009 8:43 PM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
Hi Justin,
I don't know why you are using MPI at all. As you noticed already, the dmrg code is serial. You only need to use your queuing system to submit serial jobs on different nodes. Just make sure you are running different jobs on different directories. If you run in interactive mode or under gdb, and everything works fine, the problem is elsewhere.
Saludos, <ADRIAN>
Justin David Peel wrote:
I think that I may have figured it out. I think it has to do my entering one of the mpi commands incorrectly. This may not be all of the problems, but I'll let you know if I have any more of them. I was having trouble duplicating the errors in gdb, but that was probably because of my ignorance with the mpi commands. It could also be that these errors are very infrequent. I'll just have to see. Thanks so much for being willing to help and being patient with me.
Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Adrian E. Feiguin Sent: Fri 2/6/2009 2:29 PM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
Dear Justin,
To properly diagnose your problem we need some help from you. Could you please send us the output file? Also, could you also try to run using gdb? You need to follow these steps:
$ gdb ./dmrg $ run parms_file ... crash! $ bt
and you send us the output.
Another thing: can you tell us a little bit about the problem you are studying? It may help.
Thank you, Saludos,
<ADRIAN>
sinan bulut wrote:
type "man dsyev" in the terminal. if Lapack is properly installed, you should be able to see some explanation about this LAPACK subroutine. Can you find the explicit value of the "info" variable after the occurance of the error. As you will see in the manual page of "dsyev" that the value of "info" gives clues about the nature of the error.
On Fri, Feb 6, 2009 at 12:11 PM, Justin David Peel <justin.peel@utah.edu mailto:justin.peel@utah.edu> wrote:
I don't know the answers to all of those things (I'm not very experienced with LINUX), but I do know some. The jobs are not all running in the same shared folder; I am using a scratch folder (the LINUX cluster people direct us to do so because it is faster disk access). The machine is 64-bit. I don't know about the LAPACK library (I'm not sure how to check). I didn't have to specify a location for that one or the BLAS library. I just followed the ALPS instructions for installation so I don't know if I really installed it for 64-bit or not. I've tried looking around the for LAPACK library but haven't found it. Maybe I'll have to ask the people who run the cluster about this. Thanks for the reply, Justin -----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch <mailto:comp-phys-alps-users-bounces@phys.ethz.ch> on behalf of Jeff Hammond Sent: Fri 2/6/2009 9:45 AM To: comp-phys-alps-users@phys.ethz.ch <mailto:comp-phys-alps-users@phys.ethz.ch> Subject: Re: [ALPS-users] Some errors while running dmrg This indicates a problem in LAPACK: *** ERROR in dsyev: info != 0 (failed to converge) I have had issues in the past with certain proprietary LAPACK libraries not converging but the slower Netlib version does. What library are you using? Is your machine 64-bit? Are you compiling ALPS for 64-bit integers? Is your BLAS/LAPACK for 64-bit integers? When you're running on the cluster with multiple independent jobs, are you running in the same directory on a shared file system? The DMRG scratch files aren't named uniquely for each job so if they are all being written to one directory on the same filesystem, each job will overwrite the others files. This can cause all sorts of terrible things to occur. You might want to setup your job submission script to create temporary scratch directory for each job to run in, and have the script copy back your final output files to whatever directory you submitted the job from, for example. And no, GCC is not the problem here. Jeff On Fri, Feb 6, 2009 at 10:35 AM, Justin David Peel <justin.peel@utah.edu <mailto:justin.peel@utah.edu>> wrote: > I have recently been running a lot of dmrg calculations on a linux cluster. I realize that the dmrg program is not parallel programmed, but I run a lot of separate programs on separate processors. However, the clusters are set up so that there are 2 processors to a node. I'm using ALPS 1.3.3 with boost 1.34.1 and lp solve 4.0. I had to specify where the mpich2 files were when I compiled as well as specifying the compiler as gnu (maybe that's the problem?). I was told by the support staff of the linux cluster to try that. > > The most distressing error has been "St9bad_alloc" which crashes the program most of the time (sometimes it is able to keep going). I recently received that error followed by five lines of: > *** ERROR in dsyev: info != 0 (failed to converge) > > when I was running a 2D heisenberg model (4x4 lattice). > > I also have received the error: > *** glibc detected *** double free or corruption (!prev): 0x00000000007b34a0 *** > > but it doesn't seem to crash the program ever and the results seem to be fine so I'm not as worried about that one. I don't receive these errors all the time, but they worry me all the same. Any ideas on what might be wrong? Is it because I used gnu as the compiler? > > Thanks, > Justin > > -- Jeff Hammond The University of Chicago http://home.uchicago.edu/~jhammond/ <http://home.uchicago.edu/%7Ejhammond/>
What queue system are you using? Can you post your submission script?
You should be able to just replace the line that looks like this:
mpirun (mpi commands) inputfile > outputfile
with this:
executable inputfile > outputfile
If that doesn't work, trial-and-error should lead to a solution in less than 5 iterations, if my experience holds true.
You may need to recompile the executable using non-MPI compilers, if you use CXX=mpicxx the first time.
Jeff
On Mon, Feb 9, 2009 at 9:42 AM, Justin David Peel justin.peel@utah.edu wrote:
The reason I have been using mpi is because I have only seen examples for these clusters using mpi. I know that dmrg is serial. I'll see if I can figure out how to submit the jobs without mpi. I am running different jobs in different directories.
Thanks, Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Adrian E. Feiguin Sent: Sun 2/8/2009 8:43 PM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
Hi Justin,
I don't know why you are using MPI at all. As you noticed already, the dmrg code is serial. You only need to use your queuing system to submit serial jobs on different nodes. Just make sure you are running different jobs on different directories. If you run in interactive mode or under gdb, and everything works fine, the problem is elsewhere.
Saludos,
<ADRIAN>
Justin David Peel wrote:
I think that I may have figured it out. I think it has to do my entering one of the mpi commands incorrectly. This may not be all of the problems, but I'll let you know if I have any more of them. I was having trouble duplicating the errors in gdb, but that was probably because of my ignorance with the mpi commands. It could also be that these errors are very infrequent. I'll just have to see. Thanks so much for being willing to help and being patient with me.
Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Adrian E. Feiguin Sent: Fri 2/6/2009 2:29 PM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
Dear Justin,
To properly diagnose your problem we need some help from you. Could you please send us the output file? Also, could you also try to run using gdb? You need to follow these steps:
$ gdb ./dmrg $ run parms_file ... crash! $ bt
and you send us the output.
Another thing: can you tell us a little bit about the problem you are studying? It may help.
Thank you, Saludos,
<ADRIAN>
sinan bulut wrote:
type "man dsyev" in the terminal. if Lapack is properly installed, you should be able to see some explanation about this LAPACK subroutine. Can you find the explicit value of the "info" variable after the occurance of the error. As you will see in the manual page of "dsyev" that the value of "info" gives clues about the nature of the error.
On Fri, Feb 6, 2009 at 12:11 PM, Justin David Peel <justin.peel@utah.edu mailto:justin.peel@utah.edu> wrote:
I don't know the answers to all of those things (I'm not very experienced with LINUX), but I do know some. The jobs are not all running in the same shared folder; I am using a scratch folder (the LINUX cluster people direct us to do so because it is faster disk access). The machine is 64-bit. I don't know about the LAPACK library (I'm not sure how to check). I didn't have to specify a location for that one or the BLAS library. I just followed the ALPS instructions for installation so I don't know if I really installed it for 64-bit or not. I've tried looking around the for LAPACK library but haven't found it. Maybe I'll have to ask the people who run the cluster about this. Thanks for the reply, Justin -----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch <mailto:comp-phys-alps-users-bounces@phys.ethz.ch> on behalf of Jeff Hammond Sent: Fri 2/6/2009 9:45 AM To: comp-phys-alps-users@phys.ethz.ch <mailto:comp-phys-alps-users@phys.ethz.ch> Subject: Re: [ALPS-users] Some errors while running dmrg This indicates a problem in LAPACK: *** ERROR in dsyev: info != 0 (failed to converge) I have had issues in the past with certain proprietary LAPACK libraries not converging but the slower Netlib version does. What library are you using? Is your machine 64-bit? Are you compiling ALPS for 64-bit integers? Is your BLAS/LAPACK for 64-bit integers? When you're running on the cluster with multiple independent jobs, are you running in the same directory on a shared file system? The DMRG scratch files aren't named uniquely for each job so if they are all being written to one directory on the same filesystem, each job will overwrite the others files. This can cause all sorts of terrible things to occur. You might want to setup your job submission script to create temporary scratch directory for each job to run in, and have the script copy back your final output files to whatever directory you submitted the job from, for example. And no, GCC is not the problem here. Jeff On Fri, Feb 6, 2009 at 10:35 AM, Justin David Peel <justin.peel@utah.edu <mailto:justin.peel@utah.edu>> wrote: > I have recently been running a lot of dmrg calculations on a linux cluster. I realize that the dmrg program is not parallel programmed, but I run a lot of separate programs on separate processors. However, the clusters are set up so that there are 2 processors to a node. I'm using ALPS 1.3.3 with boost 1.34.1 and lp solve 4.0. I had to specify where the mpich2 files were when I compiled as well as specifying the compiler as gnu (maybe that's the problem?). I was told by the support staff of the linux cluster to try that. > > The most distressing error has been "St9bad_alloc" which crashes the program most of the time (sometimes it is able to keep going). I recently received that error followed by five lines of: > *** ERROR in dsyev: info != 0 (failed to converge) > > when I was running a 2D heisenberg model (4x4 lattice). > > I also have received the error: > *** glibc detected *** double free or corruption (!prev): 0x00000000007b34a0 *** > > but it doesn't seem to crash the program ever and the results seem to be fine so I'm not as worried about that one. I don't receive these errors all the time, but they worry me all the same. Any ideas on what might be wrong? Is it because I used gnu as the compiler? > > Thanks, > Justin > > -- Jeff Hammond The University of Chicago http://home.uchicago.edu/~jhammond/ <http://home.uchicago.edu/%7Ejhammond/>
I'm using pbs. I had already decided to try what you suggested. I just never thought to do that because I was following blindly the only examples posted by the cluster admin. It seems to be working so far.
Thanks again, Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Jeff Hammond Sent: Mon 2/9/2009 8:51 AM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
What queue system are you using? Can you post your submission script?
You should be able to just replace the line that looks like this:
mpirun (mpi commands) inputfile > outputfile
with this:
executable inputfile > outputfile
If that doesn't work, trial-and-error should lead to a solution in less than 5 iterations, if my experience holds true.
You may need to recompile the executable using non-MPI compilers, if you use CXX=mpicxx the first time.
Jeff
On Mon, Feb 9, 2009 at 9:42 AM, Justin David Peel justin.peel@utah.edu wrote:
The reason I have been using mpi is because I have only seen examples for these clusters using mpi. I know that dmrg is serial. I'll see if I can figure out how to submit the jobs without mpi. I am running different jobs in different directories.
Thanks, Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Adrian E. Feiguin Sent: Sun 2/8/2009 8:43 PM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
Hi Justin,
I don't know why you are using MPI at all. As you noticed already, the dmrg code is serial. You only need to use your queuing system to submit serial jobs on different nodes. Just make sure you are running different jobs on different directories. If you run in interactive mode or under gdb, and everything works fine, the problem is elsewhere.
Saludos,
<ADRIAN>
Justin David Peel wrote:
I think that I may have figured it out. I think it has to do my entering one of the mpi commands incorrectly. This may not be all of the problems, but I'll let you know if I have any more of them. I was having trouble duplicating the errors in gdb, but that was probably because of my ignorance with the mpi commands. It could also be that these errors are very infrequent. I'll just have to see. Thanks so much for being willing to help and being patient with me.
Justin
-----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch on behalf of Adrian E. Feiguin Sent: Fri 2/6/2009 2:29 PM To: comp-phys-alps-users@phys.ethz.ch Subject: Re: [ALPS-users] Some errors while running dmrg
Dear Justin,
To properly diagnose your problem we need some help from you. Could you please send us the output file? Also, could you also try to run using gdb? You need to follow these steps:
$ gdb ./dmrg $ run parms_file ... crash! $ bt
and you send us the output.
Another thing: can you tell us a little bit about the problem you are studying? It may help.
Thank you, Saludos,
<ADRIAN>
sinan bulut wrote:
type "man dsyev" in the terminal. if Lapack is properly installed, you should be able to see some explanation about this LAPACK subroutine. Can you find the explicit value of the "info" variable after the occurance of the error. As you will see in the manual page of "dsyev" that the value of "info" gives clues about the nature of the error.
On Fri, Feb 6, 2009 at 12:11 PM, Justin David Peel <justin.peel@utah.edu mailto:justin.peel@utah.edu> wrote:
I don't know the answers to all of those things (I'm not very experienced with LINUX), but I do know some. The jobs are not all running in the same shared folder; I am using a scratch folder (the LINUX cluster people direct us to do so because it is faster disk access). The machine is 64-bit. I don't know about the LAPACK library (I'm not sure how to check). I didn't have to specify a location for that one or the BLAS library. I just followed the ALPS instructions for installation so I don't know if I really installed it for 64-bit or not. I've tried looking around the for LAPACK library but haven't found it. Maybe I'll have to ask the people who run the cluster about this. Thanks for the reply, Justin -----Original Message----- From: comp-phys-alps-users-bounces@phys.ethz.ch <mailto:comp-phys-alps-users-bounces@phys.ethz.ch> on behalf of Jeff Hammond Sent: Fri 2/6/2009 9:45 AM To: comp-phys-alps-users@phys.ethz.ch <mailto:comp-phys-alps-users@phys.ethz.ch> Subject: Re: [ALPS-users] Some errors while running dmrg This indicates a problem in LAPACK: *** ERROR in dsyev: info != 0 (failed to converge) I have had issues in the past with certain proprietary LAPACK libraries not converging but the slower Netlib version does. What library are you using? Is your machine 64-bit? Are you compiling ALPS for 64-bit integers? Is your BLAS/LAPACK for 64-bit integers? When you're running on the cluster with multiple independent jobs, are you running in the same directory on a shared file system? The DMRG scratch files aren't named uniquely for each job so if they are all being written to one directory on the same filesystem, each job will overwrite the others files. This can cause all sorts of terrible things to occur. You might want to setup your job submission script to create temporary scratch directory for each job to run in, and have the script copy back your final output files to whatever directory you submitted the job from, for example. And no, GCC is not the problem here. Jeff On Fri, Feb 6, 2009 at 10:35 AM, Justin David Peel <justin.peel@utah.edu <mailto:justin.peel@utah.edu>> wrote: > I have recently been running a lot of dmrg calculations on a linux cluster. I realize that the dmrg program is not parallel programmed, but I run a lot of separate programs on separate processors. However, the clusters are set up so that there are 2 processors to a node. I'm using ALPS 1.3.3 with boost 1.34.1 and lp solve 4.0. I had to specify where the mpich2 files were when I compiled as well as specifying the compiler as gnu (maybe that's the problem?). I was told by the support staff of the linux cluster to try that. > > The most distressing error has been "St9bad_alloc" which crashes the program most of the time (sometimes it is able to keep going). I recently received that error followed by five lines of: > *** ERROR in dsyev: info != 0 (failed to converge) > > when I was running a 2D heisenberg model (4x4 lattice). > > I also have received the error: > *** glibc detected *** double free or corruption (!prev): 0x00000000007b34a0 *** > > but it doesn't seem to crash the program ever and the results seem to be fine so I'm not as worried about that one. I don't receive these errors all the time, but they worry me all the same. Any ideas on what might be wrong? Is it because I used gnu as the compiler? > > Thanks, > Justin > > -- Jeff Hammond The University of Chicago http://home.uchicago.edu/~jhammond/ <http://home.uchicago.edu/%7Ejhammond/>
comp-phys-alps-users@lists.phys.ethz.ch