CUDA apps behave weirdly, also can I restart the task received if I get an init error?

Vesper
Vesper
Joined: 3 Mar 11
Posts: 3
Credit: 9418291
RAC: 0
Topic 195721

I have a WinXP PC with 2 processors and 1 GPU, which is shared via user switching (XP feature). In case I load the PC under my profile, everything works nicely with all the applications, but once the profile switch occurs, some time after this I receive a crash of CUDA application (609 one, BRPcuda32) with a yell of "can't allocate XXX bytes of GPU memory", this lets that task wait 5 minutes and then it starts rapidly failing CUDA tasks within 5s each with "Unable to initialize CUDA something" and general error of 1020. The very same scenario goes if the other profile is loaded first after reboot, while there's no profile switching, everything works OK, with the same crash sequence as soon as we switch places at the monitor. This is not only Einstein@Home problem, the SETI@Home sometimes grants a CUDA WU that behaves in the same line with these - so there should be something with the CUDA.

I also wonder, is there a possibility to recover a "failed to init" slot and force BOINC to re-run this task after I reboot the system to reset the GPU and its supporting systems? I just don't want failing tasks.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

CUDA apps behave weirdly, also can I restart the task received i

Quote:
so there should be something with the CUDA.


Yes, and it's called Windows. When you do a fast user switch in Windows, you're also changing video-driver (just as you would when using remote desktop procedure) from one you installed to one used by Windows, one which doesn't know anything about CUDA or OpenGL. This causes the BOINC to lose the connection with the GPU. Thus all your work will err.

A fix for this will be in the next BOINC (6.12), where when the connection to the GPU is lost for whatever reason, all subsequent work for that GPU will pause, until BOINC has restarted and knows where the GPU is again.

Quote:
I also wonder, is there a possibility to recover a "failed to init" slot and force BOINC to re-run this task after I reboot the system to reset the GPU and its supporting systems? I just don't want failing tasks.


Most errors cannot be recovered from. But it's also not necessary as the redundancy in BOINC is that work is normally sent to two independent computers, both of which must return the same outcome and if one doesn't or it's different, the project sends it out to a third, a fourth, a fifth computer until one returns the same outcome as any of the others did.

mikey
mikey
Joined: 22 Jan 05
Posts: 12041
Credit: 1834323219
RAC: 37243

RE: I have a WinXP PC with

Quote:

I have a WinXP PC with 2 processors and 1 GPU, which is shared via user switching (XP feature). In case I load the PC under my profile, everything works nicely with all the applications, but once the profile switch occurs, some time after this I receive a crash of CUDA application (609 one, BRPcuda32) with a yell of "can't allocate XXX bytes of GPU memory", this lets that task wait 5 minutes and then it starts rapidly failing CUDA tasks within 5s each with "Unable to initialize CUDA something" and general error of 1020. The very same scenario goes if the other profile is loaded first after reboot, while there's no profile switching, everything works OK, with the same crash sequence as soon as we switch places at the monitor. This is not only Einstein@Home problem, the SETI@Home sometimes grants a CUDA WU that behaves in the same line with these - so there should be something with the CUDA.

I also wonder, is there a possibility to recover a "failed to init" slot and force BOINC to re-run this task after I reboot the system to reset the GPU and its supporting systems? I just don't want failing tasks.

Windows does not support the switching of users AND the continuing crunching using a gpu. It had to do with the drivers in the switched user, they are generic so the gpu tasks all fail. The only way to make this work is to suspend crunching when you switch users and not start back until you come back, or to get each person their own pc. It is a Windows thing not a Boinc thing, the newer beta versions of Boinc MAY have this fixed, they are working around the problem, but I am not sure.

Vesper
Vesper
Joined: 3 Mar 11
Posts: 3
Credit: 9418291
RAC: 0

oh, thanks a lot. We don't

oh, thanks a lot. We don't need 2 PCs and sometimes consider leaving the one in question to someone else - but no way will we do that :) there's too much for a person which he can't access without one. Good thing that there will be a workaround.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.