Task stalled with Linux 4.49 test app

rhb
rhb
Joined: 15 Aug 06
Posts: 6
Credit: 1287768
RAC: 0
Topic 193760

Task http://einsteinathome.org/task/101714739 froze up with 2 1/2 hours used, showed 100% complete, not using cpu time. I couldn't skake it loose with suspend-resume. When I rebooted and restarted it, it restarted ok and shows 7 hours to completion. I will wait for it to finish and report the status. I have no clue what might have caused it to freeze.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5851
Credit: 110644962386
RAC: 33780379

Task stalled with Linux 4.49 test app

Quote:
Task 191714739 froze up with 2 1/2 hours used, showed 100% complete, not using cpu time.

I saw a similar thing on one of my machines recently.

Quote:
I couldn't skake it loose with suspend-resume.

In my case it was sufficient to stop and restart BOINC.

Quote:
When I rebooted and restarted it, it restarted ok and shows 7 hours to completion. I will wait for it to finish and report the status.

Mine finished and the result was accepted successfully by the servers. It happened quite recently and a quick scan of the results list shows no invalid results. I assume it was validated correctly but I didn't specifically check. It was so recent that I'm fairly sure I would still see it if there were a problem with it.

Quote:
I have no clue what might have caused it to freeze.

My machine was moderately overclocked but has performed without other issue for quite a while now. There have been a couple of occasions where the room got a little on the warm side so I'm assuming the lockup was heat related. The machine is a dual core and the other core continued crunching normally. It would appear that the core that froze is a little more sensitive to heat than the other.

Cheers,
Gary.

KSMarksPsych
KSMarksPsych
Moderator
Joined: 15 Oct 05
Posts: 2702
Credit: 4090227
RAC: 0

RE: Task

Quote:
Task http://einsteinathome.org/task/101714739 froze up with 2 1/2 hours used, showed 100% complete, not using cpu time. I couldn't skake it loose with suspend-resume. When I rebooted and restarted it, it restarted ok and shows 7 hours to completion. I will wait for it to finish and report the status. I have no clue what might have caused it to freeze.

I've seen similar reports over at Rosetta and have had the same experience running Rosetta on my Linux box. I've not seen it at E@H.

More rarely, you'll see similar behavior reported on Windows boxes (I think I've seen it most at Seti).

Kathryn :o)

Einstein@Home Moderator

rhb
rhb
Joined: 15 Aug 06
Posts: 6
Credit: 1287768
RAC: 0

Thanks for your comments.

Thanks for your comments. The task died due to an error:

389, 390, 391, 392, *** glibc detected *** corrupted double-linked list: 0x0bfe9218 ***

Why it then froze, I'm not sure -- probably tried to provide debug info and wasn't set up correctly. In any case, it was prepared it to be restart, and ran to successful completion. Good fumble recovery!

There are two issues here: debugging the underlying problem of the corrupted list (see the stderr file for what info is available), and if possible, producing an immediate exit and restart instead of hanging -- but hanging up may be preferable to aborting.

I will attempt to set up a better debug environment if someone advises me what to do. As for the "stuck task" problem, that could happen and not be noticed, since it recovers when boinc is restarted. That causes wasted cpu cycles, and might encourage reduction of resource share, but in my opinion I should still run einstein, even if it occured quite often.

If anyone has had this happen more than once, I think it would be interesting to know how often, and how many restart to a successful result. Also a look at the stderr files might show other bugs that can be fixed.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.