Pending Credit

BobMALCS

Joined: 13 Aug 10

Posts: 20

Credit: 54539336

RAC: 0

15 Apr 2012 23:24:31 UTC

Topic 196283

(moderation:

)

A number of workunits I have successfully completed appear stuck as 'unsent' or just not being processed. There appears to be a user called 'Anonymous' which is performing no work at all and the workunits are just timing out.

A typical workunit is 'Workunit 120166466'

Has anybody else seen this. Can it be fixed.

BobM

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5851

Credit: 110718670660

RAC: 32161442

Pending Credit

16 Apr 2012 0:20:57 UTC

Message 109018

(moderation:

)

Quote:

A number of workunits I have successfully completed appear stuck as 'unsent' or just not being processed.

'Unsent' doesn't mean "not being processed", as you describe it. It's actually an indication that the system recognises that your previous wingman has failed to return a satisfactorily completed task (for whatever reason) and that a new copy of the task has been prepared ready to send to a new wingman in the not too distant future. It can take a little time for the new copy to be sent out (at which point 'unsent' will change to the new details) but rest assured, it will happen shortly.

Quote:

There appears to be a user called 'Anonymous' which is performing no work at all and the workunits are just timing out.

There are many users so tagged :-). They happen to be all of those who have decided (for whatever reason) to hide their computers on this site. I can assure you that lots of them are quite productive :-).

Quote:

A typical workunit is 'Workunit 120166466'

It's quite common for deadline misses to occur - as this one is. The system is in the process of handling this common situation and there is nothing to be concerned about - yet :-).

Quote:

Has anybody else seen this. Can it be fixed.

If it ain't broke ... :-).

Cheers,
Gary.

Campion

Joined: 6 Mar 05

Posts: 2

Credit: 21598393

RAC: 11590

http://einstein.phys.uwm.edu/

20 Apr 2012 19:24:05 UTC

Message 109019

(moderation:

)

http://einsteinathome.org/workunit/119275964

http://einsteinathome.org/workunit/119275969

http://einsteinathome.org/workunit/119275974

All of the above units timed out on April 10th and have not been resent.

After 10 days these units have yet to be resent to a 2nd wingman.

All of the above units had been sent to Anonymous users who were also newbies with no previous credits to their accounts.

http://einsteinathome.org/host/5020958

http://einsteinathome.org/host/5020956

http://einsteinathome.org/host/5020957

This unit timed out today and within 2 hours was resent :

http://einsteinathome.org/workunit/120528010

archae86

Joined: 6 Dec 05

Posts: 3146

Credit: 7087754931

RAC: 1315932

RE: This unit timed out

20 Apr 2012 19:36:04 UTC

Message 109020 in response to message 109019

(moderation:

)

Quote:

This unit timed out today and within 2 hours was resent :

One factor that probably modulates resend time delays is the locality scheduling which Einstein employs to reduce network traffic. When this works nicely, many GW WUs are sliced out of one set of biggish files transferred from the servers to a particular host.

I imagine resends go much more quickly if a suitable host which already has the requisite files aboard requests additional work of the right type soon after the resend window opens up. Contrarywise, it may be considerably delayed if there is a lack of a host with the requisite files already onboard requesting more work.

Generally this locality thing is good--for users who pay by the byte, it gives very substantial savings in download network traffic charges, lowers outgoing load from the servers, and probably somewhere even saves a few grams of carbon emissions.

Here is a writeup on locality scheduling by Oliver Bock of the Einstein staff, which may have more detail than you like, but might give more of an impression of how this works and how it can create delays (and also how it means your set of quorum partners at any given moment is not at all a random sample of the entire Einstein community).

Of course, it is also possible that something has gone wrong here, but I'd not bet on it.

Horacio

Joined: 3 Oct 11

Posts: 205

Credit: 80557243

RAC: 0

RE: After 10 days these

21 Apr 2012 12:14:28 UTC

Message 109021 in response to message 109019

(moderation:

)

Quote:

After 10 days these units have yet to be resent to a 2nd wingman.

All of the above units had been sent to Anonymous users who were also newbies with no previous credits to their accounts.

There is something else acting here. There is a bug with BOINC which sometimes assigns a new ID to an existing host. (Ive seen lots of users having more than 20 or 30 hosts all exactly the same and all but 1 inactives. The inactives have a couple of WUs assigned in the same day of the host "creation" and the last contact of those host is also the same as the creation date.)

When this happens, all the WUs sent to the previous ID are discarded in that host and it ends requesting new tasks.
But the discarded WUs, stay there in the scheduller waiting for the results and the user can't do anything to abort them because they are not listed anymore in their client, so the wingmen have to wait until they hit the deadline to get them resent.

When you add to this, the locality scheduling then you will understand why there are so many pending WUs...

This happened before, but it become more noticeable now, cause there are new apps for wich there are no old pendings adding to the daily credits, after a couple of months you'll start to get the credits for older WUs which will compensate the credits of the new pending ones and yor daily credit and RAC will become steady again (and may be also raised).

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5851

Credit: 110718670660

RAC: 32161442

RE: RE: After 10 days

22 Apr 2012 4:50:56 UTC

Message 109022 in response to message 109021

(moderation:

)

Quote:

Quote:
After 10 days these units have yet to be resent to a 2nd wingman.

All of the above units had been sent to Anonymous users who were also newbies with no previous credits to their accounts.

There is something else acting here. There is a bug with BOINC which sometimes assigns a new ID to an existing host. (Ive seen lots of users having more than 20 or 30 hosts all exactly the same and all but 1 inactives....

Nothing like this has happened in this case. It was not caused by a BOINC bug with hostIDs. It's most likely simply due to the idiosyncrasies of locality scheduling used for the current LV run. (Please note that the 4th example quoted by Campion that was issued "within 2 hours" was a FGRP task and so was not subject to locality scheduling).

It was just coincidence that these three LV tasks had three separate 1st wingmen, all of whom failed to return those tasks. The three hosts in question all had quite different physical hardware so it couldn't have been an existing host being issued with new IDs.

In each of these particular three cases, there was quite a delay between the issue of the first and second tasks - several days in fact. That in itself (whilst not common) is not all that unusual - I've seen it happen before. Once the three second tasks were issued (over a 6 second interval to 3 consecutive new hostIDs - 5020956, 5020957 and 5020958) all three lingered around for a further 14 days unreturned so that third tasks were needed in each case.

It's not unusual for tasks never to be returned. It happens in a surprisingly large number of cases so seeing three in a row like this doesn't necessarily mean that something is wrong. The unusual thing is that it took around a further 10 days after each deadline miss for a suitable host to be found and the third task to be issued. In fact, a single host eventually took all three resends and has since returned them all so the saga is now at an end.

As others have mentioned, a delay in issuing tasks can be caused by a shortage of hosts who already have the necessary group of large data files on board. The scheduler seems to prefer to wait until a suitably endowed host requests work. Perhaps it is waiting too long before deciding to issue new data files to a requesting host.

A natural way to get rid of resends without the extra data download penalty is to issue them to new hosts which need a data download anyway. In the past I've seen many newly added hosts get resend tasks as their very first work issue.

Cheers,
Gary.

Horacio

Joined: 3 Oct 11

Posts: 205

Credit: 80557243

RAC: 0

Did I said that this was

22 Apr 2012 9:11:27 UTC

Message 109023

(moderation:

)

Did I said that this was caused by the BOINC bug?

What Ive said (or at least what I meant) is that, there are a lot of "deadlined" WUs from new hosts coming from that bug. And Ive not said that the delay in the resend is due to that bug (or any other).

The point was, that there is nothing wrong in assigning resends (or any other work) to new hosts. Having several tasks pending due to they were sent to new ones is just more probable because a lot of new users may choose to stop crunching or because, sometimes, they are not really new hosts. The new ones that succeed are also there but they are less noticed...

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2142

Credit: 2788118807

RAC: 722928

RE: ... Once the three

22 Apr 2012 9:27:06 UTC

Message 109024 in response to message 109022

(moderation:

)

Quote:

... Once the three second tasks were issued (over a 6 second interval to 3 consecutive new hostIDs - 5020956, 5020957 and 5020958) all three lingered around for a further 14 days unreturned so that third tasks were needed in each case.

With different hardware in each case, but all with exactly the same operating system: Linux 3.0.0-15-virtual

Could this be characteristic of a cluster or hardware burn-in test starting up? Especially since each host fetched precisely one task for precisely one CPU on multicore hardware.

Whilst I know the Atlas and similar clusters contribute a great deal to Einstein's overall scientific research, maybe their interaction with the locality scheduler needs a second look if this is they way they interact with 'regular' users - and in this case, they didn't do any useful science, either.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4275

Credit: 245501227

RAC: 11435

RE: Could this be

27 Apr 2012 13:14:35 UTC

Message 109025 in response to message 109024

(moderation:

)

Quote:

Could this be characteristic of a cluster or hardware burn-in test starting up? Especially since each host fetched precisely one task for precisely one CPU on multicore hardware.

Whilst I know the Atlas and similar clusters contribute a great deal to Einstein's overall scientific research, maybe their interaction with the locality scheduler needs a second look if this is they way they interact with 'regular' users - and in this case, they didn't do any useful science, either.

Well, running multiple clients on a multicore host on the same account using only one core each requires the "allow_multile_clients" feature of BOINC (Client & Server) to work correctly, which it didn't until I debugged and fixed it about a week ago. I think a client that has that working reliably isn't even out yet. These tasks may result from failed experiments of us or someone else.

Actually the reason for us to get this working was to reduce the number of tasks "trashed" / "abandoned" etc. by such clusters. Now it should be possible to complete a task that started on one node on another cluster node when the first one doesn't become idle on time.

Anyway, I think the problem here lies deep down in the assumptions underlying the locality scheduling implementation. Basically it expects Clients not to have any files at the beginning. Now that we started S6LV1 that uses the same files than the previous run, much fewer Clients are there to get somewhat random "initial" files than what the system has originally been tuned for.

There is no easy and fast way to fix this in the current implementation, though. A re-design of the locality scheduling is underway anyway, we need to take this into account then.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2142

Credit: 2788118807

RAC: 722928

RE: RE: Could this be

27 Apr 2012 13:56:12 UTC

Message 109026 in response to message 109025

(moderation:

)

Quote:

Quote:
Could this be characteristic of a cluster or hardware burn-in test starting up? Especially since each host fetched precisely one task for precisely one CPU on multicore hardware.

Whilst I know the Atlas and similar clusters contribute a great deal to Einstein's overall scientific research, maybe their interaction with the locality scheduler needs a second look if this is they way they interact with 'regular' users - and in this case, they didn't do any useful science, either.

Well, running multiple clients on a multicore host on the same account using only one core each requires the "allow_multile_clients" feature of BOINC (Client & Server) to work correctly, which it didn't until I debugged and fixed it about a week ago. I think a client that has that working reliably isn't even out yet. These tasks may result from failed experiments of us or someone else.

Actually the reason for us to get this working was to reduce the number of tasks "trashed" / "abandoned" etc. by such clusters. Now it should be possible to complete a task that started on one node on another cluster node when the first one doesn't become idle on time.

Anyway, I think the problem here lies deep down in the assumptions underlying the locality scheduling implementation. Basically it expects Clients not to have any files at the beginning. Now that we started S6LV1 that uses the same files than the previous run, much fewer Clients are there to get somewhat random "initial" files than what the system has originally been tuned for.

There is no easy and fast way to fix this in the current implementation, though. A re-design of the locality scheduling is underway anyway, we need to take this into account then.

BM

OK, I'll leave it in your capable hands, then.

From what I read, the current clients handle 'allow_multile_clients' OK, but there's an outstanding bug in the manager that makes it hard to control them. I don't imagine the clusters have very much use for BOINC Manager, though! One workround which has been found to work is to use third-party applications like BoincTasks for any management that might be necessary - I believe you're in private contact with somebody who can advise on that.

joe areeda

Joined: 13 Dec 10

Posts: 285

Credit: 320378898

RAC: 0

All I can say is that I

28 Apr 2012 19:16:13 UTC

Message 109027

(moderation:

)

All I can say is that I currently have 846 tasks that are "Completed, waiting for validation" with the oldest uploaded on Mar 16th (6 weeks ago).

As far as I can tell this is normal and that they eventually get completed, one way or the other.

I've had computers go off line and tasks time out so I can't complain about others.

Joe

Pending Credit

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner