O3ASE Questions - Issues - Advice

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5854

Credit: 111313045495

RAC: 34902165

earthbilly wrote:I seem to

18 May 2021 23:03:32 UTC

Message 185960 in response to message 185956

(moderation:

)

earthbilly wrote:

I seem to remember several days ago when I selected this task the first time, the task timer or clock showed this task estimated time to complete at 4 minutes each.

This is a well known and heavily discussed side effect of Einstein's continuing use of a single duration correction factor (DCF) combined with inaccurate crunch time estimates (in completely opposite directions) for the two main GPU apps - GRP and GW.

I'm sure cecht is fully aware of this and he would understand that it's not the cause of what he is describing.

If you want to understand why your initial GW tasks had a very low estimate, and why you would have received a whole bunch more than expected, just do a search for DCF (or duration correction factor). There are probably plenty of hits to browse just for 2021 alone. Just try filtering on 2021 to see if you find a suitable explanation. If not, you would find it for certain in 2020.

Cheers,
Gary.

cecht

Joined: 7 Mar 18

Posts: 1453

Credit: 2535381336

RAC: 1946298

Gary Roberts wrote:Here is

19 May 2021 13:46:03 UTC

Message 185969 in response to message 185958

(moderation:

)

Gary Roberts wrote:

Here is a link to a comment that Richard Haselgrove posted about problems with <max_concurrent> a little while ago. It was the first reference I found with a search just now and I think there were other comments from him with even more details.

Hopefully he'll see his name being used in vain and respond accordingly with more information :-). Knowing Richard, he has probably continued to pursue this relentlessly :-).

Dang, I need to sharpen my search skills. Thanks for the tip. I have pulled <max_concurrent> from app_config and I am waiting to see if magic happens. Both hosts have been using <max_concurrent>, like forever, but it is only recently on the one host that Bloated Task Queue Syndrome popped up. I'll report back tomorrow with results.

EDIT: Actually I have pulled <project_max_concurrent>, because I had not been using <max_concurrent>. Well, just checked and I did have <max_concurrent> set for <name>einstein_O2MD1</name> and have pulled that as well. That was a crumb left over for when I tried to run O2MD1 CPU tasks simultaneously with O3ASE GPU tasks, but since had un-ticked the O2MD1 app and all CPU tasks from by Project Preferences.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht

Joined: 7 Mar 18

Posts: 1453

Credit: 2535381336

RAC: 1946298

earthbilly wrote:Cecht, do

19 May 2021 13:39:19 UTC

Message 185970 in response to message 185956

(moderation:

)

earthbilly wrote:

Cecht, do you remember what the original batch you Q'd said per task? And is it getting closer? Sounds like it is not. Guess I could try again. Now I am interested. Just in case I'm only going to allow one computer to get them. One I know I can run x2 per gpu. Tomorrow. Then I'll report.

My task times haven't much changed since this oddness began; just the usual fluctuation with different task sets. I don't recall exactly when the problem first arose (maybe 1-2 weeks ago), or what I might have done differently at the time (*sigh* old age is hell.) I do know that when it happened my queue quickly rose to 1002 tasks and stayed at 1001 and 1002 ever since. I once ticked "No new tasks", which knocked it down a few hundred, naturally, but when I allowed new work again, it immediately shot back up to 1001-1002.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Tom M

Joined: 2 Feb 06

Posts: 5813

Credit: 7902361258

RAC: 6079269

cecht wrote: I do know that

19 May 2021 15:01:24 UTC

Message 185973 in response to message 185970

(moderation:

)

cecht wrote:

I do know that when it happened my queue quickly rose to 1002 tasks and stayed at 1001 and 1002 ever since. I once ticked "No new tasks", which knocked it down a few hundred, naturally, but when I allowed new work again, it immediately shot back up to 1001-1002.

Another possible "fix" for reducing your task # is to drop the task buffer to 0.1 or even 0.01 (Thank you, Gary).

I am getting around 15-16 minutes per task on an Rx 5700 under Windows. I tried 2 tasks but it slowed significantly. And the gpu loading apparently goes down if I try 2 or 3 gpu tasks.

So it crunches along at roughly 50% gpu loading on a single task.

Tom M

A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)

cecht

Joined: 7 Mar 18

Posts: 1453

Credit: 2535381336

RAC: 1946298

Tom M wrote: Another

19 May 2021 15:53:18 UTC

Message 185976 in response to message 185973

(moderation:

)

Tom M wrote:

Another possible "fix" for reducing your task # is to drop the task buffer to 0.1 or even 0.01 (Thank you, Gary).

Yes, I dropped the task buffer from 0.05 to 0.01 when the issue first appeared, but to no effect.

HOWEVER.... since my last post, the task queue has dropped from its consistent 1001-1002 into the lower 900's. So scrubbing all occurrences of max_concurrent from app_config did something! Soon after that edit the Event log reported:

Not requesting tasks: don't need (CPU: job cache full; AMD/ATI GPU: job cache full)

Will know in a couple of days whether it hits a sane task buffer equilibrium, but I am guessing it will.

YIPPEE! Thank you all for your tips and suggestions.

Now back to our regularly scheduled O3ASE discussion....

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2142

Credit: 2801775948

RAC: 869719

cecht wrote: Now back to our

20 May 2021 15:13:51 UTC

Message 186005 in response to message 185976

(moderation:

)

cecht wrote:

Now back to our regularly scheduled O3ASE discussion....

Since there doesn't seem to be a pressing demand for O3ASE discussion at the moment, and I felt my ears burning, may I add a post-script to the conversation about over-long caches?

I had a major episode of this late last year, and wrote it up in detail in GitHub issue 4117. The problem appears to be that the BOINC client uses its internal 'Round Robin Simulation' to calculate how much work is currently cached, and hence find out whether (and if so, how much) additional work is needed.

It has been difficult to get the 'max_concurrent' feature of app_config.xml to work consistently under all conditions. The first attempt simply blocked all requests for work for an application with a max_concurrent in place: the second re-enabled work fetch, but left some of the previous code in place: they clash.

The problem is that [rr_sim] is also used when deciding what task to run next, and in this mode, it's obviously important that max_concurrent is taken into account. So, if you look at the rr_simulation debug log for a machine with this problem, you'll see things like

02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] 1950.17: wu_sf7_DS-15x10_Grp583685of1250000_0 finishes (1.00 CPU) (8404.10G/4.31G)
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] 2320.50: wu_sf7_DS-15x10_Grp583689of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] 4270.67: wu_sf7_DS-15x10_Grp583694of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] 4641.00: wu_sf7_DS-15x10_Grp583682of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics

All those lines referring to max_concurrent refer to tasks which can't be run at this stage, but should be counted in the total cache load on the machine.

I've given all the data I can to David Anderson, but he hasn't followed it through: he says he can't reproduce the problem on his own machine, although I don't think he's tried very hard. I think he's now given up on this part of the client, and he's now stopped talking to me completely.

So, if anyone else feels like taking up the cudgels, be my guest - I think all the clues you'll need are in the GitHub issue.

mikey

Joined: 22 Jan 05

Posts: 12106

Credit: 1834708619

RAC: 44129

Okay I may be a bit late here

22 May 2021 3:17:06 UTC

Message 186074 in response to message 186005

(moderation:

)

Okay I may be a bit late here but I just noticed that there is now a checkbox to do the O3AS engineering tasks under preferences, project but it doesn't have the usual GPU at the end of it saying they are for the GPU, does that mean that they are cpu tasks? Because I am still crunching the O3AS GPU tasks and am getting new ones as I return completed ones but I don't have the O3AS tasks checked.

GWGeorge007

Joined: 8 Jan 18

Posts: 2869

Credit: 4756911364

RAC: 3433345

mikey wrote: Okay I may be a

22 May 2021 14:09:55 UTC

Message 186082 in response to message 186074

(moderation:

)

mikey wrote:

Okay I may be a bit late here but I just noticed that there is now a checkbox to do the O3AS engineering tasks under preferences, project but it doesn't have the usual GPU at the end of it saying they are for the GPU, does that mean that they are cpu tasks? Because I am still crunching the O3AS GPU tasks and am getting new ones as I return completed ones but I don't have the O3AS tasks checked.

I believe if you still have the 'Run Test Applications?' checked, you will get the 03AS tasks, regardless of whether or not you have the actual 03AS box checked. I do know that there are no 03AS tasks in the applications window.

I don't have the box checked for 03AS tasks, though I did before. I stopped getting 03AS tasks when I unchecked that box.

As for whether or not 03AS will become a GPU only task, this I do not know.

George

Proud member of the Old Farts Association

mikey

Joined: 22 Jan 05

Posts: 12106

Credit: 1834708619

RAC: 44129

George wrote: mikey

22 May 2021 17:51:51 UTC

Message 186096 in response to message 186082

(moderation:

)

George wrote:

mikey wrote:

Okay I may be a bit late here but I just noticed that there is now a checkbox to do the O3AS engineering tasks under preferences, project but it doesn't have the usual GPU at the end of it saying they are for the GPU, does that mean that they are cpu tasks? Because I am still crunching the O3AS GPU tasks and am getting new ones as I return completed ones but I don't have the O3AS tasks checked.

I believe if you still have the 'Run Test Applications?' checked, you will get the 03AS tasks, regardless of whether or not you have the actual 03AS box checked.

Yes that's how I'm getting the gpu tasks now

Quote:

I do know that there are no 03AS tasks in the applications window.

There are 5039 O3AS tasks available right now

Quote:

I don't have the box checked for 03AS tasks, though I did before. I stopped getting 03AS tasks when I unchecked that box.

As for whether or not 03AS will become a GPU only task, this I do not know.

There was 768Tflops of O3AS crunching going on las week, what I'm wondering though is if those include cpu tasks or just gpu tasks because if they are cpu tasks too I will starting crunching them as I have alot more cpu cores than I have gpu's available to me.

earthbilly

Joined: 4 Apr 18

Posts: 59

Credit: 1140229967

RAC: 0

I set up 3 identical dual gpu

23 May 2021 23:19:33 UTC

Message 186122

(moderation:

)

I set up 3 identical dual gpu hosts and had only 03AS tasks selected to accept and limited my allowance fields to 0.1 and 0.0 extra between communication. At first all 3 hosts downloaded a perfect number of tasks and I began crunching 2X per gpu. BTW it seems these tasks only use GPU's and not CPU's despite the label. Everything was going well so I selected 'no new tasks' to complete the 50-60 tasks in each host window and went away for a nap. When I returned one host had consumed all tasks and finished. 2 hosts somehow each had hundreds of tasks Q'd. Where in the heavens could they have come from? The transfer page was empty when I went away. The menu selection for 'no new tasks' is selected. I had 03AS tasks still selected in preferences but I was not expecting more with 'no new tasks' selected and 'won't get new tasks' showing.

Work runs fine on Bosons reacted into Fermions,

Sunny regards,

earthbilly

O3ASE Questions - Issues - Advice

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner