O3ASE Questions - Issues - Advice

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5854
Credit: 111313045495
RAC: 34902165

earthbilly wrote:I seem to

earthbilly wrote:
I seem to remember several days ago when I selected this task the first time, the task timer or clock showed this task estimated time to complete at 4 minutes each.

This is a well known and heavily discussed side effect of Einstein's continuing use of a single duration correction factor (DCF) combined with inaccurate crunch time estimates (in completely opposite directions) for the two main GPU apps - GRP and GW.

I'm sure cecht is fully aware of this and he would understand that it's not the cause of what he is describing.

If you want to understand why your initial GW tasks had a very low estimate, and why you would have received a whole bunch more than expected, just do a search for DCF (or duration correction factor).  There are probably plenty of hits to browse just for 2021 alone.  Just try filtering on 2021 to see if you find a suitable explanation.  If not, you would find it for certain in 2020.

Cheers,
Gary.

cecht
cecht
Joined: 7 Mar 18
Posts: 1453
Credit: 2535381336
RAC: 1946298

Gary Roberts wrote:Here is

Gary Roberts wrote:

Here is a link to a comment that Richard Haselgrove posted about problems with <max_concurrent> a little while ago.  It was the first reference I found with a search just now and I think there were other comments from him with even more details.

Hopefully he'll see his name being used in vain and respond accordingly with more information :-).  Knowing Richard, he has probably continued to pursue this relentlessly :-).

Dang, I need to sharpen my search skills. Thanks for the tip. I have pulled <max_concurrent>  from app_config and I am waiting to see if magic happens. Both hosts have been using <max_concurrent>, like forever, but it is only recently on the one host that Bloated Task Queue Syndrome popped up.  I'll report back tomorrow with results.

EDIT: Actually I have pulled <project_max_concurrent>, because I had not been using <max_concurrent>. Well, just checked and I did have  <max_concurrent> set for <name>einstein_O2MD1</name> and have pulled that as well.  That was a crumb left over for when I tried to run O2MD1 CPU tasks simultaneously with O3ASE GPU tasks, but since had un-ticked the O2MD1 app and all CPU tasks from by Project Preferences.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1453
Credit: 2535381336
RAC: 1946298

earthbilly wrote:Cecht, do

earthbilly wrote:

Cecht, do you remember what the original batch you Q'd said per task? And is it getting closer? Sounds like it is not. Guess I could try again. Now I am interested. Just in case I'm only going to allow one computer to get them. One I know I can run x2 per gpu. Tomorrow. Then I'll report.

My task times haven't much changed since this oddness began; just the usual fluctuation with different task sets. I don't recall exactly when the problem first arose (maybe 1-2 weeks ago), or what I might have done differently at the time (*sigh* old age is hell.)  I do know that when it happened my queue quickly rose to 1002 tasks and stayed at 1001 and 1002 ever since. I once ticked "No new tasks", which knocked it down a few hundred, naturally, but when I allowed new work again, it immediately shot back up to 1001-1002.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 5813
Credit: 7902361258
RAC: 6079269

cecht wrote: I do know that

cecht wrote:

I do know that when it happened my queue quickly rose to 1002 tasks and stayed at 1001 and 1002 ever since. I once ticked "No new tasks", which knocked it down a few hundred, naturally, but when I allowed new work again, it immediately shot back up to 1001-1002.

Another possible "fix" for reducing your task # is to drop the task buffer to 0.1 or even 0.01 (Thank you, Gary).

I am getting around 15-16 minutes per task on an Rx 5700 under Windows.  I tried 2 tasks but it slowed significantly.  And the gpu loading apparently goes down if I try 2 or 3 gpu tasks.

So it crunches along at roughly 50% gpu loading on a single task.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

cecht
cecht
Joined: 7 Mar 18
Posts: 1453
Credit: 2535381336
RAC: 1946298

Tom M wrote: Another

Tom M wrote:

Another possible "fix" for reducing your task # is to drop the task buffer to 0.1 or even 0.01 (Thank you, Gary).

Yes, I dropped the task buffer from 0.05 to 0.01 when the issue first appeared, but to no effect.

HOWEVER.... since my last post, the task queue has dropped from its consistent 1001-1002 into the lower 900's. So scrubbing all occurrences of max_concurrent from app_config did something! Soon after that edit the Event log reported:

Not requesting tasks: don't need (CPU: job cache full; AMD/ATI GPU: job cache full)

Will know in a couple of days whether it hits a sane task buffer equilibrium, but I am guessing it will.

YIPPEE! Thank you all for your tips and suggestions.

Now back to our regularly scheduled O3ASE discussion....

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2142
Credit: 2801775948
RAC: 869719

cecht wrote: Now back to our

cecht wrote:

Now back to our regularly scheduled O3ASE discussion....

Since there doesn't seem to be a pressing demand for O3ASE discussion at the moment, and I felt my ears burning, may I add a post-script to the conversation about over-long caches?

I had a major episode of this late last year, and wrote it up in detail in GitHub issue 4117. The problem appears to be that the BOINC client uses its internal 'Round Robin Simulation' to calculate how much work is currently cached, and hence find out whether (and if so, how much) additional work is needed.

It has been difficult to get the 'max_concurrent' feature of app_config.xml to work consistently under all conditions. The first attempt simply blocked all requests for work for an application with a max_concurrent in place: the second re-enabled work fetch, but left some of the previous code in place: they clash.

The problem is that [rr_sim] is also used when deciding what task to run next, and in this mode, it's obviously important that max_concurrent is taken into account. So, if you look at the rr_simulation debug log for a machine with this problem, you'll see things like

02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] 1950.17: wu_sf7_DS-15x10_Grp583685of1250000_0 finishes (1.00 CPU) (8404.10G/4.31G)
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] 2320.50: wu_sf7_DS-15x10_Grp583689of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] 4270.67: wu_sf7_DS-15x10_Grp583694of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] 4641.00: wu_sf7_DS-15x10_Grp583682of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:51:41 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics

All those lines referring to max_concurrent refer to tasks which can't be run at this stage, but should be counted in the total cache load on the machine.

I've given all the data I can to David Anderson, but he hasn't followed it through: he says he can't reproduce the problem on his own machine, although I don't think he's tried very hard. I think he's now given up on this part of the client, and he's now stopped talking to me completely.

So, if anyone else feels like taking up the cudgels, be my guest - I think all the clues you'll need are in the GitHub issue.

mikey
mikey
Joined: 22 Jan 05
Posts: 12106
Credit: 1834708619
RAC: 44129

Okay I may be a bit late here

Okay I may be a bit late here but I just noticed that there is now a checkbox to do the O3AS engineering tasks under preferences, project but it doesn't have the usual  GPU at the end of it saying they are for the GPU, does that mean that they are cpu tasks? Because I am still crunching the O3AS GPU tasks and am getting new ones as I return completed ones but I don't have the O3AS tasks checked.

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 2869
Credit: 4756911364
RAC: 3433345

mikey wrote: Okay I may be a

mikey wrote:

Okay I may be a bit late here but I just noticed that there is now a checkbox to do the O3AS engineering tasks under preferences, project but it doesn't have the usual  GPU at the end of it saying they are for the GPU, does that mean that they are cpu tasks? Because I am still crunching the O3AS GPU tasks and am getting new ones as I return completed ones but I don't have the O3AS tasks checked.

I believe if you still have the 'Run Test Applications?' checked, you will get the 03AS tasks, regardless of whether or not you have the actual 03AS box checked.  I do know that there are no 03AS tasks in the applications window.

I don't have the box checked for 03AS tasks, though I did before.  I stopped getting 03AS tasks when I unchecked that box.

As for whether or not 03AS will become a GPU only task, this I do not know. 

George

Proud member of the Old Farts Association

mikey
mikey
Joined: 22 Jan 05
Posts: 12106
Credit: 1834708619
RAC: 44129

George wrote: mikey

George wrote:

mikey wrote:

Okay I may be a bit late here but I just noticed that there is now a checkbox to do the O3AS engineering tasks under preferences, project but it doesn't have the usual  GPU at the end of it saying they are for the GPU, does that mean that they are cpu tasks? Because I am still crunching the O3AS GPU tasks and am getting new ones as I return completed ones but I don't have the O3AS tasks checked.

I believe if you still have the 'Run Test Applications?' checked, you will get the 03AS tasks, regardless of whether or not you have the actual 03AS box checked. 

Yes that's how I'm getting the gpu tasks now

 

Quote:
 I do know that there are no 03AS tasks in the applications window. 


There are 5039 O3AS tasks available right now


Quote:

I don't have the box checked for 03AS tasks, though I did before.  I stopped getting 03AS tasks when I unchecked that box.

As for whether or not 03AS will become a GPU only task, this I do not know. 

There was 768Tflops of O3AS crunching going on las week, what I'm wondering though is if those include cpu tasks or just gpu tasks because if they are cpu tasks too I will starting crunching them as I have alot more cpu cores than I have gpu's available to me.

earthbilly
earthbilly
Joined: 4 Apr 18
Posts: 59
Credit: 1140229967
RAC: 0

I set up 3 identical dual gpu

I set up 3 identical dual gpu hosts and had only 03AS tasks selected to accept and limited my allowance fields to 0.1 and 0.0 extra between communication. At first all 3 hosts downloaded a perfect number of tasks and I began crunching 2X per gpu. BTW it seems these tasks only use GPU's and not CPU's despite the label. Everything was going well so I selected 'no new tasks' to complete the 50-60 tasks in each host window and went away for a nap. When I returned one host had consumed all tasks and finished. 2 hosts somehow each had hundreds of tasks Q'd. Where in the heavens could they have come from? The transfer page was empty when I went away. The menu selection for 'no new tasks' is selected. I had 03AS tasks still selected in preferences but I was not expecting more with 'no new tasks' selected and 'won't get new tasks' showing.

Work runs fine on Bosons reacted into Fermions,

Sunny regards,

earthbilly

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.