Progress stops

PappyGus
PappyGus
Joined: 8 Jun 19
Posts: 6
Credit: 25527594
RAC: 0
Topic 219587

I have been running BOINC on my current computer since June. Computer info: Intel I7 - 7700K (no overclocking), 16Gb RAM, GTX 1080 Founders Ed., ASUS Z270i motherboard. I am running BOINC Manager 7.14.2 (x64). 2 Projects: E@H and SETI@Home.

A few days ago, I started seeing a problem with progress in E@H, but not SETI@Home. E@H progress would get to some random value anywhere from 10% to 90% and not go any further. The tasks would show Running, but never increased the progress.  The latest stoppage was 6 (CPU only) files all at 89.252%, 1 (CPU only) file at 10.365%, and 1 CPU/GPU file at 34.589%.

Every time I noticed this (which has been 5 or 6 times in the last 4 days), I tried the following several times each: pausing BOINC, then restarting, paused the individual tasks for a short while then restarted,  rebooted the computer, closed and restarted BOINC several times. Nothing got the tasks to start progressing. I even gave them a chance over a a whole day to see if they would restart, but nothing worked. So, each time it happened, after I gave up trying to get them to restart, I aborted the problem tasks. In between these progress stoppage issues, E@H work just fine; online stats showed they were getting done.

Any ideas on this?

 

Thanks,

Pappy

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5853
Credit: 111258169825
RAC: 34858759

Your computers are hidden so

Your computers are hidden so it's impossible to see exactly what searches you are running and any task information that might have been returned to the servers.  For example there are two different CPU only searches and two different GPU searches.  Which particular ones are you running?  If you're running all searches, which particular one(s) do the tasks that 'stall' come from?

If the behaviour you describe only started a 'few days ago' did you make any changes or perform any updates at about that time?  It's possible that the change in behaviour is correlated with some other change that occurred then.

Your CPU is a quad core with HT.  Using all 8 threads as you do, does put quite a high load on the machine.  If you allowed BOINC to use say 75% of the threads (5 CPU tasks, 1GPU task) as a test, you could see if the problem was in some way related to overloading.  You might find that tasks run a bit faster that way

Cheers,
Gary.

PappyGus
PappyGus
Joined: 8 Jun 19
Posts: 6
Credit: 25527594
RAC: 0

Gary-> OK. I did not realize

Gary-> OK. I did not realize I had the privacy setting on for my computer. Fixed that. Computer is Sloop343. 

I have not made any changes to the computer. I did wonder if maybe my system was having real issues, so I downloaded and ran the Intel Processor Diagnostic Tool 64 bit, and passed all tests. Also tested the system using OCCT version 5.3.5 and no errors showed up. So that eliminates the computer itself as a problem.

Whenever I had the BOINC Manager running, I had the CAM (from NZXT) application running, which shows my CPU and GPU loading, temperature and clock speeds. I have a liquid cooling loop on the CPU, so it doesn't get above around 65C and the GPU sits at around 75C.

I had the computing preferences setup for 100% CPUs and 50% time usage. I have not seen another progress stoppage since earlier today around noon.  I have aborted all the affected tasks. I did use your recommendation of dropping to 75% CPUs as of this evening. I do find it interesting that SETI@Home has not had the same problem.

Interestingly, with 75% CPU usage, my GPU average load has dropped to ~85% and temperature is at 72C.

I do have a question regarding CPU and GPU searches. I don't understand what you mean by searches. Is that the Project Applications selection under Project Preferences? Ifo yes, I have all of them checked. At the top of the Project Preferences page, I have USE CPU and USE NVIDIA GPU as YES, and the AMD/INTEL GPU as NO.

Thanks for your help on this!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5853
Credit: 111258169825
RAC: 34858759

PappyGus wrote:... I don't

PappyGus wrote:
... I don't understand what you mean by searches. Is that the Project Applications selection under Project Preferences? Ifo yes, I have all of them checked. At the top of the Project Preferences page, I have USE CPU and USE NVIDIA GPU as YES, and the AMD/INTEL GPU as NO.

Einstein@Home uses different applications to search for different things.  So, yes, each 'search' uses a particular application to perform the search.  As an example, there is a CPU search for pulsars that emit gamma rays using data from the Large Area Telescope (LAT) on the Fermi satellite. There is also a search for continuous gravitational wave (GW) emissions from massive rotating objects like neutron stars, some of which may be spinning at millisecond rates.  That search uses data from the LIGO observatories.

There are also two searches that use GPUs to look for similar things.  The GW GPU search is via a test app at the moment so you shouldn't be getting any of those unless you have changed the setting to allow test apps.  Now that you've 'unhidden' your computers (thank you) I can see that you aren't getting test tasks and that's a good thing because they are having validation problems at the moment.  Also, they have other issues when run simultaneously with the gamma ray pulsar GPU tasks.

Virtually all the tasks you aborted were for the FGRP5 CPU search (22 tasks).  There were no GPU tasks aborted and only 2 for the O2AS20-500 GW search.  I looked at all your tasks for the FGRP5 search and here is a link to what shows up.  Over time this will change as old tasks get removed and new tasks get added to the online database.

The first 5 tasks in that list show a progressive deterioration in the crunch time - from 18k secs to 31k secs.  That is not normal and tends to suggest that something else is going on.  Since both CPU time and run time are slowing down, it looks like CPU frequency is reducing enormously.  You would normally expect throttling like that to be due to overheating but you say your machine is water cooled - so it seems a bit of a mystery.  Perhaps you should give your machine a good check for virus/malware in case it's something like that?

FGRP5 tasks have a pretty standard run time.  The 18k secs figure for your machine seems reasonable.   Does your workload from Seti change very much?  There's got to be something that has changed that's causing the slowdown.  There is no sign that the tasks you aborted were in any sort of trouble.  You can click on the task ID link of any task to see what was returned to the project up to the point the task was aborted.  I looked at a few and there seems to be quite a lot of evidence for tasks being stopped and restarted from saved checkpoints.

You will likely lose quite a bit of progress every time that happens but all you should see when you restart is that the time and progress values should go back to what they were at the time of the saved checkpoint.  I saw one that was aborted when it was in the followup stage - probably not many minutes before it would have completed and been returned.  There is no progress beyond ~90% in the followup stage until the final jump to 100%.  Don't ever abort a task that seems stuck at ~90% without allowing 30-60 mins for it to finish normally.

Cheers,
Gary.

PappyGus
PappyGus
Joined: 8 Jun 19
Posts: 6
Credit: 25527594
RAC: 0

Gary, I've been running at

Gary,

I've been running at 75% CPU usage and 50% CPU time with GPU usage turned on. I have noticed that some of the FGRP5 tasks are taking much, much longer than others. Right now I have 3 of them that have complete 10+/- hours of computation time and all have over 1 day of expected time left, while some have completed within the 'normal' time frames you mentioned. So I have no idea what is going on with my computer. I'm leaning towards some kind of problem with the CPU or memory, but none of the testing I have done finds a single issue.

I'm going to leave all these problem tasks alone to do what they will. Whatever the issue, if it's with the CPU or some other portion of the system, it should begin to get worse at some point, which is when I will decide what to do about it. (My computer happens to be under warranty, yay!) In the meantime, since no other programs I run (including games) have had any problems, I'm just going to let it ride.

I did look at the last month's worth of SETI@Home completed tasks, and the recent ones are consistent with those completed prior to the issues starting with E@H.

What do you think of me stating another Project, such as Milkyway@Home to see if it has the same issues as E@H?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5853
Credit: 111258169825
RAC: 34858759

PappyGus wrote:... What do

PappyGus wrote:
... What do you think of me stating another Project, such as Milkyway@Home to see if it has the same issues as E@H?

I think you'd just end up with a more confusing situation than you already have.  Different projects have apps that use your resources differently.  Some make heavy use, some make lighter use.  If one project will run without issue, that doesn't really help you explain why another app has problems.

Lets look at some preference settings first - although I can't really see how this would cause what you see.  For computing preferences, you've already mentioned your 50% time setting.  There's a particular reason I dislike that setting.  I run CPUs 24/7/365.  After starting, they reach a stable operating temperature and stay there with no short term thermal cycling.  By using the 50% time setting you are essentially turning your CPU from idle to full on and back again every couple of seconds.  I worry that the expansion/contraction effects may have long term consequences for CPUs and motherboards.  I think a better mode of operation is to leave that setting at 100% and control the total heat generated by reducing the % of cores - or you could turn HT off and go back to the 4 'real' cores and run 3 CPU tasks plus 1 GPU task.  That will give a lower operating temperature without the internal fluctuations.

I have machines still running fine (Q6600 quad core CPUs) that I built in 2008.  The CPU temps are around 80-90C.  I've never had a CPU failure.  If your CPU temps never exceed 65C, you have no issue with heat.  I don't know what sort of damage might eventuate from very short term thermal cycling though.

Can you also advise of other processor usage settings that start, "Suspend when .... "?  If crunching is suspended on a regular basis, it's pretty important that the setting for keeping tasks in memory when suspended is set to yes.  That won't actually change the reported values for crunch times but it could change your perception of how long it's taking for a task to make headway if the values keep resetting to those of a former checkpoint.  If you keep tasks in memory, you can eliminate the wasted progress that gets lost every time a task has to restart from a checkpoint.

In the tasks list link that I gave you last time, there is a new FGRP5 completed task (returned 15 Sep 10:18:22 UTC) with an elapsed time of 25,733 secs.  There is also a new aborted task that was returned about an hour later.  Here is the task ID data that was returned for that aborted task. If you look down towards the bottom before the Windows debugger stuff, you can see that 7 out of 79 sky points had been completed and the 8th had just been started.  The line that says "% checkpoint read: skypoint 7 binarypoint 0" tells us that the app was restarting by reading in from a saved checkpoint.  Two lines later it tells us "% Sky point 8/79" which signifies the start of computing for the 8th checkpoint when the task was aborted.

For comparison, here is the task ID data that was returned for the successful task.  This is exactly what crunching should look like - 79 completed loops (each with 56 'dots' - the nf1dots parameter was 56) and after each one, the new checkpoint was written.  These 79 loops are the main calculations and are followed by the followup stage where the top 10 candidate signals are assessed.  No sign of a single stoppage or restart from a checkpoint anywhere in the whole run.  The followup stage is the bit that happens when the progress seems to pause just below the ~90% mark.

So, the real description of your problem is that a normal task can proceed through all stages in a certain amount of time with no signs of stopping/restarting or any other sorts of hiccups.  On the other hand you have what should be very similar tasks that get around 10% of the way through in rougly the same total time with lots of evidence of stops/restarts and extremely slow rates of progress.  Something on your machine is likely doing this.  I would think it's very unlikely to be the hardware itself.  A hardware issue should affect all tasks.

I have no idea why there is such a difference between your 'good' and 'bad' results.  It's not the data - the same data file LATeah0060F.dat was being crunched in both those tasks.  It has to be something else.  If you can't work out what it is, I would suggest not running the FGRP5 CPU tasks (you've only got a couple left) and start working on the much larger number of O2AS GW tasks that you already have.  They seem to run more successfully.

Maybe someone else might have some ideas.  I've done lots of FGRP5 tasks over the years and never seen this sort of behaviour before.

Cheers,
Gary.

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

Gary Roberts wrote:PappyGus

Gary Roberts wrote:
Lets look at some preference settings first - although I can't really see how this would cause what you see.  For computing preferences, you've already mentioned your 50% time setting.  There's a particular reason I dislike that setting.  I run CPUs 24/7/365.  After starting, they reach a stable operating temperature and stay there with no short term thermal cycling.  By using the 50% time setting you are essentially turning your CPU from idle to full on and back again every couple of seconds.  I worry that the expansion/contraction effects may have long term consequences for CPUs and motherboards.  I think a better mode of operation is to leave that setting at 100% and control the total heat generated by reducing the % of cores - or you could turn HT off and go back to the 4 'real' cores and run 3 CPU tasks plus 1 GPU task.  That will give a lower operating temperature without the internal fluctuations.

I agree with Gary on this point. Thermal instability can cause problems, especially on devices which intermittently draw large amounts of power. This is why the power supply is the most likely point of failure in any electronic device.

I'm somewhat baffled by this particular issue, but there is one setting in windows which might help. Bring up control panel (the old app under Windows in the Start menu) and click on the System Icon. From there, select Advanced Systems settings then click the Advanced tab. This brings up a menu with four choices. Under the performance section, click the settings box, then select the Advanced tab. This allows access to processor scheduling and virtual memory. Processor scheduling is normally set to programs. You may change this setting to Background services. In a multi core windows box, this allows BOINC tasks to run more efficiently at the expense of programs in the foreground. On a fast machine, you will probably not even notice. It is an easy setting to change and is benign to the system as a whole.

The other thing you may want to look into is BIOS settings. Many newer computers have hardware thermal regulation settings as well as dynamic processor settings which interact separately from any Throttling software installed at the OS level. There are other BIOS settings which can have an effect on the way a system runs. Descriptions of each BIOS setting can usually be found on the OEM website. A word of caution; changing the wrong settings can get one into deep water very quickly. If you make any changes, do so one at a time, making note of the original settings.

Clear skies,
Matt
Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

PappyGus wrote:I had the

PappyGus wrote:
I had the computing preferences setup for 100% CPUs and 50% time usage. I have not seen another progress stoppage since earlier today around noon.  I have aborted all the affected tasks. I did use your recommendation of dropping to 75% CPUs as of this evening. I do find it interesting that SETI@Home has not had the same problem.

Assuming you still have hyper threading enabled, I would reverse these settings. 100% time usage, and 50% of CPU's. If the issue resolves itself, you may gradually increase the percentage of CPUs used until you start having errors. Ideally, I like to run one task per physical core, allowing at least 2 additional threads for GPU use. In my case, I have a DL360 server with 2 processors, six physical cores each. I have run it successfully between 50% and 65% of CPU utilization. I do know, when I first started the box, It was running at 100% and every task ran with errors, including the GPU. Since I was planning to reduce the utilization I didn't give it much thought, but reading your comment sheds new light on the issue.

I'm currently running at 60%, giving me 12 concurrent CPU tasks, and 2 concurrent GPU tasks. Unless I want to add the other power supply, 65% is about as much as I can push that box.

As Gary has mentioned, different applications will have different characteristics as to how they will perform. Some are more CPU intensive than others, the result being, higher power consumption and higher operating temperature. Since you seem to have a good setup for CPU cooling, I doubt temperature is an issue, but power consumption might be.

Clear skies,
Matt
PappyGus
PappyGus
Joined: 8 Jun 19
Posts: 6
Credit: 25527594
RAC: 0

Gary Roberts wrote:PappyGus

Gary Roberts wrote:
PappyGus wrote:
... What do you think of me stating another Project, such as Milkyway@Home to see if it has the same issues as E@H?

I think you'd just end up with a more confusing situation than you already have... 

By using the 50% time setting you are essentially turning your CPU from idle to full on and back again every couple of seconds.  I worry that the expansion/contraction effects may have long term consequences for CPUs and motherboards... 

Can you also advise of other processor usage settings that start, "Suspend when .... "?  If crunching is suspended on a regular basis, it's pretty important that the setting for keeping tasks in memory when suspended is set to yes...

This all makes sense, so I’m staying with just E@H and SETI running on equal shares, I have been running 100% time and 75% CPUs for the last two days, and I have checked the box for “Leave non-GPU tasks in memory while suspended.” I also have it set to “Run always" unless I'm gaming or watching videos.

Computing Options I have set up: Suspend when non-BOINC CPU usage is above 35%, and that's it really.

Since I made those changes 2 days ago, E@H seems to have been running fine.  I haven't seen any E@H tasks take longer than they used. On the other have, I have a SETI task that is currently at 0.594%, has been running for just over 22 minutes, and says it has over 2.5 days left and that is going up fast… wow, ok... It just did something weird… that SETI task stopped (waiting to run), no other task started. And then it restarted after about a minute with elapsed time starting from 0 and remaining time is and going up fast again, over 10 hours already). Also, I just realized that even though I'm at 75% CPUs, there are 7 tasks running instead of six (4 cores, HT on). That SETI task is the only GPU task running, and looks like the latest task to start based on run times.

Here’s that task info:

Application:  SETI@home v8 8.22 (opencl_nvidia_SoG)
Name:  10se08ab.23327.8252.5.32.139
State:  Running
Received:  9/16/2019 12:09:56 PM
Report deadline:  10/6/2019 11:19:16 PM
Resources:  0.43 CPUs + 1 NVIDIA GPU
Estimated computation size:  70,727 GFLOPs
CPU time:  00:00:09
CPU time since checkpoint:  00:00:09
Elapsed time:  00:08:12
Estimated time remaining:  22:51:53
Fraction done:  0.595%
Virtual memory size:  128.79 MB
Working set size:  101.81 MB
Directory:  slots/0
Process ID:  4516
Executable:  setiathome_8.22_windows_intelx86__opencl_nvidia_SoG.exe

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5853
Credit: 111258169825
RAC: 34858759

PappyGus wrote:... I just

PappyGus wrote:
... I just realized that even though I'm at 75% CPUs, there are 7 tasks running instead of six (4 cores, HT on). That SETI task is the only GPU task running, and looks like the latest task to start based on run times.

Please realise that the % of cores BOINC is allowed to use refers to CPU tasks only and not GPU tasks.  If your "7 tasks running" consisted of 6 CPU tasks plus 1 GPU task, that is what is allowed for a setting of 75%.

For the case of a Seti task stopping and then restarting after about a minute, that could be due to a restriction like suspend when non-BOINC usage is above 35%, for example.  In other words, BOINC saw a limit being exceeded and so paused a task until whatever it was cleared.  The fact that everything went back to zero could have been due to no checkpoint being available to restart from.  You have to crunch for a while until the first checkpoint is written.  I don't run Seti and have no knowledge of how their tasks work.

 

Cheers,
Gary.

PappyGus
PappyGus
Joined: 8 Jun 19
Posts: 6
Credit: 25527594
RAC: 0

Matt White wrote:This allows

Matt White wrote:
This allows access to processor scheduling and virtual memory. Processor scheduling is normally set to programs. You may change this setting to Background services.

I went ahead and changed that setting. Since then, including going to 100% CPU time and 75% CPU usage, E@H hasn't had any issues. SETI, on the other hand, is having issues (see my last post).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.