4 Second GPU crash for some, but not all WU's GTX 1060

Tom McFarland
Tom McFarland
Joined: 5 Aug 09
Posts: 3
Credit: 55014839
RAC: 345142
Topic 226937

I'm getting the following error on a number of WU's for GPU. A bunch have also completed successfully, but I had a run of crashes yesterday that made me suspend GPU work. I'm new to Linux, so I was wondering if I was missing some special Lib or something. Any suggestions appreciated. Computer is:

CPU type: AuthenticAMD AMD FX(tm)-8350 Eight-Core Processor [Family 21 Model 2 Stepping 0]

Coprocessors: NVIDIA NVIDIA GeForce GTX 1060 6GB (4095MB) driver: 470.86

Operating system: Linux Ubuntu Ubuntu 21.10 [5.13.0-28-generic|libc 2.34 (Ubuntu GLIBC 2.34-0ubuntu3)]

BOINC client version: 7.16.17

Memory: 31910.08 MiB


<core_client_version>7.16.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
putenv 'LAL_DEBUG_LEVEL=3'
2022-02-07 13:24:35.5589 (65335) [normal]: This program is published under the GNU General Public License, version 2
2022-02-07 13:24:35.5590 (65335) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2022-02-07 13:24:35.5590 (65335) [normal]: This Einstein@home App was built at: Aug 5 2021 17:20:50

2022-02-07 13:24:35.5590 (65335) [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/einstein_O3AS_1.01_x86_64-pc-linux-gnu__GW-opencl-nvidia'.
[DEBUG} GPU type: 1
[ERROR] Couldn't get OpenCL device from BOINC (-1)!
2022-02-07 13:24:35.5882 (65335) [debug]: Flags: LAL_DEBUG, OPTIMIZE, HS_OPTIMIZATION, GC_SSE2_OPT, X64, SSE, SSE2, GNUC X86 GNUX86
2022-02-07 13:24:35.5882 (65335) [debug]: glibc version/release: 2.34/stable
2022-02-07 13:24:35.588268 - mytime()
2022-02-07 13:24:35.5884 (65335) [debug]: Set up communication with graphics process.

einstein_O3AS_1.01_x86_64-pc-linux-gnu__GW-opencl-nvidia: unrecognized option `--device'

Usage: einstein_O3AS_1.01_x86_64-pc-linux-gnu__GW-opencl-nvidia [-h|--help] [-v|--version] [@<config-file>] [--log] [--semiCohToplist] [--DataFiles1] [--IFOs] [--skyRegion] [--numSkyPartitions] [--partitionIndex] [--skyGridFile] [--dAlpha] [--dDelta] [-f|--Freq] [--dFreq] [-b|--FreqBand] [--f1dot] [--df1dot] [--f1dotBand] [--f2dot] [--df2dot] [--f2dotBand] [--f3dot] [--df3dot] [--f3dotBand] [--peakThrF] [-m|--mismatch1] [--gridType1] [--metricType1] [-g|--gammaRefine] [-G|--gamma2Refine] [-o|--fnameout] [--fnameChkPoint] [-n|--nCand1] [--printCand1] [--refTime] [--ephemEarth] [--ephemSun] [--minStartTime1] [--maxStartTime1] [--printFstat1] [--assumeSqrtSX] [--nStacksMax] [-T|--tStack] [--segmentList] [--recalcToplistStats] [--loudestSegOutput] [--writeLeanerOutput] [--tlCompartments] [--computeBSGL] [--Fstar0sc] [--oLGX] [--getMaxFperSeg] [--SortToplist] [--FstatMethod] [--FstatMethodRecalc] [--injectionSources] [--injectSqrtSX] [--timestampsFiles] [--Tsft] [--useGPUSemiCoh] [--GPUDevice]

2022-02-07 13:24:35.5891 (65335) [CRITICAL]: ERROR: MAIN() returned with error '1'

DEPRECATION WARNING: program has invoked obsolete function XLALGetVersionString(). Please see XLALVCSInfoString() for information about a replacement.
Code-version: %% LAL: 6.21.0.1 (CLEAN 8d0838c264f9ff9adc8c3cdbfa17b5154eaa2994)
%% LALPulsar: 1.18.2.1 (CLEAN 8d0838c264f9ff9adc8c3cdbfa17b5154eaa2994)
%% LALApps: 6.25.1.1 (CLEAN 8d0838c264f9ff9adc8c3cdbfa17b5154eaa2994)

FPU status flags:
2022-02-07 13:24:35.5894 (65335) [debug]: worker done. return(1) to caller
2022-02-07 13:24:35.5894 (65335) [normal]: done. calling boinc_finish(1).
13:24:35 (65335): called boinc_finish


Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4788
Credit: 17849463910
RAC: 3575493

[ERROR] Couldn't get OpenCL

[ERROR] Couldn't get OpenCL device from BOINC (-1)!

einstein_O3AS_1.01_x86_64-pc-linux-gnu__GW-opencl-nvidia: unrecognized option `--device'

Looks like you lost the OpenCL compute portion of the drivers.

A quick check with clinfo will confirm they have gone missing. sudo apt install clinfo

Either reload the Nvidia drivers or reinstall the OpenCL portion sudo apt install ocl-icd-libopencl1

 

Olivier Chassé St-Laurent
Olivier Chassé ...
Joined: 31 Dec 20
Posts: 1
Credit: 75304608
RAC: 43209

It is probably due to a

It is probably due to a recent (unattended) upgrade to the Nvidia drivers (you can check the dpkg or apt log to be sure); a simple reboot should fix it.

Tom McFarland
Tom McFarland
Joined: 5 Aug 09
Posts: 3
Credit: 55014839
RAC: 345142

Thanks! And yeah, I saw that 

Thanks! And yeah, I saw that  in the error and figured it had something to do with it, just wasn't sure how. Unfortunately, you were correct about clinfo. Number of Platforms = 0. Also unfortunately, no joy with sudo apt install ocl-icd-libopencl1. It said it was "already the newest version (2.2.14-2)". I'll have to search around the Nvidia drivers to see what's available.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4788
Credit: 17849463910
RAC: 3575493

When the drivers go

When the drivers go ka-bloooey for unknown reasons, it is often best and fastest to just do a purge of Nvidia drivers and reinstall.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.