World Community Grid - View Thread - OpenPandemics GPU Beta Test

World Community Grid Forums

Category: Beta Testing

Forum: Beta Test Support Forum

Thread: OpenPandemics GPU Beta Test - Feb 27 2021 [ Issues Thread ]

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 162

[ ]

Author

This topic has been viewed 120163 times and has 161 replies

bozz4science
Advanced Cruncher
Germany
Joined: May 3, 2020
Post Count: 104
Status: Offline
Project Badges:

2 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: OpenPandemics GPU Beta Test - Feb 27 2021 [ Issues Thread ]

Got roughly 3 pages worth of beta tasks, all but one in PV. Only error was this WU: https://www.worldcommunitygrid.org/ms/device/...s.do?workunitId=550367778 (Win 10, 1660 Super/750Ti)

Do we already have insights into how OC of a GPU might affect the stability of the results?

Anyone tried so far running multiple GPU WUs concurenntly on the same GPU? Was wondering if you can increase WU output by forcing the GPU to hold the GPU load more constantly on a high level instead of these short bursts up to 100% and then back to 0%.

Very impressive speedups, seeing runtimes between 2 an 6 min depening on the WU size on my GTX 1660 Super and 6-12 min on my 750 Ti. That's a huge efficiency gain vs. CPU-computed WUs.

However, due to the inherent nature of these WUs, the GPUs' VRMs are getting kicked hard. They continiously have to adjust the voltage of the GPU chip up and down according to the short intensive bursts of the computations. For the 1660 Super voltage was all over the place.

And I am defintely here for the science and to help to fight the pandemic from home, but is a base credit of 2.6 really adequate for computing >100 CPU jobs in one run - albeit in much shorter time?

----------------------------------------

AMD Ryzen 3700X @ 4.0 GHz / GTX1660S
Intel i5-4278U CPU @ 2.60GHz

----------------------------------------
[Edit 1 times, last edit by bozz4science at Mar 2, 2021 9:12:25 AM]

[Mar 2, 2021 9:11:48 AM]

widdershins
Veteran Cruncher
Scotland
Joined: Apr 30, 2007
Post Count: 673
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: OpenPandemics GPU Beta Test - Feb 27 2021 [ Issues Thread ]

@uplinger a possible shortcut in your line of research might be to contact NVIDIA. Whilst some random member of the public may not get a reply I would expect an enquiry from a tech at IBM working on a gpu compute project would get better service. wink

I'd imagine that if there is a problem with OpenCL on some of their older cards they'd already know about it, and even better, be able to point you to the possible cause with a lot less effort on your part. Or at least confirm that it will never work correctly on certain cards to save you any further work.

[Mar 2, 2021 10:04:15 AM]

Vester
Senior Cruncher
USA
Joined: Nov 18, 2004
Post Count: 323
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

10 year badge for Mapping Cancer Markers

14 day badge for Uncovering Genome Mysteries

14 day badge for Outsmart Ebola Together

14 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: OpenPandemics GPU Beta Test - Feb 27 2021 [ Issues Thread ]

Bozz4science said, "Anyone tried so far running multiple GPU WUs concurenntly on the same GPU?"

Yes. I am running three per GPU on AMD Radeon HD 7990 rig with 8 GPUs. No failures.

----------------------------------------

[Mar 2, 2021 10:45:01 AM]

nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

5 year badge for Help Cure Muscular Dystrophy - Phase 2

20 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

5 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Africa Rainfall Project


Re: OpenPandemics GPU Beta Test - Feb 27 2021 [ Issues Thread ]

IIRC this was also a problem with the HCC GPU app. Certain older cards were not capable of running the app and therefore were put on an ignore list so to speak.

----------------------------------------

In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.

[Mar 2, 2021 11:29:06 AM]

goben_2003
Advanced Cruncher
Joined: Jun 16, 2006
Post Count: 145
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for The Clean Energy Project - Phase 2

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

5 year badge for Africa Rainfall Project


Re: OpenPandemics GPU Beta Test - Feb 27 2021 [ Issues Thread ]

Edit 2: Yup, with driver 425.31 the GTX660M show up as OpenCL 1.2. I doubt the card is really OpenCL 1.2 compliant though. The driver supports OpenCL 1.2, but the card may not, even though it is marketed as OpenCL 1.2 capable.

Strange though that driver 306.14 which is from 2012, does not show the card as OpenCL 1.2, but then, the OpenCL 1.2 specification was actually announced on on November 15, 2011, so perhaps the 306.14 driver does not support OpenCL 1.2, and the card is actually only OpenCL 1.1

So, I'll forget about the GTX660M when it comes to GPU crunching here.

@uplinger
Hopefully this is helpful.

From looking at the specs for the gtx 600-800 mobile series, they support opencl 1.1. The driver however covers all the way through the RTX 20 mobile series. So the driver reports supporting a higher opencl than the card supports.

Notes:
This is probably true for the non mobile gtx 600-800 series.
The specs pages I looked at in the gtx 900 series - rtx 20 series do not list the opencl version

sources:
https://www.nvidia.com/en-us/geforce/gaming-l...-gtx-660m/specifications/
https://www.nvidia.com/en-us/geforce/gaming-l...-gtx-680m/specifications/
https://www.nvidia.com/en-us/geforce/gaming-l...-gtx-760m/specifications/
https://www.nvidia.com/en-us/geforce/gaming-l...-gtx-860m/specifications/
https://www.nvidia.com/en-us/drivers/results/145874/ 425.31 drivers
https://www.nvidia.com/en-us/geforce/graphics...gtx-660ti/specifications/
https://www.nvidia.com/en-us/geforce/gaming-l...-gtx-960m/specifications/
https://www.nvidia.com/en-sg/geforce/products/10series/geforce-gtx-1060/
https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2060/

Edit 1:
@Grumpy_Swede:
I am curious what the output of Nvidia's OpenCL Device Query is for your GTX 660M. Specifically if it mentions openCL 1.1 anywhere. The Nvidia card I have right now supports opencl 1.2, so I cannot see if it is just saying opencl 1.2 because of the driver. Here is an example of some of the output from my card:

OpenCL SW Info:  
 CL_PLATFORM_NAME:      NVIDIA CUDA 
 CL_PLATFORM_VERSION:   OpenCL 1.2 CUDA 11.1.96 
 OpenCL SDK Revision:   7027912
---------------------------------  Device GeForce RTX 2060  ---------------------------------   CL_DEVICE_NAME:                       GeForce RTX 2060   
CL_DEVICE_VENDOR:                     NVIDIA Corporation   
CL_DRIVER_VERSION:                    456.71   
CL_DEVICE_VERSION:                    OpenCL 1.2 CUDA
CL_DEVICE_OPENCL_C_VERSION:           OpenCL C 1.2
oclDeviceQuery, Platform Name = NVIDIA CUDA, Platform Version = OpenCL 1.2 CUDA 11.1.96, SDK Revision = 7027912, NumDevs = 1, Device = GeForce RTX 2060

The Windows / Linux / Mac Nvidia OpenCL Device Query can be found on this page.
For the Windows 64bit version, it is in the zip at NVIDIA GPU Computing SDK\OpenCL\bin\win64\Release. If you unzip it and run the oclDeviceQuery.exe it will generate oclDeviceQuery.txt with the output in the same folder.

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by goben_2003 at Mar 2, 2021 1:52:00 PM]

[Mar 2, 2021 1:37:27 PM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:

45 day badge for Nutritious Rice for the World

20 year badge for The Clean Energy Project - Phase 2

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

1 year badge for Uncovering Genome Mysteries

10 year badge for Microbiome Immunity Project


Re: OpenPandemics GPU Beta Test - Feb 27 2021 [ Issues Thread ]

Do we already have insights into how OC of a GPU might affect the stability of the results?

Yes. It won't help it.

I would never EVER overclock on a critical scientific project like this one. You just jeopardize the results for everyone.

[Mar 2, 2021 2:53:30 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

45 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

2 year badge for Influenza Antiviral Drug Search

10 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

50 year badge for Uncovering Genome Mysteries

100 year badge for FightAIDS@Home - Phase 2

50 year badge for Microbiome Immunity Project

50 year badge for OpenPandemics - COVID-19


Re: OpenPandemics GPU Beta Test - Feb 27 2021 [ Issues Thread ]

IIRC this was also a problem with the HCC GPU app. Certain older cards were not capable of running the app and therefore were put on an ignore list so to speak.

Wow, your memory is better than mine. I went back into the history and found that we used to exclude these cards specifically.

if ( !strcmp(c.prop.name, "ION") ||
!strcmp(c.prop.name, "GeForce 210") ||
!strcmp(c.prop.name, "GeForce 310") ||
!strcmp(c.prop.name, "GeForce 310M") ||
!strcmp(c.prop.name, "GeForce 315") ||
!strcmp(c.prop.name, "GeForce 315M") ||
!strcmp(c.prop.name, "GeForce 405") ||
!strcmp(c.prop.name, "GeForce 410M") ||
!strcmp(c.prop.name, "GeForce 610M") ||
!strcmp(c.prop.name, "GeForce 8200") ||
!strcmp(c.prop.name, "GeForce 8400") ||
!strcmp(c.prop.name, "GeForce 8400GS") ||
!strcmp(c.prop.name, "GeForce 8400 GS") ||
!strcmp(c.prop.name, "GeForce 8500 GT") ||
!strcmp(c.prop.name, "GeForce 8600 GS") ||
!strcmp(c.prop.name, "GeForce 8600 GT") ||
!strcmp(c.prop.name, "GeForce 8600M GS") ||
!strcmp(c.prop.name, "GeForce 8600M GT") ||
!strcmp(c.prop.name, "GeForce 8600 GTS") ||
!strcmp(c.prop.name, "GeForce 8700M GT") ||
!strcmp(c.prop.name, "GeForce 8800 GT") ||
!strcmp(c.prop.name, "GeForce 8800 GTS 512") ||
!strcmp(c.prop.name, "GeForce 8800M GTS") ||
!strcmp(c.prop.name, "GeForce 9200") ||
!strcmp(c.prop.name, "GeForce 9300 GE") ||
!strcmp(c.prop.name, "GeForce 9300M GS") ||
!strcmp(c.prop.name, "GeForce 9400 GT") ||
!strcmp(c.prop.name, "GeForce 9500 GS") ||
!strcmp(c.prop.name, "GeForce 9500 GT") ||
!strcmp(c.prop.name, "GeForce 9600 GS") ||
!strcmp(c.prop.name, "GeForce 9600 GSO") ||
!strcmp(c.prop.name, "GeForce 9600 GSO 512") ||
!strcmp(c.prop.name, "GeForce 9600 GT") ||
!strcmp(c.prop.name, "GeForce 9600M GT") ||
!strcmp(c.prop.name, "GeForce 9800 GT") ||
!strcmp(c.prop.name, "GeForce 9800 GTX+") ||
!strcmp(c.prop.name, "GeForce 9800 GTX/9800 GTX+") ||
!strcmp(c.prop.name, "GeForce 9800 S") ||
!strcmp(c.prop.name, "GeForce G102M") ||
!strcmp(c.prop.name, "GeForce G210") ||
!strcmp(c.prop.name, "GeForce GT 120") ||
!strcmp(c.prop.name, "GeForce GT 130") ||
!strcmp(c.prop.name, "GeForce GT 130M") ||
!strcmp(c.prop.name, "GeForce GT 220M") ||
!strcmp(c.prop.name, "GeForce GT 230") ||
!strcmp(c.prop.name, "GeForce GT 230M") ||
!strcmp(c.prop.name, "GeForce GT 325M") ||
!strcmp(c.prop.name, "GeForce GT 330") ||
!strcmp(c.prop.name, "GeForce GT 330M") ||
!strcmp(c.prop.name, "GeForce GT 420") ||
!strcmp(c.prop.name, "GeForce GT 510") ||
!strcmp(c.prop.name, "GeForce GT 520") ||
!strcmp(c.prop.name, "GeForce GT 520M") ||
!strcmp(c.prop.name, "GeForce GT 610") ||
!strcmp(c.prop.name, "GeForce GT 630") ||
!strcmp(c.prop.name, "GeForce GTS 240") ||
!strcmp(c.prop.name, "GeForce GTS 250") ||
!strcmp(c.prop.name, "GeForce GTX 260M") ||
!strcmp(c.prop.name, "GeForce GTX 280M") ||
!strcmp(c.prop.name, "GeForce GTX 660M") ||
!strcmp(c.prop.name, "NVS 300") ||
!strcmp(c.prop.name, "NVS 3100M") ||
!strcmp(c.prop.name, "NVS 4200M") ||
!strcmp(c.prop.name, "NVS 5100M") ||
!strcmp(c.prop.name, "Quadro 400") ||
!strcmp(c.prop.name, "Quadro FX 1600M") ||
!strcmp(c.prop.name, "Quadro FX 1700") ||
!strcmp(c.prop.name, "Quadro FX 1800") ||
!strcmp(c.prop.name, "Quadro FX 2700M") ||
!strcmp(c.prop.name, "Quadro FX 2800M") ||
!strcmp(c.prop.name, "Quadro FX 3700") ||
!strcmp(c.prop.name, "Quadro FX 380") ||
!strcmp(c.prop.name, "Quadro FX 570") ||
!strcmp(c.prop.name, "Quadro FX 570M") ||
!strcmp(c.prop.name, "Quadro FX 580") ||
!strcmp(c.prop.name, "Quadro FX 770M") ||
!strcmp(c.prop.name, "Quadro FX 880M") ||
!strcmp(c.prop.name, "Quadro NVS 160M") ||
!strcmp(c.prop.name, "Quadro NVS 290")

Now back then we ran everything through code in the scheduler, but for this new beta we are using the more popular method by BOINC for plan classes and that is to define it in an xml file that is configurable. https://boinc.berkeley.edu/trac/wiki/AppPlanSpec

I'm hoping to find a trend on the cards that have errors that would allow me to exclude them easily with this xml file. But my fall back plan is to let those devices fail out and get limited to 1 task per day as that was not a feature back when HCC was running.

I do have a minimum of 1.2 opencl set, but as some have noted, their cards are reporting 1.2 but fail still. This leads me to believe that they weren't 100% compatible with version 1.2...

Also, thank you all for the information, I will be turning on the validator in a bit to see what kind of errors we catch.

As for the points, I have not paid much attention to them at the moment as getting the science done is a higher priority.

With the question of checkpointing and disk usage. The application writes the results to disk every time they finish a ligand. This is where a checkpoint takes place and is the easiest way to restore from. The researchers are planning to send out more difficult ligands for GPU which from my recommendation should sit around 5 minutes per ligand on average. This means that if you have an awesome card (you are very lucky and i'm jealous) that check pointing may still happen every 30 seconds or so. What is being written is very small amount of data, this should not wear out an SSD.

Thanks,
-Uplinger

[Mar 2, 2021 3:06:19 PM]

[ ]