World Community Grid - View Thread - CEP2: BOINC 7.0.40+ and app

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 34

[ ]

Author

This topic has been viewed 20417 times and has 33 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

Who's jumped the client to a version 7.0.40 and above to specifically use the new app_config.xml in limiting the CEP2 concurrent number with the <max_concurrent> line?

Why I'm asking? The contribution for this science is positively dismal compared to what it could be... Barely scratching 18 years daily, this configuration tool [app_config.xml] designed to help maximize resource use without having to micromanage and knowing that the threads that may run, are running, i.e. if set to having 1 run, 1 running every minute of the day the computer is on not having to worry that more will start than specified by the device profile as allowed to be buffered on the client.

My W7-64 Octo HT, now set to having 8 buffered [device profile Default], 2 concurrent, 1.5 day cache setting. In addition, the second science [HCC1] was limited to running 6 at most, in case there was an interruption of CEP2 supply causing them not to be the oldest in the FIFO line of starting. HCC1 cannot use more than 6 slots, which is when the client will look for other work [CEP2 in this case] to fill any of the other free slots [8 minus 6]. The app_config.xml file content for this is:

<app_config>
   <app>
      <name>cep2</name>
      <max_concurrent>2</max_concurrent>
   </app>
   <app>
      <name>hcc1</name>
      <max_concurrent>6</max_concurrent>
   </app>
</app_config>

My Linux-64 Quad, now set to have 6 buffered [device profile School], 2 at the time, 1.5 day cache setting. The app_config.xml file content for this is:

<app_config>
   <app>
      <name>cep2</name>
      <max_concurrent>2</max_concurrent>
   </app>
</app_config>

My W7-32 Duo, set to have buffered 3 [device profile Work], 1 at the time, 1.25 day cache setting. The app_config.xml file content for this is:

<app_config>
   <app>
      <name>cep2</name>
      <max_concurrent>1</max_concurrent>
   </app>
</app_config>

The amount of work buffered for CEP2 is about 25% higher than can be processed in the set buffer size [considering shortest run time], meaning the CEP2 tasks are almost always the oldest, and therefor almost always get first go whenever a running CEP2 task finishes.

All it needs it placement of the little app_config.xml file in the WCG project folder, which typically on Vista/W7/8 is C:\ProgramData\BOINC\projects\www.worldcommunitygrid.org and upgrade to a version e.g. 7.0.47 [tested stable since Jan.30 on my W7-64 Octo]. If you already are on v7, it takes 2 minutes to run the installer. If you are on WCG's 6.10.58, it takes 4 minutes... follow the step sequence as written up by ibsteve2u
http://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=411282 , where half of them is eclectic.

V.v step 4), these are the 7.0.47 links:
Windows 64 bits installer: http://boinc.berkeley.edu/dl/boinc_7.0.47_windows_x86_64.exe
Windows 32 bit installer: http://boinc.berkeley.edu/dl/boinc_7.0.47_windows_intelx86.exe

(versions 7.0.48/49/50 fix nothing critical... mainly make-up).

Using app_config.xml... great, want to use this to be able to do more for CEP2 and worried... ask here how to best tune app_config.xml in your scenario.

----------------------------------------
[Edit 1 times, last edit by Former Member at Feb 18, 2013 10:15:27 AM]

[Feb 9, 2013 1:57:04 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

app_config.xml usage tip:

Sometimes the WU in your cache that is Ready to Start (RTS) and would be next to run is not a CEP2 WU when the number of running CEP2 WUs has dropped below the maximum you set in your <max_concurrent> line.
For example, this can happen when you are adjusting the setting in your WCG Device Profile for number of CEP2 WUs in your cache, or after a CEP2 WU supply outage.
If you have some CEP2 WUs that are further down in the cache, you can micromanage them so that they will all be run ahead of the non-CEP2 WUs that are RTS and ahead of them in the cache order, but your <max_concurrent> setting will still apply.

Normally, WUs that are Waiting to Run (WTR) will be run ahead of all WUs that are RTS (High Priority excepted).
However, the <max_concurrent> setting overrides this. If the max number of CEP2 WUs are already running, the next time BOINC needs to start a WU it will ignore any CEP2 WUs in the cache that are WTR and run a non-CEP2 WU that is RTS.

You can switch all CEP2 WUs in your cache from RTS to WTR as follows.
Make sure that LAIM is set!
Check that there are no running WUs that are about to finish, and defer the next steps until they have done so.
Suspend all RTS and WTR WUs, except the CEP2 ones that are RTS.
Suspend one of the running WUs. If you are running <max_concurrent> CEP2 WUs, you have to suspend a CEP2 one to keep below <max_concurrent> - duh!
A CEP2 WU that was RTS will start.
If there are no more CEP2 WUs that are RTS, Resume the CEP2 WU that was running at the start.
After a very short time, suspend the CEP2 WU that you just started.
Repeat the last 2 steps until you have run all the CEP2 WUs that were RTS and you have resumed the CEP2 WU that was running at the start.
Resume all of the Suspended WUs.
Check that you've Resumed all of the Suspended WUs.
Check that you've Resumed all of the Suspended WUs. biggrin

The CEP2 WUs that were RTS will now be WTR.

The CEP2 WUs that you start need only run for less than 1 second but can be run for a few seconds. They should be suspended while their % Progress still shows zero, so that the suspended/WTR tasks will not occupy much RAM - only about 2.2MB each. When % Progress is more than zero, a Q-CHEM process has started and that grabs up to about an extra 90MB.
HTH

[Feb 11, 2013 2:48:09 AM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

The CEP2 WUs that you start need only run for less than 1 second but can be run for a few seconds. They should be suspended while their % Progress still shows zero, so that the suspended/WTR tasks will not occupy much RAM - only about 2.2MB each. When % Progress is more than zero, a Q-CHEM process has started and that grabs up to about an extra 90MB.

Using an extra 90 MB memory isn't normally a problem, but the 1st. thing CEP2 does on start is to unzip all the files in the accompanying zip-file. This unzipping will continue even if you're pausing CEP2 while shows zero progress.

If you're starting multiple CEP2 it's a good chance you'll basically killing the disk-performance, and this increases the chance all tasks will get the dreaded "no heartbeat for 30 seconds" and due to often many hours between checkpoint for any previously-running CEP2 this can lose many hours of crunch-time.

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

----------------------------------------
[Edit 1 times, last edit by Ingleside at Feb 11, 2013 9:52:56 AM]

[Feb 11, 2013 9:51:13 AM]

deltavee
Ace Cruncher
Texas Hill Country
Joined: Nov 17, 2004
Post Count: 4835
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

5 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

100 year badge for The Clean Energy Project - Phase 2

10 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

5 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

100 year badge for Uncovering Genome Mysteries

200 year badge for Outsmart Ebola Together

200 year badge for FightAIDS@Home - Phase 2

200 year badge for Smash Childhood Cancer

200 year badge for Microbiome Immunity Project

200 year badge for Africa Rainfall Project

200 year badge for OpenPandemics - COVID-19


Re: CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

Who's jumped the client to a version 7.0.40 and above to specifically use the new app_config.xml in limiting the CEP2 concurrent number with the <max_concurrent> line?

I've been doing this on my quads since you first suggested it. It took a couple of days to hit upon the right combination of cache and concurrent CEP WUs but it has been working great ever since. A 0.5 day cache and 3 max_conccurrent CEP2s is assuring that one and only one CEP2 is always running.

----------------------------------------

[Feb 11, 2013 10:58:42 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

Yes, Ingleside/Rickjb et alia, doing this in quick succession can be mortal and sure as heck eats a good chunk of the RAM and VM for the full duration of the running CEP2 tasks, the MMd task(s) sitting in WtR state.

The outline in the OP was to buffer more, per CEP2 allowed thread, so if you have a cache setting of 1 day, allow 1 thread and have an average run time of 8 hours, set 4 in the device profile. There's most always a CEP2 that's oldest in the queue and hence per the FIFO principle get's first go as soon as a CEP2 task finishes. Have 2 threads, 1 day, 8 hours, buffer 8 or 9. WCG pushes the CEP2 tasks first [when there's work in the feeder, which is most always].

----------------------------------------
[Edit 1 times, last edit by Former Member at Feb 11, 2013 11:06:56 AM]

[Feb 11, 2013 11:04:30 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

P.S. If anyone want to insists on pre-starting CEP2 jobs so they get first start for a free [CEP2] slot, watch in BOINC Manager, better BOINCTasks , till the CPU counter starts running, as Elapsed is only a wallclock indicator that shows how much time the job was *allowed* to run. Depending on device speed, the unpacking and setup can take from 30 seconds to several minutes, before the actual science starts computing. *Also, do this one at the time. If you do this with multiple, there's so much disk IO competition, the setup duration extends very exponentially, and then the famous heartbeat fail crash as noted is looming hard!

edit: Fortunately, the <max_concurrent> limit prevents anyone from making a mockery of the MM overboard situation. :D

----------------------------------------
[Edit 1 times, last edit by Former Member at Feb 11, 2013 11:25:19 AM]

[Feb 11, 2013 11:23:16 AM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:


Re: CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

"No heartbeat" is normally taking care of the extra memory-usage from starting extra CEP2-tasks...

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

[Feb 11, 2013 4:31:16 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

Can't say I'm seeing this when CEP2 tasks are started staggered... my octo handles 8 concurrent if done so, but if concurrent launched... could as well abort as after an hour still no second of CPU time was recorded.

Whilst, we're talking really how to use BOINC as it was designed to work to it's maximum *without* anyone doing the hands-on stuff, so let's get on topic.

[Feb 11, 2013 4:42:05 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:


Re: CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

The point of my post is that it can take a while getting things to the hands-free state, so I use the procedure outlined to get <max_concurrent> happening immediately.

Have had nil problems with "no heartbeat" errors, on machines running WD SATA II Blue and WD SATA III Black drives. The Black especially just gulps down everything thrown at it.

app_config.xml is a big improvement in the ease of running CEP2, but the infrequent checkpoint issue remains.
I expect that there are many people whose machines run only part-time and they don't run CEP2 because of this.
I think my earlier suggestion of a "Suspend at next checkpoint" function in BOINC would help, though it would still be far from a solution due to the next checkpoint being unpredictably far ahead. It would not help much where the machine needs to be shut down to a deadline, eg when a user leaves work.

One of the problems with mid-subjob checkpoints cited by the CEP scientist(s)/WCG programmers was the volume of data to be saved. That might be acceptable if mid-subjob checkpoints were performed only in response to a "Checkpoint ASAP and Suspend" command from the user. But they said they didn't manage to get the code working to perform such checkpoints. Is it worth another look?

[Feb 11, 2013 11:34:41 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: CEP2: BOINC 7.0.40+ and app_config.xml concurrent jobs control

"No heartbeat" is normally taking care of the extra memory-usage from starting extra CEP2-tasks...

Note: There may not be much point in reading any further than the period at the end of this sentence if you're not willing or are unable to reconfigure your system.

If "no heartbeat" is a frequent problem, it might be worth looking into (in order of easiest) putting in an additional drive to use for BOINC data and/or going to RAID.

I've run eight concurrent CEP2 tasks "a lot" - even 10 and 12 concurrent - without the "No Heartbeat" issue. E.g., I write this from a VMware virtual machine (running XP, but that is beside the point) that I'm accessing via remote desktop; the BOINC client on the Intel 980x Win7 host is set to run both the CPU and GPU "always" and - in addition to one or more virtual machines - the host simultaneously runs 9 CEP2 tasks and one "one core/one GPU" HCC task.

The reason I can do it is because I simply don't run single hard drives which must handle all of the I/O for both the operating system and BOINC jobs. I.e., I always run at least separate, mirrored system drives and data drives to ensure that reads, at least, can be serviced by multiple threads. Consequently, I've never had a problem with erroring out on "No Heartbeat". Not even when the scheduler shoots me 8 or a dozen "high priority" jobs at one time, bumping all running jobs regardless of their completion state off their cores and launching 8 or more new jobs from 0% simultaneously...

So my (unsolicited) advice is simple: Use RAID. That generally requires planning ahead, though, as in configuring the RAID set (usually via the system BIOS or, if an add-on card, via its firmware) before installing the operating system.

So if you've already installed the O/S and you don't want to "start over" - or you can't or don't want to run RAID - you'll still get less I/O contention (think four cars trying to get through the opposite sides of a 4-way intersection...somebody has to wait) if you use two drives so you can put the BOINC data directory on a drive other than the system drive. That you can do without reinstalling the O/S; just shut down BOINC, copy the current data directory to the new drive, uninstall BOINC then reinstall it using the installer's advanced options to point the BOINC data directory to the directory on your new drive.

(Edit: If you're running Windows, suspect that your issue is the number of things starting at once when you log in/reboot, and you're relatively comfortable with things computer then you might find the combination of this page on Microsoft/SysInternal's Autoruns and this page on using WinPatrol or a batch file to stagger program startups in Windows useful. Won't be useful for resumes from sleep/hibernation, of course.)

----------------------------------------
[Edit 1 times, last edit by Former Member at Feb 12, 2013 3:24:45 AM]

[Feb 12, 2013 3:00:38 AM]

[ ]