Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 13
Posts: 13   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 58611 times and has 12 replies Next Thread
NixChix
Veteran Cruncher
United States
Joined: Apr 29, 2007
Post Count: 1187
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
shock Type A WUs Quickly Failing [RESOLVED]

I got a resend job (dg04_a485_ps0000) which immediately errored out when it started to run.
The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)
INFO: No state to restore. Start from the beginning.
CODES> MISSING PARAMETERS
A few others have reported seeing the same problem in the "Its Raining..." and the "Updates for DDDT phase 2" threads within the past few hours.

No harm. It looks like my WU is on hold and is not being resent yet again. They will be resent after the WUs are fixed by WCG staff.

Update: dg04_a485_ps0000 went out 2 more times after it failed for me for a total of 7 attempts. The last resend was aborted by the user. Can anyone explain why some clients claimed credit even though no time was logged?
dg04_ a485_ ps0000_ 6--  640 User Aborted  6/17/12 13:52:20 6/17/12 16:29:25  0.00 199.2 / 0.0 
dg04_ a485_ ps0000_ 5-- 640 Error 6/17/12 13:43:24 6/17/12 13:52:16 0.00 199.2 / 0.0
dg04_ a485_ ps0000_ 4-- 640 Error 6/17/12 11:23:57 6/17/12 17:34:14 0.00 0.0 / 0.0
dg04_ a485_ ps0000_ 3-- 640 Error 6/17/12 08:29:22 6/17/12 11:23:45 0.00 0.0 / 0.0
dg04_ a485_ ps0000_ 2-- 640 Error 6/17/12 08:01:48 6/17/12 13:43:17 0.00 199.2 / 0.0
dg04_ a485_ ps0000_ 1-- 640 Error 6/17/12 05:13:31 6/17/12 08:01:42 0.00 0.0 / 0.0
dg04_ a485_ ps0000_ 0-- 640 Error 6/17/12 05:13:27 6/17/12 08:29:16 0.00 199.2 / 0.0
No word yet from the techs on solving this problem. Lets get these type-A WUs processed so we can get some type-B and type-C work out to the legions of hungry and drooling crunchers.

Resolution: Seippel explained in post below that some naughty work units escaped from the basement of WCG headquarters. As all the postings have stated that they immediately failed, there appears to be no wasted CPU time. All of the baddies have been rounded up back into their pen. wink

Cheers coffee

[edited post header to clairify type-A work units]
[edited to add update]
[edited to add resolution]
----------------------------------------

----------------------------------------
[Edit 4 times, last edit by NixChix at Jun 20, 2012 2:34:55 PM]
[Jun 17, 2012 5:55:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Jobs Quickly Failing

Rather ''off normal'' for A type to get out of the pen without a tech posting, and on Saturday at that. Did they escape by accident?

To admin: Can you please move the posts regarding problems for this dg04 set from the Project Update thread, starting this post: https://secure.worldcommunitygrid.org/forums/wcg/printpost_post,381667 , where they are of place.

Thx.

--//--

edit: it really seems "pen" with one 'n'.
----------------------------------------
[Edit 1 times, last edit by Former Member at Jun 18, 2012 6:28:24 AM]
[Jun 17, 2012 8:03:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Jobs Quickly Failing

Howdy... huh... a couple of things:
1) I just received a DDDT2 WU !!! Imagine my happiness.. a genuine DDDT2 WU !!! (according to this it's an type-A)
2) After my initial elation, I started to think that something was wrong with this (the "4" at the end of the WU name looked suspicious)... so I checked the Work Unit Status and it didn't look encouraging:

dg04_ c045_ ps0000_ 4-- - In Progress 6/18/12 01:03:31 6/26/12 01:03:31 0.00 0.0 / 0.0 <- My WU
dg04_ c045_ ps0000_ 3-- - In Progress 6/17/12 21:38:00 6/25/12 21:38:00 0.00 0.0 / 0.0
dg04_ c045_ ps0000_ 2-- 640 Error 6/17/12 15:32:59 6/18/12 01:03:31 1.28 22.5 / 0.0
dg04_ c045_ ps0000_ 1-- 640 Error 6/17/12 05:14:43 6/17/12 21:37:57 0.77 19.5 / 0.0
dg04_ c045_ ps0000_ 0-- 640 Error 6/17/12 05:14:37 6/17/12 15:33:00 0.36 8.0 / 8.0

Since this WU has a history of erroring out, do I abort it ? (it hasn't run yet)... Or is this a fixed WU?

EDIT: The mentioned WU was "Aborted by Project"... so I guess it was determined that it was faulty and recalled.

Thanks,
CJSL
----------------------------------------
I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team


----------------------------------------
[Edit 1 times, last edit by cjslman at Jun 18, 2012 1:09:37 PM]
[Jun 18, 2012 2:17:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Jobs Quickly Failing

Errors provide a wealth of information, aborts nothing

Will most likely error, but you may have the magic system.
[Jun 18, 2012 6:41:22 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Jobs Quickly Failing

Yesterdays stats were 320 results for an average time of 0.427 hours or 25 minutes. Would that be where the first checkpoint falls, in mean?

--//--
[Jun 18, 2012 12:41:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Jobs Quickly Failing

And through the 4 way mix on the octo received one:

6.40 dddt2a dg04_d161_ps0000_5 - (-) 0,00 0,000 11:02:55 07d,23:43:04 18-6-2012 14:35:53

with a dubious history to it's name:

dg04_ d161_ ps0000_ 5-- - In Progress 6/18/12 12:35:51 6/26/12 12:35:51 0.00 0.0 / 0.0 < moi
dg04_ d161_ ps0000_ 4-- - In Progress 6/17/12 22:25:44 6/25/12 22:25:44 0.00 0.0 / 0.0
dg04_ d161_ ps0000_ 3-- 640 Error 6/17/12 19:32:17 6/17/12 22:25:24 0.25 4.4 / 4.4
dg04_ d161_ ps0000_ 2-- 640 Error 6/17/12 19:03:35 6/18/12 12:35:34 0.40 6.2 / 0.0
dg04_ d161_ ps0000_ 1-- 640 Error 6/17/12 05:18:41 6/17/12 18:39:10 0.44 5.7 / 5.7
dg04_ d161_ ps0000_ 0-- 640 Error 6/17/12 05:18:39 6/17/12 19:32:02 0.19 4.7 / 4.7

It's at the end of a 36 hour queue, so there's enough time for the high word to come out per that famous beer commercial: Wazzup (with this thing)? . Server abort without a statement will be clear enough (BTW, server aborts log as "error", maybe since start of server 700... don't know.)

--//--

edit: Contrary to initial intend waiting it out for an official word, pushed the task ahead and the 'result' was predictable, at first checkpoint point:

6.40 dddt2a dg04_d161_ps0000_5 00:20:23 (00:20:03) 98,40 100,000 - 07d,19:26:28 18-6-2012 14:35:53 Computation error LAPSED-02 0.00

Result Name: dg04_ d161_ ps0000_ 5--
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)
</message>
<stderr_txt>
INFO: No state to restore. Start from the beginning.
ENERGY CHANGE TOLERANCE EXCEEDED
Encountered error. Exiting.

</stderr_txt>
]]>

No, my machine did not do magic... it lost any rating it might already have had for DDDT2-A type :|
----------------------------------------
[Edit 1 times, last edit by Former Member at Jun 18, 2012 5:14:12 PM]
[Jun 18, 2012 12:58:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
AgrFan
Senior Cruncher
USA
Joined: Apr 17, 2008
Post Count: 358
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Jobs Quickly Failing

I've had these units error out so far ...

dg04_ c472_ ps0000_ 3-- Error 6/18/12 20:28:14 6/18/12 21:48:10 0.00 199.2 / 0.0
dg04_ c476_ ps0000_ 2-- Error 6/18/12 20:28:14 6/18/12 21:48:10 0.00 199.2 / 0.0
dg04_ d402_ ps0000_ 3-- Error 6/17/12 10:41:34 6/17/12 11:38:35 0.46 6.0 / 6.0
dg04_ d154_ ps0000_ 0-- Error 6/17/12 05:18:29 6/17/12 05:21:15 0.00 0.0 / 0.0
dg04_ c250_ ps0000_ 1-- Error 6/17/12 05:16:29 6/17/12 05:17:11 0.00 0.0 / 0.0
dg04_ c242_ ps0000_ 0-- Error 6/17/12 05:16:08 6/17/12 07:22:48 0.36 4.7 / 4.7
dg04_ c211_ ps0000_ 1-- Error 6/17/12 05:15:41 6/17/12 05:54:30 0.39 5.0 / 5.0
dg04_ c153_ ps0000_ 1-- Error 6/17/12 05:15:25 6/17/12 05:49:20 0.34 4.3 / 4.3
dg04_ b397_ ps0000_ 0-- Error 6/17/12 05:14:22 6/17/12 05:16:28 0.00 0.0 / 0.0

Isn't there alpha testing before new work is sent out for general consumption?

This project seems to have issues whenever new work is loaded into the grid.

This is getting a bit frustrating especially with work being so scarce for this project.

Knock, knock ... anybody home?
----------------------------------------
[Edit 3 times, last edit by AgrFan at Jun 18, 2012 11:07:16 PM]
[Jun 18, 2012 10:09:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dataman
Ace Cruncher
Joined: Nov 16, 2004
Post Count: 4865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Jobs Quickly Failing

This project seems to have issues whenever new work is loaded into the grid.

This is getting a bit frustrating especially with work being so scarce for this project.

Knock, knock ... anybody home?

Can we get an update from the tech's on the status of this problem? I have been unable to get any of the new wu's so have not seen the errors but the problem seems to continue without comment from tech's or scientists. Thanks. peace
----------------------------------------


[Jun 19, 2012 1:15:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
seippel
Former World Community Grid Tech
Joined: Apr 16, 2009
Post Count: 392
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Jobs Quickly Failing

This problem was caused by an issue with one of our filesystems. In the process of resolving that issue, some work units which had failed previously for dg04 were inadvertently sent out. We apologize for the inconvenience. Also, these work units are unrelated to the work units which Dr. Usha Viswanathan mentioned in the "Updates for DDDT phase 2" thread. Those work units are dg05 Type C work units while the ones with errors in this thread are dg04 work units.

It's also worth mentioning that for new work for this project, we take a number of steps to ensure work sent to users maintains a low error rate. This process includes a random sampling of new work units on our alpha lab machines. If these complete successfully, a single batch from the new set of work it sent out on the World Community Grid. Finally, if the results from that batch have a low error rate after a day or two, then the full set of work is sent out. Although it's necessary, the flip side is that this delays work units while this testing completes (which has also been a complaint from some users).

Seippel
[Jun 19, 2012 8:06:04 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dataman
Ace Cruncher
Joined: Nov 16, 2004
Post Count: 4865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Jobs Quickly Failing

Thank you for the explaination. I understand now. good luck
----------------------------------------


[Jun 19, 2012 11:57:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 13   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread