Dear @RA2DREAMChallengeParticipants,
The challenge queues are running unpredictably at the moment due to an issue beyond our control - the UAB job scheduler is intermittently timing out for all users (including non-Challenge jobs). UAB Research Computing is looking into this issue on their end.
Please hold off submitting until I update this thread. Thanks!
Best,
Robert
Created by Robert Allaway allawayr Hmm, my submission still didn't go through but I figured out my previous version worked fine on the fast lane, which makes my current version problematic. I don't know why but will further look into it.
Thanks for the comments though! Thank you. I will try the fast lane. OK, this change has been made to the fast lane and will apply to future submissions (4h runtime limit). Well, I figured this out very quickly. Our main queue has a limit of 12 hours, but our fast lane has always had a 1-hour time limit. I had forgotten about this configuration until you ran into this issue! My guess is that your model before must have been just under the time limit and you're now a bit over the 1 h time limit with some small changes to your model.
If you'd like, feel free to submit directly to the main queue as that should work for you right now- (provided your model produces a valid prediction file).
We'll bump up the fast lane limit to something more reasonable - 4 hours to start - though if a model can't complete on the fast lane dataset in 4 hours I suspect it would run into the 12 hour time limit on the main queue, so probably better to have it fail on the fast lane after 4 hours rather than the main queue after waiting a full 12 hours. Will update this thread once we've changed this.
Hmm, OK. It's a little confusing because it looks like your jobs were killed after just one hour, which shouldn't be happening.
Let me look into this a little more, thanks for your patience! Thanks for the comments.
My container only has three pre-trained models (one for joint detection, the others for narrowing and erosion score predictions) to be run as the input image comes through. So do my two previous submission. Both took about one and a half hour to finalize the submission. The model I'm trying to submit also takes as much as they used to, which means that there's no way my model takes more than 12 hours. The capacity of the container is almost the same (4.1G). @net13meet Hi there, thanks for the heads up. It does indeed look like you are running into the 12 hour job limit. Can you describe your model a bit more? Are you submitting a pretrained model that is taking 12h to run, or are you training as part of your submission?
For previous submissions that did work, approximately how many hours did it take to run your model?
Thanks,
Robert
@net13meet This seems to be unrelated to these issues - the queues have been working fine. Note that the images are now being run on a different environment:
"At the suggestion of UAB computing, we switched to a different compute partition that has a maximum run time of 12 hours. The upside to this change is that challenge submissions will get higher priority, processed sooner, and get proccessed concurrently with other submissions if we use this partition. We have not yet observed any submitted models approaching a 12 hour time limit, but if you feel this is too restrictive or you run into this limit with your model, please let us know and we can can accommodate."
So this seems you are maybe exceeding this limit, and that's why the previous versions worked (in the old environment with 48 hours limit), but why they are failing now @net13meet Hmmm, My guess would be you'll have to wait a while, it should be past 11 where they'r at. Seems like other teams are having similar submission issues. So is my team.
I've tried to submit the container that used to work perfectly fine and experienced workload fail, even on the fast lane. I looked up on the log file and it said submission work has been cancelled due to time limit.
I thought it is supposed to work on the fast lane at least. Could someone help us out? Thanks.
Here's the full message I've given:
[34mINFO: [0m Converting OCI blobs to SIF format
[34mINFO: [0m Starting build...
Getting image source signatures
Copying blob sha256:5c939e3a4d1097af8d3292ad3a41d3caa846f6333b91f2dd22b972bc2d19c5b5
Copying blob sha256:c63719cdbe7ae254b453dba06fb446f583b503f2a2c15becc83f8c5bc7a705e0
Copying blob sha256:19a861ea6baff71b05cd577478984c3e62cf0177bf74468d0aca551f5fcb891c
Copying blob sha256:651c9d2d6c4f37c56a221259e033e7e2353b698139c2ff950623ca28d64a9837
Copying blob sha256:d904eea365c17083de788b56bb95a307388e82edda6ba2e592908eb12fccd4ae
Copying blob sha256:68c95038bb0321f10de54830f56b2e898da9b64a736259600858dd091737617f
Copying blob sha256:ad86a8e1ffa14bbd80a51ecccffbfc5476aa1266b8e200eb936f4538749443e0
Copying blob sha256:d85d5817fe7db8744b578ece279afe48e6857c763f3958559f6bf2091f8c1f8d
Copying blob sha256:2e176242725342b2542dbcd651848f6950cd567c9910f97ae719214858e12429
Copying blob sha256:390ec2226bf7d6b7cef4c81ab8a1a4c5897625df2cfffbd572add90f176c43d1
Copying blob sha256:a93d7611cd97891c0aff58125048e9c2a0176ece33430bff6f770a01290dd6a9
Copying blob sha256:e3599227a00fd92e8291419571e44ed9c5d7a740c36e4849ddc469346c203033
Copying blob sha256:56c742ae0274abce169564699d087ae76418d701c879b3e3a75841db9bbc1af5
Copying blob sha256:7159314db713ea114980485242318f60639b642772cf2eac6d4cd9babf44da3c
Copying config sha256:48a7128745ba4e6cdc385d0edd9d44d79bae5a4bd0e91a493f7c932198183c8c
Writing manifest to image destination
Storing signatures
2020/03/15 19:13:25 info unpack layer: sha256:5c939e3a4d1097af8d3292ad3a41d3caa846f6333b91f2dd22b972bc2d19c5b5
2020/03/15 19:13:27 info unpack layer: sha256:c63719cdbe7ae254b453dba06fb446f583b503f2a2c15becc83f8c5bc7a705e0
2020/03/15 19:13:27 info unpack layer: sha256:19a861ea6baff71b05cd577478984c3e62cf0177bf74468d0aca551f5fcb891c
2020/03/15 19:13:27 info unpack layer: sha256:651c9d2d6c4f37c56a221259e033e7e2353b698139c2ff950623ca28d64a9837
2020/03/15 19:13:27 info unpack layer: sha256:d904eea365c17083de788b56bb95a307388e82edda6ba2e592908eb12fccd4ae
2020/03/15 19:13:27 info unpack layer: sha256:68c95038bb0321f10de54830f56b2e898da9b64a736259600858dd091737617f
2020/03/15 19:13:36 info unpack layer: sha256:ad86a8e1ffa14bbd80a51ecccffbfc5476aa1266b8e200eb936f4538749443e0
2020/03/15 19:13:36 info unpack layer: sha256:d85d5817fe7db8744b578ece279afe48e6857c763f3958559f6bf2091f8c1f8d
2020/03/15 19:13:36 info unpack layer: sha256:2e176242725342b2542dbcd651848f6950cd567c9910f97ae719214858e12429
2020/03/15 19:13:36 info unpack layer: sha256:390ec2226bf7d6b7cef4c81ab8a1a4c5897625df2cfffbd572add90f176c43d1
2020/03/15 19:13:36 info unpack layer: sha256:a93d7611cd97891c0aff58125048e9c2a0176ece33430bff6f770a01290dd6a9
2020/03/15 19:13:36 info unpack layer: sha256:e3599227a00fd92e8291419571e44ed9c5d7a740c36e4849ddc469346c203033
2020/03/15 19:13:59 info unpack layer: sha256:56c742ae0274abce169564699d087ae76418d701c879b3e3a75841db9bbc1af5
2020/03/15 19:14:25 info unpack layer: sha256:7159314db713ea114980485242318f60639b642772cf2eac6d4cd9babf44da3c
[34mINFO: [0m Creating SIF file...
slurmstepd: error: *** JOB 4163640 ON c0114 CANCELLED AT 2020-03-15T20:12:07 DUE TO TIME LIMIT *** Thank you for fixing this , it is very helpful Dear @RA2DREAMChallengeParticipants,
Thank you for you patience! Thanks to @thomas.yu's tireless troubleshooting and with some help from UAB Research Computing we believe that this issue has been resolved. Please let us know if you still encounter abnormally long submission runtimes or have issues with submission results not being returned.
Have a great weekend,
Robert
@dcentmakeover and @stadlerm thanks for the info!
We didn't trigger anything on our end, I'd guess it just got stuck somewhere between Synapse's and your email provider's server, but who knows. Let me know if you run into the issue again!
I just got the email now, roughly 40 minutes after we noticed our score - I'm not sure if you triggered it manually or maybe it was just a onetime hickup with that submission
Probably best to keep an eye on it, but for now it's all in order for us @allawayr thanks the score came through and so did the email @stadlerm Thanks for the help! We (fingers crossed) might have resolved the issue causing submissions to not run through, hopefully we didn't introduce a new issue with the emails in the process....
@dcentmakeover - your score should have just come through on the leaderboard. Can you let me know if you did or did not get an email?
thanks!
@allawayr amazing, hopefully i get to see the score before i go to sleep Yes - I think you cancelled our submission from this morning (9701982), we then resubmitted (9701994), which failed with a different error, though nothing was apparent in the email or the log
I then submitted to the fast lane 9701995, and got a response saying that the container is valid within just 6 minutes (as before these issues started).
I then resubmitted to the challenge (9701999), but did not get an email
My partner then notified me (roughly 30 minutes after submitting) of our new scores on the leaderboard - the scores registered, but we didn't get an email
Hope this help Thanks @dcentmakeover - We've been watching your submission to see what happens. The logfiles are still being updated regularly for your recent submission, so it looks like it is still running fine at the moment! also i have resubmitted Just to check that I understand @stadlerm - You got emails from the fast lane with submission 9701995, but you got no emails from the main queue with submission 9701999? @stadlerm you mean you didnt get an email but your score updated on the leaderboard? So right now for us, the fast lane worked, but the challenge submission is being buggy . However our job did work, but then did not send an email - maybe that's useful for you to diagnose the issue
Thank you @allawayr yeah sure @dcentmakeover I know this might sound silly, but can you try resubmitting? We have been messing with the queue to try and fix it. Submissions seem to be running more smoothly now....fingers crossed that it works!
my container just failed
STDOUT:
STDERR:
[1;30mINFO[0m /home/thomas.yu@sagebionetworks.org/.conda/envs/cwl/bin/cwltool 2.0.20200303141624
[1;30mINFO[0m Resolved '/data/user/thomas.yu@sagebionetworks.org/tmp4298syw_/wes_workflow.cwl' to 'file:///data/user/thomas.yu%40sagebionetworks.org/tmp4298syw_/wes_workflow.cwl'
[1;30mWARNING[0m [33mWorkflow checker warning:
../../thomas.yu%40sagebionetworks.org/tmp4298syw_/wes_workflow.cwl:54:9: Source 'docker_repository'
of type ["null", "string"]
may be incompatible
../../thomas.yu%40sagebionetworks.org/tmp4298syw_/wes_workflow.cwl:93:9: with sink
'docker_repository' of
type "string"
../../thomas.yu%40sagebionetworks.org/tmp4298syw_/wes_workflow.cwl:55:9: Source 'docker_digest' of
type ["null", "string"]
may be incompatible
../../thomas.yu%40sagebionetworks.org/tmp4298syw_/wes_workflow.cwl:95:9: with sink
'docker_digest' of type
"string"[0m
[1;30mINFO[0m [workflow ] start
[1;30mINFO[0m [workflow ] starting step set_permissions
[1;30mINFO[0m [step set_permissions] start
[1;30mINFO[0m [job set_permissions] Output of job will be cached in /data/user/thomas.yu@sagebionetworks.org/cache_workflows/88f2a9751b5328e17154923d716da6c0
[1;30mINFO[0m Using local copy of Singularity image found in /data/user/thomas.yu@sagebionetworks.org/.singularity
[1;30mINFO[0m [job set_permissions] /data/user/thomas.yu@sagebionetworks.org/cache_workflows/88f2a9751b5328e17154923d716da6c0$ singularity \
--quiet \
exec \
--contain \
--pid \
--ipc \
--home \
'/data/user/thomas.yu@sagebionetworks.org/cache_workflows/88f2a9751b5328e17154923d716da6c0:/BybvGv' \
--bind \
'/data/user/thomas.yu@sagebionetworks.org/etafxk44:/tmp:rw' \
--bind \
'/data/user/thomas.yu@sagebionetworks.org/orchestrator/.synapseConfig:/var/lib/cwl/stgd477ed71-5b92-4272-be27-1676d7a39baa/.synapseConfig:ro' \
--pwd \
/BybvGv \
'/data/user/thomas.yu@sagebionetworks.org/.singularity/docker.synapse.org_syn18058986_challengeutils:v1.3.0.sif' \
challengeutils \
-c \
/var/lib/cwl/stgd477ed71-5b92-4272-be27-1676d7a39baa/.synapseConfig \
setentityacl \
syn21766156 \
3392644 \
download
any ideas?
even i had made a submission about 10 hours ago, i still havent recived the scores, it will be run right? @stadlerm Will do! cc @thomas.yu @allawayr Can you try and restart our job from this morning? Thanks! Dear @RA2DREAMChallengeParticipants,
Unfortunately, we were unable to get this resolved today. I apologize for the inconvenience. We will continue to try and diagnose this issue - it's been very tricky to pin down.
Thanks,
Robert
Dear @RA2DREAMChallengeParticipants,
I'm sending this notice to advise you that challenge submissions are running very slowly at the moment. We are trying to determine the root cause but please be aware that it is not an issue with individual submitted containers. We will update this thread when we have identified and fixed the issue! In the meantime, please feel free to submit containers, but be aware that it may take several hours to get results.
Best,
Robert
I totally understand how this would be frustrating.
We are still trying to figure out why these runs are only sometimes failing. Since your container passed validation, you can rest assured that it is not an issue with your container.
We will re-run your container and hopefully it will make its way through the leaderboard lane.
We also experience some troubles.
I've submitted to fast lane our docker and it passed the validation. Then it failed the leaderboard lane without any meaningful logs. Now we are waiting for the fast lane for more than 2 hours. It is really hard to try new ideas or even to debug with such long waiting times. Can you please give it a look @allawayr ? Thank you
Current fast lane submission (pending): 9701867
Valid fast lane submission: 9701813
failed leaderboard submission: 9701846 Thanks for your help and feedback! Ok, your container was finally run! Sorry for all of the issues. We think there is something between Synapse and UAB that is causing a communication issue, I don't think we've resolved the issue yet, but it does look like with enough retrys on our end we can eventually force the containers through... It turns out you probably got these error messages because we killed this job manually. Ok, just figured I'd check! We are running all of these containers on the University of Alabama's supercomputer, so there's a possibility they made a change or software update that is causing some unexpected fault.
Thanks for the messages - we'll keep digging to see why this is happening.
BTW - any failed container will not count against your quota - only when scores post to the leaderboard will the container count against your quota. @allawayr My submission just failed, after working fine in the fast lane
The output seems to be unrelated to our container:
I got two emails
The message is: ib/cwl/stgd9d47638-5698-4e16-9de0-f6d7a600c0ca/predictions.csv \ -p \ syn21760112 \ -ui \ syn21638201 \ -e \ syn21515819
\ -r \ results.json \ -c \ /var/lib/cwl/stg31dd83b7-0870-4f0a-9277-81f72795eafe/.synapseConfig
The message is: ionetworks.org/.singularity/docker.synapse.org_syn18058986_challengeutils:v1.3.0.sif' \ challengeutils \ -c \ /var/lib/cwl/stg2161e010-9a5d-43c8-ab31-6e85a9f5eb80/.synapseConfig
\ setentityacl \ syn21760250 \ 3392644 \ download
Submission ID 9701798 - could you take a look and reschedule it? I hope this won't count against our upload quota - thank you!
@allawayr Sorry for the confusion - no, nothing changed about our method. We use pretrained models at the moment, so no training is happening on Synapse. We only load the models and then run the predictions, which is fairly constant. On our machine this process takes at most 10ish minutes @stadlerm - Just to double check, did you modify your method in a way that you'd expect a longer run time than your previous submissions?
No problem - it may be seeming to work because we have been canceling and rerunning submissions on our end. Still not clear to us what the problem is, though...apologies for the inconvenience! @allawayr Thank you for your support and work - it seems to be working, just really really slowly - this submission to the "fast"lane took 2 hours, and our challenge submission has now already been running for 1.5 hours as well Hmm, we are looking into it. I am not currently convinced that the problem is in your container, but will certainly let you know if that appears to be the case.
Re the SC2/3 leaderboards - this is a UI bug, working with our platform team to figure out a workaround. @allawayr Can you have a look at the queue? And maybe let us know if there is something in our container that's causing the issues, so we can fix them
Thank you Yeah for me the sorting seems to not work sporadically - sometimes it works, and other times it won't work on any or just on some also SC2 cannot be sorted Thank you - one ran through but i queued up a new on and it appears stuck again? It's been about half an hour now - and last it said it was valid Thanks for reporting! These submissions were canceled and re-run. You should be getting results soon!
Looks like the workflows all ran into an error while talking to Synapse. Not clear to us why they failed at this step, but restarting the submissions seems to have fixed it. @allawayr thank you Hi both,
We'll look into this on our end. A quick glance suggests one or more of the submitted containers is hanging and gumming up the rest of the queue. Will get back to you asap!
It seems random - I resubmitted one that ran, but now it stopped again @stadlerm did that work ? or ..... Hm, thanks - i will retry the fast lane @stadlerm fast lane seems to work , i got the response, its the main submission line which seems off.anyways lets wait till we hear from allawayr. @dcentmakeover Seems like it - we also submitted to the fast lane, but nothing so far
Unlikely to get a response anytime soon though, since organizers are based in the US and it's still quite early there @allawayr hey is the submission server down? my submisssions are not being scored, waiting for more than 5 hours, no result mail The leaderboard is now sorting by most *recent* submission for all three subchallenges.
I can reproduce the issue with the toggle for SC3, though SC2 seems to work fine for me. I'll file a ticket with our platform team to see if they have any guidance...seems like a UI bug, not sure of the cause.
>Lastly, do you think it would be possible to add submission IDs per team, so we can figure out which submission is what, across the split leaderboards? Thank you!
Can you clarify this? The submission IDs are currently listed as the first column in the leaderboard - eg you could find submission 9701507 in all three leaderboards. I think I am just misunderstanding your request.
@allawayr Thanks for the update - could you please check the sorting for the leaderboard? It seems to be mixed up and sorting in the wrong direction. I also have some issues toggling the sorts occasionally, particularly for SC2 and SC3
Lastly, do you think it would be possible to add submission IDs per team, so we can figure out which submission is what, across the split leaderboards? Thank you! Dear @RA2DREAMChallengeParticipants,
I'm writing to provide a couple of updates on the challenge infrastructure.
First, the technical issue at UAB is resolved and submissions should be processed smoothly.
We used this downtime to address a few issues that were reported or that we uncovered in the past 24 hours. Please read on as there are some important updates:
* The scoring emails are corrected so that Subchallenge 2 (joint narrowing) and Subchallenge 3 (joint erosion) scores are not swapped.
* While working on the previous fix, we determined that we have been providing the score for SC2/3 based on the Overall_narrowing and Overall_erosion scores, and not the individual joint-wise narrowing and erosion RMSEs. Our scoring code already calculates both, but there was a miscommunication and the joint-wise scores did not end up on the leaderboard. [As described in the challenge wiki summary for SC2 and 3](https://www.synapse.org/#!Synapse:syn20545111/wiki/597242), we were to provide both the overall and jointwise scores for your information, **but the challenge Assessment is using the *joint-wise scores* only.** Please check the updated leaderboard to see these jointwise scores for all of your previous submissions. My apologies for this mistake.
* At the suggestion of UAB computing, we switched to a different compute partition that has a **maximum run time of 12 hours**. The upside to this change is that challenge submissions will get higher priority, processed sooner, and get proccessed concurrently with other submissions if we use this partition. We have not yet observed any submitted models approaching a 12 hour time limit, but if you feel this is too restrictive or you run into this limit with your model, please let us know and we can can accommodate.
* The leaderboard page is now split out into 3 leaderboards (one for each subchallenge) to allow sorting - previously, the presence of NA's in some SC2/3 submissions prevented the sort toggles from working.
As always, let me know if you have any questions.
On behalf of all of the @RA2DREAMOrganizers, thanks for your patience and participation!
Best,
Robert
Hi all - in case you didn't see it on the Challenge wiki:
>
> ### The challenge queues are temporarily offline. We anticipate reopening them the morning (Pacific Time) of March 10.
>Thanks for your patience.
Drop files to upload
Challenge jobs running into errors page is loading…