Dear @RA2DREAMChallengeParticipants,
The queue is being temporarily taken offline for maintenance. We will update you when the queue is back online.
Best,
Robert
EDIT: The submission queue is up and functioning, but wait times may be longer as the UAB cluster is seeing abnormally high utilization. We are working with UAB to determine if there is a reasonable way to give RA2 Challenge submissions higher priority in the cluster job queue. Please note that even if your submission has not run after a particular weekly deadline, it counts against the quota for the time at which it was submitted.
Created by Robert Allaway allawayr For context, we have 7 submissions either waiting to be submitted to the UAB server, or running/waiting for a run job to be allocated on the UAB server. Hi Balint,
We only submit 4 concurrent submissions to the scoring cluster at present, anything else stays at "RECIEVED" status and will not show any logs until it gets handed off to the scoring server. For the ones that do show logs, they are in the 4 that are currently "running", however, they are held up by the job scheduler on the UAB cluster because there are an abnormally large number of other (non Challenge) jobs.
We're working to see if we can get a node reservation or higher priority for challenge jobs, but until we do, I'd recommend submitting at night (Central US time) if you prefer to get quicker turnaround.
Cheers,
Robert
In the meantime, I tried again. For the first time, my fast lane submission failed pretty quickly due to a typo on my end.
Then I made two new fastlane submissions: 9703625 and 9703624. After 40-50 minutes if I open the Submission Dashboard I do not have any links to the Log Folder and also I did not receive any email alert that submission started.
Also, I have a leaderboard submission that I accidentally submitted when there was no fast lane: 9703616. This one also has that typo so it will surely fail. This one has the log folder, but I don't see the stderr file even after ~2 hours. Feel free to cancel that submission if it just takes resources. However, on local machine, the docker usually runs in less than 20-30 minutes and it was also pretty quick for the fast lane when it failed.
Is this issue due to server usage?
Thank you,
Balint Thank you for the quick respone, I can see the fastlane now!
Regrads,
Balint @patbaa Apologies, this was a permission configuration issue on our end. I was wondering why we had no new fastlane submissions and only main queue submissions!
It's been corrected, please refresh and try again, and let me know if you continue to have issues. Thanks for letting me know about this! Hi @allawayr ,
Is the fastlane submission queue down now, or I just messed up something? When I try to submit I see only the leaderboard lane.
Thank you,
Balint Thanks @allawayr ! Hi there, I don't believe there is an issue with the queue. Looking at the log file I can see:
```
Submitted batch job 4663670
Submitted batch job 4663727
```
The first batch job is the job to build your container as a singularity file - so it has been built.
The second job in the workflow actually runs the container. Both of these jobs wait in the slurm queue, so if the cluster is getting a lot of jobs it will slow down the time to start processing. I will keep an eye on this but would anticipate that it will start running soon, particularly as activity is reduced on the cluster as folks sign off for the evening.
On a related note, i've reached out to our UAB collaborators to start the conversation about whether it's possible to increase the priority of these jobs in the slurm queue so that they start running faster. Waiting to hear back :)
Best,
Robert
It's been more than 4 hours and no sign of the job beginning to run.
Could it be that there is a failure in the queue? Hi there, yeah, this is partially because of infrastructure updates (we now need to request two sequential jobs to run instead of just one, this was to resolve some docker container build issues people were encountering) and likely also due to higher utilization of the cluster compute infrastructure at UAB (I would anticipate that everyone working at home and only being able to analyze data rather than generate new data might have something to do this but this is just total speculation on my part).
I'll contact our UAB collaborators to see if there is anything to be done on this front. Thanks, it finally failed because of a missing file, I resubmitted, and now again, I am waiting 2 hours for the job to work.
The processing used to begin a lot faster... Hi @arielis, your submission has the status "EVALUATION_IN_PROGRESS" but has not reported any stdout logs. When this happens it is typically the case that the job is waiting to be assigned a to a compute node by the cluster job scheduler (slurm). It should be queued up to run, though it's out of our control when exactly it will start!
thanks,
Robert
I also received a notification that my submission ID 9703543 is in progress, but I see nothing happening in the logs...
Is my job waiting in the queue? @lars.ericson should be running now, not clear what happened but the session running our pipeline on UAB got terminated. Anyway, you should be all set now. Not sure why @ikedim did not receive an email. One possibility is that emails from other users is turned off in his synapse profile. Hi Lars,
Looking into it. Your submission is sitting in RECEIVED state, so we do have it, but for some reason it is not running on UAB. Will see what is going on.
Cheers,
Robert
Is the queue back on now? My colleague @ikedim put in a submission. I got an email. He didn't. Nothing happened after that. Hi there,
Yes, the leaderboard files have changed with the data update (we removed one patient as noted in the recent announcement).
When we run your container on the scoring server, we mount in the test data template file as `/test/template.csv` (see more info [here](https://www.synapse.org/#!Synapse:syn20545111/wiki/597249) under "input files").
The template will be different patients for the leaderboard and final rounds, so I would suggest (instead of manually adding the template file to your container), to configure your container to read the template file from `/test/template.csv` at run-time, so that you don't encounter this issue again during the final round (when the template will look different). Let me know if my explanation does not make sense!
best,
Robert
Hi Robert,
I just solved the issue. It was because I did not put '/' at the beginning of the path.
Now I got a message saying that my submission is not valid because my prediction file has extra rows (which are 237 rows, excluding header row).
Actually I did not have such problem when I submitted the first and second submission. Is it because test files have changed? If so, how many rows are supposed to be in the prediction file? Yes, it worked fine on my local machine. It seems nothing is missing in my container. (can you run this container locally successfully?) Hi @net13meet - the queue is back up. The error you are receiving suggests you are missing a run.sh file in your container, or are pointing to the wrong path in your Docker ENTRYPOINT. Is the queue back on now? I still keep failing to submit the container. (I've already read the recent email)
Here's the message given:
========================================
[34mINFO: [0m Could not find any nv binaries on this host!
/bin/bash: run.sh: No such file or directory
Please let me know if I'm missing out something.
Thanks.