Dear Organizers,
I've been trying to send in a submission to the Fast Lane but it kept timing out after a few hours, then I sent it into the Challenge and it timed out after a longer period of time but still did not run. Is it a problem for others as well? So far I could send in submissions and everything was fine. Do you have any idea why this could happen? My submission IDs are 9702602, 9702608, ...615, ...616, ...635 the last van is currently running but on the current log file it is stuck at the `Creating SIF file...` message.
Thanks in advance,
Alex
Created by Alex Olar qbeer @qbeer and @net13meet, thanks again for your patience! After much wrangling, we've think we've found a workaround for this issue. I can share more information if you are interested but in brief there is a RedHat bug that causes a stalled process during the image build step on multicore machines, but only with some Docker containers. Our workaround builds the containers now on a one-core machine before handing off the built container to a node with many cores and GPU access.
Please submit your containers to the fast lane to double check that this fix works, but we tested both of your containers in a private testing queue, and built them successfully. Have a great weekend!
Thanks,
Robert @net13meet and @qbeer thanks for your patience! Would you be willing to email me (allawayr@synapse.org) your Dockerfiles so that we can compare them? There's no obvious reason to us why these containers are not building correctly as Singularity images, though it seems to break specifically in Singularity 3.5.2. I'm wondering if perhaps you are using the same base images/OSs or other dependencies that might be causing the build to fail.
Thanks!
Robert
OK, I just added your account. Thanks! On the Docker Repository page you can click on Docker Repository Tools -> Docker Repository Sharing Settings and then add my account to the list.
I can't promise I'll be able to figure this out this week, but I will take a look.
Thanks,
Robert
Sorry to interrupt again but my container still has the same problem while submitting onto the fast lane.
========================
[34mINFO: [0m Creating SIF file...
slurmstepd: error: *** JOB 4476479 ON c0112 CANCELLED AT 2020-04-08T00:36:27 DUE TO TIME LIMIT ***
========================
Would you please have my container checked as well? How do I share the container? Oh - i forgot to answer your other question. We did clear caches earlier this week, but I'm not convinced that will have fixed your issue. You are welcome to resubmit to the fast lane to see if that's resolved the issue, though! Thanks, Alex, for sharing your container. I hope to find some time to explore this further this well. @thomas.yu was able to get this container to run just fine on a separate Singularity install we have (not the challenge infrastrucfture), so it's still a mystery to me... Hi,
it does not seem that way. I am not very familiar with singularity but it feels that it does not even build the container, so it is not a running error. I don't know whether there is some caching on the server itself, but would that be possible to clean the cache? I would re-upload my docker images and see whether the problem persists. This way we are stuck and it is not possible from this end to do anything about it. Thanks again,
Alex Oh my... this is the problem I've suffered from over a month. Everything works fine on my local machine but for some reason it doesn't go through as it's supposed to. FYI: My container size is around 7GB.
My guess is that the submitted docker container couldn't open GPU-related files when it's on the queue. This is probably why the submitted container goes beyond the time limit. You can check if your GPU libraries go along with your submission.
Hi Robert,
sorry, I missed your reply, I am adding you to our container. Thank you for the help,
Alex Hi Alex,
On our end, the container looks like it's closer to 10 GB. I still do not think this should be a problem, but if you share your container with me (using the Docker Repository Sharing Settings) can try and take a closer look to figure out what might be going on.
Hi Alex, yeah, 4.4 GB should not be too large.
Looking into it some more....
Hi Robert,
I can share it with you but today I tried a lot of things. The docker image size seems to be an issue if I exclude some files (that are otherwise necessary) the image runs with errors regarding the missing files. The file does not seem excessively large with 4.4 GBs in extent.
Alex Hi Alex,
Hmm...this really does look like a failure to convert the docker container to a singularity image. I am not sure what would cause this issue.
If you give me access to your docker repository, I can investigate further to try and troubleshoot.
Thank you,
Robert
OK I'll try that but that will take some time since I need to restore that state. However just to be clear previously I got this kind of log:
```
...
[34mINFO: [0m Creating SIF file...
[34mINFO: [0m Build complete: /data/user/thomas.yu@sagebionetworks.org/.singularity/9702600.sif
[34mINFO: [0m Could not find any nv binaries on this host!
--- and here came the log from my run ---
```
Currently:
```
...
2020/04/01 04:50:14 info unpack layer: sha256:b2...
[34mINFO: [0m Creating SIF file...
slurmstepd: error: *** JOB 4458851 ON c0101 CANCELLED AT 2020-04-01T08:46:50 DUE TO TIME LIMIT ***
```
The latter one ends there and returns a time out error after 3 hours on the Fast Lane. What do you think? Thanks,
Alex
Hmm...ok, one other thought. Can you re-submit your successful March 28th container to the Fast Lane queue? If it runs successfully this suggests it's something about the container that has changed. If it fails, then there is something unexpected going on at our end.
Hi Robert,
well. Our last submission was on the 28th, since then in our `run.sh` script I modified some programs at the end of it and the size of our docker image got bigger. Locally we test with docker-compose and it takes about 30 minutes to run at most with a lot of additional info from local testing. Basically the changes made should not affect running time significantly and the run script should start to run and produce some logs that could have been seen before as the beginning of it has not been modified after our last submission. Thanks for your help,
Alex OK, it looks like you are running into the 4h and 12h time limits on the fast lane and full submission queues respectively, which is why your jobs are failing. Have you tested this container locally; are these expected run times for your model?
Thanks,
Robert
I take that back, I think the build is working fine. It looks like it is failing early into the run, but we aren't getting a useful logfile out... Hi Alex,
We did not encounter this issue when testing on our end, and others do not seem to be having this issue. Looking at your log files, it looks like the issue is happening at the Singularity building step (i.e. Singularity takes the Docker container and build it into a singularity image file before running it).
Can you let me know what types of changes you made between your last submission that **did** work and these submissions that have not worked (no need to go into details about your approach itself...looking for higher level technical details - like, did you switch to a different base docker container, are you using a different set of dependencies)?
Thanks,
Robert
Any updates on this? I don't get any useful info from the logs but it seems that my code doesn't even start to run. I assume this since it produced a log message before by running the first script and it does not show up anymore. The only meaningful log produced is that the process times out after a few hours. It is also weird that when I tried to submit to the challenge the log produced was similar to the Fast Lane log compared to before when it was only the run log of my docker file, now it is the `Copying blob SHA...` kind of text. Does anyone else face this issue currently? I see that there were some submissions but not that much. Thanks though for looking into this,
Alex Thanks for the notice - I'll investigate and see what is going on.