Common errors leading to container invalidation

Dear @RA2DREAMChallengeParticipants, I am starting this thread as a resource to list some common reasons containers fail to run. Please "Follow" this thread if you'd like to receive updates from myself and others who would like to contribute to help everyone troubleshoot. Hopefully this will be a useful resource for those encountering issues running their containers! Best, Robert --------------- For context, when a submission produces a malformed prediction file, we provide a pretty specific error message to help guide you to fix the container. However, if the container fails to complete running at all the best we can do is provide logs to help you examine what went wrong. Here's a running list of examples we've encountered that might cause containers to fail. Feel free to post a response with errors you've run into and I can add it to this list: * The container does not read the image data from the correct directories. See the [instructions on this page](https://www.synapse.org/#!Synapse:syn20545111/wiki/597249) for more information on how to configure your container to read the input files. * The container does not include all dependencies packaged in advance, including scripts, data (other than the image data and numerical training data), libraries, etc. Your containers do not run with a network connection and therefore cannot download anything interactively. * Similar to this: the container is missing required libraries or modules. Please make sure these are installed in the base image or are installed as part of your `docker build` process - that is, include the installation of these in your Dockerfile. * Your container attempts to write to a directory that is not writable. (from @arielis)

Created by Robert Allaway allawayr
@dcentmakeover glad to help!
@dcentmakeover No worries - well done on your score
@stadlerm @allawayr Thank you so much for helping me with the submissions.
@allawayr , Thanks for the organizers' rapid response! and @stadlerm , thanks for reporting this problem.
Great - thank you!
@stadlerm I just wanted to follow up with more details on this bug. It turns out your container did indeed run successfully, which is why you got a score email. However, there was a concurrency error in the workflow after the email step caused by the server and Synapse communicating, that caused your submission to be marked invalid on Synapse and thus not post to the leaderboard or count against your quota. I don't think that this is an exploitable bug as described above (that is, you shouldn't get an email if your container stops running at any point, even after the prediction csv is written out), but rather an unpredictable error that could occasionally cause a team to submit and get "lucky" and not have a submission count against their quota. Either way though, we've believe we've fixed the bug, and re-ran your submission so that it shows up on the leaderboard (it counts against the quota from when it was originally submitted - not the current week). Please let us know if you run into this or a similar issue again!
If you are getting validation errors, this means that your container is producing a prediction.csv file, so you are getting closer! You must fill in every value for all rows and columns for the prediction.csv to be scored, and they must be numeric values.
@allawayr nothing in my error report , but i get this `Predictions are not all numeric values.` `There are values missing in your prediction file.` ideas?
okay got it , thanks a lot , let me check
See my response above :) Looks like we were typing at the same time.
@allawayr i think i am , can you please share the path where this template.csv is located?
Specifically, i think you should be referencing the template.csv in the input/test directory: `read_template = pd.read_csv('/test/template.csv')`
Hey there, we provide a template for the fast lane input directory that does accurately reflect the fast lane patient IDs. Is it possible that you are copying the leaderboard template that we provide [here](syn21072036) in manually during the docker build process and using that by accident?
@stadlerm oh, okay, let me modify accordingly, thanks for the quick response.
The fast submission lane has a limited amount of images in /train and /test - it is possible that the template csv does not reflect this change accurately, and still contains all images - maybe try reading the images from the dir: os.listdir('/test')
in my _stderr.txt this is my error output,like i said i am reading from `read_template = pd.read_csv('template.csv')` but in the error report i get this, `FileNotFoundError: [Errno 2] No such file or directory: '/test/UAB117-RH.jpg'` as i see the folder path is correct right? '/test` am i missing something ?
@stadlerm Thanks for the additional insight! I agree - for some reason the scoring is happening before the container totally finished running - not sure why that would be happening, but we'll dive into it as soon as possible.
@dcentmakeover - @stadlerm has it correct. You can do this whichever what you find to be the easiest. Regarding your CUDA question - it's not clear to me that this will solve the issue, mainly because I have not spent time with pytorch and how to configure it. With tensorflow, you need to install a version that is specifically capable of using a GPU (eg, using a docker container like https://hub.docker.com/r/rocker/tensorflow-gpu or https://hub.docker.com/layers/tensorflow/tensorflow/nightly-gpu-py3/images/sha256-137357c91caee6fa028d76a13bf1ad6c47b7f9a69a299ce70e6b343060f22808?context=explore). I am not sure if pytorch has similar requirements. However, if you define the CUDA PATHs like [this script](https://github.com/allaway/ra2-docker-demo/blob/master/run.sh) that will at least allow your container to access the scoring server CUDA drivers. We provide these drivers at run time for all submissions, whether or not they require them.
@allawayr No problem - the submission was 9701287 What we do is - we write the output, and then we clean /output from everything that's not the log file or the output file (for our pipeline, we write some intermediate files) - it seems that the scoring system kicks in as the output is written, but before our image is done Sometimes, a .synapseCache directory is created, which we failed to delete, which caused our docker image to fail - in the log it reads: Traceback (most recent call last): File "/usr/local/bin/ra_joint_predictions/run_dream_predictions.py", line 48, in _clean_output() File "/usr/local/bin/ra_joint_predictions/run_dream_predictions.py", line 35, in _clean_output os.remove('/output/' + file) IsADirectoryError: [Errno 21] Is a directory: '/output/.synapseCache' We now just implement a check, and we leave any create directories alone - but i guess on your side the fix will need to be not running the scoring, until the container is finished Let me know if you need any further details from us - thank you
@stadlerm We really appreciate the heads up, this is not supposed to be possible. Looking into resolving it ASAP. In the meantime, we've shut off submissions to the main queue. Participants can submit to the fast lane/validation queue to check Docker configurations still!
alright thanks
@dcentmakeover Shouldn't matter - you can do either. We currently read all images in /test once, and create a list of filenames - we then use this list to load the images, do the prediction, and then write the outcomes. We only use the template csv file to bring our output columns in the same order as in the template
@stadlerm how are you reading the test files? are you reading from template.csv? or from test folder and making the csv as you go?
@AlexanderB Yes, we had similar issues - I also had troubles creating a directory in output. We currently just write everything into output, without much structure In my experience, the user that runs the image only has write privilege in /output, that's why it works locally, but it won't work anymore when it's in the fastlane/submission - I did not try very hard if i could create a directory, because i was getting fed up, but you might get it to work - if you do it, you will have to create the dirs in /output, at runtime, it does not work to create them initially in the Docker image creation Best of luck
@stadlerm Does this mean we are also not able to write to /tmp? I am currently struggling fixing an error we get in FastLane, which I can't reproduce locally and which started to show after I included some temp files saving to /tmp. Should we just make a /tmp folder in /output then?
@arielis Yeah we realized the same issue, which is why we wanted the organizers to know haha Our new scores were (they don't show up in the leaderboard): sc1_weighted_sum_error : 0.7676 sc2_total_weighted_sum_error : 1.0626 sc3_total_weighted_sum_error : 0.6204 We basically just wanted to see how robust our scores were to some small changes, though we then realized that what we changed wasn't very sensible, hence the drop Anyway, we have fixed the issue in our docker now, so it should be fine - however note: this error does not occur in the fast lane. This means you won't know if the issue, until you actually submit Hopefully there is something the organizers can do about it
@stadlerm, it does not seem to be an issue, it's a nice feature :) Seriously, you just revealed a way to privately test a docker, without others knowing your performance, and maybe even avoiding the attempt to be counted. I hope the organizers will correct this !
Yes - you can only write to /output - that was a very painful night for us trying to figure out why it worked locally but not on Synapse haha We also encountered an issue: if you sucessfully write the output file, but then some error happens (in our case we were doing some cleanup which had an error in it), then you will get an email with your score, but then shortly after you get another email that your submission failed, and your score won't show up in the leaderboard
I build the docker but when i run `/run.sh` locally i get return torch.from_numpy(all_anchors.astype(np.float32)).cuda() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 196, in _lazy_init _check_driver() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 101, in _check_driver http://www.nvidia.com/Download/index.aspx""") AssertionError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx I am guessing in the `/run.sh` if i specify the CUDA PATHS this should be fine?
in the container where is the `template.csv` located? `/template.csv` ?
Another problem that I encountered: your countainer attempts to write to a directory that is not writable.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Common errors leading to container invalidation page is loading…