Dear @RA2DREAMChallengeParticipants, Please be aware that at present, submissions may take several hours to run. The jobs that we submit are queued with all of the other UAB Cheaha jobs, and it can take a few hours to get to the front of the queue. This is due to abnormally high utilization of the UAB cluster that is running the challenge submissions, especially during the daytime (central US time). We're working with the folks at UAB to try and get higher priority or determine another solution to get Challenge jobs completed faster, but for now we are beholden to the queue. Thanks for your understanding! Best, Robert on behalf of the RA2 DREAM Challenge Organizers

Created by Robert Allaway allawayr
Sorry for confusion, I'm posting from my phone and markdown reformatted the post. 120 patients x 4 images on leaderboard and 10 x 4 images on fastlane
@allawayr 1204 sets of 4 images or 1204 images (so 301 sets)? Each score corresponds to a set of 4.
Hi Lars, I'm not at a computer right now so cannot check the exact numbers but I think it is about 120*4 on the main queue and 10*4 on the fast lane queue.
Yes we are Fast Laneing it. Maybe we need to add a printout on timing. How many image sets are in Test?
Hi Lars, yes - my last post was in response to Nc717. The main queue has a time limit of 12h, the fast lane 4h before it times out. But it doesn't seem like yours should timeout if the previous version of your model only took 15 minutes. I would suggest submitting both the old (working) and new (timing out) model to the fast lane. I assume the one that is running from you right now is the new model that has the timeout issue? Cheers, Robert
That's someone else. @allawayr, ours is getting timeout. Can you remind us how fast we need to be for how many images in the test set? ``` slurmstepd: error: *** JOB 4706301 ON c0101 CANCELLED AT 2020-05-02T05:16:51 DUE TO TIME LIMIT *** ```
Hi there, Check out the bottom 6 lines of your std_err log file, there's an error message related to how you are reading in the test data: ``` Traceback (most recent call last): File "/usr/local/bin/src/scoring.py", line 236, in image_dict = load_images_from_test_folder(test_path, filenames, 300) File "/usr/local/bin/src/scoring.py", line 61, in load_images_from_test_folder if file.split(".")[0].split("-")[1] == 'LF': IndexError: list index out of range ``` Best, Robert
Hi @allawayr @stadlerm , Thank you for your help! I was able to load the models, and the images from the test folder, however, I got a workflow failed error after submitting the docker repository on the fast lane. the logs for the error is below. ``` INFO:  Could not find any nv binaries on this host! 2020-05-02 14:33:35.397029: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /.singularity.d/libs 2020-05-02 14:33:35.399776: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303) 2020-05-02 14:33:35.399823: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: c0102 2020-05-02 14:33:35.399838: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: c0102 2020-05-02 14:33:35.399909: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program 2020-05-02 14:33:35.399956: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.1 2020-05-02 14:33:35.400162: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2020-05-02 14:33:35.408080: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2399975000 Hz 2020-05-02 14:33:35.408664: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5555589f75a0 initialized for platform Host (this does not guarantee that XLA will be used). Devi... ``` I have set my dockerfile as below ``` FROM continuumio/miniconda3@sha256:6c979670684d970f8ba934bf9b7bf42e77c30a22eb96af1f30a039b484719159 # Updating the image RUN apt-get update -y RUN apt-get install -y apt-transport-https RUN apt-get install emacs -y # Configure Python ENV DEBIAN_FRONTEND=noninteractive RUN conda install tensorflow-gpu RUN pip install numpy RUN pip install h5py RUN pip install pandas RUN pip install scikit-learn RUN pip install scikit-image RUN pip install ipywidgets RUN pip install tqdm RUN pip install opencv-python RUN pip install keras RUN pip install cudnnenv # Required: Create /train /test and /output directories COPY /run.sh /run.sh RUN mkdir /train \ && mkdir /test \ && mkdir /output \ && chmod 775 /run.sh # Required: Python code COPY /src /usr/local/bin/src RUN chmod 755 -R /usr/local/bin/src COPY /models /usr/local/bin/models RUN chmod 755 -R /usr/local/bin/models # This is for the virtualenv defined above, if not using a virtualenv, this is not necessary RUN chmod 755 -R /root #to make virtualenv accessible to singularity user # Required: define an entrypoint. run.sh will run the model for us, but in a different configuration # you could simply call the model file directly as an entrypoint ENTRYPOINT ["/bin/bash", "/run.sh"] ``` Please help me fix this issue. Thanks and Regards
Thank you @stadlerm for your suggestions to @Nc717. @Nc717 - I would also suggest some changes to your Dockerfile: you have 3 FROM statements, but Docker will only use the final statement when building, so you can delete the first two (unless you specifically follow the multi-stage build process outlined here : https://docs.docker.com/develp/develop-images/multistage-build/). You also have several COPY and RUN statements repeated throughout, you can eliminate the duplicates. I'm also a little confused why you install tensorflow and dependencies, as the base container (FROM tensorflow/tensorflow:nightly-gpu-py3) already has it. In general, the simpler the Dockerfile, the easier it will be for you to debug. Finally, I think your runfile is looking for (`test`) in whatever the working directory is because you are giving it a relative path, when in reality you need to give it an absolute path (`/test`). This is why you get the error `Error 2 [No file or directory test]`
Hi @lars.ericson >we got a new workflow failed message but it doesn't look like it actually reran because the logs are time-stamped April 30. The workflows did rerun - if you check the "modifed on" timestamps, you can see when the log was most recently modified. The logs are reversioned periodically during running the workflow, so you can go to "File Tools" -> "Version History" to see all of the individual versions which are dated 5-02. >The issue in the logs is "can't find any nv binaries on this host". Maybe you have hosts that have no GPUs? Anyway it looks like the rerun just saw the old logs and immediately bailed. Can you try deleting the log files first.and also watch to make sure it goes to a host with GPUs? This warning is a little misleading, unfortunately. This comes up because Singularity can't automatically find the binaries as we are running on a cluster. We mount all of the necessary drivers in at runtime (which is why the configuration described in our example model is required to fully utilize the GPUs). The job is definitely running on a node with GPUs - all of the challenge container running jobs are run on a UAB pascalnode, so this is not assigned at random. I would suggest trying this: submit this container that is not running on the main queue as well as the last version that successfully returned scores to the leaderboard to the fastlane queue. If you get "invalid" for both containers, that would suggest that something has unexpected changed on our end with the infrastructure. If it's only invalid for the new container (or only the new container times out), it's probably something different between the two containers that's causing the newer one to hang.
Hi Neelambuj, It will maybe depend on your ruh.sh - but here is what we do. I would move all the code into one directory, and all the models you want to copy over into a different one, then copy these directories into the /bin directory of the image (see excerpt from our Dockerfile below), instead of copying files, one by one ``` COPY run.sh /run.sh RUN mkdir /train \ && mkdir /test \ && mkdir /output \ && chmod 775 /run.sh ... # Dir with python code COPY /ra_joint_predictions /usr/local/bin/ra_joint_predictions RUN chmod -R 775 /usr/local/bin/ra_joint_predictions # Dir with resources aka models COPY /resources /usr/local/bin/resources/ RUN chmod -R 775 /usr/local/bin/resources/ ``` In the run.sh we then simply set the env variables for cuda & then call: ``` python /usr/local/bin/ra_joint_predictions/run_dream_predictions.py ``` What we found helped was to set a fixed path as our root, in python: ``` # Change dir to the dir that you copied over that contains your code os.chdir('/usr/local/bin/ra_joint_predictions/') ``` You can then use string paths relativ to this dir to fetch everything: ``` # Access /train or test: os.listdir('/train') os.listdir('/test') # Access your trained models: os.listdir('../resources') ```
Hi, @stadlerm @allawayr , Thanks for the replies! I have been trying to submit my docker to the challenge after doing the changes, but it gives me an error saying Error 2["No directory test"]. Please see below the docker file I have prepared for building the container. ``` FROM python:latest FROM continuumio/miniconda3@sha256:6c979670684d970f8ba934bf9b7bf42e77c30a22eb96af1f30a039b484719159 FROM tensorflow/tensorflow:nightly-gpu-py3 # Updating the image RUN apt-get update -y RUN apt-get install -y apt-transport-https RUN apt-get install emacs -y RUN apt install nvidia-cuda-toolkit -y # Configure Python #ENV DEBIAN_FRONTEND=noninteractive #RUN conda install tensorflow-gpu #RUN apt install python3-opencv -y #RUN conda install -c conda-forge opencv RUN pip install tensorflow-gpu RUN pip install numpy RUN pip install h5py RUN pip install pandas RUN pip install scikit-learn RUN pip install scikit-image RUN pip install ipywidgets RUN pip install tqdm RUN pip install opencv-python RUN pip install keras RUN pip install cudnnenv # Required: Create /train /test and /output directories RUN mkdir /train RUN mkdir /test RUN mkdir /output # Required: trained models COPY /foot_eroison_model_best.h5 /foot_eroison_model_best.h5 COPY /foot_narrowing_model_best.h5 /foot_narrowing_model_best.h5 COPY /hand_erosion_model_best.h5 /hand_erosion_model_best.h5 COPY hand_narrowing_model_best.h5 /hand_narrowing_model_best.h5 # Main program COPY /run.sh /run.sh # Required: code COPY /ra_scoring.py /ra_scoring.py # Make model and run files executable RUN chmod 755 -R /test RUN chmod 755 -R /train # Required: trained models COPY /foot_eroison_model_best.h5 /foot_eroison_model_best.h5 COPY /foot_narrowing_model_best.h5 /foot_narrowing_model_best.h5 COPY /hand_erosion_model_best.h5 /hand_erosion_model_best.h5 COPY hand_narrowing_model_best.h5 /hand_narrowing_model_best.h5 # Main program COPY /run.sh /run.sh # Required: code COPY /ra_scoring.py /ra_scoring.py # Make model and runfiles executable RUN chmod 755 -R /test RUN chmod 755 -R /train RUN chmod 755 -R /output RUN chmod 775 /run.sh RUN chmod 755 /ra_scoring.py # This is for the virtualenv defined above, if not using a virtualenv, this is not necessary RUN chmod 755 -R /root #to make virtualenv accessible to singularity user # Required: define an entrypoint. run.sh will run the model for us, but in a different configuration # you could simply call the model file directly as an entrypoint ENTRYPOINT ["/bin/bash", "/run.sh"] ``` Also in the ra_scoring.py file, I declare the test path using the below code. ``` From pathlib import Path test_path = Path('test') #And later use this to load the patient ids present in test folder as patient_ids = os.listdir(test_path) ``` But when I submit my docker container, I get an error Error 2 [No file or directory test], Isn't it the correct way to define the test path or the directory isn't being created by my container. Please help me fix this issue. Thanks, Neelambuj
@allawayr we got a new workflow failed message but it doesn't look like it actually reran because the logs are time-stamped April 30. The issue in the logs is "can't find any nv binaries on this host". Maybe you have hosts that have no GPUs? Anyway it looks like the rerun just saw the old logs and immediately bailed. Can you try deleting the log files first.and also watch to make sure it goes to a host with GPUs?
Hi Lars, We've been a little tied up today; this is now rerunning. Fingers crossed you are right and that a simple restart does the trick! best, Robert
Thank you @allawayr how do we get this job restarted?
We can re-run this and see what happens. It's simply restarting the submission, so it won't count against the current quota.
@allawayr a queue is confused when it takes 12 hours to run a 15 minute job. Given the amount of queue backup and hiccups documented on this thread, it is reasonable to assume that the queue simply got wedged and the job should be restarted and count against the April 30 quota.
Can you clarify what you mean by this: "This is most likely due to queue confusion." ?
@allawayr nothing has changed in our process. Run times should be comparable to prior runs, so more like 15 minutes than 12 hours. This is most likely due to queue confusion. Can we get another April 30 run?
@lars.ericson - Looking at the time stamps on your std_out logfile I think your model hit the 12 hour run time limit (this is not a time limit on the whole pipeline, only on the job that actually runs the container, so it isn't affected if the job is waiting in a queue to run), but not entirely sure. We'll look into this more.
@Nc717, @stadlerm is likely correct that this is your issue - you need to build the container with permissions so that "we" - the non-root user - can execute your run files. (the "5" in chmod 775) https://chmodcommand.com/chmod-775/
@Nc717 When you build your docker image, you need to make sure to cmod everything you copy in Similar to how in the demo it uses ``` chmod 775 /run.sh ``` You need to put ``` COPY /my_folder_with_python_scripts /usr/local/bin/my_folder_with_python_scripts RUN chmod -R 775 /usr/local/bin/my_folder_with_python_scripts ```
@allawayr, we got a Workflow Fail on submission ID 9703695. There is no score on the Leaderboard. The log is here: https://www.synapse.org/#!Synapse:syn22000855 It shows 34mINFO:  Could not find any nv binaries on this host! /opt/conda/envs/fastai/lib/python3.7/site-packages/torch/serialization.py:593: SourceChangeWarning: source code of class 'torch.nn.modules.loss.MSELoss' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes. warnings.warn(msg, SourceChangeWarning) /opt/conda/envs/fastai/lib/python3.7/site-packages/torch/serialization.py:593: SourceChangeWarning: source code of class 'torch.nn.modules.container.Sequential' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes. warnings.warn(msg, SourceChangeWarning) /opt/conda/envs/fastai/lib/python3.7/site-packages/torch/serialization.py:593: SourceChangeWarning: source code of class 'torch.nn.modules.conv.Conv2d' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes. warnings.warn(msg, SourceChangeWarning Can you take a look?
Hi @allawayr, I uploaded my docker image to RA fast lane, but I get an error that workflow has failed. In the logs, it says python cannot open my scoring.py file Error 13: [Permission denied], can you please help me solve this issue. Also, I do not see the RA main challenge in the dropdown to submit the docker repository there, is it an issue. Thanks, Neelambuj
Yes it's the same container - i had the same thing happen with my previous submission: 9703401 (failed), then right after, the same container (9703406) passed Yes if this run passes, it should be our third submission this quota
Interesting. It's not clear to me why that run failed. Is 9703700 the same container as 9703694? If so, let's see if 9703700 completes. If it does not, we can dig into why this is happening some more. Looking at the number of valid submissions you have from the round that just ended, I believe that 9703700 was your final submission in the quota - let me know if I have misunderstood something! Thanks, Robert
Okay, so that one is queued up - i submitted one earlier (9703694) this one just ran for a while, then it stopped without any error in any of the files - I had a similar thing happen before, but after requeing it worked
Hi there, are you talking about submission 9703700? If so, that is waiting in line. We have 7 submissions in the queue at the moment, 4 being processed and three in a holding ("RECEIVED") state (9703700 is one of those three).
@allawayr I submitted an image, but it failed without giving any error message in any log - now it says we've reached the quota - can you have a look? thank you
It counts against the submission quota at the time of submission, _not_ when the results are returned. Cheers, Robert
@allawayr If we submit today, but it gets stuck in the queue until after the uploads reset, will the submission count for this week or for the week after?

4/29 Submission queues backed up page is loading…