I'm debugging my Docker and one issue I found is that my PyTorch is assuming CUDA 10.2 and my Docker host machine for testing is running CUDA 10.1. @allawayr do you think this will be an issue and if so should I tune my docker image to 10.1 or 10.2? I'm asking because it seems to be a property of the host to the image and not of the image itself. I.e. the image can only take whatever drivers the host actually has installed.

Created by Lars Ericson lars.ericson
Well @allawayr my MONAI challenge model failed to impress the scoring engine, so it's all up to @ikedim now, he has the next and last 2 shots. For the record, just overtraining column by column to reproduce the train.csv spreadsheet does not produce optimal results. My bad! Isaac is doing something much fancier though, we've still got 2 shots to take the game. The idea I would have tried with infinite time would have been a 3D model of the hand capable of reproducing different levels of arthritis, then just find the parameters for the hand that match the X-ray and read those out as a solution.
Glad to hear that it helped! I agree AWS is another instance of specialization, but we could ensure the instance was configured in such a way that it was using a relatively standard plug-and-play setup with, say, CUDA 10.1 drivers (i.e. so that you, on your ubuntu machine, with 10.1 drivers) could easily run a container from this challenge using docker. Because we are running this on a cluster whether the bulk of the configuration is out of our hands and is designed for general cluster use, not the specific "run a GPU container" use-case, we have to deal with some unique configuration constraints that are not super typical. In other words, I think if we were running this on a simple ubuntu instance (aws, gcp, or jsut a linux server), we'd be able to abstract this configuration away. Thanks for the offer. My sense is that we'd get the same response you got! Cheers, Robert
@allawayr that one worked, finally. Thanks for your help! Using AWS instances is just another instance of specialization. The goal would be to ask NVidia to compare and contrast the NVidia driver and docker setup on Singularity cluster as a host versus what I get on my Ubuntu 19.04 PC with NVidia 440 drivers and CUDA 10.1 as a host and ask what could be abstracted away so that a Docker image with my original simple run.sh file would work as well on both hosts. I.e. what is missing from the abstraction. NVidia is relatively new to Docker world. First there was nvidia-docker and then with Docker 19.03 flag --gpus all that apparently is obsolete. So this would be just one more level of abstraction that they can think about improving so we don't have to. If you want I can ping some of my NVidia LinkedIn contacts and see if they get back to me. I would probably get a higher level response than calling the Help Desk. On the other hand I've found that on matters like getting NVidia to support leading edge projects on Windows, sometimes even with better contacts I got back a big *shrug*. I don't remember the name of the package I had that experience with...because I didn't use it. MONAI works fine on both Windows and Linux, so maybe they are getting better in that regard or just Python and PyTorch are getting better on their own at abstracting the NVidia sausage factory away from users in general.
Fingers crossed that works! Agreed, though it's not entirely the fault of Docker or nvidia. UAB graciously provided us with access to their cluster for this challenge, which (as would be the case with many other clusters) has a relatively complicated and general-purpose setup that requires us to use Singularity (for security reasons) to build and run the Docker containers as singularity images, and then we need to point these containers to specific drivers to get everything running smoothly. If we were using AWS instances for this challenge we could have a much more specifically defined infrastructure setup, but, of course, that's expensive for several months of continuous uptime (and we'd rather be allocating money towards prizes than compute :) ). You bring up a good point though - similar thoughts have crossed my mind in the past few weeks - we would want to take the top-performing containers after the challenge and reverse-engineer out these specific paths so that they are easier to run for those without access to the UAB Cheaha platform. This would not be very difficult.
@allawayr I am rebuilding Docker image now with updated run.sh file and I will submit to FastLane shortly. Note that these lines would not work on my Ubuntu host machine. This is a bit of a failure on Docker's part as the promise of Docker is that the image should be somewhat independent of the host machine. When you join Docker to NVidia the model fails because the Docker image has to be closely tailored to the setup of the host machine. It could be worth having a dialogue with NVidia on this particular example, except if this works it's probably not worth the bother.
@allawayr above post updated with full details.
Hi Lars, Thanks for providing that. I think the missing item is defining the CUDA driver paths so that your container can use the GPU. Try this for your run.sh ``` #!/bin/bash ###### export CUDA_HOME=/cm/local/apps/cuda/libs/current export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64 PATH=${CUDA_HOME}/bin:${PATH} export PATH export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cm/shared/apps/cuda10.0/toolkit/10.0.130/lib64 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/share/apps/rc/software/cuDNN/7.6.2.24-CUDA-10.1.243/lib64 ###### python3 /score.py ```` I'm working on putting together a working example for pytorch, and will update this thread when I have it! Best, Robert
Hi @allawayr , the run.sh just says ``` #!/bin/bash python3 /score.py ``` You can see in the error log that the code starts running fine. It only breaks when it gets to where it needs the GPU. I have given you download permission to the Docker image: https://www.synapse.org/#!Synapse:syn22096459 If you have time maybe you can pull the image to a 10.1 machine and give it a try. I interact with the image locally as follows: ``` export STAGE=/home/catskills/Desktop/ra2 sudo docker run -it --gpus all \ -v $STAGE/test:/test:ro \ -v $STAGE/train:/train:ro \ -v $STAGE/output:/output \ --entrypoint=/bin/bash \ docker.synapse.org/syn21478998/monai_one_model_per_column_singleshot:003 ``` You can set up a similar script, run it, and then inside the image shell, just type ``` /run.sh ``` The $STAGE/output directory needs to be set up with one or more patient image sets. If it helps to debug, I am using the MONAI package from this GitHub: https://github.com/Project-MONAI/MONAI I have a copy of the ZIP file for the GitHub and I install it locally inside the Dockerfile. Above I omitted a few lines of the Dockerfile. The full Dockerfile is as follows: ``` FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime WORKDIR / # Required: trained models add submodels /submodels # Configure Python ENV DEBIAN_FRONTEND=noninteractive RUN pip install pandas RUN pip install scipy RUN pip install scikit-learn RUN pip install scikit-image # Required: Create /train /test and /output directories RUN mkdir /train RUN mkdir /test RUN mkdir /output # Main program COPY run.sh /run.sh # Required: code COPY MedNISTDataset.py /MedNISTDataset.py COPY normalize_to_bw_256x256.py /normalize_to_bw_256x256.py COPY ra2_inference.py /ra2_inference.py COPY score.py /score.py COPY MONAI-master.zip /MONAI-master.zip RUN apt-get update RUN apt-get install unzip RUN unzip /MONAI-master.zip && cd /MONAI-master && python3 setup.py install # Make model and runfiles executable RUN chmod 775 /run.sh # This is for the virtualenv defined above, if not using a virtualenv, this is not necessary RUN chmod 755 /root #to make virtualenv accessible to singularity user # Required: define an entrypoint. run.sh will run the model for us, but in a different configuration # you could simply call the model file directly as an entrypoint ENTRYPOINT ["/bin/bash", "/run.sh"] ```
Can you share your run.sh as well?
Hi @allawayr I'm still failing even with that tag. This docker image ran fine on my computer under Docker with CUDA 10.1: https://www.synapse.org/#!Synapse:syn22096459 tag 003 a/k/a docker.synapse.org/syn21478998/monai_one_model_per_column_singleshot:003. Here is my Dockerfile: ``` FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime WORKDIR / # Required: trained models add submodels /submodels # Configure Python ENV DEBIAN_FRONTEND=noninteractive RUN pip install pandas RUN pip install scipy RUN pip install scikit-learn RUN pip install scikit-image # Required: Create /train /test and /output directories RUN mkdir /train RUN mkdir /test RUN mkdir /output # Main program COPY run.sh /run.sh # Required: code COPY score.py /score.py # Make model and runfiles executable RUN chmod 775 /run.sh # This is for the virtualenv defined above, if not using a virtualenv, this is not necessary RUN chmod 755 /root #to make virtualenv accessible to singularity user # Required: define an entrypoint. run.sh will run the model for us, but in a different configuration # you could simply call the model file directly as an entrypoint ENTRYPOINT ["/bin/bash", "/run.sh"] ``` Here is my run: https://www.synapse.org/#!Synapse:syn22098498 You can see the Blue Screen of Death here in 9704562_stderr.txt: ``` File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 54, in _check_driver http://www.nvidia.com/Download/index.aspx""") AssertionError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx ``` Could it be that FastLane is not provisioned with GPU? Should I add a call to NVIDIA-SMI in /run.sh? This error though seems pretty specific that there is no GPU available. Just to double check I will throw this in the main queue. This should count as one of Team Shirin's 3 shots. Isaac gets the next shot and then shot 3 is best of those 2. Of course if it fails on the main queue I really need help because the config is vanilla and double-checked on an Ubuntu 19.04 host with CUDA 10.1 and a 2080ti GPU installed.
Thanks @allawayr that's perfect, your file has the magic base layer: ``` FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime ```
HI Lars, Check out the Dockerfile and run.sh I posted in this thread: https://www.synapse.org/#!Synapse:syn20545111/discussion/threadId=7027&replyId=22385 I wrote it specifically to test the GPU configuuration, not to actually output a valid prediction file, so it needs some additional modification to have the requisite /input, /output directories and an actualy run file. It should be relatively easy to plug those in as described in the demo repository: https://github.com/allaway/ra2-docker-demo Hope this is at least semi-helpful. Cheers, Robert
Hi @allawayr my colleague Isaac and I are working on two different models for the final shot. Isaac's builds are debugged. I am trying out a PyTorch package. PyTorch is out of the box now as a 10.2, so I have to be careful to get a 10.1 PyTorch setup otherwise it breaks. I'm sorting that out now, I'm in Dockerfile hell but hope to get out shortly. On the other hand if you have a PyTorch Dockerfile template for Synapse that you can share, that would be great. Isaac is using FastAI which is different.
We are using 10.1. I don't believe it will be an issue. My experience with debugging participant's PyTorch containers for this challenge is that if PyTorch successfully registers and utilizes the GPU, it stops working entirely (as opposed to TensorFlow which seems to fall back to the CPU when the GPU cannot be registered - I don't think the outcome is appreciably different, but it takes much much longer to run). Since your containers seem to be working, I imagine that it's using the GPU and working fine! Best, Robert

What CUDA version is Synapse running? page is loading…