Submitting a locally working docker is invalid

500 Server Error for http+docker://localhost/v1.41/containers/deda77eb9ce9f22ad66fbbd0d25dabefdafd1b11c9542e53e0d20188909c4767/start: Internal Server Error ("OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: device error: GPU-051b105e-9c6b-0b43-5fe3-ac02ab9aa5ae: unknown device: unknown") Above is the error message I received. The problem seems to be device related and I need to get help. I can run my dockers locally.

Created by anonymous participant houqingfan
@vchung , Thank you for your help, this link is very useful.
@houqingfan , Last year, some participants experienced [similar issues](https://www.synapse.org/#!Synapse:syn25829067/discussion/threadId=8379) as you -- perhaps their approaches can help?
@houqingfan , I forwarded your extension request to the Challenge Organizers and they have agreed to grant you an extension until this Thursday, 08/04, 23:59 Eastern Time. Hopefully this helps!
@vchung ? My docker image works fine locally, but I don't know why I'm getting a lot of errors after uploading it to the server. I'm trying to get my image to work after uploading it, but there is a high probability that this will not be done before the deadline. Can I request a delayed submission?
@vchung , I can submit now, thanks.
@houqingfan , Apologies, please try again.
@vchung When I try to submit the docker again, he rejects my submission and says Elongation has reached the submission quota. What can I do for that?
@houqingfan , Great, I have closed both submission IDs 9723440 and 9723444. You should be able to submit again. Let me know if not!
@vchung , Please pause them, and again, thank you very much for your help.
@houqingfan , Of course, best of luck! P.S. I checked the submissions board and it looks like the two you submitted this morning are still running. Would you like me to pause those? Otherwise, you will not be able to submit an updated image until they are done evaluating and come back as INVALID.
@vchung Thank you for your help, it did not meet my expectations. If you are sure you have used --gpu in the command, then I will repackage the docker cuda 11.4. Thank you very much for your help again.
@houqingfan , I started running your `pytorch_diff` image about an hour ago and it remained hanging with the following message: ```bash ... ... (4, 137, 190, 140) This worker has ended successfully, no errors to report ``` I checked the `output` folder and it remains empty. I'm not sure if this is expected? ```bash CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0b6f8e24153b docker.synapse.org/syn32070240/pytorch_diff:v1 "/bin/sh -c /houqing?" 58 minutes ago Up 57 minutes bold_booth ``` Let me know.
@vchung Can you please use the following command to test my docker? ``` docker run --rm \ --network none \ --gpus \ -v /path/to/input:/input:ro \ -v /path/to/output:/output:rw \ docker.synapse.org/syn32070240/pytorch_nndiff_final:v1 ``` I still think the problem lies in the GPU mount, Please tell me the result, thanks in advance!
@houqingfan , Thank you for sharing your debugging process. Perhaps there is an incompatibility issue between the version of PyTorch used and the machine's CUDA toolkit? If it helps, this is what is available to you: ```bash $ nvidia-smi | head -n4 Tue Aug 2 17:34:09 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ ``` Another alternative solution is to use `nvidia` as the base image of your Dockerfile, rather than `ubuntu`, e.g. ```dockerfile FROM nvidia/cuda:11.4.1-base-ubuntu20.04 ``` Hope this helps!
@vchung I tried locally to not apply the --gpu command and got the following error? ``` /opt/conda/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py:115: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.") debug: mirroring True mirror_axes (0, 1, 2) /opt/conda/lib/python3.7/site-packages/torch/autocast_mode.py:141: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling') ``` This error is exactly the same as the one returned by the ID 9723444 that I submitted. Can you please test my docker? The command I am running locally is ``` docker run -it --rm --network none --gpus device=0 --name nnunet -v "/root/input/":"/input" -v "/root/output/":"/output" docker.synapse.org/syn32070240/pytorch_nndiff_final:v1 ```
Checking the log again I found the error message? ``` /opt/conda/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py:115: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.") ```
Thanks vchung, If you have time please run it for me 9723444, the code I am running locally is ``` docker run -it --rm --network none --gpus device=0 --name nnunet -v "/root/input/":"/input" -v "/root/output/":"/output" docker.synapse.org/syn32070240/pytorch_diff:v1 ``` Everything works fine in local but I get an error after committing ``` RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/ Download/index.aspx ``` I've tried to recreate the docker and the problem is still not solved
@houqingfan , Can you share what your Docker run command is? The one used by the evaluation workflow is something like this: ```bash docker run --rm \ --network none \ --runtime="nvidia" \ -v /path/to/input:/input:ro \ -v /path/to/output:/output:rw \ {docker image} ``` I tried manually running your Docker image from submission ID 9723416 with this run command, and received the same error you noted above.
Thanks for your reply, I didn't specify any GPU in my code and I tried running it on different GPUs (3090, 2080ti, or 1070) and the docker worked fine
@houqingfan , Based on the error message, my assumption is that a specific GPU device is requested but is not available. In your source code or Dockerfile, do you specify a GPU device ID?

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Submitting a locally working docker is invalid page is loading…