Hi @allawayr ,
We have a submission (ID=9704074) hanging there without progression - the last update time of the 9704074_stdout.txt file is still 10am this morning. Can you help check this submission? Thanks!
Best,
Hongyang
Created by Hongyang Li arcanum449 Glad to hear that! Thanks a lot, Robert! Our two submissions finished in time and scored. Hi @arcanum44 , We worked with UAB to get a 48h time limit for leaderboard submissions. Unfortunately if they take longer than that, we will not be able to run them, but as you say you expect an 18h runtime or so.
I restarted submission 9704114 and 9704074 which hit the 12h time limit, so hopefully it will make it through this time. We are talking with the sysadmins at UAB to see what the best solution is. The GPU nodes we are using right now have a time limit of 12 hours so we are already at the max, but there might be some other solutions that allow us to run containers for longer.
If possible, I would prefer to avoid changing the submitted container itself since the round is technically closed, but we can explore that option if need be. I see. My previous models are faster without the time issue. Based on the log, I expect two of our submissions (IDs= 9704074 and 9704114) to finish around 18h, since they ran on CPU. The 9704074 is the restarted one yesterday.
Is it possible to temporarily extend the time limit to e.g. 24h/48h? Or is it possible to simply add the CUDA environmental variables to the run.sh script without changing other parts of my docker image? (I have another submission using similar models finished in 5h using GPU.) Hi there,
That information is outdated. As per the [FAQ](https://www.synapse.org/#!Synapse:syn20545111/wiki/599308) the current limit is 12h. This change was made because most models are far under 12h runtime, and this allowed us to get higher priority in the slurm queues (meaning that challenge submissions spent less time waiting to be run).
My question is: how long did your previous models take, and for the most recent model that you are having issues with - do you expect that the runtime would be substantially longer? Hi @allawayr ,
The restarted job (ID=9704074) yesterday was cancelled due to a time limit:
```
slurmstepd: error: *** JOB 4786268 ON c0108 CANCELLED AT 2020-05-15T04:27:48 DUE TO TIME LIMIT ***
```
I remember you mentioned the time limit is 48 hours:
"Each run has access to an NVIDIA Tesla P100 with 16 GB of GPU-based memory in addition to 8 CPUs with 64 GB RAM. The maximum run time as currently configured is 48 hours per container."
Is there something wrong with the time limit set up? Glad to hear that fixed the issue!
Best,
Robert
Got it. Once I put the environmental variables into the** run.sh** script instead of the **Dockerfile**, the cuda related errors are gone. Thank you for your help! In my preliminary testing before the launch, I wasn't always able to get the GPU working when I defined the environmental variables in the Dockerfile (ie.during the docker build). It worked much more reliably when I set it at run time - ie. in the container's run.sh script, like this example https://github.com/allaway/ra2-docker-demo
The GPU errors can be a little confusing. the "nv binaries cannot be located" warning is thrown by Singularity and doesn't mean the container cant' run on the GPU, but some of the other errors you list here do seem to indicate that the gpu is not being used.
One way to check the configuration when you are using tensorflow that can be helpful is to list the devices available to tensorflow...
In R: `tf$python$client$device_lib$list_local_devices()`
In Python: `tf.config.list_physical_devices('GPU')`
It should report something like this (this is R, the python command above will look a bit different):
```
[[1]]
name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3614398062345751017
[[2]]
name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 11017888978683523545
physical_device_desc: "device: XLA_CPU device"
[[3]]
name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 15956161332
locality {
bus_id: 2
numa_node: 1
links {
}
}
incarnation: 15274366169888670892
physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:84:00.0, compute capability: 6.0"
[[4]]
name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 2569370332260097634
physical_device_desc: "device: XLA_GPU device"
```
It's a little late for this to be useful in the Leaderboard, but I hope this information helps you in the final round! the test queues will still be open then, so you can work with that queue to dial in your configuration. Thanks for restarting it, @allawayr !
I also check the stderr log file of the hanging submission and I'm not sure if GPU was used in my submission :
```
[34mINFO: [0m Could not find any nv binaries on this host!
2020-05-14 00:39:53.164727: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-05-14 00:39:53.167602: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400155000 Hz
2020-05-14 00:39:53.168085: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55555877ea80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-14 00:39:53.168123: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-05-14 00:39:53.170026: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /.singularity.d/libs
2020-05-14 00:39:53.170058: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-05-14 00:39:53.170101: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: c0097
2020-05-14 00:39:53.170117: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: c0097
2020-05-14 00:39:53.170172: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2020-05-14 00:39:53.170218: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.1
```
Any ideas about this error?
For your information, I've already add the following lines to my "Dockerfile":
```
RUN export CUDA_HOME=/cm/local/apps/cuda/libs/current
RUN export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
RUN PATH=${CUDA_HOME}/bin:${PATH}
RUN export PATH
RUN export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cm/shared/apps/cuda10.0/toolkit/10.0.130/lib64
RUN export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/share/apps/rc/software/cuDNN/7.6.2.24-CUDA-10.1.243/lib64
```
We are restarting the submission on our end, hopefully this resolves the issue.
Cheers,
Robert
Yes, it passed the fastlane (ID=9704073). Thanks for the heads up. Not sure what's going on here, as we've had lots of other submissions run through successfully.
We'll take a look into this. It might just need to be restarted.
Did this container work on the test queue?