Hi,
We have had some problems that seem to be GPU-related and would appreciate some guidance.
I created a docker image based on tensorflow/tensorflow:0.10.0-gpu but when I call
pip show tensorflow
it just seems to hang for a while and then silently quit. I'm trying to run a short test now with the cpu version of tensorflow but that has been in the queue for 11hrs so far.
From the [wiki](https://www.synapse.org/#!Synapse:syn4224222/wiki/401759) you're setting an environment variable for us ($GPUS) to be able to select "our" 2 GPUs from the 4 on the machine. However, the variable also includes /dev/nvidiactl and /dev/nvidia-uvm. Do I include them in the CUDA_AVAILABLE_DEVICES environment variable that I must construct or ignore them. Is that why our container is hanging? This isn't so important for preprocessing but will obviously be vital for training.
Cheers
Bob
Created by Bob Kemp BobK > when I call pip show tensorflow it just seems to hang for a while and then silently quit
Sorry, I don't have an answer. The obvious question is whether it works when you run the container locally. In your follow up comment you seem to say it *does*. I will ask a colleague who is more familiar than I with tensorflow.
> that has been in the queue for 11hrs so far.
We had a long backlog of submissions and just, today, added more servers to process them. The backlog has been cleared and you should be able to get quicker response.
> the variable also includes /dev/nvidiactl and /dev/nvidia-uvm.
That was unintentional and I have made a note to remove those devices. I'm not sure about "CUDA_AVAILABLE_DEVICES". It looks like there is a parameter "CUDA_**VISIBLE**_DEVICES" which is set to an integer. I found some guidance here:
http://stackoverflow.com/questions/37893755/tensorflow-set-cuda-visible-devices-within-jupyter/37901914
> Is that why our container is hanging?
if the state of your submission is "Pending" or "RECEIVED" it has nothing to do with your code, it's simply that the job is waiting for an available server.
Hope these answers help.
BTW, I still seemed to get a problem even when I set CUDA_AVAILABLE_DEVICES to the empty string so maybe I've misunderstood but I can't think of anything else that might be causing it.
Obviously it all works fine in the same container on my laptop.