I tried to run the tensorflow example code with the following configurations but it was terminated due to not enough memory:
Google net with batch size=100
Google net with batch size=10
Alex net with batch size=10
The Alex net is the second-to-smallest net among the four example neural nets and batch size of 10 is small. Looking through the log file, I found that the /gpu:0 is almost full while /gpu:1 is basically empty. Could that be the cause of the "out of memory" error? Can the organizer suggest some good method to avoid this? Ideally this shall be automatic. If we have to manually specify which GPU to use, is there anyway to look at the GPU usage before we submit jobs?
I have uploaded one log file to my Dropbox and the link is here: [example log file Alex net](https://www.dropbox.com/s/tst4cmyfdz22jdd/example%20log%20Alex%20net.txt?dl=0)
Please help me diagnose this problem. Thanks!
Created by Li Shen thefaculty Hey guys,
The way TensorFlow works (at least with the way it's set up with my code) is that it will just virtually reserve all of the vRAM for a gpu that it has processes running on it. It won't actually use up all of that vRAM, but it will still "reserve" it in some sense. Thus, it can be a bit challenging to figure out exactly how much you can expand your batch size by. The easiest way to do this would be to just continue to exponentially increase your batch size until things don't work anymore. That'd give you a general sense of the upper limit of your memory consumption.
- Darvin Li, I have to defer to @darvinyi who has more insight than I into TensorFlow. Hi Bruce,
That's strange because one of my GPUs is always occupied with less than 400MB free out of 12GB. Do you know the reason?
Li > Do we share one single NVIDIA card or not?
Your job does **NOT** share an NVIDIA card. Each job has exclusive access to one K80 card (2 GPUs) during its execution, along with 200GB RAM and 24 CPU cores. I find the same problem persists. If the example Tensorflow model will be able to identify free GPU before using it. Do we share one single NVIDIA card or not? Based on my initial reading of the Wiki, I thought we each had our own GPU-based compute node. But that does not seem to be the case. If we all use the same GPU card, it will be very easy to run out of memory. It's better to have some code that can automatically look for a GPU that still has available memory... i think it cannot be run because of somehow all jobs are on one single node. i have this gues because i previsously had a docker worked just fine, and now it stopeed to work, saying out of memorry, cannot find gpu., etc
i am not even running any model. i am only reading in one single image into theano, and it says out of memory....
that docker runs absolutely fine the second day when the open fase is open. and i didn't touch it for 20 days, it now is out of memory. Hmmmmm,
I believe Google Net should easily be able to be run with a batchsize of 50, and Alex Net should be able to be run with a batch size of 100+. Are you keeping the matrix size to 224?
Either way, an updated version of the tensorflow example model has been updated (to both the synapse Docker repository and to the linked .git). I tested this recently, and it's running on a GoogLe Net batch size of 50 just fine, guaranteed. One thing to note, however, is that the .tsv metadata file parsers within the program has been set to run for the data-split submission queue. I believe the training data submission queue might have slightly different .tsv files. If this isn't the case and the .tsv files are all formatted the same way, everything should work fine.
Finally, yes, currently, I'm only using /gpu:0. For a default quick example that participants can look at to see how to interact with Docker, I thought putting in the mutliple-gpu access would be out of the scope of "starter code".
Sincerely,
Darvin I got the exact same error when I tried to run the tensorflow code that is given in the demo. Kept getting out of memory errors.
Drop files to upload
Tensorflow example kept running out of memory page is loading…