Why are both GPUs preoccupied at the start of training?

I kept running into out of memory problems when executing code on GPUs. Here is part of the log that I got from Synapse: ``` STDERR: I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: STDERR: name: Tesla K80 STDERR: major: 3 minor: 7 memoryClockRate (GHz) 0.8235 STDERR: pciBusID 0000:87:00.0 STDERR: Total memory: 11.25GiB STDERR: Free memory: 378.90MiB STDERR: W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x6d9e1a0 STDERR: I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties: STDERR: name: Tesla K80 STDERR: major: 3 minor: 7 memoryClockRate (GHz) 0.8235 STDERR: pciBusID 0000:88:00.0 STDERR: Total memory: 11.25GiB STDERR: Free memory: 436.40MiB ``` As you can see, both GPUs have very little free memory left when they were detected. According to the organizers, each of us has a dedicated GPU card for training. Then this shall not happen. Is there any problem with the IT infrastructure?

Created by Li Shen thefaculty
This certainly looks serious. To consolidate things I will posts updates to the other thread: https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=1122
There's a new thread on GPU memory issue, [here](https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=1122) -- some issue remains even after the 10/21 updates.
Hello Bruce, I just obtained an out of memory error. Full details in the logs of (submission ID 7429836). Not using GPU. Initialisation of device 0 failed: initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY This was just a very quick infrastructure test run. You can restart it if that helps. --r
All: The code update deployed today (10/21) is designed to ensure there is never an 'orphaned' model (i.e. a previously launched model still using the GPUs when a later model is started). If you encounter a case in which a GPUs is 'preoccupied' it would have to be for another reason. Please post here and we will investigate.
HI Bruce, I have the similar issue with the GPUs. I have always GPU0 as full. Is this because of each node is provided with two K80s and every one in the template code uses default to GPU0. In my case, I have GPU0 used and still the code runs and outputs out of memory. ``` STDERR: I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: STDERR: name: Tesla K80 STDERR: major: 3 minor: 7 memoryClockRate (GHz) 0.8235 STDERR: pciBusID 0000:87:00.0 STDERR: Total memory: 12.00GiB STDERR: Free memory: 485.29MiB STDERR: W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x6a07e50 STDERR: I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties: STDERR: name: Tesla K80 STDERR: major: 3 minor: 7 memoryClockRate (GHz) 0.8235 STDERR: pciBusID 0000:88:00.0 STDERR: Total memory: 12.00GiB STDERR: Free memory: 11.81GiB ```
can i just ask this question? Are you **absolutely sure** that the cluster, especially **a) job scheduling b) close of previous dockers **, is functioning properly? May I ask how many machines do you have? How can this be all full for such a long time? How is it even possible? Everyone needs quick turn around to do experiments. Now, I only receive an error message every 3 days. It is impossible to progress/debug...
but that is too inefficient, that every time i have to wait for days just to receive an error message. is everyone else running successfully? Why Li Shen's recent submission started with in one hour and I have to wait for days?
> hi, bruce, did you figure out? this time i already waited for over a day, and nothing happens Yes, I just looked at all the servers and submissions. Currently your submission is enqueued to be run. All the servers are occupied running other submissions. Hope you find this insight helpful.
hi, bruce, did you figure out? this time i already waited for over a day, and nothing happens
i guess this one: 7323474 RECEIVED Yuanfang Guan 41180 03:02:07PM 41180 03:02:07PM is this one today? then yes. This is a docker that i have previously run successfully, so i backed it up. and in the recent 10 submissions, it is 1. never successful again, **claiming out of memory, cannot find gpu** when loading the first image. 2. **takes longer and longer to return the error **, first a couple of hours, and last time, 3 days.
Yuan Fang: Certainly. Can you provide the Submission ID? It looks like you have submitted 35 times: ${leaderboard?path=%2Fevaluation%2Fsubmission%2Fquery%3Fquery%3Dselect%2B%2A%2Bfrom%2Bevaluation%5F7213944%2Bwhere%2BuserId%253D%253D3319559&paging=true&queryTableResults=true&showIfLoggedInOnly=false&pageSize=100&showRowNumber=false&jsonResultsKeyName=rows&columnConfig0=none%2CSubmission ID%2CobjectId%3B%2CNONE&columnConfig1=none%2C%2Cstatus%3B%2CNONE&columnConfig2=userid%2CSubmitter%2CuserId%3B%2CNONE&columnConfig3=epochdate%2C%2CcreatedOn%3B%2CNONE&columnConfig4=epochdate%2C%2CTRAINING%5FSTARTED%3B%2CNONE&columnConfig5=epochdate%2C%2CmodifiedOn%3B%2CNONE&columnConfig6=none%2C%2CrepositoryName%3B%2CNONE&columnConfig7=none%2C%2CFAILURE%5FREASON%3B%2CNONE&columnConfig8=none%2C%2CWORKER%5FID%3B%2CNONE&columnConfig9=synapseid%2C%2CentityId%3B%2CNONE&columnConfig10=synapseid%2C%2CSUBMISSION%5FFOLDER%3B%2CNONE&columnConfig11=none%2C%2Cname%3B%2CNONE&columnConfig12=synapseid%2C%2CMODEL%5FSTATE%5FENTITY%5FID%3B%2CNONE}
can you please check mine? i have been waiting for hours and it never starts to run anything (last time i waited for three days). is it also because of orphan programs?
Li: Thank you for your patience and help. I'm confident we will iron out these small issues and have things running smoothly when we open up the competitive phase. Look forward to your participation!
Hi Bruce, Thank you for taking care of this so quickly! Here is the new output: ``` STDERR: I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: STDERR: name: Tesla K80 STDERR: major: 3 minor: 7 memoryClockRate (GHz) 0.8235 STDERR: pciBusID 0000:87:00.0 STDERR: Total memory: 11.25GiB STDERR: Free memory: 11.13GiB STDERR: W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x5a67d00 STDERR: I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties: STDERR: name: Tesla K80 STDERR: major: 3 minor: 7 memoryClockRate (GHz) 0.8235 STDERR: pciBusID 0000:88:00.0 STDERR: Total memory: 11.25GiB STDERR: Free memory: 11.13GiB ``` Both GPUs are now available for use.
Dear Li: Thanks for bring this to our attention. I believe I know the cause: I looked at your history of submissions. Your latest one (ID=7320362) produced the log file you show in your post. I checked the machine where that submission ran and found two "orphaned" Docker containers running there. So our system started running each container, then encountered an error but did not successfully stop the container before continuing on and starting the next one. That's obviously serious and we will address the problem. So far this is just a theory ... I would like to verify. I took the liberty of re-queuing your submission (ID=7320362). It should run again and **append** its log output to those of the last run. So when you download the log file you will see the problematic log followed by the new log output. My question is: What is displayed on the "Free Memory" line? Is it still the low value you saw before or do you have the full GPU memory available? Thanks!
i think i have said like a hundred times now...... the cluster didn't kick in at all.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Why are both GPUs preoccupied at the start of training? page is loading…