Hi @brucehoff,
Thanks for your efforts to get the system running reliably.
I read on an [earlier thread](https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=977) that you did some work on 21/oct to prevent old containers holding GPU resources, and posted a brief note there much earlier today. I'm starting a new thread since you supposed the problem was solved, and I have an analysis that shows you still had an issue as late as 07:15 today *apparently* related to GPU memory usage.
Here are a few cases of containers starting with little free GPU memory:
The first occurrence from a log starting 288ca2bd07a1ca....txt:
A train.sh started a couple of seconds before Sun Oct 23 12:27:44 2016 - but I don't have a container ID
STDOUT: | 0 Tesla K80 Off | 0000:87:00.0 Off | 0 |
STDOUT: | N/A 53C P0 64W / 149W | 11007MiB / 11519MiB | 0% Default |
STDOUT: | 1 Tesla K80 Off | 0000:88:00.0 Off | 0 |
STDOUT: | N/A 45C P0 71W / 149W | 123MiB / 11519MiB | 0% Default |
The second from a log starting 4ebfa096e6ca...txt:
A preprocess.sh started : Mon Oct 24 05:30:33 UTC 2016 on Docker Container ID : 288ca2bd07a1
STDOUT: | 0 Tesla K80 Off | 0000:87:00.0 Off | Off |
STDOUT: | N/A 48C P0 60W / 149W | 11736MiB / 12287MiB | 0% Default |
STDOUT: | 1 Tesla K80 Off | 0000:88:00.0 Off | Off |
STDOUT: | N/A 41C P0 72W / 149W | 123MiB / 12287MiB | 0% Default |
The third around the same time, clearly running on the same hardware, since the memory stats were similar
A train.sh started : Mon Oct 24 05:31:45 UTC 2016 on Docker Container ID : 7208c54f31b6
STDOUT: | 0 Tesla K80 Off | 0000:87:00.0 Off | Off |
STDOUT: | N/A 48C P0 60W / 149W | 11736MiB / 12287MiB | 0% Default |
STDOUT: | 1 Tesla K80 Off | 0000:88:00.0 Off | Off |
STDOUT: | N/A 41C P0 72W / 149W | 123MiB / 12287MiB | 0% Default |
The fourth 1hr 45min later apparently running on the same hardware, yet again, since the stats are the same.
A preprocess.sh started : Mon Oct 24 07:15:15 UTC 2016 on Docker Container ID : b98389e1cc83
STDOUT: | 0 Tesla K80 Off | 0000:87:00.0 Off | Off |
STDOUT: | N/A 48C P0 60W / 149W | 11736MiB / 12287MiB | 0% Default |
STDOUT: | 1 Tesla K80 Off | 0000:88:00.0 Off | Off |
STDOUT: | N/A 41C P0 72W / 149W | 123MiB / 12287MiB | 0% Default |
I notice that the GPU temperatures of the last three runs were suspiciously similar. I would be surprised if the temperatures and powers would remain the same, especially for a two hour span ending at 07:15. I *think* you might be having a problem with the GPU driver stats being stuck at a certain value.
Possibly you *have* fixed the 'contending container' problem, but still are suffering a 'wedged GPU stats' problem.
Let me know if you need more from our logs.
--r
Created by Russ Ferriday snaggle Ok, what you mean is that the 1TB available, is only available once.
In the same case, in the design of the application we can do both (preprocessing and cache for further use in the same session), without the need of separate these applications, and do process in images that at running decission are not going to be processed.
In my opinion, the system is devoted to run in an _only one way_ design direction (separate processing and post processing). This could be smart but you don't need to do the full preprocessing or take decissions during the training.
We find that _busines rules_ about when a resource is available must be hidden. And only one filesystem hierarchy MUST be pressent whatever you want to do during the processing session.
@kikoalbiol: The system has no concept of state maintained between submissions. Each of your submissions must completely describe how to compute its result from scratch. I think it will be very difficult to use the system if you seek to "find data that was preprocessesd in previous sessions". Rather, each of your submissions should say how to pre-process the data itself. If two submissions happen to preprocess data in the same way (i.e. they have the same preprocessing container image) then the system will, as a 'short cut', use the cached output of the previous preprocessing submission. But even that is not guaranteed. For instance, if the server containing your preprocessed data is removed for maintenance we would simply run your next submission from scratch, rerunning the preprocessing step again.
I hope this comments clarify how the system works and help you participate in the challenge. In my opinion the full view of the filesystem hierarchy **MUST** be the same between sessions.
Otherwise the logic is confused, and extra work must be carried in order to find data that was preprocessesd in previous sessions.
Thanks, Bruce.
I'll keep my log files, and we'll see if there are any more signs of GPU bleed-over, as we make progress.
I'm looking at another problem at the moment: the ability to make use of /preprocessedData/ between preprocess.sh and train.sh runs, when the training image changes.
Do you have evidence that teams are succeeding in using this feature? I have some early suspicions that this is not working reliably, and I'm working on a report. It would help to know what testing you have done.
Debugging this would certainly be massively easier if the full image name (docker.synapse.org/syn7368xxx/preproc_stability_tests2@sha256:32bcad604e3e5db7e1.......) and the submissionID could be dropped into the container, as part of the environment.
--r
@snaggle I hope you now find that the 'pre-allocated GPU memory' is gone. If for any reason you feel it's not resolved, do let us know. Thank you. @brucehoff
Sorry for the delay in responding to your messages.
Thanks for taking our observations seriously, then finding your bug and fixing it.
Just to note, the majority of those submissions were not pre-processing/training runs - just doing the testing it takes to chase down infrastructure bugs like this, and work out how to avoid the issues. Given the solid evidence, we had already resorted to requesting the second GPU, to reduce the chance of hitting pre-allocated GPU memory.
--r
@yuanfang.guan: Ah, yes, we can make a small change so that pending jobs (those that have not yet begun) show up in the table. they didn't appear in Stop a Running Job page, so i thought they disappeared. @snaggle I'm happy to say that I'm 99% sure we have now found and fixed this bug: There was a simple error in the script assigning GPUs to submissions which caused two submissions to use the same pair of GPUs. After fixing the bug (just hours ago) I see new jobs being correctly assigned GPUs. Sorry for the inconvenience.
@yuanfang.guan I see that there are two jobs submitted by you which are queued to run, IDs 7497398 and 7497902. One was sent yesterday and the other sent today. They are simply waiting for the necessary server resources to run.
@kikoalbiol If you clarify your question a bit I would be happy to answer. Do you mean? 349799 the full path? the problem is i don't even have an id. as i said it disappeared completely, i tried twic.
all files are in my 2017_TF project that i shared with you: gpu1.tar
the only line i changed is line 23 in Dockerfile, where you will see gpu1
thanks @yuanfang.guan: Can you give the 7 digit submission ID you are inquiring about? i found if i allocate to gpu1 nothing is run, not even appearing in the job queue page. can you please check? but i hope gpu1 is just gone, that makes everything more simple. @snaggle, thank you for this feedback. I see you and your colleagues have submitted around 50 jobs from 10/22-31. They have run on four different machines and multiple Tesla K80 boards (GPU pairs) on each machine. Since you are experiencing the memory loss on multiple machines and boards I don't think there is a hardware issue, unless our whole fleet is affected. Although we are careful not to run concurrent jobs sharing GPUs, the 'bleed across' hypothesis seems like a possibility, especially since it's always GPU0 which you find to be used. (Models that use just one GPU usually use GPU0.) The other possibility (from our perspective) is that you are unwittingly allocating GPU memory in your submission before running "nvidia-smi".
As a first step we will try checking the GPU memory on the machine you were on most frequently, once the current jobs complete, and see if the memory allocation remains. If so, the solution may be to explicitly free memory between running submitted jobs.
Thanks again for your feedback. We will follow up with our findings. Thanks for looking at this, @brucehoff!
More evidence of a problem.
There is a clear recent pattern in the following runs that cover an 18.5 hour span:
```
Started : Wed Oct 26 04:22:44 UTC 2016 on Docker Container ID : df1484589eb6
Started : Wed Oct 26 04:27:58 UTC 2016 on Docker Container ID : eca9063b78a0
Started : Wed Oct 26 04:53:36 UTC 2016 on Docker Container ID : 86ef52a0dce2
Started : Wed Oct 26 05:14:04 UTC 2016 on Docker Container ID : dfdcc6093354
Started : Wed Oct 26 15:51:53 UTC 2016 on Docker Container ID : e88124135192
Started : Wed Oct 26 15:57:05 UTC 2016 on Docker Container ID : 1f1625d939de
Started : Wed Oct 26 16:22:50 UTC 2016 on Docker Container ID : 8ed17c31d224
Started : Wed Oct 26 16:53:21 UTC 2016 on Docker Container ID : 9cd58d449141
Started : Wed Oct 26 18:19:55 UTC 2016 on Docker Container ID : dbef4564d407
Started : Wed Oct 26 18:35:16 UTC 2016 on Docker Container ID : 247d629d81e1
Started : Wed Oct 26 22:49:21 UTC 2016 on Docker Container ID : cf86de8a0c4a
Started : Wed Oct 26 22:59:41 UTC 2016 on Docker Container ID : c491b8fc97e9
```
where the GPU data looked like:
```
STDOUT: Wed Oct 26 22:59:41 2016
STDOUT: +------------------------------------------------------+
STDOUT: | NVIDIA-SMI 352.99 Driver Version: 352.99 |
STDOUT: |-------------------------------+----------------------+----------------------+
STDOUT: | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
STDOUT: | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
STDOUT: |===============================+======================+======================|
STDOUT: | 0 Tesla K80 Off | 0000:87:00.0 Off | Off |
STDOUT: | N/A 47C P0 61W / 149W | 11737MiB / 12287MiB | 0% Default |
STDOUT: +-------------------------------+----------------------+----------------------+
STDOUT: | 1 Tesla K80 Off | 0000:88:00.0 Off | Off |
STDOUT: | N/A 37C P0 74W / 149W | 124MiB / 12287MiB | 0% Default |
STDOUT: +-------------------------------+----------------------+----------------------+
```
The GPU0 lines for all those runs were:
```
| N/A 45C P0 61W / 149W | 11737MiB / 12287MiB | 0% Default |
| N/A 45C P0 60W / 149W | 11737MiB / 12287MiB | 0% Default |
| N/A 45C P0 61W / 149W | 11737MiB / 12287MiB | 0% Default |
| N/A 45C P0 65W / 149W | 11737MiB / 12287MiB | 27% Default |
| N/A 45C P0 65W / 149W | 11737MiB / 12287MiB | 24% Default |
| N/A 45C P0 63W / 149W | 11737MiB / 12287MiB | 27% Default |
| N/A 45C P0 65W / 149W | 11737MiB / 12287MiB | 27% Default |
| N/A 45C P0 65W / 149W | 11737MiB / 12287MiB | 27% Default |
| N/A 45C P0 60W / 149W | 11737MiB / 12287MiB | 0% Default |
| N/A 46C P0 60W / 149W | 11737MiB / 12287MiB | 24% Default |
| N/A 45C P0 61W / 149W | 11737MiB / 12287MiB | 0% Default |
| N/A 47C P0 61W / 149W | 11737MiB / 12287MiB | 0% Default |
```
Clearly, memory use is constant, but temperature and utilisation co-vary. This does not support my earlier theory that the GPU status may have stopped responding.
Theories:
* hardware/firmware issue
* bleed across -- we are seeing the GPU that is running somebody else's job
Example: We seem to be always running on the same half of the same machine. Is there another job that's running alongside all day long consuming the other two GPUs? If so, then we are seeing its GPU stats. Look for an off-by-one error.
Diagnosing alternatives:
* stop the queue
* when no jobs are running, run nvidia-smi -- make notes of the result
* reboot & repeat. If you have the same problem, it's a hardware fault.
Good luck with finding the issue.
This thread will track the 'gpu memory issues' from another thread: https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=977 I have similar problems
```
STDERR: find: `/preprocessedData': No such file or directory
STDERR: ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
STDERR: initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY. numdev=1
```
That can be followed in this post: https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=1126
Drop files to upload
GPU already preallocated at start of processing - problem still present page is loading…