Hello @thomas.yu,
When I use tensorflow/tensorflow:0.9.0-gpu as the base docker image I can start from pretrained model. However, I cannot do so after I have passed to the docker image tensorflow/tensorflow:0.11.0-gpu with the following error:
...
STDOUT: using the checkpoint ./model.ckpt-2600
STDERR: main(sys.argv)
STDERR: File "DREAM_DM.py", line 608, in main
STDERR: finetune(X_tr, X_te, opts)
STDERR: File "DREAM_DM.py", line 445, in finetune
STDERR: final_saver.restore(sess, ckpt.model_checkpoint_path)
STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1340, in restore
STDERR: if not file_io.get_matching_files(file_path):
STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 231, in get_matching_files
STDERR: compat.as_bytes(filename), status)]
STDERR: File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
STDERR: self.gen.next()
STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors.py", line 463, in raise_exception_on_not_ok_status
STDERR: pywrap_tensorflow.TF_GetCode(status))
STDERR: tensorflow.python.framework.errors.NotFoundError: ./sys/dev/block/253:738/subsystem/dm-812
When I do "ls -la" in my docker image, I see that checkpoint file and the snapshot were copied:
STDOUT: -rw-rw-r--. 1 root root 87 Dec 15 14:35 checkpoint
STDOUT: -rw-rw-r--. 1 root root 510005647 Dec 15 14:35 model.ckpt-2600
STDOUT: -rw-rw-r--. 1 root root 6968371 Dec 15 14:35 model.ckpt-2600.meta
I need tensorflow 0.11.0 as several things have been fixed in new version. Can you please help me with it?
Best
Created by Yaroslav Nikulin (Therapixel) ynikulin I would like to raise this question again. TensorFlow 0.9.0 has some issues with its batch_norm layer (discussed a lot here: https://github.com/tensorflow/tensorflow/issues/1122).
I only know that I have significant performance drop when I pass from is_training=True to False for testing using TensorFlow 0.9.0. Locally I have TensorFlow 0.11.0 and on pilot data it works the other way around (and that's how it should be!): is_training=False improves performance. However, Docker image built from TensorFlow 0.11.0 cannot load a model and continue training from a snapshot as I described above. I guess that the container has troubles accessing the virtual file system (it cannot find the file with fake name ./sys/dev/block/253:738/subsystem/dm-812).
Right now I am just always using is_training=True and most probably the models under-perform. Could you please help me with this?
Yaroslav I've been using 11 and it's working for me (but I haven't tried migrating a model from 9 to 11). Dear @thomas.yu, @tschaffter,
I am sorry, but I have to rise this question again. I do need TensorFlow 0.11.0 but apparently it is not supported, I always have the same error: it seems like when a Docker container is built starting from tensorflow/tensorflow:0.11.0-gpu it cannot see the virtual filesystem, no idea why:
STDERR: tensorflow.python.framework.errors.NotFoundError: ./sys/dev/block/253:738/subsystem/dm-812
Please, help !
Yaroslav
Just to add, when I change back to Tensorflow 0.9.0 the model is loaded and everything is allright. An example of submission with error is ID 7874131.
Does anyone else use Tensorflow 0.11.0? Did you have any problems with that?
Drop files to upload
Cannot start from pretrained model using Tensorflow 0.11.0, please help page is loading…