Hi Thomas,
As per to your suggestion, I am posting the issue on the discussion board.
We are having some difficulty during training, and we have come to believe that there might be some systematic issues.
Problem: The code which runs without any problems in the express lane has issues in the main training lane. Every trial with the identical code(implemented with Tensorflow) in the main training lane terminates at different points (i.e. number of iterations).
We have tried with many different approaches in order to understand this issue a dozen number of times, and now have come to believe that there may be some systematic resource issues or memory management problems.
Hence, if it is not too much to ask, could the allocated memory to us be increased, so we are able to check if there are differences in the training output..
We ask this because, when running on our own system with your pilot data copied to a similar scale of the MAMMO DREAM challenge data, it ran without any issues. Our system has 128GB DRAM.
The Error log is as follows:
```
355 STDERR: Traceback (most recent call last):
356 STDERR: File "run.py", line 103, in
357 STDERR: tf.app.run()
358 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
359 STDERR: sys.exit(main(sys.argv))
360 STDERR: File "run.py", line 100, in main
361 STDERR: run_helper.execute()
362 STDERR: File "/dm-resnet/run_helper.py", line 276, in execute
363 STDERR: _, step_loss_trn, step_accu_trn = sess.run([trn_op,trn_cost,trn_accuracy], feed_dict={lr: current_lr})
364 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 372, in run
365 STDERR: run_metadata_ptr)
366 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _run
367 STDERR: feed_dict_string, options, run_metadata)
368 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 708, in _do_run
369 STDERR: target_list, options, run_metadata)
370 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 728, in _do_call
371 STDERR: raise type(e)(node_def, op, message)
372 STDERR: OutOfRangeError: RandomShuffleQueue '_1_tower_0/shuffle_batch/random_shuffle_queue' is closed and has insufficient elements (requested 1, current size 0)
373 STDERR: [[Node: tower_0/shuffle_batch = QueueDequeueMany[_class=["loc:@tower_0/shuffle_batch/random_shuffle_queue"], component_types=[DT_STRING, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/shuffle_batch/random_shuffle_queue, tower_0/shuffle_batch/n/_4474)]]
374 STDERR: [[Node: tower_1/shuffle_batch_1/_3486 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_ 1012_tower_1/shuffle_batch_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]
375 STDERR: Caused by op u'tower_0/shuffle_batch', defined at:
376 STDERR: File "run.py", line 103, in
377 STDERR: tf.app.run()
378 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
379 STDERR: sys.exit(main(sys.argv))
380 STDERR: File "run.py", line 100, in main
381 STDERR: run_helper.execute()
382 STDERR: File "/dm-resnet/run_helper.py", line 178, in execute
383 STDERR: l2_trn_cost, trn_cost, trn_accuracy = tower_loss(trndata_normal_strs, trndata_cancer_strs, scope, for_training=True)
384 STDERR: File "/dm-resnet/run_helper.py", line 107, in tower_loss
385 STDERR: normal_inputs = batchiter.multimodal_preprocessed_inputs(normal_strs,for_training=for_training)
386 STDERR: File "batchiter.py", line 333, in multimodal_preprocessed_inputs
387 STDERR: shapes=shapes)
388 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 779, in shuffle_batch
389 STDERR: dequeued = queue.dequeue_many(batch_size, name=name)
390 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 434, in dequeue_many
391 STDERR: self._queue_ref, n=n, component_types=self._dtypes, name=name)
392 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 465, in _queue_dequeue_many
393 STDERR: timeout_ms=timeout_ms, name=name)
394 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op
395 STDERR: op_def=op_def)
396 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
397 STDERR: original_op=self._default_original_op, op_def=op_def)
398 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1230, in __init__
399 STDERR: self._traceback = _extract_stack()
```
Created by cdjk Hi,
> Hence, if it is not too much to ask, could the allocated memory to us be increased, so we are able to check if there are differences in the training output..
Each team has access to 200 GB of RAM. You may want to check your code for a memory leak.
Drop files to upload
Issue training on main training lane when model finishes fine on express lane. page is loading…