Issue training on main training lane when model finishes fine on express lane.

Hi Thomas, As per to your suggestion, I am posting the issue on the discussion board. We are having some difficulty during training, and we have come to believe that there might be some systematic issues. Problem: The code which runs without any problems in the express lane has issues in the main training lane. Every trial with the identical code(implemented with Tensorflow) in the main training lane terminates at different points (i.e. number of iterations). We have tried with many different approaches in order to understand this issue a dozen number of times, and now have come to believe that there may be some systematic resource issues or memory management problems. Hence, if it is not too much to ask, could the allocated memory to us be increased, so we are able to check if there are differences in the training output.. We ask this because, when running on our own system with your pilot data copied to a similar scale of the MAMMO DREAM challenge data, it ran without any issues. Our system has 128GB DRAM. The Error log is as follows: ``` 355 STDERR: Traceback (most recent call last): 356 STDERR: File "run.py", line 103, in 357 STDERR: tf.app.run() 358 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run 359 STDERR: sys.exit(main(sys.argv)) 360 STDERR: File "run.py", line 100, in main 361 STDERR: run_helper.execute() 362 STDERR: File "/dm-resnet/run_helper.py", line 276, in execute 363 STDERR: _, step_loss_trn, step_accu_trn = sess.run([trn_op,trn_cost,trn_accuracy], feed_dict={lr: current_lr}) 364 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 372, in run 365 STDERR: run_metadata_ptr) 366 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _run 367 STDERR: feed_dict_string, options, run_metadata) 368 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 708, in _do_run 369 STDERR: target_list, options, run_metadata) 370 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 728, in _do_call 371 STDERR: raise type(e)(node_def, op, message) 372 STDERR: OutOfRangeError: RandomShuffleQueue '_1_tower_0/shuffle_batch/random_shuffle_queue' is closed and has insufficient elements (requested 1, current size 0) 373 STDERR: [[Node: tower_0/shuffle_batch = QueueDequeueMany[_class=["loc:@tower_0/shuffle_batch/random_shuffle_queue"], component_types=[DT_STRING, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](tower_0/shuffle_batch/random_shuffle_queue, tower_0/shuffle_batch/n/_4474)]] 374 STDERR: [[Node: tower_1/shuffle_batch_1/_3486 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_ 1012_tower_1/shuffle_batch_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]] 375 STDERR: Caused by op u'tower_0/shuffle_batch', defined at: 376 STDERR: File "run.py", line 103, in 377 STDERR: tf.app.run() 378 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run 379 STDERR: sys.exit(main(sys.argv)) 380 STDERR: File "run.py", line 100, in main 381 STDERR: run_helper.execute() 382 STDERR: File "/dm-resnet/run_helper.py", line 178, in execute 383 STDERR: l2_trn_cost, trn_cost, trn_accuracy = tower_loss(trndata_normal_strs, trndata_cancer_strs, scope, for_training=True) 384 STDERR: File "/dm-resnet/run_helper.py", line 107, in tower_loss 385 STDERR: normal_inputs = batchiter.multimodal_preprocessed_inputs(normal_strs,for_training=for_training) 386 STDERR: File "batchiter.py", line 333, in multimodal_preprocessed_inputs 387 STDERR: shapes=shapes) 388 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 779, in shuffle_batch 389 STDERR: dequeued = queue.dequeue_many(batch_size, name=name) 390 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 434, in dequeue_many 391 STDERR: self._queue_ref, n=n, component_types=self._dtypes, name=name) 392 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 465, in _queue_dequeue_many 393 STDERR: timeout_ms=timeout_ms, name=name) 394 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op 395 STDERR: op_def=op_def) 396 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2260, in create_op 397 STDERR: original_op=self._default_original_op, op_def=op_def) 398 STDERR: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1230, in init 399 STDERR: self._traceback = _extract_stack() ```

Created by cdjk
Hi, > Hence, if it is not too much to ask, could the allocated memory to us be increased, so we are able to check if there are differences in the training output.. Each team has access to 200 GB of RAM. You may want to check your code for a memory leak.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Issue training on main training lane when model finishes fine on express lane. page is loading…