Hi,
I was trying to run two training processes each of which uses a GPU core on Tensorflow. However, I received Out of Memory error which I suspect that I was running two processes on single GPU.
I ran two python processes in the train.sh shell:
```
python train.py /gpu:0 & python train.py /gpu:1
```
In Python file, TensorFlow object are initialized as:
```
with tf.device(self.device_mode):
... python code ....
```
where self.device_mode is "/gpu:0" and "/gpu:1" for those two threads respectively.
However, I received following error in one of the thread:
STDOUT: at training thread 1, Exception in start_training()(, ResourceExhaustedError(), )
STDOUT: Traceback (most recent call last):
STDOUT: File "train.py", line 196, in start_training
STDOUT: pred_L, pred_R = model.train_one_batch(case)
STDOUT: File "/predictive_model.py", line 66, in train_one_batch
STDOUT: feed_dict=feed_dict)
STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 717, in run
STDOUT: run_metadata_ptr)
STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 915, in _run
STDOUT: feed_dict_string, options, run_metadata)
STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _do_run
STDOUT: target_list, options, run_metadata)
STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 985, in _do_call
STDOUT: raise type(e)(node_def, op, message)
STDOUT: ResourceExhaustedError: OOM when allocating tensor with shape[2,1299,999,32]
STDOUT: [[Node: conv1_CC/conv1_CC/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_Placeholder_0/_19, conv1_CC/weights/read)]]
STDOUT: [[Node: fully_connected/Softmax/_33 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1767_fully_connected/Softmax", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
STDOUT:
STDOUT: Caused by op u'conv1_CC/conv1_CC/Conv2D', defined at:
STDOUT: File "train.py", line 292, in
STDOUT: validation_break_number=validation_size, batch_size=batch_size, parameters=param, tag=tag)
STDOUT: File "train.py", line 170, in start_training
STDOUT: model = predictive_model.PredictiveModel(parameters, my_logger)
STDOUT: File "/predictive_model.py", line 39, in __init__
STDOUT: convnet = model_class(self.logger, parameters, self.x)
STDOUT: File "/models.py", line 15, in thinner_convnet
STDOUT: layers.all_views_conv_layer(x, 'conv1', number_of_filters = 32, filter_size = [3, 3], stride = [2, 2], parameter_tying = parameters['parameter_tying']), 'conv')
STDOUT: File "/layers.py", line 14, in all_views_conv_layer
STDOUT: h_L_CC = tf.contrib.layers.convolution2d(inputs = input_L_CC, num_outputs = number_of_filters, kernel_size = filter_size, stride = stride, padding = 'VALID', scope = CC_scope)
STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
STDOUT: return func(*args, **current_args)
STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 438, in convolution2d
STDOUT: padding=padding)
STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d
STDOUT: data_format=data_format, name=name)
STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
STDOUT: op_def=op_def)
STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
STDOUT: original_op=self._default_original_op, op_def=op_def)
STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
STDOUT: self._traceback = _extract_stack()
STDOUT: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2,1299,999,32]
STDOUT: [[Node: conv1_CC/conv1_CC/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_Placeholder_0/_19, conv1_CC/weights/read)]]
STDOUT: [[Node: fully_connected/Softmax/_33 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1767_fully_connected/Softmax", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
I think the reason this happen is that both processes were trying to use memory on the same GPU. I am wondering if it's possible to do the multi-process training on different GPU cores. If yes, what's the correct way to implement that?
Thanks for the help!!!
Created by Yiqiu Shen ashen Hi @darvinyi and @tschaffter,
Thanks so much for the suggestion! I tested this on the express lane and it seems that it's working now.
Artie Hey guys,
So yea, Tensorflow (to my knowledge) will sorta reserve all the available memory on all the available gpu's as soon as you start a session, even if you hard code in certain processes to be done on certain gpu's. The best way I can think of doing what you want to do would be to make certain gpu's invisible to the code. You can do this via "CUDA_VISIBLE_DEVICES=..." before your python call. Thus, to do what you're doing, I'd put in the line
CUDA_VISIBLE_DEVICES=0 python train.py & CUDA_VISIBLE_DEVICES=1 python train.py
- Darvin @darvinyi Do you know how to run two Tensorflow processes on different GPUs? Thanks!
Drop files to upload
Running Two TensorFlow Training Processes using Different GPU Cores page is loading…