Running Two TensorFlow Training Processes using Different GPU Cores

Hi, I was trying to run two training processes each of which uses a GPU core on Tensorflow. However, I received Out of Memory error which I suspect that I was running two processes on single GPU. I ran two python processes in the train.sh shell: ``` python train.py /gpu:0 & python train.py /gpu:1 ``` In Python file, TensorFlow object are initialized as: ``` with tf.device(self.device_mode): ... python code .... ``` where self.device_mode is "/gpu:0" and "/gpu:1" for those two threads respectively. However, I received following error in one of the thread: STDOUT: at training thread 1, Exception in start_training()(, ResourceExhaustedError(), ) STDOUT: Traceback (most recent call last): STDOUT: File "train.py", line 196, in start_training STDOUT: pred_L, pred_R = model.train_one_batch(case) STDOUT: File "/predictive_model.py", line 66, in train_one_batch STDOUT: feed_dict=feed_dict) STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 717, in run STDOUT: run_metadata_ptr) STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 915, in _run STDOUT: feed_dict_string, options, run_metadata) STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _do_run STDOUT: target_list, options, run_metadata) STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 985, in _do_call STDOUT: raise type(e)(node_def, op, message) STDOUT: ResourceExhaustedError: OOM when allocating tensor with shape[2,1299,999,32] STDOUT: [[Node: conv1_CC/conv1_CC/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_Placeholder_0/_19, conv1_CC/weights/read)]] STDOUT: [[Node: fully_connected/Softmax/_33 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1767_fully_connected/Softmax", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] STDOUT: STDOUT: Caused by op u'conv1_CC/conv1_CC/Conv2D', defined at: STDOUT: File "train.py", line 292, in STDOUT: validation_break_number=validation_size, batch_size=batch_size, parameters=param, tag=tag) STDOUT: File "train.py", line 170, in start_training STDOUT: model = predictive_model.PredictiveModel(parameters, my_logger) STDOUT: File "/predictive_model.py", line 39, in init STDOUT: convnet = model_class(self.logger, parameters, self.x) STDOUT: File "/models.py", line 15, in thinner_convnet STDOUT: layers.all_views_conv_layer(x, 'conv1', number_of_filters = 32, filter_size = [3, 3], stride = [2, 2], parameter_tying = parameters['parameter_tying']), 'conv') STDOUT: File "/layers.py", line 14, in all_views_conv_layer STDOUT: h_L_CC = tf.contrib.layers.convolution2d(inputs = input_L_CC, num_outputs = number_of_filters, kernel_size = filter_size, stride = stride, padding = 'VALID', scope = CC_scope) STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args STDOUT: return func(*args, **current_args) STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 438, in convolution2d STDOUT: padding=padding) STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d STDOUT: data_format=data_format, name=name) STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op STDOUT: op_def=op_def) STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2380, in create_op STDOUT: original_op=self._default_original_op, op_def=op_def) STDOUT: File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1298, in init STDOUT: self._traceback = _extract_stack() STDOUT: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2,1299,999,32] STDOUT: [[Node: conv1_CC/conv1_CC/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_Placeholder_0/_19, conv1_CC/weights/read)]] STDOUT: [[Node: fully_connected/Softmax/_33 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1767_fully_connected/Softmax", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] I think the reason this happen is that both processes were trying to use memory on the same GPU. I am wondering if it's possible to do the multi-process training on different GPU cores. If yes, what's the correct way to implement that? Thanks for the help!!!

Created by Yiqiu Shen ashen
Hi @darvinyi and @tschaffter, Thanks so much for the suggestion! I tested this on the express lane and it seems that it's working now. Artie
Hey guys, So yea, Tensorflow (to my knowledge) will sorta reserve all the available memory on all the available gpu's as soon as you start a session, even if you hard code in certain processes to be done on certain gpu's. The best way I can think of doing what you want to do would be to make certain gpu's invisible to the code. You can do this via "CUDA_VISIBLE_DEVICES=..." before your python call. Thus, to do what you're doing, I'd put in the line CUDA_VISIBLE_DEVICES=0 python train.py & CUDA_VISIBLE_DEVICES=1 python train.py - Darvin
@darvinyi Do you know how to run two Tensorflow processes on different GPUs? Thanks!

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Running Two TensorFlow Training Processes using Different GPU Cores page is loading…