Hello,
Did some testing with the Express Lane and the new template Dockers released last week.
Submission 7514144 (Caffe preprocess + training template) to the Express Lane appears to have finished both steps with the Express Lane.
Though resubmitting the same as 7515200 to the full Model Training mechanism had the following error:
STDOUT: Starting Caffe AlexNet training
STDERR: F1111 14:48:48.464579 6 caffe.cpp:93] Check failed: error == cudaSuccess (35 vs. 0) CUDA driver version is insufficient for CUDA runtime version
STDERR: *** Check failure stack trace: ***
STDERR: @ 0x7f502f314e6d (unknown)
STDERR: @ 0x7f502f316ced (unknown)
STDERR: @ 0x7f502f314a5c (unknown)
STDERR: @ 0x7f502f31763e (unknown)
STDERR: @ 0x40a32e get_gpus()
STDERR: @ 0x40b3d3 train()
STDERR: @ 0x408e6c main
STDERR: @ 0x7f501e506b15 __libc_start_main
STDERR: @ 0x409775 (unknown)
STDERR: /train.sh: line 20: 6 Aborted (core dumped) caffe train --solver=$MODEL -gpu $GPUS
STDOUT: Done
Is there a difference in the CUDA runtime version between the two routes of submission?
Thank you in advance,
Jeff
Created by mobileroaming > Is there a difference in the CUDA runtime version between the two routes of submission?
The Express Lane and Challenge GPU servers have new NVIDIA drivers that support CUDA 8.0. The examples that we have released last week are based on `nvidia/cuda:8.0-cudnn5-devel-centos7` and have been tested on the Express Lane and Challenge machines. I've observed that the CUDA Docker images provided by NVIDIA don't work with the NVIDIA drivers installed on the Open Phase machines, which are going to be decommissioned tonight at 6 pm ET. On the Express Lane machines:
```
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.48 Sat Sep 3 18:21:08 PDT 2016
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
```
Drop files to upload
Same submission finishes with Express Lane, but fails full Model Training page is loading…