Dear @brucehoff,
Can you help me to check 8510126. It started at almost the same time as 8509793 at about 10 hours ago. The code are same, just slightly different parameter in last stage that does the model merging(only have impact after progress 99%). Now after 10 hours running, 8509793's progress is about 18.1% while 8510126's progress is 0.787%. I suspect that there is something seriously wrong. One possibility is that GPU's stuck in an invalid state because of problematic hardware or buggy CUDA driver, so our program keeps retry and stuck. According to my previous experience on my local machine, the only way to reset GPU's state is rebooting the whole machine. We had tested 8510126 on express lane before officially submission and everything is normal. Can you help us to check it? We printout the nvidia-smi output periodically (search for NVIDIA-SMI in the log, also search for 'Creating TensorFlow device'). Once you identify the problem, can you restart this submission 8510126 on a different host? Thanks in advance!
Created by vacuum Just a follow up: 8510126 finished in 30 hours and 8509793 finished in 48 hours, all got scored. Thanks! Dear @brucehoff,
I don't think our submissions are problematic. One strong proof is that the submission 8469265 and 8469272 with same code (only xgboost model are different, which only have impact after progress 99%) run successfully for 2.5 days and 6.5 days respectively, and get scored. As for our code, we are using vanilla tensorflow 1.0.0 + python. I don't see anything we do special that may trigger a machine crash. We do use 2 GPUs. We have tested our code extensively on our local environment (i7 + dual 1080/1070 + 32G memory). We have seen GPU lock-up once (and had to restart) during our whole development process.
I highly suspect that it is either a docker/nvidia-docker bug or cuda driver bug triggered by contention. That's why I suggest many times that we run one submission on one physical machine one at a time, and do a reboot after each run. That will be more reliable, no contention and has more throughput for the whole system, basically win-win for everyone. I work for a startup that is heavily involved with docker related development and I know that docker offer much less isolation comparing to VM, especially on heavy load. If our submission is stuck because of the environment again, maybe you can try to put the submission on a dedicated machine, and only give it 1/4 of the 12 days quota. I am pretty confident that it will finish in 1.5 days for sc1 and 2 days for sc2. Thanks!
 
BTW,
I noticed that you restarted both 8509793 and 8510126 at the same time after my second message, while 8510126 still making progress at 14.96% at that time (i.e. only 8509793 was stuck). Also last night 8509793 and 8510126 were restart the same time while 8509793 still making progress at 18.1% (only 8510126 was stuck) after I sent my first email. I wonder it is expected (i.e. a rerun of a previous submission has to restart other submission in the same sub-challenge) or not. @vacuum: Your inference submissions have been problematic. We have had 5 events in the last two days in which a server crashed and rebooted. In 4 of the 5 events one of your submissions was running. The relevant submissions from you are:
8509793
8510126
We will try to run your submissions but if they keep causing the machines to crash they may not complete. Do you have any insight as to why your code might cause a server to crash? Also, I think add a field 'scheduled on' to show when the submission starts to run in the job tracking page is very useful for us to estimated why it will finish. If the submission is not scheduled yet, it can just show 'n/a'. Thanks! Dear @brucehoff
Thanks for restart submission 8510126!
Now submission 8509793 is stuck. It is not just stuck, actually the progress reversed. It was 18.1% when I wrote my first post, but it is 3.941% now. The last updated time is shown shown as 03/23/2017 10:47:34PM. I guess there is something wrong with that machine. Please restart submission 8509793. Thanks! The server running 8510126 crashed some hours ago and the submissions running on it were restarted.
Drop files to upload
submission 8510126 is stuck, need help page is loading…