today, my jobs are completely stopped, while it still consumes time: 8112814 and 8112661, I checked again and again locally. there is no deadloop. Yet, it fails to process a single image on the cluster after 7 hours. I had to terminate them and submit new ones. But the problem is the server is dead, so no matter what code I submit, nothing happens.

Created by Yuanfang Guan ???? yuanfang.guan
yes. thomas yu.   but, the speed is ridiculously slower than i process locally for the pilot data, locally, **each of the image takes 1 second, with an out-of-date gaming gpu I can still found around** on your cluster, for 5 hours now it only processed 1000 images, that is 15-20 sec per image.   there is **something very very weird with your cluster,** looks to me, all machines or at least 5 machines are trying to access the same volume of the trainingData
Dear Yuanfang, It says that the log file was modified at 8:03 PM. Your log files seem to be updating. I downloaded version 8 of your log file and there was 1087 lines and version 9 of your log file has 1329 lines. Best, Thomas
can you please take a look at 8115015? now, i am really so confused whether there is a dead loop, or the log just doesn't update. because i am processing each image at 1sec, yet, i never see any up date on log thanks
never mind, i think i did write a deadloop. it was just so hard to find it out.

completely stopped: my jobs today page is loading…