today, my jobs are completely stopped, while it still consumes time:
8112814 and 8112661, I checked again and again locally. there is no deadloop. Yet, it fails to process a single image on the cluster after 7 hours.
I had to terminate them and submit new ones. But the problem is the server is dead, so no matter what code I submit, nothing happens.
Created by Yuanfang Guan ???? yuanfang.guan yes. thomas yu.
but, the speed is ridiculously slower than i process locally for the pilot data, locally, **each of the image takes 1 second, with an out-of-date gaming gpu I can still found around** on your cluster, for 5 hours now it only processed 1000 images, that is 15-20 sec per image.
there is **something very very weird with your cluster,** looks to me, all machines or at least 5 machines are trying to access the same volume of the trainingData Dear Yuanfang,
It says that the log file was modified at 8:03 PM. Your log files seem to be updating. I downloaded version 8 of your log file and there was 1087 lines and version 9 of your log file has 1329 lines.
Best,
Thomas can you please take a look at 8115015?
now, i am really so confused whether there is a dead loop, or the log just doesn't update. because i am processing each image at 1sec, yet, i never see any up date on log
thanks never mind, i think i did write a deadloop. it was just so hard to find it out.
Drop files to upload
completely stopped: my jobs today page is loading…