I'm getting stuck in validation mode for extended times again. Is the server overloaded again? My ID is syn7342934 and current submission is 8012578. Is there a way to monitor the server status/expected time in validation mode?
Created by Bill Lotter bill_lotter Thanks Bruce! And I certainly understand that this is a unique and difficult challenge.
That does help and is good to know. For now I think I'm going to keep the current preprocessing, but I'll let you know if I decide to change it. @bill_lotter: First, let me say that I understand the frustration of waiting in a queue for an indeterminate length of time.
> Is there any way to estimate how long this will be, i.e. will it be 5 hours or 5 days (which is how long I have waited before).
Models submitted by participants do not communicate their time remaining. The only estimate we can provide is based on the 336 hour per round time quota: If the jobs for the current participants exceed their teams' quota the jobs will be canceled by us to allow yours to run. Neither job currently running on the machine where your preprocessed data is cached is approaching that cut-off.
> If it's going to be on the day scale, is it worth starting a new preprocessing, which will be done in less than a day and then I can train my model or is it better to wait?
The former is an option but, as explained above, we cannot tell how long a job will run. If it helps I can say that in Round 1 most jobs ran for less than 50 hours.
> Also, are you guys able to transfer previous preprocessing to other machines
That is technically possible - and we actually did a bit of this in Round 1 - but what we found is that it can as long to transfer the preprocessed data (which may reach 10TB) between servers as to recompute it on a new server. We decided that recomputation is the best policy.
If you opt to recompute the preprocessed data you can request here on the discussion forum that your preprocessed data be removed. A free machine will then pick up your enqueued job and begin the preprocessing phase.
I hope these comments are helpful.
Apologies if I'm not using the right verbage, but it is the case that I'm waiting on other people's jobs to finish and there isn't room for mine right now. Is there any way to estimate how long this will be, i.e. will it be 5 hours or 5 days (which is how long I have waited before). If it's going to be on the day scale, is it worth starting a new preprocessing, which will be done in less than a day and then I can train my model or is it better to wait? Also, are you guys able to transfer previous preprocessing to other machines or are you locked into a machine and then it's just luck of the draw whether other people are using the same node? How many saved preprocessing submissions/teams are assigned to a given machine? @bill_lotter: Nothing is "stuck" or "overloaded". You have requested to reuse preprocessed data created in a previous submission. Your preprocessed data (which took 16 hours to compute) is on a machine currently running models from other users. When those jobs are done (or if they exceed their time quota and are canceled) yours will be processed in due course. I have the same issue. Moreover, when I just want to make sure that the model is training properly using Express Lane, it spends all the 20 minutes waiting for the graphic card. Example: submission 8014476. Basically, even if formally it was evaluated during 20 minutes I have no clue if it is actually allright, the last few lines are just mapping TensorFlow devices:
STDERR: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1
STDERR: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y Y
STDERR: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1: Y Y
STDERR: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:87:00.0)
STDERR: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:88:00.0)
Maybe one should decrease the express lane time down to 15 or 10 minutes? I guess it will not change anything (noone uses it for actual training) but at least we will be able to check the code.