I submitted a preprocessing(ID 8444525), and when it was completed, I submitted a training job(ID: 8456196), but it started from the very beginning, which reran the preprocessing step instead of jumping to training step. So I canceled it and it's currently in cancel requested status. Can you help me check why it happens? How can I refer to that preprocessing?
Created by Rui Hou ruihou > Could you please take a look and explain?
There are a lot of submissions from your team (see below) but it looks like the sequence of interest is:
8461676 <--finished successfully
8477429 <--encountered an error
8477586 <--the submission in question (why did preprocessing restart)
So let's look at the param's of each submission:
8461676
"preprocessing":"docker.synapse.org/syn7415408/cpi89@sha256:22dd9537d2acb60574fc7963d56d54824b2babf0f41eb9b0ba80738d0a70af1d"
"training":"docker.synapse.org/syn7415408/cpi101@sha256:a562062531b3f347d3aa9b2bf2c8ae072b215f868209340605ede2f0cf9f89ff"
8477429
"preprocessing":"docker.synapse.org/syn7415408/cpi101@sha256:a562062531b3f347d3aa9b2bf2c8ae072b215f868209340605ede2f0cf9f89ff"
"training":"docker.synapse.org/syn7415408/cpi105@sha256:4bb75f1336c3a79443bd704ef3b265f2f26d941c6ff9db5ef36c5baa8dd61a5a"
8477586
"preprocessing":"docker.synapse.org/syn7415408/cpi89@sha256:22dd9537d2acb60574fc7963d56d54824b2babf0f41eb9b0ba80738d0a70af1d"
"training":"docker.synapse.org/syn7415408/cpi105@sha256:4bb75f1336c3a79443bd704ef3b265f2f26d941c6ff9db5ef36c5baa8dd61a5a"
Submission 8477429 used different preprocessing compared to 8461676. This triggers the cached preprocessing results to be discarded, to free up space for the new preprocessing output. It encounters an error so now there is *no* preprocessing output. When 8477586 runs it starts preprocessing again.
The fundamental rule affecting how the system works is that we only provide a single (10TB !!) preprocessing "slot" per team. If you change your preprocessing image we will dutifully discard the previous results and reuse the slot to hold the new results.
I hope this helps.
${leaderboard?path=%2Fevaluation%2Fsubmission%2Fquery%3Fquery%3Dselect%2B%2A%2Bfrom%2Bevaluation%5F7213944%2B%2Bwhere%2BSUBMITTER%253D%253D%25223319922%2522&paging=true&queryTableResults=true&showIfLoggedInOnly=false&pageSize=100&showRowNumber=false&jsonResultsKeyName=rows&columnConfig0=none%2CSubmission ID%2CobjectId%3B%2CDESC&columnConfig1=none%2C%2CSUBMISSION%5FFOLDER%3B%2CNONE&columnConfig2=none%2C%2CTRAINING%5FSUBMISSION%5FPARAMETERS%3B%2CNONE&columnConfig3=none%2C%2Cstatus%3B%2CNONE&columnConfig4=synapseid%2C%2CMODEL%5FSTATE%5FENTITY%5FID%3B%2CNONE&columnConfig5=none%2C%2CSTATUS%5FDESCRIPTION%3B%2CNONE&columnConfig6=none%2C%2CWORKER%5FID%3B%2CNONE&columnConfig7=epochdate%2C%2CcreatedOn%3B%2CNONE&columnConfig8=epochdate%2C%2CTRAINING%5FSTARTED%3B%2CNONE&columnConfig9=none%2C%2CTIME%5FREMAINING%5FDISPLAY%3B%2CNONE&columnConfig10=epochdate%2C%2CmodifiedOn%3B%2CNONE&columnConfig11=none%2C%2CcancelRequested%3B%2CNONE&columnConfig12=none%2C%2CuserId%3B%2CNONE&columnConfig13=none%2C%2CSUBMITTER%3B%2CNONE&columnConfig14=synapseid%2C%2CentityId%3B%2CNONE&columnConfig15=none%2C%2Cname%3B%2CNONE}
Bruce
The repeated preprocessing and training finished. I then launched another training run (submission ID is 8477586)
using the newly re-preprocessed data. But the system started preprocessing yet again., Could you please take a look
and explain?
As you'll see, I canceled the submission, for obvious reasons - that would be the third time preprocessing the same set
Thank you
> But all of a sudden, today the training did not reuse the preprocessing step, but launched it from scratch
It's intentional. Your job was migrated from one server to another due to load/backlog from other submissions. I have the same problem: the training submission ID 8461676 is using previously-run preprocessing step.
However, the system launched preprocessing again. Could you take a look?
The preprocessing step uses exact same repository and sha (cpi89/22dd9537d2acb60574fc7963d56d54824b2babf0f41eb9b0ba80738d0a70af1d)
that already ran. I have actually run several training submissions with this exact preprocessing step, and they
worked as expected - i.e., they skipped preprocessing and proceeded to training. But all of a sudden, today
the training did not reuse the preprocessing step, but launched it from scratch
Thanks
Yes, the preprocessing submission file(syn8444524): preprocessing=docker.synapse.org/syn8119917/preprocess-patch-1per-997calc-parallel@sha256:e8c7928c3cf4b06bbdce26da08bc6388db1df8b818f88756a7357082142ba884
The the training submission file(syn8451468):
preprocessing=docker.synapse.org/syn8119917/preprocess-patch-1per-997calc-parallel@sha256:e8c7928c3cf4b06bbdce26da08bc6388db1df8b818f88756a7357082142ba884
training=docker.synapse.org/syn8119917/train-patch-based-negpro2@sha256:fe615e9f1b0f635f8757e3ab88f04caa3f64e2a59ea96b3ab3dbe562a86f97a1 > I submitted several jobs afterwards hoping to do training with the preprocessed data.
Do the later submissions invoke the same preprocessing step (same docker repository and same sha256 digest) as the original? Hi Bruce, @brucehoff, can you please help me check what's the problem of my preprocessing(ID:8444525)? I submitted several jobs afterwards hoping to do training with the preprocessed data. But all of them started from preprocessing, not directly jumped to training. The returned log file of the preprocessing doesn't seem to have any problem. Thank you.
Drop files to upload
Why my existing preprocessing was reruned? page is loading…