Dear CAGI 6 organizers,
We are currently participating to the CAGI 6 challenge - Annotate all missense, to score these variants with our tool MISTIC (https://doi.org/10.1371/journal.pone.0236962. We have 2 questions in order to finish and submit our results:
(1) None missense variants in the dbNSFP4_nsSNV dataset
We have already preprocessed the ~80 millions variants provided in the input file dbNSFP4_nsSNV.zip. However, we have some issues with ~10 million variants, which don't seem to be missense variants and hence seems out of scope of this challenge. For instance, the first variant in the file seems to be a stop lost (https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=AY5d2llL91LZFFZL-7661862):
#CHROM POS ID REF ALT QUAL FILTER INFO
10 47057 . C A . . INFO=10-47057-C-A-10-92997;CSQ=A|stop_lost|HIGH|TUBB8|ENSG00000261456|Transcript|ENST00000568584|protein_coding|4/4||ENST00000568584.6:c.1335G>T|ENSP00000456206.2:p.Ter445TyrextTer19
How should we handle these < 10 million none-missense variants, since "No empty cells are allowed in the submission."? Can we skip these lines in our output file? If not what values should we put in the columns of these variants?
We have already asked the data provider Dr Xiaoming Liu and he recommended us to consult you for this issue.
(2) SD: standard deviation of the prediction in column 5 indicating confidence
How should we calculate the SD? Is it based on the score on the whole set of variants on only for each corresponding chromosome?
Best regards,
Kirsley and Thomas