Hello. I would like to explicitly ask if the DNA validation data will be exactly in the same format with the training data: i.e. if there exists 3 vcf files for each sample (one annotated with Mutect2 and Strelka indel and snv files). Also are the structure of the vcf files (annotation style, info, format and other column structures) of the same format? Thanks, Best Regards.

Created by Huseyin Demirci huseyind
Dear Jincheng, Please note instructions [here](https://www.synapse.org/#!Synapse:syn6187098/wiki/449445) Under the section "Script input data", it should tell you where all the files are. There is no directory structure in the test-data directory. All the files are in one directory. Best, Tom
A follow-up question on Huseyin's question: 1. Does the validation folder has the same file structure like below? Because the metadata file only stores the filename of the vcf file without path information, I want to clarify I can hardcode this relative path for the validation file. ``` /Clinical Data /Genomic Data /MMRF IA9 CelgeneProcessed /MuTect2 SnpSift Annotated vcfs /Strelka SnpSift Annotated vcfs /snps /indels ``` 2. For DFCI dataset, if Strelka is not available, we only can use the vcf files in `MuTect2 SnpSift Annotated vcfs` ? Thank you. Jincheng
For Challange 1 (whole exome data) will we have all the three types of vcf's?
That is a good question. The structure will **NOT** be the same for the DFCI dataset. These vcfs are from RNA-seq data and as such STRELKA cannot run on them since they do not have paired healthy tissue to filter out the germ-line variants. (we'll mention this in today's webinar). They will have mutect only.

DNA validation data set structure. page is loading…