Data structure

Hello! I downloaded all the files from the competition using your script. I was wondering if you could check and clarify a few things, and I apologize in advance if I made mistakes. There is a total of 2703 files downloaded, out of which 2676 are in vcf.gz format. These are actually correct triplets of calls from Mutect2 and Strelka SNVs and INDELs, 892 of each. This all seems complete, but there are several hundred files which do not seem to be mentioned in the globalClinTraining.csv. There are several examples: 1. MMRF_1153 sample is supposedly NA for WES data, but has three files. 2. MMRF_2621 sample is not mentioned in the csv (neither global one or the mmrf.clinical.csv), but there are 6 files in total (two groups of three marked T1 and T3) 3. Some samples like MMF_1157 are mentioned multiple times in the columns RNASeq_transLevelExpFileSamplId and RNASeq_geneLevelExpFileSamplId, namely they are written as MMRF_1157_4_BM;MMRF_1157_2_BM;MMRF_1157_1_BM;MMRF_1157_3_BM. There is only one file mentioned in the WES fields, but there are 4 triplets of VCFs with corresponding names. Is this ok, and how do we interpret these 4 samples - can they be merged and tested together? How should we deal with these? Any comment would be welcome! If more clarification is needed, I am at your disposal! Thanks!

Created by Ognjen Milicevic ognjen011
Hi, Thank you for the clarification. Will check the latest clinical annotations. Kind regards, Suhas
I have corrected the clinical files so that they contain the label for just the newly diagnosed MMRF samples and no other later samples. You'll still receive multiple vcfs in the download and have multiple columns in the rna-seq expression data for some patients. You are free to use or ignore those additional samples but note that you'll not be able to use additional samples in prediction in the validation round since you will not have access to them. Kind Regards Mike
Yes. Please keep in mind that data outside the sc2_training_clinannotations.csv may be useful in pre-training work like filtering or prioritization.
Thank you for the reply. So initial recommendation is to prioritize the CSV data over the present files when encountering inconsistencies?
Hi, I'll try to look into these in the following days. I'll address what I can right now. What is most likely happening is there are some sample from MMRF that are from relapsed patient samples and our clinical file aims to have only newly diagnosed. This is complicated by the fact that some samples from MMRF have multiple aliquots ran on different assays. Some even have multiple aliquots from time series samples. Unfortunately the "_1_" is not indicative of sample order. I'll look into this and we may adjust the pull down script and we'll notify folks of the change. You are welcome to use any data from MMRF but you might want to be careful about using data that is not from newly diagnosed samples. Kind regards, Mike
Hi, I have a similar question related to the RNA-Seq (gene and transcript) data. In Clinical Data > sc2_training_clinannotations.csv, the number of unique RNASeq_(gene/trans)LevelExpFileSamplId are 671. But in Expression Data > RNA-Seq Data, the files contain 734 samples and by merging the sc2_training_clinannotations and RNA-Seq files we lose 63 samples in the RNA-Seq. Could you please let us know how this should be dealt with? Kind regards, Suhas

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Data structure page is loading…