Hi, I just downloaded the genomic data and wonder if we are able to extract the paired germline information for the centers which conducted paired tumor-normal sequencing? For example, MSK.
I wonder what does the columns **Reference_Allele, Match_Norm_Seq_Allele1, Match_Norm_Seq_Allele2, Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allele1, Match_Norm_Validation_Allele2** mean? For the centers which conducted Tumor-only sequencing, what does Reference_Allele mean? Is it based on the database used for potential germline filtering? In other words, all the patients from the same centers using the same filtering pipeline should have the same values for each variant?
It will be very helpful if there is a codebook for all the columns.
Thanks a lot!
Created by Xinan Wang xinanwang.insitro @xinanwang.insitro,
Thanks for the questions! If the question is "can I rely on this column to indicate germline data in GENIE" the answer is always "no", no matter the column.
The project is built from the ground up to remove germline variants to have a viable governance structure (that is, to protect patient privacy and release an open dataset). Our sites (e.g. UCSF) sometimes do work to make this happen before the data gets to us, and we do some additional work on the Sage end, so tracing those steps accurately and imagining whether a germline variant hypothetically would have made it through is generally not possible or advisable. I would guess this is why some sites which use tumor-normal sequencing (PROV, YALE, MSK) decline to provide alleles, barcodes, or both. Regardless, it's an optional column, so sites are justified in excluding this data for any reason in our project.
The case of CHOP is an interesting one. I do not know what data they are filling in here without talking to them. As you correctly pointed out, it seems unlikely this is accurate germline data for patients. We greatly appreciate you pointing this out!
We would encourage you to look at other datasets which are set up to collect germline data (e.g. TCGA) if that's where you research interest lies.
Dear @xinanwang.insitro,
It is important to note that any detected germline variants are filtered out in project GENIE. You can read more about this in the Germline Filter section of the data guide.
However, you bring up some good points about the usage of Match_Norm_Seq_Allele1, Match_Norm_Seq_Allele2, and Matched_Norm_Sample_Barcode fields. We will discuss internally and circle back to you as soon as we can. In the meantime, do not hesitate to reach out if you have any more questions.
Best,
Chelsea Hi Chelsea,
Thank you for sharing the links; they have been very helpful! I have a follow-up question based on the MAF format document you provided. It states, "Set values to be blank in the following columns that may contain information about germline genotypes..." However, I noticed that there are still values present in the columns Match_Norm_Seq_Allele1, Match_Norm_Seq_Allele2, and Matched_Norm_Sample_Barcode for the centers UHN, VICC, UCSF, and CHOP. Additionally, according to the assay_information.txt, CHOP has conducted tumor-only sequencing, and MSK has performed tumor-normal sequencing. Interestingly, for MSK, the columns Match_Norm_Seq_Allele1, Match_Norm_Seq_Allele2, and Matched_Norm_Sample_Barcode are left blank.
Given these observations and potential discrepancies, I am wondering if it is appropriate to use the columns Match_Norm_Seq_Allele1 and Match_Norm_Seq_Allele2 as indicators of germline genotype for the centers UHN, VICC, and UCSF?
Dear @xinanwang.insitro,
Thank you for your interest in the GENIE dataset.
Please refer to the mutation data file format section [here](https://docs.cbioportal.org/file-formats/#cbioportal-mutation-data-file-format) within the cBioPortal docs to learn more about the columns. Additional information about the MAF format can be read [here](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/).
In Project GENIE, the reference_allele refers to the position indicated in the GRCh37 (hg19): Genome Reference Consortium Human Build 37. More information about what happens across the mutation data input files are in the data_guide [here](https://www.synapse.org/Synapse:syn21683345) under the Genomic Profiling at Each Center section and Data Hamonization & QC Process section.
Let me know if this helps.
Best,
Chelsea