Hi there, I am seeking the method to get the profiled samples for each CNA gene as shown in the Oncoprint provided by cBioPortal. I have made an effort to obtain the profiled genes of each panel from the file ./gene_panels/data_gene_panel_XXXX.txt and the alteration types from assay_information.txt. However, I have encountered some discrepancies between the alteration types listed in assay_information.txt and those displayed in cBioPortal. For instance, the CNA profile for the gene PTPN6 on cBioPortal indicates: Number of samples profiled for copy number alterations: 273 Gene panels involved: VICC-01-T6B, VICC-01-D2, WAKE-CLINICAL-R2D2 Yet, according to assay_information.txt, the alteration type for the panel WAKE-CLINICAL-R2D2 is listed as "snv;small_indels" rather than "gene_level_cna". Could you please provide guidance on how to reconcile these differences? Thank you for your attention to this matter. I look forward to your prompt response. Best regards, Wan

Created by Wan Shi Wanda
Hi @Wanda , The discrepancies you've outlined is due to the fact that there are samples part of those sequencing panels that may not have CNA or SV values. The `data_sv.txt` and `data_CNA.txt` contain samples with identified SV/CNA but not all samples profiled for SV/CNA. This is the problem we are facing - this requires the sites to provide the information to us instead of us generating the `data_gene_matrix.txt` file to let us know which samples were profiled for SV / CNA and what panel (It could be different from the mutation panels) It is different for mutations: All samples were profiled for mutations, so samples without variants mean there were no (technically valid) mutations.
Hi @thomas.yu, As the documentation on [file-formats](https://docs.cbioportal.org/file-formats/#gene-panel-data) , "The MAF and structural variant formats are unable to include samples which are sequenced but contain no called mutations." That is, the `data_sv.txt` file only contains samples with identified SVs and does not account for samples that have been profiled for SVs. Is creating an SV column using the same method as we do for CNV column accuracy?
Hi @thomas.yu? Looking forward to the new features that will be included in the next release. I have found a potential inconsistency in the data presented on cbioportal for the GENIE v15.1-public dataset. The "Genomic Profile Sample Counts" chart indicates the following numbers: * Structural Variants: **158,401 **samples * Copy-number alterations: **136,752 **samples However, upon reviewing the "Case lists" chart (also the cases_XXXX.txt files): * samples with Structural Variants: **158,392** * samples with CNA: **160,994** Based on the information provided, I would expect these figures to be consistent. Is there a specific reason for the discrepancy?
Hi @Wanda , We have plans on adding the SV column for release 17. The `data_gene_matrix.txt` file is not completely accurate either (except for the mutation column). * The CNA column is created by mapping the samples that exist in the CNA file to a SEQ_ASSAY_ID, and then those SEQ_ASSAY_IDs are filled in via the CNA column. * We will create an SV column eventually using the same method above That said, there are some issues with this, as currently we only collect mutation panel information, so we're assuming that the same panel is used across mutation, CNA, and SV. This is also in discussion but most likely won't be resolved until the 18 release.
Hi @thomas.yu, Thanks a lot for your reply. I am still seeking clarification regarding the number of samples profiled for Structural Variants (SV) analysis of the XXX gene. I understand that this number should serve as the denominator when calculating the alteration frequency for the XXX gene. The file "data_gene_matrix.txt" lists columns such as SAMPLE_ID, mutations, cna. This assumes the type of genomic data extracted for the sample based on the associated SEQ_ASSAY_ID. However, I am unsure where to find similar information for the Structural Variants (SV) data. Could you please guide me on how to locate this information or provide the necessary details?
Hi @Wanda , Thanks for bringing these up - these will continuously improve our documentation. - `data_gene_panel_XXXX.txt` files: These are cBioPortal specific files: https://docs.cbioportal.org/import-gene-panels/. We create those from the `genomic_information.txt` file with `includeInPanel=True` per assay for it to show up nicely in cBioPortal. - `assay_information.txt` file: This file is information submitted by each contributing center, so discrepancies need to be thoroughly reviewed by us and the center. - Gene panel profiling: You can assume that all genes that appear in the gene panel files are profiled for mutations. As for SNV and CNA, these are actually optional files, so sites do not need to upload them. There is currently not an easy way to determine which genes are specifically profiled for SVs or CNA. Please let us know if you have further questions
Hi @thomas.yu? Thanks for you reply. I have a few follow-up questions regarding the data and the gene panels: **File Generation Clarification: **Could you please confirm whether the data_gene_panel_XXXX.txt files were generated from the assay_information.txt file, or was the assay_information.txt file derived from the data_gene_panel_XXXX.txt files? **Gene Panel Profiling:** If a gene panel's alteration type is listed as "snv;small_indels;gene_level_cna;structural_variants", does this imply that all genes within the panel are profiled for mutations, CNA and SV? If not, how can I determine which genes are specifically profiled for SVs or CNA or mutations? **Details on Panel Processing:** I had read the "data_guide.pdf". Could you provide additional details or point me to resources that explain how gene panels were handled? Best regards, Wan
Hi @Wanda, Thanks for pointing out this data discrepancy, we will have to take a look internally to understand the source of the issue. My initial hypothesis, without any detailed investigation with the sites, is that the assay information file is missing information and / or incorrect for some sites. At this time, we don't have an exact timeline as to when we will be able to fully explore this but we hope to get to it before the end of the year

Inconsistencies of alteration type of panles between Synapse downloaded assay_information.txt and GENIE cBioPortal data (v15.1) page is loading…