Hey there!
I noticed that in the new version released (v15), the copy number file includes values of 1.5 and -1.5. How should we interpret these values? Should we count 1.5 as amplification or gain and -1.5 as loss or deletion?
These kind of values didn't exist in older versions.
Thanks!
-Maayan
Created by maayanbaron Fantastic - we look forward to hearing what you find in your research! Thanks again for your interest. Thanks for clarifying this!
I'm going to focus my analysis on that group of 57,384 people, as you pointed out. And I'll exclude -1.5 from the DEL calculations.
Looking forward to the continuous updates of GENIE.
Thanks, I missed the hover over but I'm able to reproduce 57,384 samples in cBioPortal.
Here is code to get the same number and gene panels from R:
```
# Specify your own GENIE.dir
library(magrittr); library(data.table); library(dplyr);
library(tidyr); library(stringr); library(janitor)
gi <- readr::read_tsv(
paste0(GENIE.dir, "genomic_information.txt")
)
gene_mat <- readr::read_tsv(
paste0(GENIE.dir,"data_gene_matrix.txt"),
)
gi_mtap_status <- gi %>%
group_by(SEQ_ASSAY_ID) %>%
summarize(
mtap_covered = any(Hugo_Symbol %in% "MTAP"), .groups = "drop"
) %>%
arrange(desc(mtap_covered))
mtap_cna_using_gene_mat <- gene_mat %>%
select(SAMPLE_ID, cna) %>%
# string search is maybe not ideal, but I think it works for now:
filter(!is.na(cna)) %>%
left_join(
., gi_mtap_status, by = c(cna = "SEQ_ASSAY_ID")
) %>%
rename(mtap_covered_cna_tested = mtap_covered) #%>%
mtap_cna_using_gene_mat %>% tabyl(mtap_covered_cna_tested)
mtap_cna_using_gene_mat %>% filter(mtap_covered_cna_tested) %>% count(cna, sort = T)
```
So 57,384 is the number of samples that had a panel listed for CNA in the gene matrix, and had an MTAP value in the genomic_information.txt file. As we discussed previously, we don't actually know if all the genomic regions in genomic_information.txt are tested for CNAs (it's a file originally implemented for mutations), but I think this is probably the best way to obtain the number tested. From there you can get the number with different alteration levels.
The method I listed previously, which uses assay_information.txt rather than gene_matrix.txt to determine which panels cover CNAs, gives 50,743 tested samples. This is due to a discrepancy in two panels (DFCI-ONCOPANEL-3 and DUKE-F1-DX1). I will notify my team of the problem, but I think your solution to use the gene_matrix.txt file makes a lot of sense.
The number cohort of samples with 114,530 members makes no sense, I agree. You will see that this is the same as the 57,384 cohort above, but with lots of 0 values added. Because the CNA data spec (https://docs.cbioportal.org/file-formats/#discrete-copy-number-data) does not really define what an NA value is, I don't think it's technically incorrect, but it would be more intuitive to me if the number of non-NA samples for MTAP was 57,384 (currently its 57k, with ~50k extra zeroes added). The problem with this is as I said, we don't technically know that. I would limit your analysis to the group of 57,384, which makes the most sense to me and seems to match cBioPortal perfectly.
Hi @@alexander.paynter ,
Thank you for your response.
Previously, our discussion was centered around the 15.1 public release. With the new release, the number of samples profiled for copy number alterations in MTAP has increased from 49,649 to 57,384.
57,384 Samples Profiled for MTAP CNA:
* To view this data, navigate to the `CNA genes` chart.
* Enter `MTAP` in the search box.
* Upon hovering over the `frequency 6.7%`, you will see:
> of samples profiled for copy number alterations in this gene: 57,384
> Gene panels: CRUK-TS, MSK-IMPACT505, DFCI-ONCOPANEL-3, WAKE-CLINICAL-DX1, VICC-01-DX1, DUKE-F1-DX1
148,035 Total Samples Profiled for CNA:
* The `CNA genes`chart title itself indicates a total of 148,035 profiled samples.
* This number is also reflected in the `Genomic Profile Sample Counts` chart.
I interpret 148,035 as the total number of samples profiled for CNA across all genes, while 57,384 specifically refers to samples profiled for CNA in MTAP.
I concur with your approach to establishing a proper denominator for CNA testing. However, I prefer to derive the panels covering CNAs or mutations from `data_gene_matrix.txt` and those covering structural variants (SVs) from `assay_information.txt`. This is because some panels associated with cfDNA samples were excluded from `data_gene_matrix.txt`, as previously discussed [link1](https://www.synapse.org/Synapse:syn7222066/discussion/threadId=11268&replyId=32580) and [link2](https://www.synapse.org/Synapse:syn7222066/discussion/threadId=11267&replyId=32584). This method aligns with the values reported by cBioPortal.
Nevertheless, I am puzzled by the discrepancy where the number of samples with MTAP CNAs (114,530, in `data_CNA_16_public.txt`) exceeds 57,384. Logically, the number of samples with MTAP CNAs should be less than or equal to the total number of samples profiled for CNAs in MTAP, as it represents only those samples that have been profiled and have yielded results.
I would appreciate any insights you might have on this matter.
```
# get the panels that cover CNAs or mutations from data_gene_matrix.txt
data_gene_matrix <- fread(paste0(GENIE.dir,"data_gene_matrix.txt"))
panels_with_cna_mtap_2 <- data_gene_matrix %>%
filter(cna != "") %>%
pull(cna) %>% unique() %>%
intersect(., gi_mtap_status[gi_mtap_status$mtap_covered == TRUE,]$SEQ_ASSAY_ID) #aligns with cBioPortal: Gene panels: CRUK-TS, MSK-IMPACT505, DFCI-ONCOPANEL-3, WAKE-CLINICAL-DX1, VICC-01-DX1, DUKE-F1-DX1
data_gene_matrix_mtap_cna <- data_gene_matrix %>%
filter(cna %in% panels_with_cna_mtap_2)
nrow(data_gene_matrix_mtap_cna) <- 57,384 #aligns with cBioPortal
# get samples with MTAP CNAs value
cna_long <- cna %>%
pivot_longer(
cols = -Hugo_Symbol,
names_to = "sample_id",
values_to = 'cna_val'
)
cna_long_mtap <- cna_long %>%
filter(Hugo_Symbol %in% "MTAP")
table(cna_long_mtap$cna_val)
-2 -1.5 -1 0 1 2
3754 94 5296 104172 1162 52
# 114530 in total
```
Best
Wan
@Wanda ,
Thanks for pointing out this possible discrepancy between Synapse and cBioPortal. I'm not an expert in genomics or cBioPortal, but I will do my best to help you investigate this and get you to the right people if we can't figure it out.
I don't see anything wrong with your R analysis and I agree with your deduction about -1.5 and -2 obviously being counted as DEL in cBio. So what remains is figuring out what you're seeing in cBioPortal and why it doesn't match. I am unable to find 49,649 samples stated anywhere as the number of samples profile for MTAP CNAs, instead I see 148,035 with the 16.0 release for the number of profiled CNA samples under "CNA Genes." Can you share a screenshot, link or explain how you got there?
I can re-create your R results in cBioPortal. Using the 16.0 public release, if you go to "Charts", and click the "Gene Specific" tab, then select "Copy-number alterations" as the first drop down, then type "MTAP" in the box that says "Enter gene symbols". Hit add chart. The new chart should show about 110k samples with non-missing values as you showed above. About 3000 deletions and 1000 gains (because for some reason this chart counts values of 1 and 2 as "GAIN").
Even though I can't reproduce 49k in cBio, I would also like to suggest a guess about what 49k could be. I think this is the number of samples which covers the MTAP gene AND has some stated ability to detect a copy number alteration. The coverage can be obtained from genomic_information.txt or the gene panel files (cBio porbably does the latter, but it's derived from the former). CNA ability of the panel is covered in assay_information.txt. **At the bottom of this post is a script** where I'm getting roughly 49k samples doing just that. Again, this is a guess and I'm not expert in what cBioPortal is doing.
Scientifically our genomic_information.txt file was really designed for mutations, and to be frank documenting whether someone was tested for a CNA on a specific gene is a work in progress. However, if you are willing to make the assumption that every region in genomic_information.txt represents a "tested" region for a CNA, this is probably the best way to do it. I am not aware of a better way to come up with a good denominator for CNA testing.
```
library(magrittr)
library(data.table)
library(dplyr)
library(tidyr)
library(stringr)
library(janitor)
cna <- fread(
paste0(GENIE.dir,"data_CNA_16_public.txt")
)
samp <- readr::read_tsv(
paste0(GENIE.dir,"data_clinical_sample.txt"),
skip = 4
)
gi <- readr::read_tsv(
paste0(GENIE.dir, "genomic_information.txt")
)
ai <- readr::read_tsv(
paste0(GENIE.dir,"assay_information.txt"),
skip = 0
)
# Find the genes with at least one MTAP region covered:
gi_mtap_status <- gi %>%
group_by(SEQ_ASSAY_ID) %>%
summarize(
mtap_covered = any(Hugo_Symbol %in% "MTAP"), .groups = "drop"
) %>%
arrange(desc(mtap_covered))
# Find the panels that both cover CNAs and have some MTAP coverage:
panels_with_cna_mtap <- ai %>%
# string search is maybe not ideal, but I think it works for now:
filter(str_detect(alteration_types, 'cna')) %>%
left_join(
., gi_mtap_status, by = "SEQ_ASSAY_ID"
) %>%
filter(mtap_covered %in% T) %>%
pull(SEQ_ASSAY_ID)
# Here my analysis of the CNA data differs from Wan's.
# There is nothing wrong with the way she did it - I'm just used to
# long tidy datasets so I did this originally to confirm her results.
cna_long <- cna %>%
pivot_longer(
cols = -Hugo_Symbol,
names_to = "sample_id",
values_to = 'cna_val'
)
cna_long_mtap <- cna_long %>%
filter(Hugo_Symbol %in% "MTAP")
cna_long_mtap %>%
left_join(
.,
select(samp, sample_id = SAMPLE_ID, SEQ_ASSAY_ID),
by = 'sample_id'
) %>%
mutate(
panel_has_cna_mtap_cov = SEQ_ASSAY_ID %in% panels_with_cna_mtap
) %>%
tabyl(cna_val, panel_has_cna_mtap_cov)
# The sum of the "TRUE" column is about 49k (slightly more, new release)
# Another note: We could check our list with those she stated had CNA DEL values in her post:
panel_list_with_mtap_del <- c(
"MSK-IMPACT505",
"DFCI-ONCOPANEL-3.1",
"CRUK-TS",
"DFCI-ONCOPANEL-3",
"WAKE-CLINICAL-DX1",
"VICC-01-DX1",
"DUKE-F1-DX1"
)
ai %>%
filter(SEQ_ASSAY_ID %in% panel_list_with_mtap_del) %>%
left_join(
., gi_mtap_status, by = "SEQ_ASSAY_ID"
) %>%
filter(mtap_covered %in% T) %>%
select(SEQ_ASSAY_ID, alteration_types)
# Pretty good match, not perfect. Note that DUKE says it doesn't cover CNAs, and some are missing.
```
Dear @chelsea.nayan,
Upon examining the "CNA genes" chart, the following information was presented:
* MTAP DEL: 3360 samples. Total # of samples profiled for copy number alterations in this gene: 49,649; Gene panels: MSK-IMPACT505, DFCI-ONCOPANEL-3.1, CRUK-TS, DFCI-ONCOPANEL-3, WAKE-CLINICAL-DX1, VICC-01-DX1, DUKE-F1-DX1
* MTAP AMP: 49 samples
However, when analyzing the data_cna.txt file using the R script, I obtained different results:
```
MTAP.CN <- fread(paste0(GENIE.dir,"data_cna.txt")) %>%
filter(Hugo_Symbol == "MTAP") %>%
tibble::column_to_rownames(var = "Hugo_Symbol") %>%
t() %>% as.data.frame() %>%
filter(!is.na(MTAP)) #107883 samples with MTAP Value #49649 web profiled samples
table(MTAP.CN$MTAP)
```
> -2 -1.5 -1 0 1 2
> 3287 73 4971 98409 1094 49
```
length(MTAP.CN$MTAP)
```
>107883
The script indicates that there are 107,883 samples with an MTAP CNA value, which is significantly higher than the 49,649 samples reported in the chart. Specifically, the number 3360 (reported as MTAP DEL) seems to be a sum of -2 (3287 samples) and -1.5 (73 samples), suggesting that the -1.5 category is being interpreted as DEL.
Could you please assist me in understanding the reason for this discrepancy? Your expertise would be invaluable in clarifying whether this is an issue with the dataset, the chart representation, or my analysis method.
Thank you for your time and assistance.
Best
Wan Thank you for the clarification! super helpful! Dear @maayanbaron,
Thank you for your interest in the GENIE dataset.
Values in a copy number file represent the degree of copy number variation in the DNA sequence at specific loci.
* 1.5 should be interpreted as a gain: It typically indicates an increase in copy number but not strong enough to be classified as an amplification.
* -1.5 should be interpreted as a loss: It typically signifies a decrease in copy number but not a complete loss (deletion).
More information on the copy number file format can be found [here](https://docs.cbioportal.org/file-formats/#discrete-copy-number-data).
Best,
Chelsea
Drop files to upload
GENIE CN v15 has ambiguous values page is loading…