Hello!
I understand that gene counts computed with the TCGA mRNA-Seq pipeline1 are further quantile normalized. Is this correct?
If so, given that the various predictors might be based on different normalization approaches (e.g. TMM, VST), would it be possible to have access to non-normalized counts?
Thank you!
Created by Francesca Finotello FrancescaF Hi @Fede ,
That is correct! You will only have access to the Docker logs on the synthetic data. Dear @Michael.Mason ,
so the log file is always generated by running the code on the synthetic data and not on the validation?
Thanks!
Federica Dear @FrancescaF and @Eimanahmed ,
I suspect that the odd issues you are seeing are due to the way the synthetic data was generated. We had some fairly strict obfuscation requirements that meant the the synthetic data was generated for each gene in each file independently. This means the gene A in a counts file will have no relationship to gene A in a TPM file **and** that the sums across all genes for one sample will likely **not** sum to an expected result. This is true of the synthetic data only. The real data was processed by BMS with standard approaches and should match expectations.
I hope this helps,
Regards,
Mike Dear @Michael.Mason ,
thank you very much for your answer.
Unfortunately, we see a strange behavior of our model on both, the synthetic and validation data.
By adding some "data sanity checks" to our code, we found out that, in some cases, the total sum of TPM and counts per sample exceeds by several orders of magnitude what one would expect from RNA-seq data.
This is especially evident for total TPM that, by definition, should not exceed 1 million.
As we are using matched counts and TPM data, and leveraging the full cohort for data normalization, this issue (if confirmed) is likely to have a negative impact on our results.
I hope we are not doing any mistakes in the data loading step, but it would be great if you could check the validation data in this respect.
Many thanks in advance for your help,
Francesca
Dear @Michael.Mason
Does it mean all other gene expression counts available are already normalized? like "GRCh37ERCC_refseq105_genes_count" is already normalized in the validation dataset?
Thank you, Dear @Francesca ,
Thanks for participating in the Challenge. I have checked with our collaborators and the non-normalized counts are:
* GRCh37ERCC_ensembl75_genes_counts.csv
* GRCh37ERCC_ensembl75_isoforms_counts.csv
Hope that helps!
Mike