Thank you for providing validation data, that is extremely helpful!
I have a concern re: format of validation RNAseq expression data (which may lead to a quantification question).
It seems as though the count data for synthetic samples using either RefSeq or ENSEMBL transcriptomes are in doubles, not integers.
Here's a bit of "GRCh37ERCC_refseq105_genes_count.csv"
|p227|p359|p533|p149
PHF20L1|4055|6892|3911|2743
OGDHL|56|34|5|170
ZCCHC18|41.75|88.78|73.7|42.72
MIR1236|0|0|0|0
The same is true for transcript-level quantification in "GRCh37ERCC_refseq105_isoforms_count.csv"
|p227|p359|p533|p149
NR_049841.1|0|0|0|0
NM_001039584.1|256.59|463.02|1005.26|291.86
NM_001350039.2|0|0|0|0
NM_001366269.1|5.5|18.49|22.27|8.89
It would seem that these are not count data, given that these numbers are not integers. It also appears that these samples, either as a result of the synthetic data generation process, or intentionally to reflect the samples that these models will be tested on, are already normalized.
It would be helpful if you could share the RNAseq quantification and normalization process that your samples from CheckMate 026 will undergo so that we can avoid re-normalizing.
If the use of doubles instead of integers in the "count" synthetic data is simply an artifact of the process that generates synthetic samples for validation, it would be great to know that we can safely ignore that! It may cause issues during the synthetic validation phase, though.
Created by Garrett Graham GarrettGraham Hi @GarrettGraham ,
Our BMS colleagues have confirmed that the decimals are due to using RSEM.
Cheers,
Mike Hi @Michael.Mason ,
Thanks for checking on your RSEM output! To me, that would indicate that these are not direct counts, and may have been corrected for sequencing bias, length, or GC. I know that Salmon quantification outputs similar partial reads to correct for position and gc depending on the options selected at runtime. The specific commands used to generate these files and those summary stats from mixcr would be very helpful!
Thanks for liaising w BMS for us!
Garrett Hi @GarrettGraham ,
I checked that the real sequencing counts *do* have decimal values more or less following those in your tables above. I believe this is due to RSEM's probabilistic read assignment but I will check with BMS folks to see if they can provide more details.
Cheers,
Mike Dear @Michael.Mason
That good to know!
Having the webinar Q&A in one place is also awesome - thanks for you time on this!
@GarrettGraham Dear @GarrettGraham,
Thanks for bringing this up. I suspect that this is the result of the synthetic data generation, which selected parametric distributions to pull from for each gene based on fit and then sampled from that distribution. I will double check that the true data is indeed counts and post in this thread.
The sequence processing pipeline available in the [webinar Q&A](syn18404605/wiki/607473), which was just posted. Check the question starting with *Will the bioinformatics pipeline be made public*. I'll be messaging all participants about the Q&A later.
Kind Regards,
Mike
Drop files to upload
RNAseq expression quantification in synthetic validation data page is loading…