RNA-seq data normalization (TPM, RPKM, raw counts, ...)

Hello, Concerning the RNA-seq data used in the challenge, will it always be given with a same normalization (e.g. TPM), or will it be sometimes also FPKM or e.g. directly the raw counts? Currently, the input.csv file that you provide only indicates the platform and scale but not the normalization, which is of course a very important parameter. Thank you. Cheers, Julien

Created by Julien Racle jracle
https://www.synapse.org/#!Synapse:syn15589870/discussion/threadId=5699&replyId=19753 Thank you so much, Dr. Martin and Dr. Zurkin. Actually, I generate the random predicted proportion and submit to the Fast line(test line) and R1 line. I get the valid email from the Fast line and R1 line report fails. I have not received the scoring email. I suggest you can generate some result to test whether these fails caused by your model or bu the docker system. I am still working on it. Good luck! Best, Wennan
Cool! Any news of the round 1 leaderboard? Thank for all your work Andrew!! Best, Martin
@martinguerrero89 - Many thanks. @chang91 - Sometimes workflow works (you get a scoring email) but a failure log is sent as well. I was told to ignore such cases as this is a bug in the docker system.
The range of the last data set is huge since I print the quantile information about the data. Also, can you submit the docker successfully? I get the fail email even I submit the result using a very simple random proportion. Best, Wennan
Hi! I'm not a challenge organizer but think might help you. Illumina HumanHT-12 V4.0 is a microarray platform So, as for what I've seen, there is only one Illumina HiSeq2000 data in phase 1. I guess in phase 2 we will be able to test the models in the second Illumina RNAseq dataset mentioned. Hope it helps! Best, Martin
Regarding "For the first two leaderboard rounds there will only be two RNA-seq data sets. One will be log2 TMM and the other will be CPM." - my logs show that the two RNA-seq datasets are: - Illumina HiSeq 2000, Log2, TMM - Illumina HumanHT-12 V4.0, Linear, average (this is derived from input.csv). It is not in-line with what you claim about CPM. This is why I asked what is "average".
Hi @zurkin , 1. "Average" is only used for microarray data. It refers to average normalization with GenomeStudio Version 4, which is described here: https://www.illumina.com/documents/products/technotes/technote_beadstudio_normalization.pdf 2. For the first two leaderboard rounds there will only be two RNA-seq data sets. One will be log2 TMM and the other will be CPM. 3. TPMs are not provided for RNA-seq data in the leaderboard phase (just TMM and CPM as described above). For the validation phase, normalization will be TPM, with the correspond TPM data in the files native.expr.file, hugo.expr.file, and ensg.expr.file. We will also add a second normalization2 = "est_counts" and corresponding data in native.expr.file2, hugo.expr.file2, ensg.expr.file2. Brian
Hi, This is still unclear to me. Regarding the "normalization" values: - Would you please specify what does "average" mean in the context of RNASeq data? is it TPM values? - Would you please specify all options allowed for "normalization" in the context of RNASeq data (and their meanings) for the leaderboard phase? - "For RNA-seq data in the validation phase we will provide kallisto estimated counts (est_counts), in addition to TPMs." - what files contain the TPM values?
Hi @brian.white , Great, thanks for the updated input file. Best, Julien
Hi @djo, Yes, this list is exhaustive (now that I've added TPM and a separate entry for estimated counts)--please see my changes noted above. Data that are not (or lightly) normalized will be annotated as TPM, in the estimated counts entries, or in the fastq entries. Does that answer your question? Thanks, Brian
Hi @jracle , Yes, we will provide both kallisto-derived TPMs and estimated counts for the validation data. I have updated the Wiki here to reflect the change: https://www.synapse.org/#!Synapse:syn15589870/wiki/592699 For RNA-seq data in the validation phase we will provide kallisto estimated counts (est_counts), in addition to TPMs. The relevant estimated counts files, if available, will be provided in naitve.expr.est.counts.file, hugo.expr.est.counts.file, ensg.expr.est.counts.file, symbol.compression.est.counts.function, and ensg.compression.est.counts.function. Thanks, Brian
@brian.white You specified some example values for the new normalization column in https://www.synapse.org/#!Synapse:syn15589870/wiki/592699. Is the list exhaustive? How will you annotate data that is not normalized?
Hi @brian.white , Thank you for the replay. You already resolved my concerns. I only mentioned the admixtures since the in silico "simulation" of cell mixtures may not come with a total number of mapped reads. Kind regards, Dominik
Hi @brian.white , Thank you for the information and the updated input.csv file. Sure, for the leaderboard phase the data will come from various technologies and there might be many issues in normalization as you cannot reprocess them in a uniform manner. But concerning the validation data, would it be possible that you provide also TPM values? Kallisto outputs both the estimated counts and TPM and it is better to use the same type of counts than what is used in the training. Of course, one might perform an approximate counts to TPM conversion afterwards, but it's better to use directly the value returned based on the dataset at hand. Thank you! Best regards, Julien
Hi @djo , As mentioned above, we will provide estimated counts for the validation data. So you will be able to sum these to estimate total mapped reads. Unfortunately, much of the leaderboard data is microarray, so we will not be able to provide such information generally. I don't understand your comment about weighted averages for admixtures. Would you please ask it a different way? Regards, Brian
Hi @jracle , I have updated the Wiki to provide a normalization column in the input.csv file. Unfortunately, the leaderboard data are normalized using a variety of different normalization approaches. These are published data and we often do not have access to the raw data to perform the normalization. The validation data will be RNA-seq, for which we will provide Kallisto-derived "estimated counts." Best, Brian
I would welcome information on that matter as well. Additionally, if only normalized RNA-Seq expression data is available, will there also be a specification of the total number of mapped reads? Such an inside would be beneficial to infer measurement uncertainty. In the case of admixtures, a weighted average as an approximation would also be very useful.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

RNA-seq data normalization (TPM, RPKM, raw counts, ...) page is loading…