Hi, how was the RNA-Seq data preprocessed exactly? Especially, was some kind of pseudocount used when computing the log2(cpm) values and if so, to which value was this pseudocount set? Thanks!

Created by Kristina Thedinga kristina.t
Thank you for the very detailed answer!!
Hi Joshua, RPKM. More details are available in the *RNA-Sequencing and Data Processing* in [Tyner, et al 2018](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6280667/), including: > The data was collated from featureCounts matrices and all genes with no counts across the samples were excluded. Genes with duplicate gene symbols and those where the counts were < 10 for 90% or more of the samples were additionally removed prior to normalization similar to the approach suggested for weighted gene correlation network analysis (WGNCA)62. Samples for which their median expression was less than 2 standard deviations below the average were removed from the dataset (N=10). Normalization was performed using the conditional quantile normalization procedure63, which produced GC-content corrected log2 reads per kilobase per million mapped reads (RPKM) values. This procedure produces both offsets to be used in conjunction with edgeR as well as a matrix of log2 normalized RPKM values for clustering. Best, Jacob
Hello Jacoberts, I have one further question: Did raw_counts_matrix contained RNA sequencing data in RPKM, FPKM or TPM normalization? Thanks in advance!
Yes, that helps. Thank you!
Hi Kristina, The log2(cpm) values are generated using the cpm() function of edgeR as follows: ``` log2.cpm.matrix <- cpm(raw_counts_matrix, prior.count=2, log=TRUE) # default is log2 ``` Hope that helps! Jacob

Data preprocessing page is loading…