Thank you for creating this wonderful resource. I have a quick question concerning the samples you included. Initially I was under the impression you mapped all RNA-seq data present in the SRA database (with more than 150 million sequenced base pairs), but now notice a fair number of sequencing runs were not included (about 1/3rd). Here are a few examples:
Study one examples:
SRR998545
SRR998530
Study two example:
SRR949118
I do notice that other samples from the same study are present (e.g. SRR998542,SRR949117). Are these removed due to some quality control metric or perhaps another reason?
Created by Sipko van dam Sipko Thanks! Looks exactly like what I needed.
Good to hear your better again ;) I just merged the kalisto mapping stat data :) : https://www.synapse.org/#!Synapse:syn16805205
Sorry it took a while, I got a cold yesterday... I added you in google hangout, but I think you need to accept before I can talk to you (my email is sipkovandam@gmail.com)
Ready when you are ;) Great, speak to you soon I can do at 9 am Wednesday (PST) :)
Looking forward to hearing from you. I added your google hangout ID. Maybe today or tomorrow at 9 AM your time? Alternatively, a bit later is also ok, but preferably before 2 PM (your time).
Q: If you wish I could show some figures, maybe on Skype or something. If you would be interested in that feel free to let me know and suggest a date/time.
A: I am down to schedule a time to chat. :) My google hangout ID is: btsui@eng.ucsd.edu
I am processing the new file you have uploaded:
2018-08-29-12_43
Homo_sapiens.gene_symbol.est_counts.npy.gz
We have previously processed a number of these files ourselfves and notice that the total number of reads we find mapping to genes are in most cases higher. I guess this is because you only report a subset of 34738 genes (where we have approximately 55.000 (GRCh38 version 83). Is there a reason you excluded the other genes? I did not inspect this yet, but is there any subgroup of genes you excluded for any reason? Or did you exclude all genes for which there were no variants annotated in the dbgap database? And is there any chance you can upload a file with a column containing the number of reads processed and a column with the total number of reads mapped, per sample? We would like to use this as a covariate to remove from the data. If you wish I could show some figures, maybe on Skype or something. If you would be interested in that feel free to let me know and suggest a date/time. Might be a bit easier to communicate. I am from Europe though, so please account for that if you suggest anything :)
Additionally there are approximately 16.000 samples have 0 reads mapping. We find the same in our own results (those we have in our much smaller subset), but maybe it is worthwhile removing them from the matrix as they add nothing ;)
No worries, i know how these things can go ;). Thank you for updating it, I will download it and use it. Happy to hear you used Kallisto. Happy to see you also uploaded the estimated count files (instead of TMP) as it allows me to calculate the % of reads mapping. We would like to remove the signal from this technical covariate from the data (using linear regression), as we found there is a correlation/systematic bias between the reads mapping per gene en the % of reads mapping per sample.
Also happy to hear you zipped the files. Makes everything faster :). Also fortunate your servers are very fast (I downloaded with 400 mbit/second, which is the cap of the internet connection we have here).
I think your response times are great ;). Very happy you are responding to all my questions.
All in all you make me a very happy person :D Hey @Sipko , sorry it took so long for the RNAseq merging. I just merged the data. Got caught up at some other work, the following link contains the new expression matrix.
https://www.synapse.org/#!Synapse:syn11416994
I actually switched to the close variant of Sailfish which is Kalisto for mapping. The reason was that Sailfish kept throwing me library errors in the latest run. Not sure why.
I also fixed the gene mappings problems. I used the biomart for the previous iterations, which probably screwed up the transcript to gene symbol mapping, but now I directly use the annotation directly from the transcript annotation for this, which I checked had 0 missing genes.
I also used gz for the npy file to decrease the size of the data npy by ten folds to optimize downloading speed.
If you can run the test that you did last time on the new data, it would be much appreciated : )
I will have more people to work on the project in the future, hopefully that will improve our response time. Good to hear :)
I also have another question. Would it by any chance be possible to share the percentage of reads mapping per run? In earlier research, we noticed that there is a systematic effect of that on the data, which we would like remove (using linear regression). The number of reads mapping and the total number of reads are reported by Kallisto/Salmon/Sleuth in one of the output files. If it is to much hassle it is not a problem, I just thought I can always aks :D I am glad you like it. :) There are still 10,000 remaining for RNAseq quantification... Hopefully, they will be done soon.
Thank you for your clear answer. I think it is very clever to only include those reads that overlap regions with SNPs from dbSNP. I assume that for the count files all reads where included (e.g: hgGRC38__allSRAmatrix.realign.v9.base.TPM.gene.symbol.npy) ? I did notice there were only approximately 27000 genes and was wondered what happened to the others (I think there are around 60.000 annotated genes in GRCh38).
I think it makes sense to pick the SNPs from dbSNP, I was just wondering how it would scale to do more :). I am a bit surprised that the sorting step is the rate limiting step; I would have figured somebody would have thought of a clever solution to solve this problem. I do admire your attempt to write your code in C, I have considered doing this for some projects I did, but never dared to really venture there ;)
I am currently doing some quality control (PCA to remove check outliers, check technical confounders etc) on the counts per gene matrix and it looks pretty consistent with what we found before with a much smaller set (that we created our selves), so we are quite excited :) Hi @Sipko,
Thanks so much for trying again. I would love to put you in the acknowledgment section of the Github page.
Q: is this because it is to computer intensive to call all known variants?
**A: Short answer: ** Correct, I was trying to minimize compute/storage requirements.
**Long answer:**
My understanding of your question is why am I selecting 300k variants as opposed to picking all the variants.
Unfortunately, "The truth is rarely pure and never simple. " (Oscar Wilde), but in summary, I was trying to minimize the compute and storage requirements, while maximizing the amount of biological information captured, and satisfying data completeness.
**1. minimize the compute and storage requirements with 300k variants: **
The key thing for the paper was creating a reference index that targets the existing variants, thus I used bowtie as a filter for reads that only hit the targeted variants to minimize the reads that require for downstream compute in the pipeline.
This optimization turns out to be a huge deal for minimizing the turnaround time for processing each sequencing runs, as I noticed that sorting was the slowest step.
All the processes in the pipeline are O(n) with respect to the number of reads n, except sorting which is n log(n). The log(n) multiplicative term adds up pretty quick when you think about it, for example, log2 (10^6) is 19.9~20x. Also, n log(n) in sorting also means that it cannot be fully Unix piped, it says it has to have intermediate storage either in the form of RAM or disk to hold the data, which either translate into resource strain or slower completion time.
I also looked into the option of writing C-code for this end-to-end pipeline with full in-memory processing, but it turned out to be a much bigger endeavor than I expected, so I gave on this option.
**2. maximizing amount of biological information captured with 300k variants:**
This is the place I foresee getting the most amount of bullets at. I choice dbSNP for multiple reasons: 1.) good maintenance: It is maintained by NIH NCBI, so it will be the last to disappear suddenly. 2.) clarity of information: I know precisely where each variant is from using the bitfield from dbSNP VCF info (http://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf).
dbSNP seems to be the most comprehensive variant annotations I have seen so far. Correct me if I am wrong, my understanding is that variants from Gnomad are mostly derived from Exac, which are also in the dbSNP VCF annotation.
**3. Data completeness ** I provide the allelic read counts for all the sites with >1 reads for the 300k variants, regardless of whether I see a variant or not. The common approach is filtering the data based on p-values to keep the data small, but the hope is that providing the raw counts, they can slice and dice based on their favor cherry cut off on their favorite posthoc statistics (p-values, FDR) without me doing it at first.
Does this help?
Looks like a lot more samples! Great! I have already started with the set you currently have online, as a test set. I will use the full set after you have uploaded them.
Also I have a question about the SNPs you called. I noticed you limited yourselves to approximately 300.000 variants, is this because it is to computer intensive to call all known variants (e.g. those annotated in gnomAD)? Does the run time scale linearly with the number of variants included? I created this notebook ( https://github.com/brianyiktaktsui/Skymap/blob/master/Pipelines/RNAseq/checkProgress.ipynb ) where the plot keeps tracks of data increase. What do you think?
I still haven't merged the data yet. Some of the data are still running. I could probably merge it on Wednesday. If you need it sooner, please let me know :) . Thanks so much for pointing this out, you won't see it in next merge. Thank you! Also I thought I noticed a couple of columns only have NaN values:
SRR4841403
SRR5444655
SRR3236088
SRR3214897
SRR1538694
ERR1633523
ERR1632742
SRR3340829
I would argue it is better to remove all samples that only contain NaN values from the matrix. Saves the user some troubles :) Thanks for the suggestion, I will try to do that :) Thank you for looking into this. I am looking forward to the latest merge. I will wait for this before we start our analyses then. As a suggestion, maybe it is nice to have a changelog that keeps track of changes between different versions (e.g. how many new samples were added, so people know it is worth rerunning a previous analysis on a newer dataset version).
Best, Sipko Hi Spiko,
Thanks so much for trying out again. I just looked into the problem and found the solution, most likely you will see the data in a week.
Back to your question on why there are missing data.
For the specific runs that you showed, SRR998545 and SRR998530 had missing availability metainfo from SRA metadata, thus they were not processed in the original pipeline.
I have modified the filtering criteria so that the pipeline will pull the data whenever visibility equals not to "protected", thus maximizing data coverage even when (try downloading and processing this one )
Problem 2:
The RNAseq merge is pretty old. I will have SRR949118 in the latest merge (~ in a week of time). It has already been processed.
The longterm solution:
I will improve the bookkeeping of the resource by showing the metatable info and keep track of all the error log files.
Cheers,
Brian Hi Spiko,
Thanks so much for pointing this out. Let me try to take a look. I will try to give you a reply within the day.
Cheers,
Brian