Hello, We are looking into obtaining access to the unmapped fastq files for the Mayo RNA sequencing study-path aging syn5550404, so that we can recreate the whole fastq files so we can redo the mapping using another mapping algorithm. Unfortunately, you do not provide the unmapped reads after running SNAP, so I would like to ask you to provide us with the initial fastq files so that we can run them with different mapping algorithms. Importantly, we do not need the reconstructed fastq files from the old bam files, but the initial raw fastq files that came out of the sequencer. As it is the rule in the field , these are the reads normally stored in the SRA repository of NCBI. Please let me know how we could access this data and whether a data transfer agreement may be needed.

Created by travis haight haight
Hello Cory, My PI would like to know if you have contact information to someone that could assist us in sorting out this unmapped reads issue. Thanks Travis
Cory, I downloaded a single bam file (18237_TCX.snapr.sorted.bam) like you suggested, and ran samtools idxstats on it. The output is below, which shows that there is zero unmapped reads present (the fourth column). As stated before reading the SNAP guide, page 3, I can read "if there are too many hits on a seed SNAP will simply ignore it." My understanding is that you get many reads (especially those in repetitive elements omitted). In this we do not need the entire initial fastq but we do need the unmapped reads fastq , as your other colleagues in this portal have done (see MSBB Bam files , https://www.synapse.org/#!Synapse:syn7416949). Please let me know how we could proceed. Once again we are grateful to Mayo for this great published resource. samtools idxstats 18237_TCX.snapr.sorted.bam 1 248956422 7012935 0 2 242193529 2725317 0 3 198295559 2192451 0 4 190214555 1509206 0 5 181538259 1771630 0 6 170805979 1832988 0 7 159345973 1926661 0 8 145138636 1488224 0 9 138394717 1654465 0 10 133797422 1576367 0 11 135086622 2770422 0 12 133275309 2327270 0 13 114364328 832900 0 14 107043718 1483661 0 15 101991189 1236874 0 16 90338345 1805136 0 17 83257441 2786858 0 18 80373285 976629 0 19 58617616 2581977 0 20 64444167 1169703 0 21 46709983 14049129 0 22 50818468 1031772 0 MT 16569 36686383 0 X 156040895 1254402 0 Y 57227415 24624 0 * 0 0 2460758
Travis, I'm not sure I can be any more clear: the unmapped reads are in the bam file. If you don't believe me, I suggest you download a single bam file and use the idxstats command in samtools. You will see exactly how many unmapped reads are in the bam file. Ignoring a read for the purposes of mapping is not the same as deleting it. It simply is not mapped and does not have any of the associated mapping metrics.
Hi Cory, thank you very much for your prompt answer. Ofcourse nobody expects you to upload the 6T RNAseq of MayoRNAseq (if not necessary). However, reading the SNAP guide, page 3, I can read "if there are too many hits on a seed SNAP will simply ignore it." My understanding is that you get many reads (especially those in repetitive elements omitted). In this we do not need the entire initial fastq but we do need the unmapped reads fastq , as your other colleagues in this portal have done (see MSBB Bam files , https://www.synapse.org/#!Synapse:syn7416949). Ofcourse, if you do not have the unmapped fastq files, these files must be made somehow available, so that we can check reproducibility using also other mapping algorithms. Please let me know how we could proceed. Once again we are grateful to Mayo for this great published resource.
Hi Travis. I helped write SNAPR. It doesn't leave any reads out. They're all there. We have no plans of uploading the 6 terabytes of fastq files.
Hello Cory, Thank you for your response. Indeed for other projects within the consortium the unmapped reads have been provided, but in this case they are not. Does the mapping algorithm that was specifically used here preserve the unmapped reads within the bam files? After reading SNAPR details this is not clear to me and it is my understanding that the mapping may be filtering out specific reads with seeds with multiple alignments. In such case having the original fastq file or at least the fastq file with the unmapped reads, as in other consortium projects, would be optimal. Thanks Travis
Travis, My recollection is that the decision for this consortium was to only post the bam files which include all mapped and unmapped reads (feel free to confirm that with samtools which can count all mapped and unmapped reads). There are several programs that will convert the bam files back to fastq files. I understand that might not be ideal, but it's what was decided for this consortium due to space constraints.
Hi Travis , I'm following up with the data provider about this

Mayo RNA sequencing study-path aging syn5550404 page is loading…