My group is interested in using isoform-level expression/alternative splicing data for deconvolution. Would it be possible to make the aligned BAM-level information available from the validation data? Moreover, there are many non-coding RNAs that may not be annotated, but might be useful for deconvolution given their known tissue-specificity.

Created by Angela Brooks brooks
Hi @brooks and @Tebalde0 , I'm sorry for the very long delay in responding while we have been discussing this. We can tentatively provide the raw FASTQs for the one (!) leaderboard data that will be RNA-seq and for all of the validation data. I have updated the Wiki here: https://www.synapse.org/#!Synapse:syn15589870/wiki/592699 to include columns (fastq1.files, fastq1.files, and fastq.samples) that will be point you to the FASTQ files when available. If either of you (or others) plan on participating and accessing these data, please let me know. I would like to understand how you will access them efficiently and what type of compute and memory resources you will require. Is it possible for us to provide an intermediate format that will be more efficient to process? e.g., transcript-based summaries? Splice junctions? Regards, Brian
Hi Brian, Our dream challenge team would also like to get access to the FASTQ file. Our strategy , and therefore our participation, requires the raw data or BAM files to capture specific sequences. Thanks, Antonin
Hi Angela and all, We had not intended on doing this. But I can appreciate the motivation. Are others interested in this level of data? Angela (and others), would isoform-level summaries suffice? Though I said in the webinar that we do not plan to enforce an execution time limit (yet), I'm afraid this goes well beyond the level of processing we had anticipated. i.e., won't it take a long time to process all of these BAMs? e.g., if a dataset includes 100 admixtures? In this case, is there a standard splicing pipeline that Sage could run on the data? This would need to be something we could do efficiently (in terms of effort). Finally, any method leveraging isoform / splicing / non-coding RNAs is going to be at a disadvantage. Much of the training data we have linked to is microarray. The leaderboard data also has a mix of microarray and RNA-seq. We are looking into whether we can easily get BAMs / FASTQs for the leaderboard data, but this will be more of a challenge than for the validation data. That said, the leaderboard rounds are mostly for your benefit to see how you are performing and to revise your method. You would get an NA for the microarray data, which would lead to an overall rank / score of NA. But, we could provide performance on a per-dataset basis so that you could see results on RNA-seq data -- assuming that is, we can efficiently include raw RNA-seq data in the leaderboard. I'm sorry to expose you to my rambling thoughts. I think it's a great idea and I'd love to facilitate it. Convince me it's going to be doable for us on a somewhat tight timeline. Unfortunately, I will be out for a week. If one of my colleagues does not respond in my absence, I will once I return. Thanks, Brian

Availability of FASTQ or BAM alignment for validation data page is loading…