Inconsistency between truth and data in sim31 isoform dataset

Dear organizers, I'm looking at isoform quantification, particularly at sim31 from round2 as a case for further exploration. I have looked only at this dataset so far, but I'm finding some results that are difficult to explain, and wondering whether there might have been an error in the data simulation process. Here is one of the more straightforward examples of what I am seeing. Consider gene ENSG00000158042, or MRPL17. The contest gtf file says that there are four isoforms: ENST00000288937 (MRPL17-001), ENST00000529958 (MRPL17-002), ENST00000532676 (MRPL17-003), and ENST00000532203 (MRPL17-004). (I'll use the MRPL17-xxx names to make my case clearer.) The exons of these isoforms, as taken from the gtf file, are (all on chromosome 11, minus strand): MRPL17-001: 6704354-6704632, 6703978-6704046, 6702013-6703633 MRPL17-002: 6703978-6704527, 6703182-6703633 MRPL17-003: 6703407-6703631, 6702382-6702459 MRPL17-004: 6704354-6704547, 6703380-6703633 These all seem reasonable, and I have re-aligned these transcripts to the GRCh37 genome to verify these exon coordinates. Furthermore, no other gene in the gtf overlaps this overall gene region (chromosome 11, 6702013 to 6704632), on either the plus or minus strand. Now, here is the problem: the truth file claims the expressions are MRPL17-001 (0.000000), MRPL17-002 (0.000000), MRPL17-003 (0.000000), and MRPL17-004 (0.580000). However, I contend that the data implies that MRPL17-001 has a non-zero expression. Note that MRPL17-001 has some unique regions compared to all other transcripts. In particular, its final exon extends further 3' than any other transcript, in particular, the region from 6702013 to 6703182 is unique to this transcript. Therefore, any read that aligns to this part of chromosome 11 can be explained only by expression of MRPL17-001, and therefore would mean that MRPL17-001 cannot have a zero expression. It is possible that a read aligning to this region could align elsewhere in the genome, but in my analysis below, this turns out not to be the case. In my alignment of the reads to the genome, I find over 100 such reads that imply MRPL17-001 has non-zero expression. If the contest organizers want the full set of such reads, they can send a private message to me. But here is a clear example: read 23364697, which aligns to -11:6702424-6702325 exactly (with no mismatches) for the first end, and whose second end aligns to +11:6702051-6702150 exactly (with no mismatches). The second end lies completely in the unique region for MRPL17-001, while the first end lies partially in this unique region. But neither end can be explained by another transcript, and especially not MRPL17-004, which is claimed to be the only expressing transcript in this gene. A direct pairwise alignment between the ends of this read and each of the four transcripts confirms that MRPL17-001 can be the only source. Interestingly, MRPL17-001 also has begins with a unique region; its exon 1 has 6704632 to 6704548 and that distinguishes it from all other transcripts. But I do not find any reads that map to this unique region. This is also very strange and difficult to explain, to have such a discrepancy in apparent expression between the 5' and 3' ends of MRPL17-001 transcript. (Organizers, I have a graph that shows this discrepancy very clearly.) Therefore, I believe that the data simulator has a bug. Perhaps you can investigate further, and tell me how read 23364697 was generated. If it is supposedly being generated by MRPL17-004, then something is clearly wrong. If it is being generated somewhere else in the genome, then perhaps I have misaligned these reads, and you can ignore this post. However, I have also aligned the reads to a transcriptome, and the same problem occurs there: these reads align only to MRPL17-001 and no other transcript. Here is my guess as to the problem. The simulator seems to be generating paired-end reads based on the starting coordinate of a genomic coordinate within a given transcript. In this case, the simulator knows that MRPL17-004 is an expressed transcript, so it generates a paired-end read starting at 11:6702424. However, the simulator apparently does not check the ending coordinate, to make sure that the paired-end read lies completely within the boundaries of that transcript. For this read, the read extends past the 3' end of MRPL17-004. Therefore, any analysis program, including my own, has to infer that MRPL17-001 is being expressed. This explanation, about using the starting genomic coordinate, would also explain why I see this problem only at the 3' end of MRPL17-001 and not at its 5' end. Of course, I am guessing that your simulator is trying to generate reads on genomic coordinates, and not from transcripts, which would be harder for me to explain. Anyway, I'm happy to be proven wrong here, but something just does not make sense to me. Thanks for looking further into it.

Created by genomehacker
Dear @genomehacker : Thanks for this careful observation. Since we are simulating both canonical isoforms and gene fusions, there may be cases where a canonical isoform, such as MRPL17-001, is not expressed per se, but is part of a gene fusion that is expressed. Note that the truth file for sim31 fusions includes the following fusion: ENST00000584895-ENST00000288937. The expression of canonical isoforms are reported in the isoforms truth file, but expression of the fusion products are not. If you have other examples that are not answered by this explanation, please post for us to take a look.
I'm looking briefly at the RSEM simulator code. It does seem to work on cDNA transcripts, so it's hard to see how it could make an error. The code is a bit complex for me to debug right now, but the PairedEndModel.h file says that the program generates an identifier for each read that contains the rid (read id), dir (direction), sid (sequence id), pos (read position), and insertL (insert length). There must be a step afterwards to remove everything but the read id. The question I have is what the sequence id is for read id 23364697. Is there a way to run the simulator again for sim31, with the same random number generator settings, but keep the sequence id in the output? That would at least answer the question of whether there is a problem.
The simulator code can be found at https://github.com/Sage-Bionetworks/rnaseqSim There are instructions to document the steps we are taking to build the training data. The reads are generated by the RSEM simulator, based on the provided expression values, in the 'fastq_create' step ( https://github.com/Sage-Bionetworks/rnaseqSim/tree/master/fastq_create )
Thanks for this report, we are looking into it.
The problems reported earlier in this forum with the round 2 dataset files were minor formatting errors. The anomaly I am reporting calls into question the validity of the data simulation process. If I'm correct, then all of the datasets, including the round 3 datasets, have serious problems. I suppose I could find a similar example in the round 3 series, and now that I know what to look for, I could probably find it very easily, but I don't think the data simulator changed between round 2 and round 3.
I'm not sure your problem has been resolved, but the current dataset appears to be sim41 through sim45 (others also found problems with the sim3* set of files, see elsewhere on this forum).

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Inconsistency between truth and data in sim31 isoform dataset page is loading…