I believe there is something seriously wrong with the fusion dataset. Either that, or I'm missing something that should be relatively simple. My understanding is that the bedpe file contains the genomic position of the donor and acceptor splice sites for each simulated gene fusion. Now a biologically plausible gene fusion should join a donor splice site from one gene to an acceptor splice site from a distant gene. This is because the gene fusion occurs in the genomic DNA sequence, most likely between two introns, and then splicing of the merged intron joins the two exons together in the RNA data, resulting in an exon-exon boundary. In fact, a candidate gene fusion that does not occur between two biologically plausible splice sites is likely to be a library artifact and not a true gene fusion. Many genomic alignment programs look for this exon-exon boundary to identify true gene fusions.
So, I extracted all splice sites from the ensembl GTF file, and then looked up each entry in the bedpe file against these "known" splice sites. I found that only 15 of the 50 cases have both the donor and acceptor splice sites match a known splice site in the ensembl GTF file. In most of the cases, the problem is that the donor site is okay, but acceptor splice site does not match. I even manually checked some of these faulty acceptor sites, and they don't even occur next to a canonical AG dinucleotide, which makes them extremely unlikely to occur in nature.
For this reason, I believe that it is not worth pursuing the fusion contest any further until this anomaly can either be explained or corrected. There is no point having a contest when the dataset appears to have major problems, so there is no acceptable gold standard. I am attaching my analysis of the sim1_filtered.bedpe file below. The word "match" means the given splice site matches the ensembl GTF file, whereas "no" means it does not. In some cases, I detected minor errors in the ensembl GTF file itself, usually when two different gene IDs are used for the same set of transcripts, but in one case, where the gene ID differs from that in the bedpe file.
16 85945263 85945264 2 172379747 172379748 ENSG00000140968-ENSG00000071967 0 + + match no
3 46305945 46305946 4 184605103 184605104 ENSG00000183625-ENSG00000168538 0 + + match no
3 165547305 165547306 4 141294691 141294692 ENSG00000114200-ENSG00000153130 0 - + match no
20 60736635 60736636 5 172261417 172261418 ENSG00000184402-ENSG00000113719 0 + + match no
1 63153898 63153899 8 144657262 144657263 ENSG00000116641-ENSG00000147813 0 - - match match
12 105380244 105380245 12 124817009 124817010 ENSG00000151131-ENSG00000196498 0 + - match match
1 36432675 36432676 12 57944055 57944056 ENSG00000126070-ENSG00000155980 0 + + match no
5 45695771 45695772 15 77150219 77150220 ENSG00000164588-ENSG00000140386 0 - - match match
X 105197164 105197165 19 53875535 53875536 ENSG00000123572-ENSG00000203326 0 + + match no
2 110584423 110584424 1 89329796 89329797 ENSG00000015568-ENSG00000137947 0 + - match match
19 53879608 53879609 10 134143988 134143989 ENSG00000203326-ENSG00000165752 0 + - no match
11 114168879 114168880 6 84417645 84417646 ENSG00000166741-ENSG00000065609 0 + - match no
19 16738683 16738684 19 10197104 10197105 ENSG00000105085-ENSG00000130813 0 - + match(also ENSG00000151979) no
1 42776721 42776722 11 380292 380293 ENSG00000198815-ENSG00000182272 0 - + match match
11 61205319 61205320 6 133055521 133055522 ENSG00000256591-ENSG00000093134 0 + - match(ENSG00000167985) match
3 167812887 167812888 15 91499965 91499966 ENSG00000173905-ENSG00000166965 0 - + match no
11 57563200 57563201 12 26208277 26208278 ENSG00000198561-ENSG00000123094 0 + + match no
15 74374767 74374768 11 118997655 118997656 ENSG00000159289-ENSG00000172273 0 - + match no
21 47532737 47532738 6 33169023 33169024 ENSG00000142173-ENSG00000112473 0 + + match no
4 122723075 122723076 12 124086696 124086697 ENSG00000123737-ENSG00000111364 0 + + match no
8 145138639 145138640 8 43035719 43035720 ENSG00000197858-ENSG00000165102 0 + + no no
18 72176162 72176163 8 86253863 86253864 ENSG00000133313-ENSG00000133742 0 + - match no
21 37429682 37429683 11 1244354 1244355 ENSG00000185917-ENSG00000117983 0 - + match no
9 131344834 131344835 20 48770173 48770174 ENSG00000197694-ENSG00000240849 0 + - match no
4 672952 672953 14 53417285 53417286 ENSG00000215375-ENSG00000073712 0 + - no no
7 105603809 105603810 X 153631530 153631531 ENSG00000128536-ENSG00000013563 0 + - match match
6 64029819 64029820 11 9838570 9838571 ENSG00000146166-ENSG00000133812 0 - - match match
18 57115215 57115216 4 56724490 56724491 ENSG00000183287-ENSG00000090989 0 - + match no
17 71238489 71238490 7 624204 624205 ENSG00000141219-ENSG00000188191 0 + - match match
5 131329829 131329830 3 58415964 58415965 ENSG00000164398-ENSG00000168291 0 - - match match
20 37175072 37175073 12 7043165 7043166 ENSG00000170471-ENSG00000111676 0 + + match no
16 89970612 89970613 14 95085543 95085544 ENSG00000141002-ENSG00000196136 0 + + match match
19 53058880 53058881 7 134890809 134890810 ENSG00000198482-ENSG00000105875 0 + - no match
11 9735185 9735186 16 8829597 8829598 ENSG00000133789-ENSG00000183044 0 + + match no
16 20824623 20824624 11 70507868 70507869 ENSG00000005189-ENSG00000162105 0 + - match no
2 158655939 158655940 12 108910750 108910751 ENSG00000115170-ENSG00000198855 0 - + match no
15 37390168 37390169 1 156873719 156873720 ENSG00000134138-ENSG00000187800 0 - + match no
16 27895807 27895808 1 59004965 59004966 ENSG00000169181-ENSG00000162600 0 - - match no
8 144654591 144654592 16 71807264 71807265 ENSG00000204839-ENSG00000166747 0 - - match match
21 38311132 38311133 19 51159318 51159319 ENSG00000159267-ENSG00000235034 0 - + match no
X 147022180 147022181 1 205633783 205633784 ENSG00000102081-ENSG00000158715 0 + - match no
11 3845593 3845594 1 46597623 46597624 ENSG00000148985-ENSG00000117461 0 + - match no
16 77468503 77468504 5 1036895 1036896 ENSG00000140873-ENSG00000145506 0 - + match no
16 304679 304680 15 67813587 67813588 ENSG00000167930-ENSG00000189227 0 + + match(also ENSG00000269881) no
8 42623495 42623496 2 24046438 24046439 ENSG00000147434-ENSG00000119778 0 - - match match
6 38670747 38670748 7 122131473 122131474 ENSG00000124767-ENSG00000081803 0 - - match match
10 71703933 71703934 17 28270646 28270647 ENSG00000197467-ENSG00000176927 0 + + match no
11 211348 211349 16 89813095 89813096 ENSG00000177963-ENSG00000187741 0 + - match match
8 31498244 31498245 20 1099417 1099418 ENSG00000157168-ENSG00000125818 0 + + match no
2 220327107 220327108 16 30773867 30773868 ENSG00000072195-ENSG00000103549 0 + + match no
Created by genomehacker Hi Kristen Dang, and everyone:
This discussion thread is very helpful.
>"Moreover, next week we will release another training dataset and its fusion junctions will be more heavily biased towards exon boundaries and canonical splicing sites."
We tried to find this new training dataset on synapse website but we can't find it. Could you give us the link to the new dataset?
Thank you!
Dear all,
Thank you for your attention to this issue. After much discussion of the issues raised here, we have decided on the following approach for round 1:
>Due to the way in which the training data for round 1 was constructed, the simulated fusions do not all exist at exon boundaries and do not necessarily respect canonical splicing rules (thanks to participant genomehacker for bringing these issues to light). To address these issues, the evaluator will output performance statistics for different subsets of simulated fusions defined by whether or not the junctions are (1) at exon boundaries, and (2) at canonical splicing sites. Moreover, next week we will release another training dataset and its fusion junctions will be more heavily biased towards exon boundaries and canonical splicing sites.
>
>Because we want to make sure participants have enough time to take advantage of these improvements the round 1 deadline has been moved to September 9, 2016 and deadlines for subsequent rounds will be adjusted.
We appreciate your continued participation in the challenge. Please continue to keep us apprised of any concerns or questions. Any updates on the training/truth dataset and how this will affect scoring? With one day out and no communication it's difficult to see how the challenge is still following the proposed timelines. Hi Kristin,
Any news on the fusion break point coordinate issue ? The deadline for the 1st round is approaching. As the coordinate issue is pretty important and would seriously affect tools' performance, I am wondering how would you handle it ? Thanks.
Hi Kristen,
Just so we are clear: I just don't see how the round 1 datasets can be salvaged for the purpose of evaluating gene fusion pipelines.
Regarding the first issue in your post: FUSIM (or whatever method you use) should not be selecting the first exon in the acceptor gene, nor should it be picking the last exon in the donor gene. Picking the first exon in the acceptor gene is tantamount to a fusion between a donor splice site and a start codon, which doesn't make any biological sense. Picking the last exon in the donor gene is tantamount to a fusion between a stop codon and an acceptor splice site, which also doesn't make any biological sense. (The endpoints appear to be selected from the CDS entries in the GTF file, which is why the start and stop codons are appearing at the fusion junctions.) If FUSIM is picking the last donor exon or the first acceptor exon to splice, then that is simply a bug in FUSIM.
This is not just a matter of non-canonical dinucleotides occurring at the fusion junctions. The fusion junctions in the datasets are just not plausible in real life, so no one would ever design a pipeline to look for them. In fact, it makes no sense to reward a pipeline to look for events that are essentially false positives.
As for the second issue, if the coordinates of the fusion junctions are off, then there will be insertions or deletions at the fusion junctions. Those are also unlikely to occur in real life, and I would guess that very few pipelines are designed to look for such artifacts.
Hi Kristen,
Thanks for looking into this issue and for your explanation. You also need to make sure that the exon that FUSIM selects for the donor gene should not be the last exon of the gene. That appears to be the problem with the fusion having donor gene ENSG00000203326, with donor junction at chromosome 19, 53879608-53879609. That is actually the stop codon for the gene, according to the GTF file, is at 53879607-53879609.
I find it rather surprising, though, that the FUSIM software would have picked the first exon in the acceptor with such a high frequency. That does not seem like a random selection of exons across each acceptor gene, and might suggest a possible bug in FUSIM.
Note that the splice sites don't necessarily require canonical GT-AG dinucleotides. As you undoubtedly known, a small fraction of real splice sites do have other dinucleotides, especially GC-AG and AT-AC. But the ones chosen for the challenge should either match annotated splice sites, or be highly plausible otherwise.
I hope you are able to resolve the coordinate issue as well. I noticed that the bedpe files for the dryrun samples all had the same coordinate for the two splice site endpoints, whereas the bedpe files for the sim samples all had the two coordinates separated by 1 bp. The format should be consistent across the different datasets.
Dear genomehacker,
Thanks for posting this issue. There are two separate processes that are contributing to your observations.
First, we did run FUSIM with the option to maintain exon boundaries and to auto-correct the orientation. However, when keeping exon boundaries, FUSIM often uses the first exon of the acceptor gene as the acceptor in the fusion, which does not necessarily have the expected acceptor dinucleotide. This explains the "match" -- "no match" discrepancy you found with ENSEMBL. All cases where the acceptor transcript starts with an exon > 1 are matches with the expected dinucleotide. So the round 1 dataset does attempt to model a dataset where the fusions are at the exon boundaries, but not all of them will have the canonical dinucleotide splicing. We are planning to increase the biological realism with upcoming rounds, and will include splicing using canonical dinucleotides.
In checking into your post, I did find a separate issue, where it looks like about half of the junctions are +1 of the actual exon boundary. This appears to come from the GTF --> refflat converter that comes with FUSIM. This unfortunately means that exon-exon boundaries are not maintained as we intended.
Please expect another post soon regarding how we will handle this second issue.
I have just read the paper on FUSIM. It says that "breakpoints are created by randomly selecting $n$ number of consecutive exons from the start or end of each gene." But, then the program has two options: "By default, genes are fused together by splitting the joined exons in random positions (split exons). The keep-exon boundary option will fuse genes exclusively on exon boundaries."
It is nice that the program authors implemented both behaviors, and at least recognized the need to maintain exon boundaries. However, their default behavior does not simulate how gene fusions are created in real life in genomic DNA. Split exons are extremely rare in real data, and very few, if any, are relevant to cancer genomics.
On the other hand, the keep-exon boundary option would simulate real data more closely, and this is how the challenge dataset needs to be created, if the goal is to evaluate bioinformatics approaches to real-life data.
I'm also worried by another default behavior in FUSIM. The paper says that "FUSIM can be set to auto-correct the orientation of the resulting fusion transcript if genes are located on different strands. This is done by reverse complementing the selected exons to match the orientation of the first gene in the fusion." If I understand this correctly, then of course, the fusion transcript needs to be corrected so all exons have the same orientation. There is no way that a biological process (which needs to create 5' to 3' chemical bonds) could create a gene fusion transcript otherwise.
We also have our own internal software for generating artificial gene fusion datasets. The best way to generate real data is to simulate a gene fusion in the genomic DNA, and then translate this into RNA transcripts and RNA-Seq reads. You could then have a few split exons (although again, they are not typically relevant for cancer genomics), but most fusions would be at exon-exon boundaries.
I would strongly urge the challenge organizers to run FUSIM with the keep-exon boundary option and make sure that fusion exons are in a consistent 5'-to-3' orientation. Round 1 appears to be a loss at this point, but perhaps the dataset can be generated correctly for round 2.
I'm not familiar with FUSIM, but I have this:
https://github.com/FusionSimulatorToolkit/FusionSimulatorToolkit/wiki
which is a way to simulate fusions - and in this case restricts the fusions to exon-exon junctions between different transcripts of different genes. Of course, using this might give me an unfair advantage since I use this extensively for my own benchmarking. I agree that this is a very serious problem that should be resolved by the DREAM challenge organizers. I hope the challenge organizers will address this ASAP given that the first deadline is rapidly approaching.