Dear organizers,
Looking at the methods to generate the benchmarking datasets, it seems that the focus of the fusion challenge is on those chromosomal aberrations that fuse genes in proper orientation (5'-3') and that can hence yield chimeric proteins. However, there are many chromosomal aberrations that lead to aberrant transcripts which are not translated but which can still be relevant to tumor biology. For example translocations which truncate tumor suppressor genes lead to inactivation of these genes. Another example are translocations disrupting the 3' UTR of PD-L1, which has been shown to deregulate PD-L1 expression. Such chromosomal aberrations also manifest in the RNA: The former example yields transcripts with intergenic breakpoints; the latter example yields transcripts with breakpoints in the UTR. Some fusion detection tools also report these events in the results, because they are relevant to cancer. If only "classical" fusions are considered as true positives, this would put these tools at a disadvantage, because these events will inflate the result set and thus hurt precision. For the simulated and spike-in datasets, this is not an issue, since it is clear in these cases that no such events are expected. Those tools that usually report such events should simply disable this feature for these datasets. But on the challenge overview it says that in the end the tools will also be tested on real tumor data. How will these types of events be scored there?
Other examples for aberrant transcripts which are not fusions in the classical sense but are nevertheless interesting in the context of cancer include:
- chromosomal aberrations which fuse genes in nonsensical orientations (5'-5' or 3'-3'), such that the resulting transcript consists of the sense strand of one gene and the antisense strand of the other gene
- small duplications inside a gene, yielding duplicate exons
- small inversions inside a gene, leading to transcripts involving both the sense and antisense strand of a gene
- small deletions inside a gene, leading to exon skipping
On the other hand there are non-canonically spliced transcripts that look like they arise from chromosomal aberrations, but that are perfectly normal and are frequently observed in healthy tissue. Examples for such transcripts are:
- cis-spliced transcripts between neighboring genes (a.k.a. read-through transcripts)
- trans-spliced transcripts between genes on opposite strands
- circular RNAs which look like duplications in NGS data
- exon shuffling which leads to non-canonical ordering of exons and also looks like duplications in NGS data
All of these can be observed in healthy tissue samples and are therefore unlikely to be of importance for cancer development. For this reason, some fusion detection tools remove such events from their results, even though a PCR-validation would probably turn out positive. Would this be rated as a false negative by the scoring system? This question is theoretically relevant to the simulated datasets, because FUSIM could - by chance - mimick a fusion between neighboring genes.
Many thanks in advance for your clarification,
Sebastian
Created by uhrigs Hello,
there are multiple issues which one needs to be aware when simulating fusion genes. For example picking up randomly two genes and merging them to generate a fake fusion gene has a lot of gotchas, like for example:
- make sure that the two genes which are picked up randomly do not overlap on same strand in any of the widely avaiable gene annotation databases (for example gene IFITM1 and IFITM2 are overlapping in Ensembl database v87 but they do not overlap in Genecode and other gene annotation databases; this means that if one generates a fake fusion from IFITM1 and IFITM2 then the fusion finder which uses the Ensembl gene annotation will not pick it because these genes are overlapping whilst the other fusion finders will find it because in their gene annotation database is not overlapping)
- make sure that the genes which form the fake fusion, do not have a high degree of sequence similarity (this happens actually quite a lot especially between genes and their pseudogenes; for example gene DUX4 has 42 pseudogenes);
- a real fusion junction in real samples happens in mid introns, in mid exons and quite rarely at exon-intron junction (see: scientific articles published about real fusions and therefore simulating fake fusions using only exon-intron junctions is not very close to reality);
- real fusions, like for example IGH fusions, even have random insert of 13-20bp long (and sometime even more) at the fusion junction (see scientific articles about IGH fusions/translocations; therefore if one wants to be close to reality then generate also some IGH fusion with random inserts at fusion junction) ;
- all gene annotations have mistakes and therefore the assumption that the GTF file represents a healthy person is actually wrong many times (for example; the genes involved in these very well known fusion genes FIP1L1-PDGFRA and GOPC-ROS1 are annotated wrongly [they are shown to overlap each other] in many gene annotation databases, e.g. Genecode, and this has as effect that most of the fusion finders will miss them because of this).
- some class of real fusion genes are challenging to find because of their pseudogenes (see fusions which involve DUX4 gene in these articles http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0099439 and
http://onlinelibrary.wiley.com/doi/10.1002/gcc.22454/full ) therefore when one should simulates fake fusions also of this type (e.g. CIC-DUX4. IGH-DUX4, NPM1-ALK, etc.).
Another issue is when one simulate reads, then one should make sure (or at least make publicly available the info) how many pairs of reads support the fusion and how many reads map on the fusion junction. If this is missing then RSEM might not generate (and it does not guarantee in any way) reads which map the fusion junction for some low coverage fusions and some fusion finders will be penalized just because they have a different threshold than the simulator. See here:
https://github.com/Sage-Bionetworks/rnaseqSim/issues/8 Basically, simulating reads from a fusion does not mean that there are indeed reads which support that fusion and therefore some kind of extra-checking is needed.
Regarding this "All of these can be observed in healthy tissue samples and are therefore unlikely to be of importance for cancer development. For this reason, some fusion detection tools remove such events from their results, even though a PCR-validation would probably turn out positive.". My comment is that this kind of cases are plenty and they even get validated by PCR because the current gene annotation has many errors. This is why there is a new release of a new gene annotation every month in order to remove this kind of errors in gene annotations. So I would say that those are just errors in gene annotations.
The most challenging issue with fusion finding is the specificity. It is very very easy to predict existence of fusion genes on very shaky/non-existent evidence because there is almost no penalty for it. See: https://github.com/pmelsted/pizzly/issues/2
Also a negative data set should be sued. Therefore one could create a simulated data set of reads from transcriptome (e.g. using just some GTF file and hg38 to simulate reads out of its trascriptome ) which contains no fusion genes at all and use this to evaluate the performance of all fusion finders. If a fusion finder finds a fusion gene in such dataset then for sure it is a FP.
Dear Sebastian,
Thanks for these suggestions. Indeed, we have not modeled the full spectrum of potential fusion events in the current data, but these would be good to add for future benchmarking purposes. I opened issues for many of your suggestions in the [repo for the simulation code](https://github.com/Sage-Bionetworks/rnaseqSim/issues). Please create additional issues and feel free to submit a pull request for any of them!
We'll answer your question about scoring in real data in a separate post.
Thanks for your contributions.
Drop files to upload
method of benchmarking / types of events page is loading…