Dear organizers,
I noticed that the fusionToolEvaluator does not always recognize true positives as one would expect. If I give it the same files for **--input** and **--truth**, then to my understanding precision and sensitivity should both be 100%, but this is not always the case. I ran fusionToolEvaluator version 0.1.3 on the truth set of simulation 43 and here is what I get:
Command:
```
python SMC-RNA-Challenge-master/script/evaluation.py evaluateFusionDet --gtf Homo_sapiens.GRCh37.75.refFlat --input sim43_filtered.bedpe --truth sim43_filtered.bedpe
```
Output:
```
#Evaluation_Name Num_Res_Trans Num_Truth_Trans Sensitivity_Trans Precision_Trans F1_Trans Num_Res_Gene Num_Truth_Gene Sensitivity_Gene Precision_Gene F1_Gene
No Subsetting 17 17 1 1 1 17 17 0.882353 0.882353 0.882353
```
As you can see, **Sensitivity_Gene** and **Precision_Gene** are not 100%.
The following lines are considered mismatches:
```
16 89254692 89254693 2 241069444 241069445 ENST00000289746-ENST00000307266 0 + - .
1 46049583 46049584 1 160319894 160319895 ENST00000437901-ENST00000368063 0 + + .
```
Do you have an explanation for this? Do I understand correctly that the columns named **..._Gene** match by genes overlapping with the breakpoints and the columns named **..._Trans** map by coordinate of the breakpoints?
Best regards,
Sebastian
Created by uhrigs Thanks for bringing this to our attention @uhrigs. I have opened an [issue](https://github.com/Sage-Bionetworks/SMC-RNA-Challenge/issues/41) to add validation of the truth bedpe as well. We also recently added a [script](https://github.com/Sage-Bionetworks/SMC-RNA-Challenge/commit/28b1cd42a69c50f6a85fbbf0d079c94aaa385555) to fix bedpe files that fail validation because of invalid chromosome names, strand designations, and positions (e.g. start1 + 1 > end1). This script can be used on both the input and/or truth files. Please feel free to open issues on the [SMC-RNA-Challenge](https://github.com/Sage-Bionetworks/SMC-RNA-Challenge) repo if you find anything else! Dear @creason,
There seems to be another issue with the fusionToolEvaluator and/or the truth files:
The truth files supplied with the training datasets do not comply with the expected format specified in the wiki. Namely, the last two columns encode the forward and reverse strands as **1** and **-1**, instead of as **+** and **-**. This causes fusionToolEvaluator to fail to recognize matches between the **--input** and **--truth** files.
For example, if you run fusionToolEvaluator with one of the truth files from the latest training datasets as both **--input** and **--truth**, it fails with the following error:
```
python SMC-RNA-Challenge-master/script/evaluation.py evaluateFusionDet --gtf Homo_sapiens.GRCh37.75.refFlat --input sim51_filtered.bedpe --truth sim51_filtered.bedpe
Strand should only contain +/-/.
```
This is fine and exactly what fusionToolEvaluator should do. It complains, because the **--input** file uses the wrong encoding for the strands.
If you correct the encoding in the **--input** file (by replacing **-** for **-1** and **+** for **1**), but not in the **--truth** file, then fusionToolEvaluator finishes without errors, but the columns **Sensitivity_Trans** and **Precision_Trans** are not equal 1, even though they should be:
```
#Evaluation_Name Num_Res_Trans Num_Truth_Trans Sensitivity_Trans Precision_Trans F1_Trans Num_Res_Gene Num_Truth_Gene Sensitivity_Gene Precision_Gene F1_Gene
No Subsetting 9 9 0.111111 0.111111 0.111111 9 9 1 1 1
```
One has to correct the encoding in the **--truth** file, too, for these two columns to equal 1. Since the truth files for the training data all use the incorrect encoding, I suspect that the incorrect encoding is also used for the evaluation of submissions, which may produce bogus results.
In order to ensure that submissions are scored correctly, I suggest that
- fusionToolEvaluator applies the same sanity checks to the **--truth** file as it does to the **--input** file, i.e. refuses the values **1** and **-1** in the strands columns.
- the truth files are edited to use the proper encoding for the strands.
Best regards,
Sebastian Dear @ndaniel,
Your question above seems to be getting at comparison of findings with those in healthy samples? If so, perhaps you could open a new thread to discuss this. Additionally, we are happy to consider contributions to the [simulation code base](https://github.com/Sage-Bionetworks/rnaseqSim) that add features of importance to the community.
Hi @ndaniel,
fusionToolEvaluator considers anything to be true that is specified in the truth file. It is merely a tool to check whether two files are identical. So the question should not be "Does fusionToolEvaluator consider read-through fusions as true events?" but rather "How should we design our test cases?". I raised this exact question in another thread: https://www.synapse.org/#!Synapse:syn2813589/discussion/threadId=1781
Regards,
Sebastian Does fusionAnnotator follows the "fusion gene" definition from the article from below (where for example fusion genes in healthy samples are not considered to be a fusion gene and actually they are errors in gene annotations; also the two neighbouring genes are not considered also fusion genes)?
Also does fusionAnnotator checks that two genes are not already known to be overlapping on same strand (and therefore are not fusion genes)?
http://www.mdpi.com/1422-0067/18/4/714
**It Is Imperative to Establish a Pellucid Definition of Chimeric RNA and to Clear Up a Lot of Confusion in the Relevant Research**
Abstract
There have been tens of thousands of RNAs deposited in different databases that contain sequences of two genes and are coined chimeric RNAs, or chimeras. However, ?chimeric RNA? has never been lucidly defined, partly because ?gene? itself is still ill-defined and because the means of production for many RNAs is unclear. Since the number of putative chimeras is soaring, it is imperative to establish a pellucid definition for it, in order to differentiate chimeras from regular RNAs. Otherwise, not only will chimeric RNA studies be misled but also characterization of fusion genes and unannotated genes will be hindered. We propose that only those RNAs that are formed by joining two RNA transcripts together without a fusion gene as a genomic basis should be regarded as authentic chimeras, whereas those RNAs transcribed as, and cis-spliced from, single transcripts should not be deemed as chimeras. Many RNAs containing sequences of two neighboring genes may be transcribed via a readthrough mechanism, and thus are actually RNAs of unannotated genes or RNA variants of known genes, but not chimeras. In today?s chimeric RNA research, there are still several key flaws, technical constraints and understudied tasks, which are also described in this perspective essay. I second this.
Also, can you please update the dreamchallenge/smcrna-functions docker image with this new feature?
I'm asking because the install directions at https://github.com/Sage-Bionetworks/SMC-RNA-Challenge/tree/master/FusionDetection/Evaluator are incorrect (FusionEvaluator_0_1_0 doesn't exist)so it's easier to use the pre-built image.
I wonder whether it might help if we could check our own results on past rounds. For example, in round 2, the leaderboard posts the results on sim37 through sim40, but we don't have those datasets, just the training datasets sim31 through sim36 for which we don't have official results. If we had the final evaluation datasets, we could see if there were any issues with the evaluation, and also understand how our methods succeeded or failed. Since round 2 is over, it wouldn't matter any more if we had access to those evaluation datasets. Or perhaps you could consider making them available only to teams that have submitted a method for round 2.
Hi @uhrigs, thank you for notifying us of this issue. The data simulation code generates fusions based on all ENSEMBL transcripts, including transcripts truncated in the coding sequence. However, the fusionToolEvaluator filters out these truncated transcripts when reporting the Gene statistics, which leads to the discrepancy you noted. We have fixed this issue by providing a new parameter (-a/--all-transcript) in the tool that allows for the usage of all transcripts when calculating the gene statistics (https://github.com/Sage-Bionetworks/SMC-RNA-Challenge/pull/39). The leaderboards for Round 1 and Round 2 were also generated using the Gene statistics, and therefore we will be updating them soon.