The following shows a couple examples where the truth fusion appears to be incorrectly assigned to the upstream adjacent exon boundary(exon no. +1). I also see 3 cases where the STAR aligner could not find any reads in the vicinity of the breakpoints. Whether this is an aligner-specific problem or simulation problem is yet to be determined. Only one fusion was found to be correct.
All examples are from Sim36(6 true fusions)
####CASE1--Incorrect
TRUTH: 6 30876949 30876950 1 22834493 22834494(chr6 at exon2 boundary)
FOUND:6 30876180 30876181 1 22834493 22834494 (chr6 at exon1 boundary)
####CASE2--Incorrect
TRUTH: X 16774843 16774844 8 87470149 87470150 (chrx at exon7 boundary)
FOUND:X 16773217 16773218 8 87470149 87470150 (chrx at exon6 boundary)
####CASE3--Correct
TRUTH: 9 135145001 135145002 17 78348257 78348258
FOUND: 9 135145001 135145002 17 78348257 78348258
####No alignments supporting breakpoints
19 46238704 46238705 X 153420048 153420049
22 24621212 24621213 9 108127763 108127764
2 99804719 99804720 22 39267550 39267551
Can anyone else corroborate these results?
Created by Jeff Jasper jasper1918 Point taken @genomehacker and I'll clarify my statement. As long as the data simulator is producing some form of detectable evidence(crossing reads or otherwise) they definitely should remain in the challenge. In fact if the simulator is using different types of read features(directly, indirectly, no matter the difficulty) we can use these to benefit the field. Likewise, if it failed to produce the read support that was intended(for whatever reason)then these should be removed so that we can identify deficiencies and spur the necessary development. It's all about accounting and knowing that "true" positives can through some methodology be discovered. Just don't give us false despair!
Unfortunately, we won't know which fusions can or cannot be detected, until we see everyone's results posted. Although I asked initially about the possibility of fusions with zero crossing reads, I can understand that there might be fusion detection methods that could infer them without depending on crossing reads. Or methods might differ based on the amount or type of crossing read evidence they need. So I would argue against removing fusions based on preconceived notions of their detectability. Maybe we will all be surprised by some method that is able to find them, or perhaps they will motivate someone to find such a method.
If no one finds such fusions, then perhaps we can declare them "undetectable", and revise the statistics accordingly. But I would err on side of making the contest more challenging, rather than removing potential challenges prematurely.
There are some interesting talking points here and I think the focus really comes down to what the challenge is trying to test. Just as with any given assay there are two different considerations: 1) How well does the wet-lab assay perform in terms of coverage, quality, repeatability, etc and 2) How well is the software pipeline performing in calling your favorite variant class on that data. These are separate questions and are meant to be handled disjointly to elucidate shortcomings and where improvements can be made.
In a typical scenario, regions without coverage are scored negatively for the wet-lab method but then are censored in terms of understanding the analytical approach. The two are considered comprehensively as the gross output of an assay however. So what are we trying to test in this challenge? The software or the software in aggregate with the simulation? If we continue with the latter, there is no way of uncoupling False-Negatives that have coverage(calling algorithm defects) from regions that just lack coverage(simulation defect). While all methods will be penalized evenly, I still see it necessary to understand why the fusions are not being called. Mainly for the fact, that it publicizes where the field as a whole stands and where efforts need to be made. I would vote strongly in favor of removing fusions that have zero coverage from the truth bedpe files for the reasons stated.
Moreover, I would ask what is the purpose of training data if not to iteratively test and improve your approach. As it stands currently, there is no way of understanding where we fail or how to improve.
That's reasonable and explains to me what the bedpe file means. Essentially, fusions with "zero" effective coverage might not be detectable by any method (or might require some inference, as you mention).
I think the coverage issue does lead to the larger philosophical question of what are we trying to measure: the best caller, or how good callers deal with the data? Everyone will still be effected by the same lack of direct evidence. And maybe it is possible to infer the junction without direct evidence, but if nobody does get it, I don't believe should grade on a curve.
I feel like if an event did 'happen' but there isn't any direct evidence to support it, it still did happen and the callers should be judged by that fact. In nature there won't be any guarantees that we'll get supporting reads for fusions, so we should try to understand what our false-negative call rate is. Understanding how 'badly' we do at these kinds of tasks is one of the more informative lessons that come out of these kinds of challenges.
Thanks. That makes a lot of sense, and takes care of the problem.
I do have another question about the bedpe files: Are all the fusions listed guaranteed to have at least one read in the FASTQ data that crosses the fusion junction? Or is it possible that a fusion has no supporting reads? This might happen because the read generation process is presumably random, and that at low levels of coverage, a fusion might not get "covered" by a read.
If a fusion has no reads "covering" it, then one could argue that that fusion should be removed from the bedpe file. Of course, there might be some dispute about what "covered" means, and how much overhang is needed to adequately define that. One might also argue that a paired-end read provides sufficient evidence for a fusion, if its two ends lie on both sides of the fusion junction, even though neither end directly crosses the fusion junction.
In other words, is there a post-processing step for the bedpe files that goes back to check if there was at least one supporting read generated that crosses each junction (according to some criterion)?
Thanks for this report.
The discrepancy is due to an off-by-one issue in a list slice in the simulation. This affects the coordinates written out to the bedpe files, but not the rest of the data. We will update the bedpe files for the sim3* datasets very shortly. The FASTQ data is unchanged.
We will also re-score round 1 for any affected datasets. We've been looking into this issue, and don't yet have a response.
Part of our strategy for dealing with these issues is to release the source code of the simulation pipeline, so that participants can inspect it and suggest improvements. We'll update soon when that code is avalible publicly.
I concur with what Jeff is reporting. Essentially, the pattern of bugs in the bedpe files depends on the genomic strand of the fusion ends, as follows:
For the 5' fusion gene, if the genomic strand is plus, then the reported coordinate is wrong. If the genomic strand is minus, then the reported coordinate is correct.
For the 3' fusion gene, if the genomic strand is minus, then the reported coordinate is wrong. If the genomic strand is plus, then the reported coordinate is correct.
###Dear Organizers,
I have just finished a thorough review of the other datasets and am finding the same problem described above throughout the data. Namely, the truth bedpe files incorrectly state that many of the fusion breakpoints are in an adjacent exon. These are not isolated cases and are impacting the majority of the data. The data was verified by running through multiple callers and all are at odds with the truth.
Here is just one example from Sim31:
TRUTH: 2 74719873 74719874 3 108639431 108639432 (chr2-exon12, chr3-exon2)
FOUND: 2 74719571 74719572 3 108635060 108635061 (chr2-exon11, chr3-exon3)
Again, this data is pivotal in understanding the submission/scoring process and undermines the integrity of the challenge at large. I am happy to show MANY more cases where the truth bedpe is in conflict with the simulated data should this be necessary.
**Please let us know that you acknowledge this as an issue and how the challenge can proceed in light of these observations. **