Dear challenge organizers,
as you may know; the usage of bigwig files instead of the aligned BAM files not only makes for a slightly time-consuming code but also leads to a huge loss of information regarding the way the reads were mapped and algorithms participant's may have to assuage NGS-based biases.
Will the challenge be based solely on BigWig files or is there an opportunity to make the aligned BAM files available as well?
Thank you very much!
Eduardo
Created by Eduardo Gusmao eduardogadegusmao We are not planning to provide access to the FASTQs. You will need to use the bigwigs as provided. Rather than just use the reference genome, I would like to use the various experiments to make specific reference genomes for each data source to improve training. Is it possible to at least get access to the fastqs that made the bigwigs? Hi Anshul,
yes. To be honest I think this makes perfect sense (especially given a fixed-length challenge and the fact that some participants might be more savvy - distracting the challenge from imputation and turning it into an NGS-bias challenge...).
I only meant that some algorithms directly on the reads "could" be interesting. But as you said, post-challenge, this can be an interesting aspect to explore...
Thanks!
Ed Excellent question. We have also discussed these kinds of issues quite a bit and reached the challenge design we currently have available as a reasonable compromise i.e. we provide a specific processing of the datasets and provide specific types of processed signal files as bigwig files. There are several reasons for these choices.
1. We need some fixed ground truth to be able to score performance of imputation methods. The data can be processed in different ways and then the imputation methods would operate on these different types of processing. We wanted to avoid mixing these two aspects. We want to evaluate imputation methods keeping data processing fixed. This choice also eliminates the need for participants to be intimately familiar with these data modalities to know exactly how to process them. Certainly, post challenge we would like to collaborate with top performing teams and use their methods to also evaluate the other side of things i.e. how different types of processing affect the results of imputation. In order to keep the challenge tractable, we decided to settle on providing a pre-defined processing of the datasets and signal tracks.
2. From the stand-point of ENCODE, we would like imputed tracks to fit well with the pipelines and signal track generation approaches we currently use. That way, we can easily integrate the imputed tracks into the compendium. Hence, we kept the processing as close to the current ENCODE pipelines as possible. There are always continuous improvements to data processing and we consider pipeline revisions every few years.
3. Now the question regarding BAMs vs processed bigwigs. Providing BAMs as primary input has a few issues. The BAMs for each of the assays can and often need to be processed differently for the different assay types to get coverage tracks that are capturing the relevant events (e.g. for histone ChIP-seq due to mirrored reads around the modified sites, reads need to be extended in the 5' to 3' direction. One the other hand, ATAC-seq and DNase-seq need to be treated differently). So there are just too many subjective choices that users would need to make to generate coverage tracks from the BAMs. And this once again gets to the issue of evaluating imputation methods decoupled from processing differences.
Presenting this problem as a limited duration challenge introduces many constraints and we made decisions that we think are reasonable. The datasets have all been uniformly processed using approaches and pipelines that are reasonably standard in the community. We haven't performed any specific bias correction or explicit batch correction (which is actually very difficult to do for this diversity of datasets, labs and timelines of experiments) for any of the tracks so we fully understand that the imputations will have the same biases as those present in provided training tracks. In some sense, we care about how well an imputation method can replicate the output of an experiment (with all its biases/confounders). The main goal of the challenge is to assess imputation methods independent of the biases and data processing considerations. So I think we are going to stick to the uniformly processed bigwigs for the purposes of this challenge.
Happy to discuss more.