I've noticed a particularly poor overlap between the training set peaks and DNase domains in several cases. I'd typically expect real TF binding sites to be near DNase I hypersensitive regions, but in some cases only 30% or fewer of the training set narrow peaks are near DNase enriched regions. This issue is worse on average for some cell types (e.g. SK-N-SH, IMR90), and also for some factors (e.g. MAFK).
To take one striking example, there is a large disparity between the two sets of GATA3 ChIP-seq peaks (SK-N-SH and A549) in terms of how frequently they overlap with DNase domains. Roughly speaking (of course with a particular set of thresholds):
~97% of A549 GATA3 ChIP-seq peaks overlap with A549 DNase domains
~34% of SK-N-SH GATA3 ChIP-seq peaks overlap with SK-N-SH DNase domains
Does anyone have an explanation for this type of poor overlap, other than some ChIP-seq peak sets containing many false positives?
Was there any vetting done to see if this issue (if it is an issue) affects any of the final round training/test datasets?
Created by Shaun Mahony shaunmahony Checked the overlap percentages on the final round test datasets and they all look good except ATF2 in HepG2. We will drop that sample from scoring even if folks submit.
Also performed a full overlap analysis of TF conservative peaks to matching DNase relaxed peaks for all the DREAM datasets. In general for each TF the overlap percentages are largely consistent across cell types with a few exceptions. They are not always very high but are consistent. Depends on the characteristics of the factor. e.g CTCF and CEBPB have universally moderate overlap percentages.
After the Sept 30th deadline, we'll do some clean up for the next round.
Specifically, we'll remove all the SK-N-SH datasets (which have mismatch in conditions between ChIP and DNase although its the same cell-line).
And we will remove ATF2 altogether since all the new hidden datasets for that factor are overall showing low overlaps compared to previously released data. Also why we are ignoring it in the final round scoring.
Shaun - Thanks for pointing this out and apologies to everyone for these issues. Some of these newer hidden datasets were very recently generated and while their QC stats looked great, seems like there may be some issues with cell type identity. Sample identity problems are pretty hard to catch but we should have done this additional DNase overlap sanity check. But its good we are getting a chance to fix it for the next round.
-Anshul. Thanks a lot, Anshul -
I had a feeling that several of the factors would be explained biologically. The jumps in percentages between cell types were what was really worrying me.
Best,
Shaun CEBPB also universally has a moderate overlap with DNase. This I also would say is a property of the TF. So all of those datasets are fine.
REST also as a repressor typically has moderate to low overlap with DNase. So these are fine too.
SPI1 has pioneer activity so again expected to have moderate overlap. So these are fine too.
SK-N-SH definitely seems to have some issue. I think it may be untreated vs. treated cells for ChIP-seq vs DNase. Will look into it some more. I think we will just throw these out after the Sept. 30 deadline for the next round.
So that leaves us with these other 5
H1-hESC.SRF 69.10%
HepG2.FOXA2 68.60%
HepG2.ATF7 27.70%
HepG2.JUND 47.10%
MCF-7.ATF2 26.30%
I'll look into these some more.
-Anshul.
Thanks Shaun. We are checking this systematically with our peak sets as well.
The MAFK datasets makes sense. That one is a biological phenomenon. We are looking into the rest.
Since this phase of the challenge closes in a week, we aren't going to change anything at this point. But will check the overlap stats for the final round test data.
When the next phase begins after Sept 30 - Jan 7, we'll remove the few datasets that are genuine outliers (eg. All the SKNSH data) so everyone is using a cleaner training set.
Thanks for bringing it to our attention. By the way, I do realize that it's particularly bad form to bring up an issue like this in the last week of the challenge... sorry! Just catching up now on EST morning time!
Thanks for looking into this so deeply, Anshul, and good to hear that there are no sample swaps.
Just a note on my methods: the numbers above come from overlapping the provided conservative ChIP-seq narrow peak regions with DNase-seq domains called by a custom domain-finder. However, I also sanity checked by overlapping the conservative ChIP-seq peaks with the provided relaxed DNase narrow peaks. Slightly different percentages, but the same message for GATA3. Using the same (first) methodology to get simple percentage overlaps with cell-appropriate DNase domains, here are some other low one: (I can send you full table)
HepG2.MAFK|16.3%
MCF-7.ATF2|26.3%
HepG2.ATF7|27.7%
SK-N-SH.JUND|28.1%
GM12878.MAFK|28.6%
SK-N-SH.GATA3|33.5%
H1-hESC.MAFK|39.0%
IMR-90.MAFK|39.5%
SK-N-SH.TCF12|40.1%
SK-N-SH.TEAD4|40.4%
A549.CEBPB|43.8%
H1-hESC.CEBPB|46.1%
HepG2.JUND|47.1%
HepG2.CEBPB|48.6%
H1-hESC.REST|49.3%
HeLa-S3.REST|50.6%
K562.CEBPB|55.8%
IMR-90.CEBPB|57.3%
SK-N-SH.EP300|59.4%
SK-N-SH.MAX|66.2%
GM12878.SPI1|67.4%
HepG2.FOXA2|68.6%
IMR-90.CTCF|68.9%
H1-hESC.SRF|69.1%
HeLa-S3.MAFK|69.9%
ENCODE DNase data from the Stam lab (which we provide for the challenge) and the Crawford lab (they use a different single-cut protocol) in SK-N-SH also match up very nicely. So no problem there.
-Anshul. I checked JUND in SK-N-SH as well. The DREAM datasets we provided match those at the ENCODE portal. So no sample swap error there either at least on our part.
-Anshul. We'll do a quick pass on all the TF-DNase overlap statistics and check if the final round datasets show any strange behavior. It is certainly the case that some TFs do not have strong overlap with DNase sites but if this is specific to a cell type then that seems quite fishy. Will get back to you tomorrow.
-Anshul. Checked the GATA3 SK-N-SH ChIP-seq as well. This matches the portal data perfectly as well. Also if you check the GATA3 ChIP-seq peaks against the ChIP-seq signal tracks they look good. So at least this case is not a sample swap at our end or a peak calling problem. I'll check a bit more with the ENCODE production groups to see if something may have gone wrong at submission time.
-Anshul. Ok. So I checked the DNase tracks in SK-N-SH that we provided and the ones at the ENCODE portal and they match. So there was at least no sample swap for DNase on our end during processing for the challenge. Checking the GATA3 ChIP-seq now.
-Anshul. There are several cases like this. Thanks for raising this! I did notice some strange patterns in performance metrics, but did not yet look deeply into the cause. I guess this explains it. Other examples could be, for instance, JUND in SK-N-SH and ATF2 in MCF-7. Btw which DNase peaks are you using. The conservative set or the relaxed set? The TF peaks are called conservatively using IDR. So there is little chance of large numbers of false positives. A huge difference in overlap would only likely be due to a cell type mismatch between the TF and DNase. The exception being MAFK. MAFF/MAFK factors do have poor overlap with DNase regions. We've observed this all the way since ENCODE2.
But the GATA3 case sounds fishy. Will look into it. Smells like a potential sample swap.
Can you list any other cases that show such behavior.
Thanks,
Anshul.
Drop files to upload
ChIP-seq peaks vs DNase hypersensitive regions page is loading…