In the interest of open science I'm releasing my code for the challenge. This is intended to be fully self contained (save dependencies on synapseclient, pysam and pyDNase python packages), including programmatic download of challenge data, pre-processing, model fitting, prediction and submission. Performance is highly competitive for some TFs (e.g. MAX https://www.synapse.org/#!Synapse:syn6131484/wiki/402503) and less so for others (e.g. REST https://www.synapse.org/#!Synapse:syn6131484/wiki/402505). The code is here: https://github.com/davidaknowles/tf_net The core is a pretty standard convolutional neural net on genomic sequence, implemented in Theano, with the following features added: * normalized per-base DNase I cuts for the + and - strand are concatenated onto the one hot encoding of sequence, to give a [sequence context] x 6 input matrix. * gene expression PCs are included as features to allow the model to interpolate between different cell types. * a three class ordinal likelihood is used for the Unbound/Ambiguous/Bound labels. * simultaneous analysis of the forward and reverse complement. * down-sampling of the negative set to speed up training (and accounting for by weighting the likelihood). Please feel free to give it a go and let me know if you have any problems!

Created by David Knowles davidaknowles
Yup!
Hi Daniel, I think you have the right idea. In particular, there is no sense of linking specific peaks to genes. The 8 PCs are used as extra features for the model. Note that these PC features will not vary across regions, they only vary across cell types. Is that clear? If not I can write it out formally. Best David. P.S. Your numbers are looking very good, will be interested to hear what you're doing eventually!
Hi David, Just wanted to confirm how you used the GE principal component features. From what I can tell, you performed PCA for all ~20k transcript quantities across the different cell lines, and extracted the top 8 principal components. You then used these 8 principal components as non-structural extra meta features for your model. Am I correct? Thanks, Daniel
Jim - no problem, hope it's of some use/interest to others! Ivan - that's correct. I know there are ways (fixing some filters, initialization, using a prior) to incorporate known TF motifs, but I didn't get around to doing that - I may do before the final submission but this is very much a side project! I have considered using DNase similarity/PCs: I don't have a good sense of which would work better vs. GE (no reason not to use both I suppose, as you suggest). From a practical point of view DNase similarity would actually make the most sense since you might not have GE for some new cell type so it would be nice to only require DNase.
Dear David, Thank you for sharing. If I understood correctly, you do not use any premade representation of TF-binding sequence motifs and learn them along with the whole model. I love the idea of interpolating the training cell types based on gene expression similarity. Have you considered to incorporate an overall similarity of DNase profiles in a similar fashion? Best, Ivan
David, this is much appreciated in the spirit of open science. Thanks for you efforts. Kind Regards, Jim

I'm releasing my CNN code page is loading…