I've gotten a convolutional neural network running in Tensorflow that tries to build a filter for positive-binding regions off of DNA sequence, shape, and local chromatin state. Now, I've gotten this program running, but based on the large number of possible "training windows," the potential training dataset seems to be pretty huge. Pulling and parsing a single randomized minibatch of 1000 examples (dna sequence using pysam, chromatin state using bigwig, dna shape using bigwig, and labels using tabix) takes several minutes. Right now it slow enough, that I do not believe I'll be able to train and evaluate all the final submission cell types, even with multiple machines running. I was wondering how other folks are dealing with the size of the dataset. I've considered pre-processing all input data into a serialized binary format (results in a massive massive file) or only performing training on a subset of the genome (ie: at transcriptional start sites; certainly leads to bias) My research training is actually primarily in genetics, and I've been branching out into machine learning. I just wanted to get a feel out there for how ambitious this approach sounds, compared to possibly less computational approaches out there.

Created by Lawrence Du LawrenceDu
Hi Lawrence, Preprocessing the data into memmapped / on disk arrays is a reasonable approach. Training on entire genome with random sampling of examples is unlikely to work given how imbalanced TF binding tasks are wrt whole genome - it will not get to sample enough positive examples this way to learn much, if anything. So training on a subset of the genome is not a bad idea if that subset is chosen carefully. You could do this iteratively. For example, if you train on TSSs and find that after a few epochs you consistently underperform at enhancer regions, you could sample more from those regions where the model is underperforming. The key is to sample carefully - not too imbalanced but enough variety to minimize biases. Good luck! -Johnny

Pulling/parsing training examples efficiently. Anyone using Convolutional neural networks? page is loading…