Hi, To build the training dataset, I need to extract sequences from the hg19.genome.fa file. I tried getting every training sequence (1000bp, with 400bp of context on either side of the 200bp prediction regions) with the following bedtools script: bedtools getfasta -fi hg19.genome.fa -bed train_regions.blacklistfiltered.bed -fo train_regions_1000_sequences.fa However, unsurprisingly, the training region file ends up being extremely large (~30G). I was planning to convert the FASTA file to HDF5, but I can't really work with a file of this size. How are others processing their data so that it's in a format that can be read in to an ML model to train on? I realize I'll have to do one-hot encoding, etc., but what needs to be done before getting to that step? One option is reading in the sequence immediately before it's used (and then perhaps only sample a small fraction of the unbound training regions). Does this work better?

Created by Rajiv Movva rmovva
I do a small amount of pre-computing (storing the one hot sequence of the entire filtered genome, no overlaps) and then generate the batches on the CPU while training the model on the GPU.
Yeah, planning on using a convnet -- so I should extract the sequences one batch at a time? Your code would be a very helpful resource, thanks.
If you have some kind of online learning algorithm or a batch learning algorithm like neural networks, you can create batches while you train. I'm going to post my code on github soon, so you will be able to see how it could be done. Stackd

Extracting Training Sequences as FASTA page is loading…