Dear organizers, Looking at the training data, the sequence lengths can vary significantly. 3' flanking regions look truncated in some cases, but sometimes it seems that the insert size between the constant regions has a different length (quite often +-1bp). Can you clarify whether there were indeed variable 80bp+-Xbp random insert sizes, or do the observed differences arise from sequencing errors? BTW, there are also polyN regions in the training data. Probably these peculiarities were already discussed during the webinar, I apologize if I have missed the respective details.

Created by Ivan Kulakovskiy ivan.kulakovskiy
Great, thank you for a very detailed clarification.
Hi @ivan.kulakovskiy We did not discuss this, so thank you for bringing it up! There is some heterogeneity in sequence lengths. Historically, we have seen inserts of 79 bp is common, <=78 bp less so, and >=81 is also rare. These likely arise from errors in synthesis, where deletions due to base misincorporation are common, particularly with long oligos. Some of the really long sequences are probably incorrect (e.g. 4197 have a total size (including constant flanks) of >120 bp), resulting from incorrect reconstruction of the complete sequence from the two sequencing reads. (They overlap in the middle and must be aligned). The Ns represent cases where the sequencer was not confident enough to call a base at that position. Since we have no reference to compare to, we don't actually know the bases at those positions, so we left them as Ns. The flanking regions should all be identical with the 17 bp TGCATTTTTTTCACATC upstream and the 13 bp GGTTACGGCTGTT downstream of the random 80 bp region.

Length of the random insert in the training data page is loading…