Simulated datasets are great because the ground truth is known but if the simulated sets do not at least approximate real world error profiles then there are major issues when extrapolating simulated results as being indicative of expected results when real world datasets are used and the ground truth is unknown, or at best inferenced from other evidence. Is there any detailed information available by which the error profile used when generating the simulated read sets can be independently verified as to their suitability for approximating real world reads as generated through a full sequencing pipeline. For instance: proportion of simulated reads with retained adaptors, if PCR was part of the sequencing pipeline then the contribution of PCR artefacts including chimeric crossovers, etc., etc. Simply referencing simulated datasets as having an Illumina error profile is telling the read set analyst nothing about how well the read sets approximate real world error profiles as errors arising in the library preparation prior to the final (Illumina) base calling are dominant.

Created by Stuart Stephen stuartjs

Training (simulated) readsets should be quantified for their error profiles page is loading…