The Cybertory DNA Sequencing Simulator generates virtual DNA sequencing electropherograms based on the user’s choice of primer, template, and experimental parameters. Results are returned in Sequence Chromatogram Format (SCF), which can be read in third party trace viewers or sequence management programs, such as the Staden package.
Users choose a template and enter a primer sequence, and the sequencing simulator searches the template for potential primer binding sites. Each site is evaluated for priming efficency, using a nearest neighbor thermodynamic model. The program produces an SCF file containing the superimposed products from all priming sites, weighted by priming efficiency. This means that if the student has chosen a primer sequence that is not sufficiently unique, and/or has used insufficiently stringent hybridization conditions, the resulting sequence quality will be low, or perhaps unusable.
The sequencing simulator produces simulated results in a real data format (SCF). One of the major advantages of having the results in a real-world format like SCF is that they can be analyzed using real-world tools, like the Staden package. Traditionally, students have learned to use this software to assemble sets of sequence trace files. The advantage of having a sequencing simulator is that it closes the loop; finishing a sequence and resolving ambiguities is an iterative process. One often needs to perform additional reactions, often with custom primers, in order to obtain information necessary to join contigs, or to clarify uncertain sequence regions. This is not possible with static data sets.
The trace generator creates the chromatograms from the template sequence using its own model of the relationships between local sequence, primer position, peak height, and peak mobility. It then uses a "base caller" to essentially reverse the process, labeling the peaks to produce a sequence. By using a third-party base caller, with a slightly different model of the relationships between traces and sequences, we end up with occasional differences between the original template sequence and the simulated experimental results. This is a model of experimental uncertainty, and is a major component of our sequencing exercises.
Approximate primer/template matches are found using the “Shift-OR” algorithm [Wu and Manber 1991], alignments are performed, and nearest neighbor thermodynamic calculations determine the fraction of each candidate site occupied by primer under the specified conditions. Priming coefficients are calculated as a function of binding and the 3' end of the aligned primer. Each priming site marks the beginning of a component sequence, weighted by its priming coefficient. Bases are represented as Gaussian peaks described by height, position, and width. The intensity of each channel at each time point is a weighted sum of component peaks.
Peak heights [Takahashi 2002] and spacing are predicted based on distance from the primer and on the local sequence. Peak width increases gradually with distance from the primer, and random noise is added to the trace intensity points. We have updated and enhanced the public domain base caller "autoseq" [Hart 1992] to add called bases to the SCF output.
This diagram illustrates the data flow in a sequencing exercise:
Data flow diagram
Open figure in new window
Sequence assembly is normally taught with “canned” data, where students are given a set of sequence chromatogram data to manage and interpret. With the simulator we can close the loop, so students can do experiments, including primer walking with custom sequencing primers of their own design, to address particular problem areas in assembly or sequence quality. This enables more active problem-solving exercises.
We have used this simulator with the Staden package for undergraduate exercises in shotgun sequencing and HIV genotyping by trace subtraction.
This program is available as open source from www.cybertory.org. Funded by NIH grant #R44RR1364502A2 to Attotron Biosensor Corporation.