PCR Primer Design

1. Summary

Given the sequence of a bacterial gene, you will learn to design a pair of PCR primers to amplify a particular target region. You will test the effects of reaction conditions on reaction yield and specificity for various primers. Finally, you will test your primers on different bacterial strains, for which you do not have the genomic sequence, to see if they will amplify related genes.

Technologies: This exercise uses the Cybertory PCR simulator with the E. coli genome.

Time required: approximately 2 to 4 hours.

2. Learning Objectives

After completing this exercise, you should be able to:

  1. Design PCR primers to amplify given target regions from a DNA template.

  2. Identify the direction in which a given primer would be extended by polymerase on a given template.

  3. Explain how primer length and base composition affect annealing temperature.

  4. Cite three reaction parameters affecting the specificity of a primer.

  5. Explain why the products of a PCR generally run as a single band on a gel.

3. Background

In many PCR applications you use standard, pre-synthesized primers. For example, sex-typing of human DNA can be done using primers for the amelogenin locus on the the X and Y chromosomes. The sequences of these primers were chosen so that they match both versions of the locus.

In research, however, there are many cases where you will have to order custom-made primers. Primers are just short single-stranded DNA molecules, so the design of a primer is simply a question of deciding what its sequence will be.

Here we will practice primer design by attempting to amplify specific target regions from a bacterial genome. You will be given a copy of the sequence of the E. coli genome, together with a set of annotations describing the sites of various loci.

Information is stored in DNA as a sequence of nucleotides, much as information is stored in a computer file as a sequence of binary digits. Sequence information is essentially one-dimensional, since it can be represented by stringing characters together in a line.

Simple 'arrow notation' for representing DNA molecules

The figure above shows the sequence of both strands of a double-stranded DNA molecule, and illustrates how we can refer to portions of the molecule with simple lines having half-arrowheads.

As you should already know, the two strands in DNA are held together largely through hydrogen bonding between base pairs. Normal base pairing matches A with T, and C with G. The bases in these pairs are said to be complementary to one another.

Further, each strand has a directionality. Individual nucleotides are not really like round beads on a string, because each nucleotide has two distinct sites where it connects to other nucleotides. These are the 5' (five prime) and 3' (three prime) carbons, located in the deoxyribose (sugar) part of the nucleotide.

DNA is produced in living organisms by enzymes called DNA polymerases. Most of these work by adding nucleoside triphosphates on to the 3' end of an existing DNA strand. This means that polymerases construct DNA strands in the 5' to 3' direction. We use our half-arrowhead to indicate the direction in which a strand would grow, if it were to be extended by polymerase.

Now we will use this notation to show the PCR process. PCR is done by repeated rounds of amplification. Each round has three stages. First, we raise the temperature to melt the strands apart. Then we lower the temperature to let the primers anneal (the primers anneal much more quickly than the two template strands can come back together, because the primers are in much higher concentration). Finally, we incubate to allow the polymerase to extend the primers. This figure illustrates the first round of amplification:

First round of PCR amplification

Note that the primers match up with portions of the template; this means that their sequences match the template at those places. The target region that we want to amplify from the template is marked off with vertical dotted red lines. The right and left primers are each oriented in toward the region being amplified, and thus toward each other. The right primer will have the same sequence as a region of the top strand, meaning that it will prime on the bottom strand. The left primer sequence has the same sequence as a section of the bottom strand, so it will bind to and prime on the top strand.

Extending each of these primers is shown by the dotted lines. The result, at the end of the first round, is that we have four strands: the two we started with, plus the two that we just made by extending the left and right primers. The strands made by extending primers each start at a defined position on the template, which is where the primer bound. Note that, although we are trying to amplify the region between the vertical dotted lines, the polymerase does not know where we want it to stop. It keeps going, past our other vertical dotted red line, to the end of the template strand (or as far as it can go, generally not more than a few kilobases).

Second round of PCR amplification

In the second round, we will go from four strands to eight, but the eight product strands are of several different types. We have the original template strands, which are “long” on both ends, in that they are longer than our target region. We have primer extension products that were made using these original long strands as templates. These start at a primer, which places their 5' ends at exactly one edge of our target region, but they go on past the other primer binding site. This category of product was produced in the first round as well, but in the second round we also see for the first time a third category of product: these go exactly from one end of our target region to the other. These strands are “short on both ends”, meaning both ends are at the desired positions. They are made from extending a primer using as a template one of the earlier strands which was itself extended from a primer. These strands have one end defined by the position of the primer that makes the strand itself, and the other end defined by the position of the primer that made its template.

Types of strands produced by PCR

Now that we can categorize the types of strands, we can ask how many of each type we can expect at each round of a PCR amplification. This table shows the total numbers of each type of strand after various numbers of amplification cycles:

Cycle

Long on both ends

Long on one end, short on the other

Short on both ends

Total strands

Total amplification

0

2

0

0

2

1

1

2

2

0

4

2

2

2

4

2

8

4

3

2

6

8

16

8

4

2

8

22

32

16

5

2

10

52

64

32

6

2

12

114

128

64

7

2

14

240

256

128

8

2

16

494

512

256

9

2

18

1004

1024

512

10

2

20

2026

2048

1024

20

2

40

2097110

2097152

1048576

30

2

60

2147483586

2147483648

1073741824

40

2

80

2199023255470

2199023255552

1099511627776

n

2

2n

2^n - 2n - 2

2^(n+1)

2^n

Note that the total number of strands doubles with each cycle. This is called “exponential amplification” because it doubles each round, so the amount of amplification is two to the power of the number of rounds. But condsider what types of strands we get with all of this amplification. The number of strands that are “long on both ends”, like the original template strands, never changes. It is always the same number we started with, because we can't make these strands (unless we use a primer that inds at the very end of the template, which we are not doing in this example). Now look at the third column, which shows the number of strands that are “short” on one end and “long” on the other. These strands are made by extending primers bound to the original template. Since, in this example, there are only two original template strands, we only get two more strands in this category with each cycle. It is the fourth column, showing the number of strands short on both ends that really gives us the impressive numbers. These strands are produced by extending primers on strands that are themselves primer extension products. The 3' ends of these strands are short because they run off the ends of the templates, and the ends of the templates are short because they start with one of the primers.

The take-home point is that, after a couple dozen rounds of amplification, almost all of the products extend exactly fom the 5' end of one primer to the 5' end of the other. This means that PCR gives us both large amounts of product (because it produces more or less exponential amplification), and these products are of a defined and predictable size (reaching from one primer to the other).

4. Primer length

The diagrams and discussion above address primer position and orientation, but they don't tell us how long the primer needs to be. We consider three factors that affect length requirements: hybridization stability, template complexity, and the possibility of mismatches.

1. Melting temperature

The melting or annealing temperature of a DNA molecule is the temperature at which half of the molecules are double stranded, and half are single stranded. This is a commonly used metric to reflect the stability of a DNA duplex, as more stable duplexes will have higher melting temperatures.

Reaction conditions, especially DNA concentration and salt concentration (particularly the concentration of divalent cations like magnesium) affect DNA hybridization, so melting is usually measured using standard conditions.

In general, longer strands have higher melting temperatures, as do sequences with higher G and C content. A simple formula called “Itakura's empirical rule” has been developed to make a quick and dirty estimate of the melting temperature of an oligonucleotide:

Tm = 2 (A+T) + 4 (G+C)

where Tm is the melting temperature in degrees Celsius. In other words, the melting temperature gets two degrees for each A:T pair in the duplex, and 4 degrees for each C:G. This number, often called the 'Wallace temperature' is frequently eerily accurate for ranges between about 45 and 70 degrees.

Other approximations are used for longer DNA sequences. Generally, more accurate estimations can be made using nearest neighbor thermodynamic models, which take into account the stacking energy of neighboring base pairs.

Of course, hybridization is not really the same as priming. For example, mismatches between primer and template near the 3' end of a primer can hinder or prevent primer extension, even if the primer hybridizes stably. In other words, binding is necessary but not always sufficient for priming. Also, PCR buffer generally has far lower salt and DNA concentrations than the standard hybridization buffers. Nevertheless, Tm is a very important consideration in primer design. Generally, we want both primers in a PCR to have similar melting temperatures, because this will let us find an annealing temperature at which both have near-optimal specificity. Even if our approximations don't give accurate absolute numbers (since we are not using the standard salt concentration or whatever), if they help us design primers with relatively similar annealing temperatures, that is generally what we need.

2. Template complexity

A bacterial genome is about one million base pairs (in scientific notation, this is 1e6 bp), while mammalian genomes are generally several billion (say 3e9 bp). This makes it much easier to make an oligonucleotide that matches exactly one position in a bacterial genome than in a mammalian genome.

For example, say we have an oligo 12 bp long. What are the odds that it will match a randomly chosen 12 bp sequence? Assuming that each of the four bases is equally likely, each base has a one in four chance of matching, so the odds of matching each of 12 bases is one in one-fourth to the twelfth power, (0.25^12), or 5.96e-8. On average, we would expect such a random match to happen about once every 1.7e7 base pairs. We would be unlikely to find a perfect match to an arbitrary 12 bp sequence in a bacterial genome, but quite likely to find several in a mammalian genome.

For mammalian genomes, oligos need to be about 16 bases or longer to make random matches unlikely. This estimate is pretty rough for several reasons. First of all, genomes are not really random. They include a lot of duplicated genes and other repetitive sequence. If an oligonucleotide resembles one of these repeated sequences, our simple statistical argument doesn't really apply. Similarly, if a genome has an unusual base composition (such as high GC content), the numbers will be different. Finally, the specificity of a PCR product depends on both primers, so even if there are a few unintentional binding sites, they would need to have appropriate spacing and orientation to form unintentional PCR products.

3. Mismatches

So far, we have only considered primers that match the template exactly. But there are times when we need to use primers that do not necessarily match at every base of the template. For example, we may have to use sequences from known organisms to try to guess at primers that will work in a related organism. Our guess might be close, but not perfect. The good news is that primers can often work at sites where they have only near matches, if we use low annealing temperatures (or low stringency conditions). The bad news is we would expect many more sites that have, say, fifteen out of sixteen matches to the primer that would have all sixteen bases match. This means we are much more likely to amplify unintended products. We can often compensate for this by making the primers a bit longer.

5. Protocol

In this exercise, you will design your own primer sequences to amplify selected target regions from genomic templates, and see if you obtain products of the predicted length. The simulator is loaded with a variety of genomes to use as templates.

We suggest that you use small genomes, such as bacteria, when you are first learning to design PCR primers. The first time a given primer sequence is used in the system, the simulator needs to search the primer sequence against the entire target genome, and small genomes can be searched much more quickly than large ones. If the same primer is used again, the database remembers where the binding sites for that primer were, so it can skip the search and return results more quickly. If you have done some of the other exercises, such as paternity testing or forensic identification, you have probably noticed that your results are returned in just a few seconds. This is because those exercises use primers that have already been run against the human genome. If you are the first one to run a given primer sequence against the human genome, it will take much longer to get results. If a large class of students all tries PCR with custom primers on large genomes (like the human genome) at the same time, it can put a considerable load on the simulation server, and results will be very slow.

To perform experiments with the PCR simulator, you first need to identify a target region of your desired template. The simulator works uses complete genomes for templates, so you should be able to use essentially any gene from the available organisms. For your convenience, we have made a page containing the sequence of the polA gene from E. coli, which encodes the enzyme DNA polymerase I. The page shows various segments of interest in the gene, and reviews some basic concepts of gene structure.

  1. Choose a target segment to amplify from polA. Here are some suggestions (your instructor may assign targets to various lab teams):

  2. Design primers to amplify your target segment using the Cybertory PCR simulator on simulated bacteria. For a first attempt, make each primer 16 bases long. Do you get bands of the expected size? Do you get any additional bands?

  3. Try various annealing temperatures. Do they make a difference?

  4. Can you identlfy any other reaction conditions that affect the reaction?

  5. Try making your primers shorter (maybe 12 bases each). Now how do various reaction conditions (especially annealing temperature) affect the results?

  6. Try your primers on other bacteria as well.

1. Suggested further steps

Look up the polA gene in the E. coli reference genome from GenBank. This is a large file containing the entire genomic seqeunce of E. coli, together with annotations describing thelocations of the known genes in this organism. Can you identify any of the regions from the polA page in the genome? Look through the annotations and identify other genes of interest. If you have recently taken biochemistry, the genes for enzymes of the tricarboxylic acid cycle may be familiar, for example. Design PCR primers to amplify one of these genes. What is the predicted size of the product? Run these primers in the PCR simulator. Be sure you use appropriate sizes for your molecular weight marker. Do you find a product of the expected size? Are there any unexpected products ('artifacts')? Can you adjust reaction conditions to make the reaction more specific? If not, does it help to modify your primer sequences?

Starting with a pair of primers that give a single product band of the expected size, see what happens if you make changes in one of the primer sequences. These changes should cause mismatches between the primer and the intended binding site. Does changing a single base at the 5' end of a primer have as much effect as changing the base at the 3' end of the primer? Can you adjust reaction conditions to make the mismatched primers work better?

The default primer sequences in the PCR simulator with bacterial templates are 'universal' ribosomal RNA primers, which work on (almost) any species of bacteria. These primers are designed to match highly conserved regions of the rRNA sequences. Can you find a paper describing how pan-species primers like these are designed?

Can you find PCR primer sequences in scientific papers that work in the PCR simulator? Can you find any that do not work in the simulator?

6. Review Questions

The first few questions are based on this figure, which we used earlier to illustrate the arrow notation for DNA strands:

Figure for the first four review questions

  1. What is the sequence of primer 'a'?

  2. What is the sequence of primer 'c'?

  3. Which primer could be used together with primer 'c' to amplify a portion of the DNA sequence?

  4. How many base pairs long would the product be from a PCR amplification using primers 'a' and 'b'?

  5. Design a pair of primers to amplify the entire length of the following 45 base pair sequence. Make each primer 14 bases long. Write the sequences of the primers in 5' to 3' order.

    5'-GATGCCCGTTGGATAAATTGGGCGTCTAGAATCGGTCACACTTAG-3'

  6. Amplification of very long PCR products often requires use of a polymerase with proofreading activity. Can you explain why this might be?



©2009, Robert M. Horton, PhD