This page shows the sequence of the DNA polymerase I gene ("polA") from Escherichia coli strain K-12, substrain MG1655. Various sections of the gene and the corresponding protein sequence are indicated.
|DnaA binding site||-132 to -124||(N/A)||The replication initiation protein, DnaA, acts as a transcription factor for polA.|
|'CAAT box' and 'Pribnow box'||-57 to -52, and -36 to -31||(N/A)||These are parts of the RNA polymerase binding site.|
|→ transcription start site||-19||(N/A)||This is the beginning of the region that is copied into mRNA by RNA polymerase.|
|ribosome binding site||-10 to -6||(N/A)||This is the Shine-Delgarno sequence, which is important for ribosomes to bind to the mRNA and begin translation.|
|cleaved region||1 to 969||1 to 323||removed by proteolytic cleavage to leave the Klenow fragment.|
|5' to 3' exonuclease||19 to 756||7 to 252||This is the enzymatic domain responsible for removing DNA from the template ahead of the enzyme. This is important in vivo for removing the RNA primers from the lagging strand in DNA replication, and in vitro for nick translation.|
|Klenow fragment||970 to 2874||324 to 928||produced from the DNA Polymerase I holoenzyme by proteolytic cleavage to remove the 'cleaved' section from the carboxy terminus. This fragment contains both polymerase and proofreading activities, but lacks 5' to 3' exonuclease.|
|3' to 5' exonuclease||1033 to 1617||345 to 539||This is the "proofreading" domain. If the enzyme makes a mistake, this domain makes it possible to back up, remove the erroneous base, and try again. This leads to much higher fidelity of replication.|
|polymerase||1639 to 2778||547 to 926||This domain is responsible for the actual polymerization of the new DNA strand.|
-300 ATCCTTAAGGAGAAAAATAATTCATATCTATCCACATTAGAAAAAATCCCATTATCTCAA -241 -240 TTATTAGGGATGGATTTATTTTTAACTGCATGAAAAACAAAGACAAACATCATGCTGTAA -181 -180 AAAGCATGATAATAAATTAAAAGCGATGTAAATAATTTATGCACAAAGTTATCCACATGA -121 -120 CGATTTGCGAGCGATCCAGAAGATCTACAAAAGATTTTCACGAAAAGCGGTGAAAAACTC -61 → -60 ATGTTTTCATCCTGTCTGTGGCATCCTTTACCCATAATCTGATAAACAGGCACGGACATT -1 1 ATGGTTCAGATCCCCCAAAATCCACTTATCCTTGTAGATGGTTCATCTTATCTTTATCGC 60 1 MetValGlnIleProGlnAsnProLeuIleLeuValAspGlySerSerTyrLeuTyrArg 20 61 GCATATCACGCGTTTCCCCCGCTGACTAACAGCGCAGGCGAGCCGACCGGTGCGATGTAT 120 21 AlaTyrHisAlaPheProProLeuThrAsnSerAlaGlyGluProThrGlyAlaMetTyr 40 121 GGTGTCCTCAACATGCTGCGCAGTCTGATCATGCAATATAAACCGACGCATGCAGCGGTG 180 41 GlyValLeuAsnMetLeuArgSerLeuIleMetGlnTyrLysProThrHisAlaAlaVal 60 181 GTCTTTGACGCCAAGGGAAAAACCTTTCGTGATGAACTGTTTGAACATTACAAATCACAT 240 61 ValPheAspAlaLysGlyLysThrPheArgAspGluLeuPheGluHisTyrLysSerHis 80 241 CGCCCGCCAATGCCGGACGATCTGCGTGCACAAATCGAACCCTTGCACGCGATGGTTAAA 300 81 ArgProProMetProAspAspLeuArgAlaGlnIleGluProLeuHisAlaMetValLys 100 301 GCGATGGGACTGCCGCTGCTGGCGGTTTCTGGCGTAGAAGCGGACGACGTTATCGGTACT 360 101 AlaMetGlyLeuProLeuLeuAlaValSerGlyValGluAlaAspAspValIleGlyThr 120 361 CTGGCGCGCGAAGCCGAAAAAGCCGGGCGTCCGGTGCTGATCAGCACTGGCGATAAAGAT 420 121 LeuAlaArgGluAlaGluLysAlaGlyArgProValLeuIleSerThrGlyAspLysAsp 140 421 ATGGCGCAGCTGGTGACGCCAAATATTACGCTTATCAATACCATGACGAATACCATCCTC 480 141 MetAlaGlnLeuValThrProAsnIleThrLeuIleAsnThrMetThrAsnThrIleLeu 160 481 GGACCGGAAGAGGTGGTGAATAAGTACGGCGTGCCGCCAGAACTGATCATCGATTTCCTG 540 161 GlyProGluGluValValAsnLysTyrGlyValProProGluLeuIleIleAspPheLeu 180 541 GCGCTGATGGGTGACTCCTCTGATAACATTCCTGGCGTACCGGGCGTCGGTGAAAAAACC 600 181 AlaLeuMetGlyAspSerSerAspAsnIleProGlyValProGlyValGlyGluLysThr 200 601 GCGCAGGCATTGCTGCAAGGTCTTGGCGGACTGGATACGCTGTATGCCGAGCCAGAAAAA 660 201 AlaGlnAlaLeuLeuGlnGlyLeuGlyGlyLeuAspThrLeuTyrAlaGluProGluLys 220 661 ATTGCTGGGTTGAGCTTCCGTGGCGCGAAAACAATGGCAGCGAAGCTCGAGCAAAACAAA 720 221 IleAlaGlyLeuSerPheArgGlyAlaLysThrMetAlaAlaLysLeuGluGlnAsnLys 240 721 GAAGTTGCTTATCTCTCATACCAGCTGGCGACGATTAAAACCGACGTTGAACTGGAGCTG 780 241 GluValAlaTyrLeuSerTyrGlnLeuAlaThrIleLysThrAspValGluLeuGluLeu 260 781 ACCTGTGAACAACTGGAAGTGCAGCAACCGGCAGCGGAAGAGTTGTTGGGGCTGTTCAAA 840 261 ThrCysGluGlnLeuGluValGlnGlnProAlaAlaGluGluLeuLeuGlyLeuPheLys 280 841 AAGTATGAGTTCAAACGCTGGACTGCTGATGTCGAAGCGGGCAAATGGTTACAGGCCAAA 900 281 LysTyrGluPheLysArgTrpThrAlaAspValGluAlaGlyLysTrpLeuGlnAlaLys 300 901 GGGGCAAAACCAGCCGCGAAGCCACAGGAAACCAGTGTTGCAGACGAAGCACCAGAAGTG 960 301 GlyAlaLysProAlaAlaLysProGlnGluThrSerValAlaAspGluAlaProGluVal 320 961 ACGGCAACGGTGATTTCTTATGACAACTACGTCACCATCCTTGATGAAGAAACACTGAAA 1020 321 ThrAlaThrValIleSerTyrAspAsnTyrValThrIleLeuAspGluGluThrLeuLys 340 1021 GCGTGGATTGCGAAGCTGGAAAAAGCGCCGGTATTTGCATTTGATACCGAAACCGACAGC 1080 341 AlaTrpIleAlaLysLeuGluLysAlaProValPheAlaPheAspThrGluThrAspSer 360 1081 CTTGATAACATCTCTGCTAACCTGGTCGGGCTTTCTTTTGCTATCGAGCCAGGCGTAGCG 1140 361 LeuAspAsnIleSerAlaAsnLeuValGlyLeuSerPheAlaIleGluProGlyValAla 380 1141 GCATATATTCCGGTTGCTCATGATTATCTTGATGCGCCCGATCAAATCTCTCGCGAGCGT 1200 381 AlaTyrIleProValAlaHisAspTyrLeuAspAlaProAspGlnIleSerArgGluArg 400 1201 GCACTCGAGTTGCTAAAACCGCTGCTGGAAGATGAAAAGGCGCTGAAGGTCGGGCAAAAC 1260 401 AlaLeuGluLeuLeuLysProLeuLeuGluAspGluLysAlaLeuLysValGlyGlnAsn 420 1261 CTGAAATACGATCGCGGTATTCTGGCGAACTACGGCATTGAACTGCGTGGGATTGCGTTT 1320 421 LeuLysTyrAspArgGlyIleLeuAlaAsnTyrGlyIleGluLeuArgGlyIleAlaPhe 440 1321 GATACCATGCTGGAGTCCTACATTCTCAATAGCGTTGCCGGGCGTCACGATATGGACAGC 1380 441 AspThrMetLeuGluSerTyrIleLeuAsnSerValAlaGlyArgHisAspMetAspSer 460 1381 CTCGCGGAACGTTGGTTGAAGCACAAAACCATCACTTTTGAAGAGATTGCTGGTAAAGGC 1440 461 LeuAlaGluArgTrpLeuLysHisLysThrIleThrPheGluGluIleAlaGlyLysGly 480 1441 AAAAATCAACTGACCTTTAACCAGATTGCCCTCGAAGAAGCCGGACGTTACGCCGCCGAA 1500 481 LysAsnGlnLeuThrPheAsnGlnIleAlaLeuGluGluAlaGlyArgTyrAlaAlaGlu 500 1501 GATGCAGATGTCACCTTGCAGTTGCATCTGAAAATGTGGCCGGATCTGCAAAAACACAAA 1560 501 AspAlaAspValThrLeuGlnLeuHisLeuLysMetTrpProAspLeuGlnLysHisLys 520 1561 GGGCCGTTGAACGTCTTCGAGAATATCGAAATGCCGCTGGTGCCGGTGCTTTCACGCATT 1620 521 GlyProLeuAsnValPheGluAsnIleGluMetProLeuValProValLeuSerArgIle 540 1621 GAACGTAACGGTGTGAAGATCGATCCGAAAGTGCTGCACAATCATTCTGAAGAGCTCACC 1680 541 GluArgAsnGlyValLysIleAspProLysValLeuHisAsnHisSerGluGluLeuThr 560 1681 CTTCGTCTGGCTGAGCTGGAAAAGAAAGCGCATGAAATTGCAGGTGAGGAATTTAACCTT 1740 561 LeuArgLeuAlaGluLeuGluLysLysAlaHisGluIleAlaGlyGluGluPheAsnLeu 580 1741 TCTTCCACCAAGCAGTTACAAACCATTCTCTTTGAAAAACAGGGCATTAAACCGCTGAAG 1800 581 SerSerThrLysGlnLeuGlnThrIleLeuPheGluLysGlnGlyIleLysProLeuLys 600 1801 AAAACGCCGGGTGGCGCGCCGTCAACGTCGGAAGAGGTACTGGAAGAACTGGCGCTGGAC 1860 601 LysThrProGlyGlyAlaProSerThrSerGluGluValLeuGluGluLeuAlaLeuAsp 620 1861 TATCCGTTGCCAAAAGTGATTCTGGAGTATCGTGGTCTGGCGAAGCTGAAATCGACCTAC 1920 621 TyrProLeuProLysValIleLeuGluTyrArgGlyLeuAlaLysLeuLysSerThrTyr 640 1921 ACCGACAAGCTGCCGCTGATGATCAACCCGAAAACCGGGCGTGTGCATACCTCTTATCAC 1980 641 ThrAspLysLeuProLeuMetIleAsnProLysThrGlyArgValHisThrSerTyrHis 660 1981 CAGGCAGTAACTGCAACGGGACGTTTATCGTCAACCGATCCTAACCTGCAAAACATTCCG 2040 661 GlnAlaValThrAlaThrGlyArgLeuSerSerThrAspProAsnLeuGlnAsnIlePro 680 2041 GTGCGTAACGAAGAAGGTCGTCGTATCCGCCAGGCGTTTATTGCGCCAGAGGATTATGTG 2100 681 ValArgAsnGluGluGlyArgArgIleArgGlnAlaPheIleAlaProGluAspTyrVal 700 2101 ATTGTCTCAGCGGACTACTCGCAGATTGAACTGCGCATTATGGCGCATCTTTCGCGTGAC 2160 701 IleValSerAlaAspTyrSerGlnIleGluLeuArgIleMetAlaHisLeuSerArgAsp 720 2161 AAAGGCTTGCTGACCGCATTCGCGGAAGGAAAAGATATCCACCGGGCAACGGCGGCAGAA 2220 721 LysGlyLeuLeuThrAlaPheAlaGluGlyLysAspIleHisArgAlaThrAlaAlaGlu 740 2221 GTGTTTGGTTTGCCACTGGAAACCGTCACCAGCGAGCAACGCCGTAGCGCGAAAGCGATC 2280 741 ValPheGlyLeuProLeuGluThrValThrSerGluGlnArgArgSerAlaLysAlaIle 760 2281 AACTTTGGTCTGATTTATGGCATGAGTGCTTTCGGTCTGGCGCGGCAATTGAACATTCCA 2340 761 AsnPheGlyLeuIleTyrGlyMetSerAlaPheGlyLeuAlaArgGlnLeuAsnIlePro 780 2341 CGTAAAGAAGCGCAGAAGTACATGGACCTTTACTTCGAACGCTACCCTGGCGTGCTGGAG 2400 781 ArgLysGluAlaGlnLysTyrMetAspLeuTyrPheGluArgTyrProGlyValLeuGlu 800 2401 TATATGGAACGCACCCGTGCTCAGGCGAAAGAGCAGGGCTACGTTGAAACGCTGGACGGA 2460 801 TyrMetGluArgThrArgAlaGlnAlaLysGluGlnGlyTyrValGluThrLeuAspGly 820 2461 CGCCGTCTGTATCTGCCGGATATCAAATCCAGCAATGGTGCTCGTCGTGCAGCGGCTGAA 2520 821 ArgArgLeuTyrLeuProAspIleLysSerSerAsnGlyAlaArgArgAlaAlaAlaGlu 840 2521 CGTGCAGCCATTAACGCGCCAATGCAGGGAACCGCCGCCGACATTATCAAACGGGCGATG 2580 841 ArgAlaAlaIleAsnAlaProMetGlnGlyThrAlaAlaAspIleIleLysArgAlaMet 860 2581 ATTGCCGTTGATGCGTGGTTACAGGCTGAGCAACCGCGTGTACGTATGATCATGCAGGTA 2640 861 IleAlaValAspAlaTrpLeuGlnAlaGluGlnProArgValArgMetIleMetGlnVal 880 2641 CACGATGAACTGGTATTTGAAGTTCATAAAGATGATGTTGATGCCGTCGCGAAGCAGATT 2700 881 HisAspGluLeuValPheGluValHisLysAspAspValAspAlaValAlaLysGlnIle 900 2701 CATCAACTGATGGAAAACTGTACCCGTCTGGATGTGCCGTTGCTGGTGGAAGTGGGGAGT 2760 901 HisGlnLeuMetGluAsnCysThrArgLeuAspValProLeuLeuValGluValGlySer 920 2761 GGCGAAAACTGGGATCAGGCGCACTAAGATTCGCCTGAACATGCCTTTTTTCGTAAGTAA 2820 921 GlyGluAsnTrpAspGlnAlaHisEnd 928
DNA sequence is shown in black on a white background, and the corresponding protein is shown using various text and background colors. Each is numbered according to where protein translation starts.
The area before the protein coding region contains the promoter, which affects how much mRNA will be transcribed from this gene by RNA polymerase under various circumstances. Several interesting parts of the promoter sequence are marked.
The binding site for RNA polymerase contains two segments that are relatively conserved among genes. These are called the CAAT box and the Pribnow box, or the -35 and -10 boxes, based on their normal locations relative to the transcription start site (remember that our numbering starts from the translation start site. You may notice that the -35 box is not quite at -35 bases in this gene, though.)
The polA promoter also contains a binding site (maybe two binding sites) for a protein called DnaA, an initiator protein for DNA replication in E. coli. The main function of DnaA is to help to unwind DNA at the oriC origin of replication during DNA replication. Here it also acts as a transcription factor, to increase the amount of DNA polymerase I produced when the cell is replicating.
The start site of transcription is shown with an arrow. This is where RNA polymerase begins to copy the gene into mRNA. A short region at the beginning of the mRNA does not code protein; this is called the 5' untranslated region (5' UTR). One of the most important features of the 5' UTR is the ribosome binding site, which controls where translation starts in the mRNA sequence. It is complementary to the UCCU core sequence of the 3'-end of 16S rRNA in the 30S ribosomal subunit.
The first amino acid is numbered 1. This amino acid is special, since it starts a new chain, rather than being added to an existing chain. In prokaryotes, the initial amino acid is a modified methionine (N-formyl methionine, or fMet). The tRNA for fMet usually matches an 'ATG' codon, just like normal methionine; this is the case here. In some cases, bacterial genes use different codons for the fMet (like 'GTG'). The first base of this first codon (here an 'A') is numbered 1 in the DNA sequence. Nucleotides before the coding region have negative numbers. Note that there is no zero in this counting scheme.
This protein actually has three distinct enzymatic activities. It is, of course, a DNA polymerase, that is, an enzyme that catalyzes polymerization of dNTPs into DNA. More specifically, this is a DNA-dependent DNA polymerase, since it constructs DNA from a DNA template. In addition, it has two different exonuclease activities. A nuclease is an enzyme that degrades nucleic acids. An exonuclease eats away from one end of a DNA strand (as opposed to an endonuclease, which attacks DNA strands in the middle). Since DNA has two distinct ends (a 5' end and a 3' end), it may not be surprising that some exonucleases eat DNA strands from one end, while others eat it from the other end. DNA polymerase I has both kinds of exonuclease activity, and each is carried out by a separate part of the protein.
A 5' to 3' exonuclease eats DNA strands starting at the 5' end. When DNA polymerase is performing primer extension along a DNA template, the 5' to 3' exonuclease serves to remove any DNA strands it might run into that are bound to the template ahead of the strand it is making.
A 3' to 5' exonuclease activity allows "proofreading"; it can chop off the end of a DNA strand that has just been made (that is, the 3' end). Now, if it always chopped off the bases that had just been added, we would have a lot of trouble ever making progress on our new strand. But we do make progress, because the 3' exonuclease activity is very slow compared to polymerization. It only really makes much difference when the polymerase accidentally incorporates an incorrect base. This causes a mismatch at the end of the strand being extended, and slows down the polymerase enough that the proofreading activity has a chance to chop off the 3' end. The net result is that errors get corrected. In fact, DNApolymerases without proofreading activity tend to make a lot more mistakes than those that have proofreading. This means they introduce mutations into the product strands at a much higher frequency.
Historically, this protein was purified and well characterized before its gene was cloned or sequenced. It was discovered that partial digestion with specific proteolytic enzymes cut it neatly into two parts. The larger of these parts retained the polymerase and proofreading activities, and proved very useful for a variety of biochemistry applications, including the earliest experiments in PCR. Since that time other enzymes, such as Taq polymerase, have taken over most of the applications of E. coli DNA polymerase I and the Klenow fragment. It is interesting to note that Taq polymerase, the DNA polymerase I from Thermus aquaticus has only the two activities in the Klenow fragment, and lacks the 5'-3' exonuclease.
Now that we have the complete sequence of this gene, as well as many similar genes from other organisms, we can identify "conserved domains" in the sequence. There are three such domains, each with an associated enzymatic activity. The enzyme can be described as a fusion of DNA polymerase I 5' to 3' polymerase, a 3' to 5' exonuclease, and a 5' to 3' exonuclease. The various regions of the enzyme are given in the entry for polA in the Entrez Gene database.
The termination codon is marked with the word 'End' in red on the amino acid sequence. 'End' is not really an amino acid.