Almost immediately after the structure of DNA was elucidated by Watson and Crick, the mechanism by which genetic information was maintained within a cell and used to create proteins became apparent. This mechanism has become known as the "Central Dogma of Molecular Biology". The Central Dogma has three main parts:
1. Genetic information is preserved and transmitted to new cells and offspring by a duplication process called replication. Replication occurs as a part of mitosis, normal cell division reviewed above.
2. Genetic information stored in the nucleus is made available to the rest of the cell by the creation of numerous temporary copies known as messenger RNA (mRNA) through a process known as transcription. mRNA is similar to DNA in that it consists of a long, specific sequence of nucleotides. It differs in that it is single-stranded, contains the sugar ribose rather than deoxyribose in its backbone, and utilizes the base uracil in place of thymine. Transcription is a major part of one of the most important aspects of gene expression, the "turning on" of genes in appropriate cells at appropriate times.
3. In the cytoplasm, ribosomes construct specific proteins by interpreting the sequence of bases in mRNA. This process is known as translation. The genetic code which allows ribosomes to assemble the correct amino acids in the correct order is the subject of the following section.
Since proteins are the structural core of the cell and since proteins (in the form of enzymes) control nearly all of the cell's metabolism, the ability to specify protein structure makes DNA the primary determinant of the structure and function of cells. The Central Dogma is a major organizing principle in molecular biology and the organization of DNA in cells and genes cannot be fully understood except in its context.
Table 1 Universal codon table. One-letter amino acid abbreviations follow names
One of the most important discoveries in biology was the means by which a DNA sequence specified the sequence of amino acids in a protein. Through experimentation, it was found that consecutive groups of three nucleotides, known as codons, determined the particular amino acids that would occur sequentially in a polypeptide. The relationship between particular codons and particular amino acids was found to be the same for nearly all living organisms, and this relationship has become known as the genetic code (Table 1). In the table, all codons within a section of the table code for the listed amino acid (i.e. GGU, GGC, GGA, and GGG all code for Glycine). Also note that the codons in this table are specified for mRNA (the intermediary between DNA and protein production). To determine the codon as specified in the original DNA, simply substitute T (thymine) for U (uracil).
Fig. 15 Detailed diagram of the organization of the beginning of a gene (strand direction descriptors follow the NCBI convention). Note that "positive" and "negative" strand designations are relative to the p arm of the chromosome, while the "sense" strand designation is relative to the orientation of a particular gene.
A particular strand of DNA could be divided into codons three different ways depending on the starting position from which the nucleotide triples are demarcated. Since genes can be coded for on either of the complementary strands, a double-stranded piece of DNA can thus have a total of six different frames of reference for demarcating codons. Each of these six is called a reading frame.
All DNA sequences coding for proteins begin with the same codon, ATG, which codes for the amino acid methionine. This codon is also known as the start codon. In Fig. 15, the ATG sequence on the negative strand demarcates the beginning of a coding sequence. (Since the gene in the example is coded on the negative strand, for that gene the negative strand is the sense strand.) The reading frame which contains that methionine is therefore the one correct reading frame (out of the possible six), which codes for the protein. The other five reading frames are essentially random gibberish. Methionine will appear at the beginning of all genes since it is the only codon used to signal the start of protein coding, but not all methionines in a sequence are start codons. In addition to its role in the initiation of translation, methionine is also simply an amino acid equivalent to the other 19 amino acids.
Because of their random nature, the other five possible reading frames derived from a DNA sequence will often contain stop codons that are generated by chance. If those reading frames actually coded for proteins, the stop codons would indicate to the ribosomes that they should stop adding amino acids to the polypeptide. Since it would not make sense biologically for translation to stop after so short a time, the presence of many stop codons can serve as an indication that a particular reading frame does not legitimately code for a protein sequence. In a DNA sequence composed of a random series of nucleotides, stop codons should occur by chance on average every 21 codons (43 possible combinations of three nucleotides divided by the three possible stop codons). Therefore it is unlikely that any reading frame would continue for a distance of much longer than 21 codons without being interrupted by a stop codon unless it actually coded for a protein. A segment of DNA that contains a reading frame in which a long sequence is not interrupted by any stop codon is called an open reading frame (ORF). Therefore, the first step in searching a new genome sequence for unknown protein-coding genes is usually to identify ORFs. Note: In order for a segment of DNA to be considered an ORF, the stretch of DNA lacking a stop codon must be much longer than what would be likely to occur by chance in a random sequence of nucleotides (i.e. many more than 21 codons= hundreds of nucleotides).
You should be aware that there are genes that do not code for proteins. Most notably, these are genes that code for non-messenger RNAs (i.e. transfer RNA=tRNA and ribosomal RNA=rRNA). In the case of RNA coding genes, the RNA transcript is the final structural product and is not subsequently translated. Thus tRNA and rRNA genes do not contain codons, nor does the term reading frame have any relevance to them.
The genetic structure described in the previous section is relatively simple. However, eukaryotic genes have a relatively complicated structure that includes additional components beyond the coding sequence. Transcribed mRNA preceding the start codon is necessary for the attachment of the ribosomes that translate the mRNA. Untranslated mRNA is also present after the stop codon. In addition to this flanking DNA, untranslated sequences are present within the coding portion of the gene that contains the coding sequence. These regions, which in some cases may be quite long, are known as introns, and the coding segments are called exons (Fig. 16). Before translation, introns are spliced out of the sequence. The joined exons are then used by the ribosomes in the translation process. The precise purpose of introns is unknown, although they may allow for the formation of variant forms of genes through a proposed process called "alternative splicing". However, it is known that their sequence evolves at a much faster rate than exons because mutations in introns have no effect on the amino acid sequence of the protein coded for by the gene.
Fig. 16 Production of messenger RNA through transcription and RNA editing