Knowledge of the genetic basis of health and disease has increased dramatically in the last 20 years. The first draft of the sequence of the human genome was published in 2001, and the Human Genome Project was completed in 2003.1,2 From this project has come a detailed map of genes and genetic markers and a clearer understanding of how genes function. The intent of this chapter is to provide a basic understanding of gene structure and function and a glossary of terms that should be helpful in incorporating genetic screening and testing into a practice of obstetrics and gynecology. For those interested in a more comprehensive resource on basic molecular genetics, we suggest either Gelehrter et al or Nussbaum et al.3,4
Within the human nucleus are 3 billion base pairs of DNA, contained in which are approximately 25,000 to 30,000 genes. The DNA is tightly wrapped around proteins called histones to form what are termed nucleosomes. Nucleosomes are organized into solenoid structures and looped around a nonhistone protein scaffold to form chromatin, which makes up chromosomes. Each chromosome is composed of densely packed nontranscribed DNA located near the centromere (heterochromatin), and less densely packed and transcribed DNA (euchromatin). Chapter 2 provides a more detailed discussion of the structure of chromosomes.
About three-quarters of the genome is unique, single-copy DNA, and the remaining one-quarter is made up of various forms of repetitive DNA. Less than 10% of the genome encodes genes. Initially, it was thought that the repetitive DNA, and much of the single-copy DNA, had no function. However, recent studies suggest that this “noncoding” DNA may be where the various “switches” are located that control gene function. Although more than 99% of DNA is identical in all humans, the small variations, known as polymorphisms, have been key to understanding the genetic basis of many diseases thought to have a genetic component, such as heart disease, diabetes, and other adult-onset disorders.
In addition to the nuclear genome, each cell contains a mitochondrial genome, approximately 16,500 nucleotides in length. The mitochondrial genome contains 37 genes, and these genes encode 13 essential mitochondrial proteins, 22 transfer RNAs (tRNAs), and 2 ribosomal RNAs (rRNAs). The mitochondria genome does not encode all of the proteins that make up mitochondria; the remaining proteins are encoded by the nuclear genome. Each mitochondrion usually contains multiple copies of mitochondrial DNA, and each healthy cell consists of several hundred mitochondria. If all the mitochondria in a given cell contain the same DNA sequence, this is called homoplasmy, while populations of mitochondria with differing DNA sequences give rise to heteroplasmy.
Each gene is a unique series of four purine (adenine, guanine) and pyrimidine (thymine, cytosine) bases. The nucleotides that make up these genes are composed of a base, a phosphate, and a sugar moiety that polymerize into long polynucleotide chains. In the human genome, these polynucleotide chains form the double helix, and range in size from approximately 50 million base pairs (chromosome 21) to 250 million base pairs (chromosome 1, the largest chromosome). Individual genes themselves vary in size from as little as 1000 base pairs up to 2 million base pairs (the dystrophin gene on the X chromosome).
Genes are composed of one or more exons, which are the DNA sequences that are transcribed into messenger RNA (mRNA) that will be translated into a polypeptide at the ribosomal level. In addition to the exons, the gene has introns or intervening sequences that may transcribe RNA, but that RNA is not part of the mature mRNA that is found in the cytoplasm. The regions upstream and downstream to the exons are called the 5′ untranslated region and the 3′ untranslated region, respectively. These adjacent nucleotide sequences provide the molecular signals for “starting” and “stopping” the synthesis of mRNA. At the 5′ end of the gene is located the promoter region, which has the sequence necessary for the initiation of transcription. Within this region are several other DNA elements that are conserved among many different genes and play key roles in gene regulation. Within the 3′ untranslated end of the gene lies a region of DNA that contains a signal for the addition of a sequence of 100 to 200 adenosine bases (the poly A tail) to the end of the mature mRNA. Within both the 5′ and 3′ untranslated regions are many other regulating elements (enhancers, silencers, locus control regions) that are essential for gene expression and may be sites of mutation that cause genetic diseases by interfering with gene expression. Figure 1-1 illustrates a typical human gene and its associated products.
FIGURE 1-1.
General elements of a typical human gene and its associated products. A sixexon gene is shown with upstream regulatory regions, such as promoters (TATA box, CCAAT box) in grey. Exons are composed of regions that do not encode protein (untranslated regions [UTR], depicted in yellow), and coding regions are shown in blue. The red star depicts a gene mutation that changes the nucleotide sequence from CAC to CGC, which translates into amino acid change from histidine (H) to arginine (R). This is an example of a missense mutation. The splice donor site includes a conserved dinucleotide sequence, GT, at the 5′ end of the intron, while the splice acceptor site at the 3′ end of the intron contains a conserved dinucleotide AG (highlighted in orange). Mutations within conserved splice donor and acceptor sites will cause abnormal splicing, and result in abnormal protein products.
Initiation of the transcription of a gene is under the influence of transcription factors (specific proteins that function to “turn on” genes) that interact with promoters and other regulating elements. Transcription begins in a transcriptional “start site” on chromosomal DNA upstream from the coding DNA. Transcription continues through both exons and introns and past the coding sequences. Synthesis of the mRNA for coding proteins is done by RNA polymerase II, and proceeds from the 5′ to the 3′ end of the RNA. This means the DNA strand of the gene being transcribed is being read in the 3′ to 5′ direction.
One of the important promoter sequences is known as the “TATA box.” It is a region, conserved in many genes, that is rich in adenine and thymine bases and is just upstream, by 25 to 30 base pairs, of the transcription start sites. It appears to be key in determining the position of the start of transcription. A second conserved region (CCAAT) is called the “CAT box.” It is a few dozen base pairs farther upstream than the TATA box and is a key element in the expression of genes that are tissue specific. In the so-called “housekeeping genes”, that are constitutively expressed in most tissues, these elements may be lacking in the promoter region. Rather, these housekeeping genes have promoter regions, rich in cytosines and guanines, that are referred to as CpG islands. These CG-rich sequences are thought to serve as binding sites for specific transcription factors.
Once the mRNA has been transcribed, a chemically modified guanine nucleotide (called a cap) is added to the 5′ end to prevent the mRNA from being degraded. Cleavage at a specific point on the 3′ end, downstream of the coding area, occurs, and a poly A tail is added to the 3′ end of the mRNA. This posttranscriptional modification takes place in the nucleus, as does RNA splicing, which removes RNA transcribed from introns. The mature mRNA must have only the transcripts of the exons to be a functional mRNA.
The splicing reactions are guided by specific DNA sequences at both the 5′ (splice donor sites) and 3′ ends (splice acceptor sites) of introns. The 5′ sequence, located immediately adjacent to the splice site, appears invariant among all genes. In similar fashion, there are key elements at the 3′ end. For example, in the β-globin splice reaction, the 3′ sequence consists of approximately 12 nucleotides, of which two are AG nucleotides located immediately 5′ to the intron/exon boundary, and appear essential for normal splicing.
The fully processed mRNA is then transported to the cytoplasm, where translation takes place. Because genes may contain more than one promoter or have alternative splice sites, the same gene may encode many different protein products. The concept of “one gene, one protein” is no longer valid.
In the cytoplasm, mRNA is translated into proteins by the action of tRNAs, each of which is specific for a particular amino acid. These tRNA molecules transfer the correct amino acid to their position on the mRNA template, resulting in a polypeptide chain. The key to translation is a code that identifies a specific amnio acid. This code is a combination of three adjacent bases along the mRNA, termed a codon. With four bases to create the three-base codon, there are 64 possible triplet combinations, known as the genetic code. Because there are only 20 amino acids, and 64 possible codons, most amino acids are coded by more than one codon. Only methionine and tryptophan are each specified by a single unique codon. Three of the codons are called “stop” codons because they designate termination of translation of the mRNA at that point. Of note, translation is always initiated at a codon specifying methionine (AUG), which is termed the initiation codon. This establishes the reading frame whereby each subsequent codon is read to determine the amino acid sequence. Although methionine is the first encoded amino acid always, it is usually removed before protein synthesis is completed.