This chapter will provide a basic introduction to the human genome and some of the tools used to analyse it. Genomics and molecular biology have developed rapidly during the last few decades, and this chapter will highlight some of these advances, in particular with respect to the impact on our knowledge of the structure and function of the genome. The basic science described in this chapter is fundamental to the understanding of the field of clinical genetics, which is described in the following chapter.
Chromosomes
Inheritance is determined by genes, carried on chromosomes in the nuclei of all cells. Each adult cell contains 46 chromosomes, which exist as 23 pairs, one member of each pair having been inherited from each parent. Twenty-two pairs are homologous and are called autosomes . The 23rd pair is the sex chromosomes, X and Y in the male, X and X in the female.
Each cell in the body contains two pairs of autosomes plus the sex chromosomes for a total of 46, known as the diploid number (symbol N). Chromosomes are numbered sequentially with the largest first, with the X being almost as large as chromosome 1 and the Y chromosome being the smallest. This means that each cell (except gametes) has two copies of each piece of genetic information. In females, where there are two X chromosomes, one copy is silent (inactive) (i.e. genes on that chromosome are not being transcribed (see later)).
Each individual inherits one chromosome of each pair from the mother and one from the father following fertilisation of the haploid egg (containing one of each autosome and one X chromosome) by the haploid sperm (containing one of each autosome and either an X or a Y chromosome). The sex of the individual is therefore dependent on the sex chromosome in the sperm: an X will lead to a female (with the X chromosome from the egg) and a Y chromosome will lead to a male (with an X from the egg).
Chromosomes are classified by their shape. During metaphase in cell division, chromosomes are constricted and have a distinct recognisable ‘H’ shape with two chromatids joined by an area of constriction called the centromere. For ‘metacentric’ chromosomes the centromere is close to the middle of the chromosome, and for ‘acrocentric’ chromosomes it is near to the end of the chromosome. The area or ‘arm’ of the chromosome above the centromere is known as the ‘p arm’, and the area below is the ‘q arm’. For acrocentric chromosomes, the p arm is very small, consisting of tiny structures called ‘satellites’. Within the two arms, regions are numbered from the centromere outwards to give a specific ‘address’ for each chromosome region ( Fig. 1.1 ). The ends of the chromosomes are called telomeres. Chromosomes only take on the characteristic ‘H’ shape during a metaphase when they are undergoing division (hence giving the two chromatids).
Chromosomes are recognised by their banding patterns following staining with various compounds in the cytogenetic laboratory. The most commonly used stain is the Giemsa stain (G-banding), which gives a characteristic black and white banding pattern for each chromosome.
In the cell, the chromosomes are folded many hundreds of times around histone proteins and are usually only visible under a microscope during mitosis and meiosis. DNA is composed of a deoxyribose backbone, the 3-position (3′) of each deoxyribose being linked to the 5-position (5′) of the next by a phosphodiester bond. At the 2-position each deoxyribose is linked to one of four nucleic acids, the purines (adenine or guanine) or the pyrimidines (thymine or cytosine). Each DNA molecule is made up of two such strands in a double helix with the nucleic acid bases on the inside. This is the famous double helix structure that was first proposed by James Watson and Francis Crick in 1953, based upon the x-ray diffraction work of Rosalind Franklin and colleagues. The bases pair by hydrogen bonding, adenine (A) with thymine (T), and cytosine (C) with guanine (G). DNA is replicated by separation of the two strands and synthesis by DNA polymerases of new complementary strands. With one notable exception, the reverse transcriptase produced by viruses, DNA polymerases always add new bases at the 3′ end of the molecule. RNA has a structure similar to that of DNA but is single stranded. The backbone consists of ribose, and uracil (U) is used in place of thymine ( Fig. 1.2 ).
Gene Structure and Function
DNA is organised into discrete functional units known as genes. Genes contain the information for the assembly of every protein in an organism via the translation of the DNA code into a chain of amino acids to form proteins. DNA that encodes a single amino acid consists of three bases, or letters. With four letters and three positions in each ‘word’, there are 64 possible combinations of DNA, but in fact only 20 amino acids are coded for ( Table 1.1 ). Therefore the third base of a codon is often not crucial to determining the amino acid – a phenomenon known as wobble.
1st Position | 2nd Position | 3rd Position | |||
---|---|---|---|---|---|
T | C | A | G | ||
T |
|
|
|
|
|
C |
|
|
|
|
|
A |
|
|
|
|
|
G |
|
|
|
|
|
A diagram of a typical gene structure is shown ( Fig. 1.3 ). Each gene gives rise to a messenger RNA (mRNA), which can be interpreted by the cellular machinery to make the protein that the gene encodes.
Genes are split into exons, which contain the coding information, and introns, which are between the coding regions and may contain regulatory sequences that control when and where a gene is expressed. Promoters (which control basal and inducible activity) are usually upstream of the gene, whereas enhancers (which usually regulate inducible activity only) can be found throughout the genomic sequence of a gene. The two base pair sequences at the boundary of introns and exons (the splice acceptor and donor sites), identical in more than 99% of genes, are known as the splice junction (see Fig. 1.3 ); they signal cellular splicing machinery to cut and paste exonic sequences together at this point. The first residue of each gene is almost always methionine, encoded by the codon ATG.
Recent estimates based on the genome sequence put the number of genes at less than 23,000, a considerable reduction from earlier estimates. This means that the vast majority of human DNA does not contain a coding sequence (i.e. exons) but is rather an intronic sequence: structural motifs and regulatory regions such as promoters and enhancers. This is distinct from lower organisms (e.g. bacteria), where more than 95% of the DNA is a coding sequence. Just exactly why so much noncoding DNA is present remains somewhat enigmatic but is believed to be linked to the complex layers of gene regulation through interacting regulatory regions. The other key implication of this finding is that the huge complexity of humans compared with other organisms with similar numbers of genes must arise from more subtle regulation of gene expression, rather than greater numbers of different genes.
The Central Dogma of Molecular Biology
The central dogma of molecular biology concerns the information flow pathway in cells and can be simply summarised as: ‘DNA makes RNA makes protein, which in turn can facilitate the two prior steps’. These steps are now explained in more detail.
Transcription
‘Transcription’ is the process of the information encoded in DNA being transferred into a strand of mRNA. During transcription the RNA polymerase, which constructs the complementary mRNA, reads from the DNA strand complementary to the RNA molecule. This is known as the antisense strand, while the opposite strand, which has the same base pair composition as the RNA molecule (with thymidine (T) in place of uracil (U) as mentioned previously), is the sense strand. Gene sequences are expressed as the sequence of the sense strand of DNA, although it is in fact the antisense strand which is read ( Fig. 1.4 ). The vast majority of genes consist of a 5′ untranslated region (UTR) containing response elements to which proteins may bind that influence transcription. The 5′ regions of genes are frequently characterised by elements such as the TATA and CAAT boxes (see Fig. 1.3 ) and are often richer in GC pairs than elsewhere in the genome. This is frequently the case around the 5′ ends of ‘housekeeping’ genes that are constitutively expressed in the majority of tissues. There then follows the transcribed sequence. The expressed coding parts of the gene are known as the exons, while the intervening sequences are known as introns. The coding portion of the gene is often interrupted by one or more noncoding intervening sequences, although numerous examples of single exon genes exist. Initially, the RNA molecule transcribes both introns and exons and is known as heavy nuclear RNA (hnRNA). The exons are perfectly spliced out (as marked by the splice boundary sequences) and a protective cap added before the now mature mRNA exits the nucleus. Hence cytoplasmic mRNA consists only of coding regions flanked by UTRs at the two ends. A polyadenine (poly A) tail is added to most mRNA molecules at their 3′ end, facilitated by the polyadenylation signal found past the stop codon in the coding sequence. This tail, found on the great majority of expressed mRNAs, serves to protect the RNA from degradation prior to translation by the ribosome (see later).