Postgraduate medical education has changed enormously in the last 10 years presenting huge logistical challenges for local, regional and national organisations. Assessment is under change in line with major revisions of postgraduate curricula. Old methods of assessment are changing to newer evidence-based methods supported by ongoing research into good practice. This review examines the purpose and practical considerations of written assessment, the pros and cons of different assessment methods and how good practice can be evaluated and quality assured. Good quality assessment comes at a cost in terms of time and money, and organisations need to invest in their assessment strategies to ensure the highest possible standards.
Introduction
Clinical medicine has long been taught and assessed using traditional methods with little or no evidence to back up their use. The last 10–15 years have seen a revolution in all aspects of medical education, including assessment. Most of this has been driven at the undergraduate level in forward-looking medical schools, but the world of postgraduate education is changing fast and “this is the way we have always done it” is being replaced by evidence-based practice. Written assessment is embedded in all postgraduate curricula. Traditional ‘True’/‘False’ multiple-choice questions (MCQs) and essays are slowly being phased out as they have many intrinsic failings, and newer assessment methods, particularly Single Best Answer questions (SBAs) are beginning to predominate. This review will look at the principles and practice of written assessments and look at the pros and cons of different assessment methods. Objective structured clinical examinations (OSCEs) and Work-place Based Assessments (WBAs) are discussed elsewhere in this publication.
The purpose of assessment
There are many reasons for assessing junior doctors. With respect to written assessments, it is to ensure they have adequate knowledge for their stage of development but, more importantly, that they are able to ‘apply’ that knowledge in a range of clinical situations. Written assessments also assure the general public that their doctors possess the required knowledge to make them safe practitioners. Assessments also provide feedback to candidates on their level of understanding, their areas of strength and their areas of weakness and feedback to trainers on their effectiveness as trainers and on the effectiveness of their training programmes. Like all assessments, they motivate candidates to study and to learn. Consequently, it is important to perfect the written assessment so that it leads to the right sort of learning – learning in context, learning based on deep understanding and not simply surface learning based on memorisation of trivial facts. Doctors who cannot apply their knowledge are unlikely to be good at their jobs.
Learning outcomes and assessment
If asked “What should you assess?”, there might be a variety of responses: “what you need to do the job”, “what we have always assessed”, “what is easy to assess” and “what the teachers feel is important”. While the first answer is a reasonable response, it is not specific enough. A fuller answer might be: “the common and important learning outcomes expressly outlined in the curriculum, designed to ensure trainees learn what they need to do the job.” All assessment starts with the learning outcomes for the training programme and throughout the design and writing of the assessment, they should underpin the process. This will ensure that only those things are assessed that are deemed ‘assessment-worthy’ and both the assessors and assessed know what subject areas are going to be tested. It will also avoid the eternal problem of the ‘hidden curriculum’ where assessors individually decide what ‘they’ feel is important, which actually forms no part of the course.
Blueprinting the assessment
The aims of the assessment will broadly be to ensure adequate and appropriate coverage of multiple aspects of the course, each element tested proportionate to its importance in clinical practice. This implies that common and core topics should be assessed more than rare ones. The way to ensure appropriate sampling is to blueprint the assessment in line with the stated learning outcomes of the training programme. Blueprints come in many different forms but they all have an easy visual format to check that curricular coverage is adequate in both subject areas and domains of interest (see Fig. 1 for an example).
Assessment criteria and marking schemes
These two concepts are sometimes confused or used interchangeably. Assessment criteria specify exactly what content is to be assessed, and the marking scheme allocates the relative importance or weighting for each criterion in the overall final mark.
Consider the following assessment task: “Produce a scientific poster presentation of a clinical audit of your choice. It will be marked out of 100.” Now compare the two assessment criteria and marking schemes below:
Assessment criteria 1 | Marking scheme 1 | Assessment criteria 2 | Marking scheme 2 |
---|---|---|---|
Title appropriate to topic | 10 marks | Overall quality of the poster content and visual presentation | 100 marks |
Introduction and aims | 10 marks | ||
Methods | 15 marks | ||
Results | 15 marks | ||
Discussion | 15 marks | ||
References (Vancouver) | 10 marks | ||
Visual impression | 25 marks |
Which of the two is most likely to result in better poster presentations? Which of the two is likely to be marked more consistently and less open to inter-observer variation? Assessment criteria and marking schemes provide guidance to trainees and consistency to marking.
Validity
A valid assessment is one which measures what it is intended to measure. For example, a written test of a clinical skill is not likely to be valid as written tests do not per se measure clinical ability, whereas an OSCE or WBA is much more fit for this purpose.
There are different types of validity.
‘Content validity’ is whether the assessment truly tests a domain or domains of interest. For example, if contraceptive knowledge is being thoroughly tested in a written examination, then one question on the progesterone-only pill simply tests knowledge of the progesterone-only pill. It does not test knowledge on any other contraceptive method. It is therefore necessary to include questions on several other areas of contraception. Blueprinting as described above helps to ensure adequate coverage of the curriculum and also that the assessment is aligned with the learning outcomes.
‘Criterion validity’ refers to whether your assessment predicts future performance. Does a good performance in the Membership Exam of Royal College of Obstetricians and Gynaecologists (MRCOG) predict good clinical performance thereafter? A well-designed assessment ‘should’ predict future performance. It needs therefore to test knowledge that is required and used later in their career.
‘Face validity’ is the degree to which the assessment is respected or valued by trainees and assessors. ‘True’/‘false’ multiple choice questions testing recall of trivial facts have low face validity; SBA questions, written in a clinical context and testing important and common areas of practice, have a much higher face validity.
‘Consequential validity’ (educational impact) relates to the extent to which the assessment process affects how candidates prepare for it. In an ideal world, an assessment process should encourage regular, continuous and deep learning as opposed to “cramming” or surface learning simply to “pass an examination”. The reality is that assessment has a major impact on student behaviour and learning. This can be advantageous, however. If you want trainees to learn a particular topic in a particular way, then assess it in such a way that the trainees ‘have’ to learn it that way. If you want to assess critical thinking, create an assessment that specifically tests just that. Trainees will soon get the message.
Reliability
Reliability is a measure of the reproducibility of any assessment. How likely is the result of the assessment to be the same if it was run a second time? It can be measured numerically and is most often calculated using Cronbach’s alpha , which has a range from 0 to 1, where 0 is totally unreliable and 1 is 100% reliable. Reliability inevitably increases as sampling time increases. This is because longer periods of assessment are less likely to be prone to individual variations in performance, sampling errors and examiner bias. We all have areas where we perform better or about which we have more knowledge, therefore, depending upon which questions are asked out of the whole question bank (any assessment is only a sample of the whole), performance could vary widely because of this sampling issue. Short answer questions (SAQs) and essays are more prone to this variability in performance as sampling is considerably reduced, especially in essays. In addition, there is the phenomenon of examiner bias – the so-called ‘hawk–dove’ effect – where some examiners “genetically” mark more harshly than others (and some more leniently). This is exacerbated if the number of questions is small and the effect becomes more noticeable. The more questions involved and the more markers involved, the more reliable the assessment becomes.
Testing time
There is always a logistical and resource issue with testing time; hence, how much time is enough to make your written assessment reliable? With increasing assessment time comes increasing reliability, but it will depend upon your question format. As a general rule, testing beyond 3 h is unlikely to be worthwhile for multiple choice methods as there is only a small increase in reliability for considerable extra use of resources.
Types of written questions
There are many types of written assessment. They have been designed to test different things in different ways and the choice will be dependent upon what you wish to test (e.g., basic scientific knowledge vs. knowledge of medical ethics) and the resources available (most multiple-choice formats can be electronically marked vs. manual marking for SAQs).
True/False MCQs
These questions require the candidate to select a true or false answer for all questions posed in any stem. An example of a true/false question is:
Example 1:
The following is/are branches of the internal iliac artery:
- (a)
Inferior epigastric artery
- (b)
Middle rectal artery
- (c)
Ovarian artery
- (d)
Uterine artery
- (e)
Vaginal artery
Answers (b), (d) and (e) are unequivocally true whereas (a) and (c) are unequivocally false. You will note that this question tests factual scientific knowledge as opposed to clinical application of knowledge. Many answers to questions in the basic sciences may indeed be absolutely true or false. Compare this question to:
Example 2:
The following is/are routine tests at the booking visit in antenatal care:
- (a)
Hepatitis B
- (b)
HIV
- (c)
Random glucose
- (d)
Rubella
- (e)
Toxoplasmosis
Immediate problems start to arise: (a), (b) and (d) are true as long as you are thinking of the UK (it is not specified) and (c) is highly dependent on where you work even within the UK: therefore, it is neither completely true nor completely false. The option (e) is true in other European countries but not the UK; therefore, this again leads to ambiguity. The candidate has to ‘guess’ what the examiner has in mind as the stem is not clear. If you are assessing clinical application of knowledge, then very few things in clinical medicine are absolutely true or false and this format leaves little or no room for uncertainty. This is the principle reason why this longstanding format has largely fallen out of favour in medical education.
SBA (best of five) MCQs
The SBA format does not require a ‘True’/‘False’ answer to every option, rather a single answer from a possible list of five options relating to a stem. The other ‘incorrect’ options are called distractors. An example would be:
Example 3:
An 18-year-old attends gynaecological outpatients with repeated lower abdominal pain. The pattern is predictable on day 14 of a 28-day cycle. She has non-painful periods, pain at no other time and has never been sexually active. She has no urinary or bowel symptoms and has no significant medical or surgical history.
What is the single most likely diagnosis?
- (a)
Adenomyosis
- (b)
Endometriosis
- (c)
Ovulatory pain
- (d)
Pelvic adhesions
- (e)
Pelvic congestion
The option (c) is clearly the most likely diagnosis by a long way. Though (b) is possible, it is very unlikely given the story and the candidate has been asked to select the most likely. In this format, all options can be ‘correct’ as long as one of the options is clearly the most likely one. All possible options need to be at least plausible alternatives to the SBA. SBA allows for the uncertainty of clinical medicine and is an excellent way of testing clinical application of knowledge as opposed to factual recall of isolated facts.
Question flaws
Question flaws generally relate to clues for the examination-wise candidate or irrelevant difficulty.
Examples of question flaws relating to the test-wise candidate are:
- (1)
Grammatical clues and word repeats: the stem points to the correct answer in the way it is constructed or rules out incongruous distractors;
- (2)
Use of absolute terms in the options (things described as always or never are almost always wrong!);
- (3)
The very specific long answer compared to the other options stands out; and
- (4)
Convergence strategy, where the correct answer has most in common with the other options and therefore becomes more likely.
Examples of question flaws relating to irrelevant difficulty are:
- (1)
The use of vague terms in the options (things that ‘may’ or ‘could’ happen are much more likely);
- (2)
The use of words such as ‘often’, ‘usually’, ‘sometimes’, ‘rarely’, ‘occasionally’ and ‘seldom’, which mean very different things to different people ;
- (3)
The options are unnecessarily long, complex or ask two questions;
- (4)
The stem is overly long and/or complex.
Below are some examples of flawed questions:
Example 4:
A woman has delivered her first baby 6 h before. She has an eclamptic seizure. The best option for abolishing her fit would be administration of:
- (a)
Arterial line insertion
- (b)
Fluid restriction
- (c)
Hydralazine
- (d)
Labetolol
- (e)
Magnesium sulphate
Clearly (a) and (b), though they may be desirable, aspects of pre-eclampsia (PET)/eclampsia management are incongruous with the stem and can be discounted immediately. There is now a 33% chance of a correct answer even with minimal knowledge.
Example 5:
When consenting a woman for a diagnostic laparoscopy, the most appropriate complication(s) that need(s) to be discussed by the surgeon is/are:
- (a)
Deep venous thrombosis
- (b)
Haemorrhage, bowel, bladder and vascular injury, intra-peritoneal infection and the need for laparotomy in the event of complications
- (c)
Nausea and vomiting post procedure
- (d)
Post catheterisation UTI
- (e)
Wound dehiscence
The option (b) is so specific and so much longer than all the others that it automatically selects itself as the correct answer.
Example 6:
Following prolonged debulking surgery for ovarian cancer, a 78-year-old woman develops a paralytic ileus. The most likely biochemical abnormality that will exacerbate the problem postoperatively is:
- (a)
Hypocalcaemia
- (b)
Hyperkalaemia
- (c)
Hypokalaemia
- (d)
Hyponatraemia
- (e)
Hypermagnasaemia
The presence of two kalaemias immediately stands out as likely to contain the answer. The presence of three hypo-options and only two hyper-options then makes the choice of kalaemias much easier.
The purpose of assessment
There are many reasons for assessing junior doctors. With respect to written assessments, it is to ensure they have adequate knowledge for their stage of development but, more importantly, that they are able to ‘apply’ that knowledge in a range of clinical situations. Written assessments also assure the general public that their doctors possess the required knowledge to make them safe practitioners. Assessments also provide feedback to candidates on their level of understanding, their areas of strength and their areas of weakness and feedback to trainers on their effectiveness as trainers and on the effectiveness of their training programmes. Like all assessments, they motivate candidates to study and to learn. Consequently, it is important to perfect the written assessment so that it leads to the right sort of learning – learning in context, learning based on deep understanding and not simply surface learning based on memorisation of trivial facts. Doctors who cannot apply their knowledge are unlikely to be good at their jobs.
Learning outcomes and assessment
If asked “What should you assess?”, there might be a variety of responses: “what you need to do the job”, “what we have always assessed”, “what is easy to assess” and “what the teachers feel is important”. While the first answer is a reasonable response, it is not specific enough. A fuller answer might be: “the common and important learning outcomes expressly outlined in the curriculum, designed to ensure trainees learn what they need to do the job.” All assessment starts with the learning outcomes for the training programme and throughout the design and writing of the assessment, they should underpin the process. This will ensure that only those things are assessed that are deemed ‘assessment-worthy’ and both the assessors and assessed know what subject areas are going to be tested. It will also avoid the eternal problem of the ‘hidden curriculum’ where assessors individually decide what ‘they’ feel is important, which actually forms no part of the course.
Blueprinting the assessment
The aims of the assessment will broadly be to ensure adequate and appropriate coverage of multiple aspects of the course, each element tested proportionate to its importance in clinical practice. This implies that common and core topics should be assessed more than rare ones. The way to ensure appropriate sampling is to blueprint the assessment in line with the stated learning outcomes of the training programme. Blueprints come in many different forms but they all have an easy visual format to check that curricular coverage is adequate in both subject areas and domains of interest (see Fig. 1 for an example).