Evaluating the utility of workplace-based assessment tools for speciality training




Workplace assessment has been incorporated into speciality training in the UK following changes in the training and work patterns within the National Health Service (NHS). There are various types of assessment tools that have been adopted to assess the clinical competence of trainees. In obstetrics and gynaecology, these include mini-CEX, Objective Structured Assessment of Technical skills (OSATS) and case-based discussion (CbDs). This review provides a theoretical background of workplace assessment and the educational framework that may be adopted to evaluate their effectiveness. It summarises current evidence for the utility of these tools with regard to reliability, validity, acceptability, educational impact and cost.


There is an increasing emphasis on more formalised and effective assessment of postgraduate medical training in the United Kingdom (UK). This trend reflects developments in speciality training in the UK, changes in health-care delivery within the National health services (NHS) and shifts in the socio-political status of the medical profession.


The European Working Time Directive (EWTD) has altered junior doctor working patterns in the UK, with a reduction in working hours and fewer training years. Consequently, there has been a reduction in total speciality training hours from 30 000 to 6000. The traditional clinical apprenticeship with its reliance on time-related experiential training and subjective, observational assessment of clinical skills is no longer feasible. There are fewer opportunities for trainers to observe trainees during clinical encounters, thereby, limiting the assessment of skills and provision of feedback.


The delivery of services within the NHS has changed significantly in the last decade with more emphasis on achieving targets, including stringent waiting times and increased turnover of clinical activity. This has made training and assessing trainees in high-pressure clinical areas, such as outpatient clinics and operating theatres, challenging. High-profile cases, including the Shipman inquiry, have put the medical profession under scrutiny not only from the public and the media, but also from politicians. Consequently, a doctor needs to demonstrate not just competency, but also provide documentary evidence regarding attainment and continuation of this competency.


In response to these challenges, the Postgraduate Medical Education and Training Board (PMETB) advised Royal colleges across specialities to incorporate rigorous assessment strategies within their speciality training programmes. The assessment should reflect the overall objectives of the training, and provide a documentary evidence of achievement of stage-specific competence.


This article provides an overview of the evidence for the utility of workplace-based assessment in medicine. It specifically focusses on the assessment methods employed by the Royal College of Obstetricians and Gynaecologists (RCOG) to assess their speciality trainees. The review addresses the utility of tools designed principally to assess clinical and technical skills. Methods to specifically evaluate professionalism, such as the team observation forms, are not discussed in this article.


Workplace-based assessment


Workplace-based assessment allows the collection of evidence during normal work activities in order to decide whether a required standard has been achieved. It involves observation and assessment of work in practice and providing feedback in response to the work in progress. In the context of medicine, workplace assessment provides a method for doctors to be evaluated with regard to their performance in naturalistic settings, that is, in clinical situations. Judgements can be made not only on clinical procedural competence, but also on other aspects of practice, including professionalism and decision making.


In the UK, all postgraduate trainees in the field of medicine undergo an annual Record of In- Training Assessment (RITA) or Annual Review of Competence Progression (ARCP) to ensure that they are competent to continue with training and, at the end of the designated years of training, capable of practising, independently. Since previous assessment methods used were applied locally and varied in their quality; this made the RITA process rather subjective, unstructured and informal. At the end of the hospital placement, the trainee got ‘signed off’ by the clinical supervisor as being competent in a specific task. The whole process was arbitrary, and no objective evidence was collected regarding evidence of a trainee’s competence. The RITA also varied between different specialities with no obvious quality assurance or governance mechanism to ensure that the process or the evidence on which assessment was based on could be substantiated. The use of workplace assessment allows monitoring of trainees’ progress, and is a means of generating evidence of satisfactory and/or unsatisfactory professional performance as trainees mature from novice to expert.


Workplace-based assessment represents the top-two levels of Miller’s pyramid ( Fig. 1 ), a commonly adopted structure for assessing clinical competence. The lowest level of the pyramid tests knowledge (knows); this is progressively followed by competence (knows how), performance (shows how) and actual demonstration of action (does). Workplace-based methods of assessment target the upper two levels of the pyramid; on the other hand, other methods of assessment, such as multiple choice questions or objective structured clinical examinations (OSCEs), target the lower levels of the pyramid. This testing of performance at the place of work may allow a more realistic evaluation of a doctor’s performance taking into account the variations due to factors such as case mix and complexity.




Fig. 1


Miller’s pyramid.




Frameworks to evaluate assessment methods


There are various approaches to the evaluation of the quality of assessment methods that measure competence. Traditionally, this has involved estimation of psychometric properties of assessment methods including their reliability and validity.


The reliability of an assessment method describes the reproducibility of its results. This is a measure of how consistently the assessment would produce the same result if the test was taken by a candidate on different occasions, or the same candidate assessed by different assessors. In the context of workplace assessment, reliability is measured using the generalisability theory. The theory allows analysis of variance components to quantify generalisability of scores generated across different examination conditions such as task type and assessor numbers. The ratio of the variance components for the object of measure to the total error variance yields an estimate of reliability: the generalisability (G) coefficient ranges from 0 to 1, where 0 is the lowest, and 1 the highest value.


The reliability of workplace assessment tools is usually reported as reliability coefficients to calculate three aspects of the measures:


Internal consistency – this is commonly measured using Cronbach’s alpha coefficient with a score between 0 and 1, where 1 is the highest. A high Cronbach’s alpha coefficient implies a high correlation between different items within a measure if these items are measuring the same construct, for example, communication skills.


Test–retest reliability – this measures the consistency of an assessment tool when administered on different occasions to the same sample.


Inter-rater reliability – this refers to the degree of agreement between two or more assessors assessing the same encounter. This is usually measured using various methods depending upon the type of data. For example, Cohen’s kappa is used for categorical data and Pearson’s correlation coefficient is used for interval data.


The validity of an assessment tool provides an indication as to whether it is, in-fact, measuring what it is supposed to measure. There are four main types of validity:


Content validity – this ensures that items within a measure represent the content domain that is being tested.


Concurrent validity – this shows the degree to which the measure correlates with similar validated measures testing the same domain.


Construct validity – this shows the degree to which a measure associates with a theoretical hypothesis. For example, the hypothesis may be that a senior trainee will perform better in an assessment rather than a junior trainee. If the assessment method discriminates between senior and junior trainees in terms of performance scores, then it may be said to have strong construct validity.


Predictive validity – this shows the extent to which performance during an assessment can predict similar performance or behaviour in the future.


The applicability of a purely psychometric model in the evaluation of assessment tools for competency may be challenging due to some limitations. Workplace-assessment methods may be assumed to have content validity; however, their predictive and concurrent validity are more difficult to prove. Predictive validity implies that performance of trainees during workplace assessment will translate into competent performance in subsequent professional careers. As the introduction of work-based assessments is relatively recent, these data are not yet available. Concurrent validity of these forms of assessment methods assume that assessor evaluation of performance is valid (based on personal expertise and clear setting of measurement objectives), and also that there is no degree of subjectivity in the assessment of competence; both of these assumptions are difficult to evaluate. Assessment of clinical competence often includes a subjective evaluation of the whole performance. Workplace-based assessment methods generally split the skill being observed into smaller components. There is little evidence of how this kind of reductionist approach fits into the assessment of complex clinical skills, that is, does the sum total of the individual components add up to competence? There is a view that checklists are reliable in assessing technical skills. However, more complex skills such as interactions with patients may be more reliably assessed using global ratings. In order to address this reductionist approach to assessment, workplace-based assessment methods often include both checklists and global ratings. In addition, the validity of assessment tools is often challenging to prove as there are no ‘gold standard’ measures of clinical performance. This is an essential element to determine whether assessment methods measure what they are meant to measure.


Establishing the reliability of workplace assessment tools is equally contentious. This is mainly because individual patient encounters vary in complexity and context, and no two encounters can, therefore, be considered as equivalent. In addition, changes in working patterns, particularly shift working, do not allow the same assessor to sample a trainee’s progression over a period of time.


To address these limitations, theoretical frameworks have emerged to evaluate additional aspects of assessment methods, in addition to reliability and validity. The framework suggested by Baartman et al. was designed mainly to evaluate competency assessment programmes rather than specific assessment tools. For the purpose of the evaluation of the RCOG workplace-assessment tools, we have used the framework suggested by van der Vleuten , as it allows evaluation of individual assessment tools. According to this model, the utility of an assessment method is influenced by several variables. These variables include reliability, validity, educational impact, acceptability and cost. Each of these variables should be considered when evaluating the use of workplace-based assessment tools. Although reliability and validity are always discussed when reviewing methods of assessment, of equal importance is understanding an assessment’s educational impact, practicality of use and acceptability in terms of cost, which also affect its utility in practice.




Frameworks to evaluate assessment methods


There are various approaches to the evaluation of the quality of assessment methods that measure competence. Traditionally, this has involved estimation of psychometric properties of assessment methods including their reliability and validity.


The reliability of an assessment method describes the reproducibility of its results. This is a measure of how consistently the assessment would produce the same result if the test was taken by a candidate on different occasions, or the same candidate assessed by different assessors. In the context of workplace assessment, reliability is measured using the generalisability theory. The theory allows analysis of variance components to quantify generalisability of scores generated across different examination conditions such as task type and assessor numbers. The ratio of the variance components for the object of measure to the total error variance yields an estimate of reliability: the generalisability (G) coefficient ranges from 0 to 1, where 0 is the lowest, and 1 the highest value.


The reliability of workplace assessment tools is usually reported as reliability coefficients to calculate three aspects of the measures:


Internal consistency – this is commonly measured using Cronbach’s alpha coefficient with a score between 0 and 1, where 1 is the highest. A high Cronbach’s alpha coefficient implies a high correlation between different items within a measure if these items are measuring the same construct, for example, communication skills.


Test–retest reliability – this measures the consistency of an assessment tool when administered on different occasions to the same sample.


Inter-rater reliability – this refers to the degree of agreement between two or more assessors assessing the same encounter. This is usually measured using various methods depending upon the type of data. For example, Cohen’s kappa is used for categorical data and Pearson’s correlation coefficient is used for interval data.


The validity of an assessment tool provides an indication as to whether it is, in-fact, measuring what it is supposed to measure. There are four main types of validity:


Content validity – this ensures that items within a measure represent the content domain that is being tested.


Concurrent validity – this shows the degree to which the measure correlates with similar validated measures testing the same domain.


Construct validity – this shows the degree to which a measure associates with a theoretical hypothesis. For example, the hypothesis may be that a senior trainee will perform better in an assessment rather than a junior trainee. If the assessment method discriminates between senior and junior trainees in terms of performance scores, then it may be said to have strong construct validity.


Predictive validity – this shows the extent to which performance during an assessment can predict similar performance or behaviour in the future.


The applicability of a purely psychometric model in the evaluation of assessment tools for competency may be challenging due to some limitations. Workplace-assessment methods may be assumed to have content validity; however, their predictive and concurrent validity are more difficult to prove. Predictive validity implies that performance of trainees during workplace assessment will translate into competent performance in subsequent professional careers. As the introduction of work-based assessments is relatively recent, these data are not yet available. Concurrent validity of these forms of assessment methods assume that assessor evaluation of performance is valid (based on personal expertise and clear setting of measurement objectives), and also that there is no degree of subjectivity in the assessment of competence; both of these assumptions are difficult to evaluate. Assessment of clinical competence often includes a subjective evaluation of the whole performance. Workplace-based assessment methods generally split the skill being observed into smaller components. There is little evidence of how this kind of reductionist approach fits into the assessment of complex clinical skills, that is, does the sum total of the individual components add up to competence? There is a view that checklists are reliable in assessing technical skills. However, more complex skills such as interactions with patients may be more reliably assessed using global ratings. In order to address this reductionist approach to assessment, workplace-based assessment methods often include both checklists and global ratings. In addition, the validity of assessment tools is often challenging to prove as there are no ‘gold standard’ measures of clinical performance. This is an essential element to determine whether assessment methods measure what they are meant to measure.


Establishing the reliability of workplace assessment tools is equally contentious. This is mainly because individual patient encounters vary in complexity and context, and no two encounters can, therefore, be considered as equivalent. In addition, changes in working patterns, particularly shift working, do not allow the same assessor to sample a trainee’s progression over a period of time.


To address these limitations, theoretical frameworks have emerged to evaluate additional aspects of assessment methods, in addition to reliability and validity. The framework suggested by Baartman et al. was designed mainly to evaluate competency assessment programmes rather than specific assessment tools. For the purpose of the evaluation of the RCOG workplace-assessment tools, we have used the framework suggested by van der Vleuten , as it allows evaluation of individual assessment tools. According to this model, the utility of an assessment method is influenced by several variables. These variables include reliability, validity, educational impact, acceptability and cost. Each of these variables should be considered when evaluating the use of workplace-based assessment tools. Although reliability and validity are always discussed when reviewing methods of assessment, of equal importance is understanding an assessment’s educational impact, practicality of use and acceptability in terms of cost, which also affect its utility in practice.




Assessment tools used in workplace assessment in obstetrics and gynaecology


Mini-CEX


A mini-CEX encounter consists of a single member of the faculty observing a doctor while they conduct a focussed history and physical examination in a clinical setting. The assessor completes a form which tests six clinical domains: history taking; physical examination; humanistic qualities/professionalism; clinical judgement; counselling/communication skills; and organisation and efficiency. At the end of the encounter, the assessor marks the performance on overall clinical competence/care. There is also provision for grading the case complexity as low, average or high.


During a typical encounter, the trainee doctor engages in a specific clinical activity – for example, history taking and examination – and then summarises the findings to provide a provisional diagnosis and/or treatment plan to the assessor. The faculty member observing the encounter scores the trainee on their performance and provides immediate feedback. The encounters are meant to be short and, on average, take 15–25 min. Time for feedback has been reported as varying between as low as five and 17 min. Trainees are expected to be assessed several times by multiple assessors. In addition, there is space on the mini-CEX form to record details of the clinical encounter, thereby providing documented evidence of the assessment.


The RCOG has developed individual mini-CEX forms (one for obstetrics, and one for gynaecology) to be used in speciality-specific clinical settings – for example, labour ward or outpatient clinics ( Figs. 2 and 3 ). The structure and content of the forms are very similar to the original mini-CEX forms.




Fig. 2


Mini-CEX obstetrics.



Fig. 3


Mini-CEX gynaecology.


Objective structured assessment of technical skills (OSATS)


In medicine, particularly in surgical specialities, training and assessment of technical and procedural skills are essential. Assessment forms such as Direct Observation of Procedural Skills (DOPS) have been used by various Royal colleges to assess procedural skills from basic procedures such as venepuncture to more complex procedures such as cardiac surgery. The Objective Structured Assessment of Technical skills (OSATS) is similar to the DOPS, and consist of two components: the first is a checklist of specific competencies required to perform a particular procedure; the second is a generic-skills form that measures more generic competencies such as tissue or instrument handling and communication with the team.


The assessment focusses on the domains of technical skills and pre/postprocedure counselling, rather than other aspects of a clinical encounter – for example, history taking or physical examination. The procedure is usually supervised by an experienced doctor – for example, a consultant. Trainees are assessed at various procedures and with several different assessors. At the end of the procedure, the OSATS are completed with a focus on providing feedback for the performance; the encounter ends with the assessor either signing the trainee as competent to perform the procedure, independently, or as working towards competence. The time taken to complete an OSATS form is usually 5–10 min; this does not include the time required to observe the procedure, which, of course, will vary with the type of skill being assessed and the experience of the trainee.


The RCOG has separate OSATS forms for basic (e.g., caesarean sections, perineal repair and hysteroscopy) and more complex (e.g., operative laparoscopy) procedures. An example of an OSATS form is given in Fig. 4 .




Fig. 4


OSATS.


Case-based discussion (CbD)


The case-based discussion (CbD) is a variation of a tool called ‘chart-stimulated recall’ (CSR) , which was developed for use by the American Board of Emergency Medicine. The CbD is designed to assess the domains of medical record keeping, clinical assessment, decision making and professionalism in a particular area of clinical practice. The trainees discuss a case for which they have been responsible with their supervising consultant or other senior colleagues. It is expected that trainees will select cases of varying complexities and clinical settings, including critical incidents. It is estimated that each CbD encounter will take approximately 15 min and, although it is usual for discussion to take place in a quiet area (e.g., an office or clinic room), they may be used in more innovative settings – for example, in a departmental clinical meeting. In the real world, trainees’ expertise is variable; the successful completion of a fixed number of forms may be insufficient evidence of competence or adequate clinical experience. Although there are suggestions that a minimum of six CbDs should be conducted during early training stages , the PMETB guidance remains that the number of CbDs required per trainee-year should be judged on an individual basis, by the assigned educational supervisors and annual review panels. It may be ideal if these forms were used during each training encounter; this would help trainers to assess trainees across a wide range of clinical situations, thereby forming a more comprehensive picture of the trainee’s strengths and weaknesses.


The RCOG has a separate CbD form for obstetrics and gynaecology ( Figs. 5 and 6 ). This allows the forms to be used in speciality-specific clinical settings (e.g., labour ward, antenatal clinic for obstetrics; outpatient clinic or acute admission for gynaecology). In addition, the clinical problems discussed are different (e.g., antenatal care, maternal medicine for obstetrics; benign gynaecology, pelvic floor management for gynaecology).


Nov 9, 2017 | Posted by in OBSTETRICS | Comments Off on Evaluating the utility of workplace-based assessment tools for speciality training

Full access? Get Clinical Tree

Get Clinical Tree app for offline access