Evaluating the utility of workplace-based assessment tools for speciality training

Workplace assessment has been incorporated into speciality training in the UK following changes in the training and work patterns within the National Health Service (NHS). There are various types of assessment tools that have been adopted to assess the clinical competence of trainees. In obstetrics and gynaecology, these include mini-CEX, Objective Structured Assessment of Technical skills (OSATS) and case-based discussion (CbDs). This review provides a theoretical background of workplace assessment and the educational framework that may be adopted to evaluate their effectiveness. It summarises current evidence for the utility of these tools with regard to reliability, validity, acceptability, educational impact and cost.

There is an increasing emphasis on more formalised and effective assessment of postgraduate medical training in the United Kingdom (UK). This trend reflects developments in speciality training in the UK, changes in health-care delivery within the National health services (NHS) and shifts in the socio-political status of the medical profession.

The European Working Time Directive (EWTD) has altered junior doctor working patterns in the UK, with a reduction in working hours and fewer training years. Consequently, there has been a reduction in total speciality training hours from 30 000 to 6000. The traditional clinical apprenticeship with its reliance on time-related experiential training and subjective, observational assessment of clinical skills is no longer feasible. There are fewer opportunities for trainers to observe trainees during clinical encounters, thereby, limiting the assessment of skills and provision of feedback.

The delivery of services within the NHS has changed significantly in the last decade with more emphasis on achieving targets, including stringent waiting times and increased turnover of clinical activity. This has made training and assessing trainees in high-pressure clinical areas, such as outpatient clinics and operating theatres, challenging. High-profile cases, including the Shipman inquiry, have put the medical profession under scrutiny not only from the public and the media, but also from politicians. Consequently, a doctor needs to demonstrate not just competency, but also provide documentary evidence regarding attainment and continuation of this competency.

In response to these challenges, the Postgraduate Medical Education and Training Board (PMETB) advised Royal colleges across specialities to incorporate rigorous assessment strategies within their speciality training programmes. The assessment should reflect the overall objectives of the training, and provide a documentary evidence of achievement of stage-specific competence.

This article provides an overview of the evidence for the utility of workplace-based assessment in medicine. It specifically focusses on the assessment methods employed by the Royal College of Obstetricians and Gynaecologists (RCOG) to assess their speciality trainees. The review addresses the utility of tools designed principally to assess clinical and technical skills. Methods to specifically evaluate professionalism, such as the team observation forms, are not discussed in this article.

Workplace-based assessment

Workplace-based assessment allows the collection of evidence during normal work activities in order to decide whether a required standard has been achieved. It involves observation and assessment of work in practice and providing feedback in response to the work in progress. In the context of medicine, workplace assessment provides a method for doctors to be evaluated with regard to their performance in naturalistic settings, that is, in clinical situations. Judgements can be made not only on clinical procedural competence, but also on other aspects of practice, including professionalism and decision making.

In the UK, all postgraduate trainees in the field of medicine undergo an annual Record of In- Training Assessment (RITA) or Annual Review of Competence Progression (ARCP) to ensure that they are competent to continue with training and, at the end of the designated years of training, capable of practising, independently. Since previous assessment methods used were applied locally and varied in their quality; this made the RITA process rather subjective, unstructured and informal. At the end of the hospital placement, the trainee got ‘signed off’ by the clinical supervisor as being competent in a specific task. The whole process was arbitrary, and no objective evidence was collected regarding evidence of a trainee’s competence. The RITA also varied between different specialities with no obvious quality assurance or governance mechanism to ensure that the process or the evidence on which assessment was based on could be substantiated. The use of workplace assessment allows monitoring of trainees’ progress, and is a means of generating evidence of satisfactory and/or unsatisfactory professional performance as trainees mature from novice to expert.

Workplace-based assessment represents the top-two levels of Miller’s pyramid ( Fig. 1 ), a commonly adopted structure for assessing clinical competence. The lowest level of the pyramid tests knowledge (knows); this is progressively followed by competence (knows how), performance (shows how) and actual demonstration of action (does). Workplace-based methods of assessment target the upper two levels of the pyramid; on the other hand, other methods of assessment, such as multiple choice questions or objective structured clinical examinations (OSCEs), target the lower levels of the pyramid. This testing of performance at the place of work may allow a more realistic evaluation of a doctor’s performance taking into account the variations due to factors such as case mix and complexity.

Frameworks to evaluate assessment methods

There are various approaches to the evaluation of the quality of assessment methods that measure competence. Traditionally, this has involved estimation of psychometric properties of assessment methods including their reliability and validity.

The reliability of an assessment method describes the reproducibility of its results. This is a measure of how consistently the assessment would produce the same result if the test was taken by a candidate on different occasions, or the same candidate assessed by different assessors. In the context of workplace assessment, reliability is measured using the generalisability theory. The theory allows analysis of variance components to quantify generalisability of scores generated across different examination conditions such as task type and assessor numbers. The ratio of the variance components for the object of measure to the total error variance yields an estimate of reliability: the generalisability (G) coefficient ranges from 0 to 1, where 0 is the lowest, and 1 the highest value.

The reliability of workplace assessment tools is usually reported as reliability coefficients to calculate three aspects of the measures:

Internal consistency – this is commonly measured using Cronbach’s alpha coefficient with a score between 0 and 1, where 1 is the highest. A high Cronbach’s alpha coefficient implies a high correlation between different items within a measure if these items are measuring the same construct, for example, communication skills.

Test–retest reliability – this measures the consistency of an assessment tool when administered on different occasions to the same sample.

Inter-rater reliability – this refers to the degree of agreement between two or more assessors assessing the same encounter. This is usually measured using various methods depending upon the type of data. For example, Cohen’s kappa is used for categorical data and Pearson’s correlation coefficient is used for interval data.

The validity of an assessment tool provides an indication as to whether it is, in-fact, measuring what it is supposed to measure. There are four main types of validity:

Content validity – this ensures that items within a measure represent the content domain that is being tested.

Concurrent validity – this shows the degree to which the measure correlates with similar validated measures testing the same domain.

Construct validity – this shows the degree to which a measure associates with a theoretical hypothesis. For example, the hypothesis may be that a senior trainee will perform better in an assessment rather than a junior trainee. If the assessment method discriminates between senior and junior trainees in terms of performance scores, then it may be said to have strong construct validity.

Predictive validity – this shows the extent to which performance during an assessment can predict similar performance or behaviour in the future.

The applicability of a purely psychometric model in the evaluation of assessment tools for competency may be challenging due to some limitations. Workplace-assessment methods may be assumed to have content validity; however, their predictive and concurrent validity are more difficult to prove. Predictive validity implies that performance of trainees during workplace assessment will translate into competent performance in subsequent professional careers. As the introduction of work-based assessments is relatively recent, these data are not yet available. Concurrent validity of these forms of assessment methods assume that assessor evaluation of performance is valid (based on personal expertise and clear setting of measurement objectives), and also that there is no degree of subjectivity in the assessment of competence; both of these assumptions are difficult to evaluate. Assessment of clinical competence often includes a subjective evaluation of the whole performance. Workplace-based assessment methods generally split the skill being observed into smaller components. There is little evidence of how this kind of reductionist approach fits into the assessment of complex clinical skills, that is, does the sum total of the individual components add up to competence? There is a view that checklists are reliable in assessing technical skills. However, more complex skills such as interactions with patients may be more reliably assessed using global ratings. In order to address this reductionist approach to assessment, workplace-based assessment methods often include both checklists and global ratings. In addition, the validity of assessment tools is often challenging to prove as there are no ‘gold standard’ measures of clinical performance. This is an essential element to determine whether assessment methods measure what they are meant to measure.

Establishing the reliability of workplace assessment tools is equally contentious. This is mainly because individual patient encounters vary in complexity and context, and no two encounters can, therefore, be considered as equivalent. In addition, changes in working patterns, particularly shift working, do not allow the same assessor to sample a trainee’s progression over a period of time.

To address these limitations, theoretical frameworks have emerged to evaluate additional aspects of assessment methods, in addition to reliability and validity. The framework suggested by Baartman et al. was designed mainly to evaluate competency assessment programmes rather than specific assessment tools. For the purpose of the evaluation of the RCOG workplace-assessment tools, we have used the framework suggested by van der Vleuten , as it allows evaluation of individual assessment tools. According to this model, the utility of an assessment method is influenced by several variables. These variables include reliability, validity, educational impact, acceptability and cost. Each of these variables should be considered when evaluating the use of workplace-based assessment tools. Although reliability and validity are always discussed when reviewing methods of assessment, of equal importance is understanding an assessment’s educational impact, practicality of use and acceptability in terms of cost, which also affect its utility in practice.