The assessment of professional competence: building blocks for theory development




This article presents lessons learnt from experiences with assessment of professional competence. Based on Miller’s pyramid, a distinction is made between established assessment technology for assessing ‘knows’, ‘knowing how’ and ‘showing how’ and more recent developments in the assessment of (clinical) performance at the ‘does’ level. Some general lessons are derived from research of and experiences with the established assessment technology. Here, many paradoxes are revealed and empirical outcomes are often counterintuitive. Instruments for assessing the ‘does’ level are classified and described, and additional general lessons for this area of performance assessment are derived. These lessons can also be read as general principles of assessment (programmes) and may provide theoretical building blocks to underpin appropriate and state-of-the-art assessment practices.


Assessment of professional competence is one area in medical education where significant progress has been made. Many developments and innovations have inspired good research, which has taught valuable lessons and prompted steps leading to further innovations. In this article, the most salient general lessons are presented, which differs from our previous reviews of assessment of competence. Previously, the research around certain instruments to arrive at general conclusions was examined, but, in this article, this order is deliberately reversed to provide a different perspective.


Miller’s pyramid is used by the authors as a convenient framework to organise this review of assessment ( Fig. 1 ).




Fig. 1


Miller’s pyramid and types of assessment used for assessing the layers.


When the literature on assessment of medical competence is surveyed from a historical perspective, what is striking is that, over the past decades, the research appears to have been steadily ‘climbing’ Miller’s pyramid. Current developments are concentrated at the top: the ‘does’ level, while assessment at the lower layers, directed at (factual) knowledge, application of knowledge and demonstration of skills, has a longer history and might even be qualified as ‘established’ assessment technology. Assessment at the top (‘does’) level is predominantly assessment in the workplace. This article first discusses general lessons from the research of assessment at the bottom three layers and then concentrates on the top layer. The lessons are summarised in the ‘Practice points’.


The first three layers: ‘Knows’, ‘Knows how’ and ‘Shows how’


Competence is specific, not generic


This is one of the best-documented empirical findings in the assessment literature. In medical education, it was first described in the research on so-called patient management problems (PMPs). PMPs are elaborate, written patient simulations, and candidates’ pathways and choices in resolving a problem are scored and taken as indications of competence in clinical reasoning. A quite disconcerting and counterintuitive finding was that candidates’ performance on one case was a poor predictor of performance on any other given case, even within the same domain. This phenomenon was later demonstrated in basically all assessment methods, regardless of what was being measured. It was termed the ‘content specificity’ problem of (clinical) competence. A wealth of research on the Objective Structured Clinical Examination (OSCE) exposed content specificity as the dominant source of unreliability. All other sources of error (i.e., assessors, patients) either had limited effects or could be controlled. The phenomenon of content specificity is not unique to medical expertise; it is also found elsewhere, often under the name of task variability. How surprising and counterintuitive this finding was (and sometimes still is) is easier to understand when it is realised that much of the thinking about competencies and skills was based on notions from research on personality traits. Personality traits are unobservable, ‘inferred’, stable traits, distinct from other traits and characterised by monotonous linear growth. A typical example of a trait is intelligence. It cannot be observed directly, so it has to be inferred from behaviour; it is independent of other personality traits, etc. The trait approach was a logical extension of psychometric theory, which had its origins in personality research. However, empirical research in education contradicted the tenets of the personality trait approach, revealing that the expected stability across content/tasks/items was very low at best. Moreover, when sufficient cases or subjects were sampled to overcome the content specificity problem, scores tended to correlate across different methods of assessment, thereby shattering the notion of the independence of measurements (it is later seen that this led to another insight). Content specificity resonated with findings from cognitive psychology, where much earlier transfer was identified as a fundamental problem in learning. This sparked a great deal of research in cognitive psychology, providing insights on how learners reason through problems, how eminently important knowledge is therein, how information is chunked, automated and personalised as a result of personal experience and how people become experts through deliberate and sustained practice. Viewed from the perspective of cognitive psychology, the phenomenon of content specificity thus becomes understandable as a quite logical natural phenomenon.


The consequences of content specificity for assessment are far-reaching and dramatic. It would be naïve to rely on small samples across content. Large samples are required to make reliable and generalisable inferences about a candidate’s competence. In other words, short tests can never be generalisable. Depending on the efficiency of the methods used to sample across content (a multiple choice test samples more efficiently than a ‘long case’ oral examination such as used in the British tradition), estimations show that at least 3–4 h of testing time are required to obtain minimally reliable scores. In short, one measure is no measure, and single-point assessments are not to be trusted. The wisest strategy is to combine information across content, across time and across different assessment sources.


Objectivity does not equal reliability


This insight is closely related to the previous one, and it is central to our thinking around assessment. Another discovery emerged from increasing numbers of publications on the reliability of assessment methods: reliability does not co-vary with the objectivity of methods; so-called subjective tests can be reliable and objective tests can be unreliable, all depending on the sampling within the method. It became clear that content specificity was not the only reason to sample widely across content. When another factor, such as the subjective judgements of assessors, influenced measurement, it was usually found that sampling across that factor also improved the reliability of the scores. To illustrate this, even the notoriously subjective, old-fashioned, oral examination can be made reliable by wide sampling across content and examiners. The concept of the OSCE arose to combat the subjectivity of the then-existing clinical assessment procedures. The solution was sought in objectivity and in standardisation, hence the ‘O’ and ‘S’ in the acronym. However, as research accumulated, the OSCE turned out to be as (un)reliable as any method, all depending on the sampling within the OSCE. Apparently, reliability depended less on objectivity and standardisation than on sampling of stations and assessors. Further research around the OSCE revealed yet another piece of the puzzle: a strong correlation between global rating scales and checklist ratings. Admittedly, global ratings were associated with a slight decrease in inter-rater reliability, but this was offset by a larger gain in inter-station reliability. Apparently, compared with the more analytical checklist scores, global, holistic judgements tended to pick up on elements in candidates’ performance, which were more generalisable across stations. In addition, global rating scales proved to be more valid: they were better able to discriminate between levels of expertise. This was a clear and intriguing first indication that human expert judgement could add (perhaps even incrementally) meaningful ‘signal’ to measurements instead of only ‘noise’.


The notion that objectivity is not synonymous with reliability has far-reaching practical consequences. Most importantly, it justifies reliance on (expert) human judgement. Obviously, this is primarily relevant to those assessment situations where we cannot do without it, but, later in this article, we will argue that reliance on human judgement is at least as relevant to many of the modern competency definitions that are being developed around the world. It is reassuring to know that, provided our sampling is adequate, we have no reason to ban subjective and holistic judgements from our assessment repertoire. In our view, this justifies the return of assessment to the clinical environment, which it had abandoned when the OSCE was introduced. Only this time, the move is scientifically underpinned by assessment theory.


What is being measured is determined more by the format of the stimulus than by the format of the response


Any assessment method is characterised by its stimulus and response formats. The stimulus is the task presented to the candidate, and the response determines how the answer is captured. A stimulus format may be a written task eliciting a fact, a written patient scenario prompting a diagnostic choice or a standardised scenario portrayed by a simulated patient (SP), who is interviewed and diagnosed in an OSCE. Responses can be captured by short multiple-choice questions (MCQ) or long menu answers, a write-in, an essay, an oral situation, direct observation reported in a checklist, etc. Although different response formats can be used with one method, assessment methods are typically characterised by their response formats (i.e., MCQs, essays, orals, and so on). What empirical research revealed, surprisingly, was that validity – what is being measured – was not so much determined by the response format as by the stimulus format. This was first demonstrated in the clinical reasoning literature in repeated reports of strong correlations between the results of complex paper-based patient scenarios and those of multiple-choice questions. Like case specificity, this finding seemed highly counterintuitive at first sight. In fact, among test developers, it remains a widely accepted notion that essays tap into understanding and multiple-choice questions into factual knowledge. Although there are certain trade-offs (as we pointed out in relation to checklists and rating scales), there is no denying that it is the stimulus format and not the response format that dictates what is being measured. Studies in cognitive psychology, for example, have shown that the thought processes elicited by the case format differ from those triggered by a factual recall stimulus. Moreover, there is evidence that written assessment formats predict OSCE performance to a large extent.


The insight that the stimulus format is paramount in determining validity has first of all a practical implication: we should worry much more about designing appropriate stimulus formats than about appropriate response formats. An additional, related, insight concerns the stimulus format: authenticity is essential, provided the stimulus is pitched at the appropriate level of complexity. Extremely elaborate and costly PMPs, for example, did not add much compared with relatively simple short patient scenarios eliciting a key feature of a problem. Thus, short scenarios turned out to be not only relatively easy to develop, but they were quite efficient as well (good for wide sampling). It is no coincidence that written certifying examinations in the US and Canada have completely moved from measuring ‘Knows’ to measuring ‘Knows how’, using short scenario-based stimulus formats. Pitching formats at the appropriate level of authenticity is relevant for OSCEs too. The classic OSCE consists of short stations assessing clinical skills in fragmentation (e.g., station 1: abdominal examination, station 2: communication). This is very different from the reality of clinical practice, which the OSCE was designed to approximate in the first place. Although fragmented skills assessment may be defensible at early stages of training (although one might question that too), at more advanced stages of training, integrated skills assessment is obviously a more appropriate stimulus format, since it provides a closer approximation of the real clinical encounter. The importance of pitching the stimulus at a suitable level of complexity is supported by cognitive load theory, which posits that, when a learning task is too complex, short-term memory quickly becomes overloaded and learning is hampered as a result. This probably applies equally to assessment tasks. Authenticity therefore needs to be carefully dosed and fitted to the purpose of the assessment. However, the core lesson is that across all assessment methods, it is not the response format but the stimulus format on which we should focus.


A second implication of the significance of the stimulus format is more theoretical, although it has practical implications as well. When we aggregate information across assessments, we should use meaningful entities, probably largely determined by or related to the content of the stimulus format. This signifies a departure from the single-method-to-trait match (i.e., written tests measure knowledge, PMPs measure clinical reasoning and OSCEs measure clinical skills), which is in line with the trait approach and still characteristic of many assessment practices: it is easy to aggregate within one method. This tenet becomes questionable if we accept that the stimulus is the crucial element. Is the method (the response format) really the most meaningful guide for aggregation? For example, does it make sense to add the score on a history-taking station to the score on the next station on resuscitation? Clearly, these stations measure very different skills. Why does similarity of method warrant aggregation? We see no legitimacy. Perhaps the current competency movement can provide a more meaningful framework. Nonetheless, in our view, the prominence of the stimulus implies that we should aggregate information across sources of information that are meaningfully similar and make sense. It also implies that similar information is not by definition information derived from identical assessment methods. We will address the practical pay-off of this insight when we discuss assessment programmes.


Validity can be ‘built-in’


The general notion here is that assessment is not easy to develop and is only as good as the time and energy put into it. Good assessment crucially depends on quality assurance measures around both test development and test administration. Quality appraisal of tests during the developmental stage is imperative. Peer review is an essential ingredient of efforts to improve the quality of test materials significantly. Unfortunately, it is not uncommon for test materials in medical schools to go unreviewed both before and after test administration. Not surprisingly, the quality of test materials within schools is often poor. The same holds for test administration. For example, it is important to train SPs and assessors for an OSCE, because it makes a difference in terms of preventing noise in the measurement. Ebel, one of the early theorists on educational achievement testing, highlighted the difference between assessment in education and trait measurement in psychology. He argued that, while the latter is concerned with unobservable latent variables, assessments in education have direct meaning, can be discussed and evaluated, and directly optimised. Ebel also argued that validity can be a ‘built-in’ feature of an assessment method. We take the view that all assessment at the three bottom layers of Miller’s pyramid can be controlled and optimised: materials can be scrutinised, stakeholders prepared, administration procedures standardised, psychometric procedures put in place, etc. The extent to which this is actually done will ultimately determine the validity of the inferences supported by the assessment. Later, we will discuss how built-in validity is different at the top end of the pyramid.


The logical practical implication is to invest as much time and effort in test construction and administration processes as resources will allow. Another implication is that we should consider about sharing resources. Good assessment material is costly, so why not share it across schools and institutions? Not sharing is probably one of the biggest wastes of capital in education. Within our own context, five medical schools in the Netherlands have joined forces to develop and concurrently administer a comprehensive written test (Progress Test). Laudable international initiatives to share test material across institutions are the IDEAL Consortium ( http://www.hkwebmed.org/idealweb/homeindex.html , accessed 4 November 2009) and the UK UMAP initiative ( http://www.umap.org.uk/accessed 4 November 2009).


Assessment drives learning


By now, it has almost become a cliché in assessment that assessment drives learning. The idea that assessment affects learning, for better or for worse, is also termed ‘consequential validity’. It has been criticised by some who argue that it negates intrinsic motivation. Without any doubt, learners are also intrinsically motivated and not all learning is geared to assessment, but at the same time, academic success is defined by summative assessment, and learners will try to optimise their chances of success, much as researchers allow impact factors to drive their publication behaviour. If certain preparation strategies (reproductive learning, for instance) are expected to maximise assessment success, one cannot blame learners for engaging in these strategies. Nevertheless, the relationship remains poorly understood (what happens, to whom and why?) and we will revisit this issue in our suggestions for further research. For the time being, we note that many issues around assessment (format, regulations, scheduling, etc.) can have a profound impact on learners.


The immediate implication is that we should monitor assessment and evaluate its effect on learners. Assessment has been known to achieve the opposite effect to that intended. For example, when we introduced OSCEs within our school, students immediately started memorising checklists, and their performance in the OSCE was trivialised. This reinforces the point we made about quality control, and extends it beyond test administration. A second, potential consequence is that we might use assessment strategically to achieve desired effects. If assessment drives learning in a certain (known) way, we might actually use this to promote positive learning effects.


No single method can do it all


No single method can be the magic bullet for assessment. Single-point assessments have limitations and any form of assessment will be confined to one level of Miller’s pyramid. This realisation has inspired us to advocate ‘Programmes of Assessment’. Each single assessment is a biopsy, and a series of biopsies will provide a more complete, more accurate picture.


Thinking in terms of programmes of assessment has far-reaching consequences, particularly in relation to the governance of assessment programmes. We see an analogy here with a curriculum and how it is governed. A modern curriculum is planned, prepared, implemented, co-ordinated, evaluated and improved. We believe the same processes should be in place for an assessment programme. Such a programme needs to be planned and purposefully arranged to stimulate students to reflect at one point, to write at another, to present on certain occasions, to demonstrate behavioural performance at other arranged points, etc. Committees should be appointed to oversee test development, support should be arranged for test administration, evaluations should be carried out, and necessary improvements should be implemented. In a programme of assessment, any method can have utility, depending on its fitness for purpose. In our earlier reviews, we argued in favour of mindful utility compromises, allowing, for example, inclusion of a less reliable assessment method to make use of its beneficial effect on learning. We propose that decisions about learners should never be based on a few assessment sources but rely on many. Information is preferably aggregated across the programme, and, as we argued earlier, across meaningful entities. This hinges on the presence of an overarching structure to organise the assessment programme .


Armed with the lessons and insights on assessment, which we have discussed so far, we are now ready to tackle the top end of Miller’s pyramid. Pivotal in this move are the determination to strive towards assessment in authentic situations and the broad sampling perspective to counterbalance the unstandardised and subjective nature of judgements in this type of assessment.


Assessing ‘Does’


Any assessment method at the ‘does’ level is characterised one way or another by reliance on information from knowledgeable people to judge performance. Obviously, this includes the assessee too. For now, we will park self-assessment to return to it later. Essentially, all assessment in natural settings relies on knowledgeable others or on ‘expert’ judgements. Sometimes reliance is indirect, as when assessment primarily relies on artefacts (e.g., prescription records, chart review, procedures done), but, ultimately, artefacts will have to be judged by one or more suitable assessors. The term ‘expert’ should be interpreted broadly to include peers, superiors, co-workers, teachers, supervisors, and anyone knowledgeable about the work or educational performance of the assessee. The assessment consists of gathering these judgements in some quantitative or qualitative form. As with OSCEs, the dominant response format is some form of observation structure (rating scale, free text boxes) on which a judgement is based. Unlike the OSCE, however, the stimulus format is the authentic context, which is essentially unstandardised and relatively unstructured. The response format is usually more or less generic and is not tailored to a specific assessment context. Predominantly, judgements take the form of global ratings of multiple competencies, often followed by oral feedback and discussion. In addition to scoring performance on rating scales, assessors are often invited to write narrative comments about the strengths and weaknesses of a student’s performance.


The authentic context can be ‘school based’. An example is the assessment of professional behaviour in tutorial groups in a problem-based learning environment. The authentic context can also be ‘work-based’, that is, medical practice at all levels of training (undergraduate, postgraduate and continuous professional development). g


g We note a different use of work-based assessment in North-America and Europe. In North-America, this term is associated with work after completion of training. In Europe, it refers to all (learning) contexts that take place in a workplace. This may include undergraduate clinical rotations and postgraduate residency training programmes. We use the term here in the latter sense.

Particularly in the work-based arena, we have witnessed a recent explosion of assessment technologies. At the same time, we see a proliferation of competencies that are to be assessed. Increasingly, integral competency structures are proposed for modern assessment programmes, including the well-known general competencies from the US Accreditation Council of Graduate Medical Education and the Canadian ‘CanMEDS’ competencies. What they have in common is their emphasis on competencies that are not unique to the medical domain but have equal relevance to other professional domains. An example is the CanMEDS competency ‘Collaborator’ or ‘Communicator’, which has wide applicability. Although these competencies are generic to some extent, we immediately acknowledge that, for assessment purposes, they are just as context-specific as any other skill or competency. It is interesting that these frameworks should heavily emphasise more generic competencies, and they probably do so for all the right reasons. Typically, when things turn bad in clinicians’ performance, it is these competencies that are at stake. Research shows that success in the labour market is more strongly determined by generic skills than by specific domain-specific skills. Recent research in the medical domain shows that issues around problematic professional performance in clinical practice are associated with detectable flaws in professional behaviour during undergraduate medical training. Therefore, it is imperative that generic skills are assessed. Unfortunately, these competencies are as difficult to define as their assessment is indispensable. An illustration in point is professionalism, a competency that has given rise to a plethora of definitions. Detailed definitions and operationalisations can be incorporated in a checklist, but the spectre of trivialisation looms large. At the same time, all of us have an intuitive notion of what these competencies entail, particularly if we see them manifested in concrete behaviour. We would argue that, to evaluate domain-independent competencies, we have no choice but to rely on assessment at the top of the pyramid, using some form of expert judgement. It follows that expert judgement is the key to effective assessment at the ‘does’ level.


Clinical professionals in a (postgraduate) teaching role traditionally gauge the professional maturity of trainees by their ability to bear clinical responsibility and to safely perform clinical tasks without direct supervision. It has been advocated that a summative assessment programme at the ‘does’ level should result in statements of awarded responsibility (STARs). These STARs, representing competence to practise safely and independently, would be near the top end of the Miller pyramid, but below its highest level: a physician’s track record in clinical practice. This is where the ultimate goal of competence, good patient care, comes into play.


All modern methods of assessment at the ‘does’ level allow for or apply frequent sampling across educational or clinical contexts and across assessors. The need to deal with content specificity means that sampling across a range of contexts remains invariantly important. At the same time, the subjectivity of expert judgements needs to be counterbalanced by additional sampling across experts/assessors. The aggregate information must theoretically suffice to overcome the subjectivity of individual assessments. At this point, we will bypass instruments that do not allow for wide sampling.


First, we will discuss the organisation of assessment procedures at the ‘does’ level and then derive some general notions based on the current state of affairs in the literature. Assessment at the top of the pyramid is still very much a work in progress. Systematic reviews of these assessment practices invariably lead to the conclusion that hard scientific evidence is scarce and further research needed. Nevertheless, we believe that some generalisations are possible.


We will make a distinction between two types of assessment instruments. The first involve judgement of performance based directly on observation or on the assessor’s exposure to the learner’s performance. The second consists of aggregation instruments that compile information obtained from multiple sources over time. These two types will be discussed separately.


Direct performance measures


Within direct performance measures, we make another distinction between two classes of assessment methods, characterised by the length of the period over which the assessment takes place. In ‘Individual Encounter’ methods, performance assessment is confined to a single concrete situation, such as one (part of a) patient encounter. Instruments that are found here include the Mini-Clinical Evaluation Exercise (Mini-CEX ), Direct Observation of Practical Skills (DOPS ), the Professionalism Mini-evaluation (P-Mex ) and video observation of clinical encounters. In a concrete, time-bound, usually short (hence the ‘mini’ epithet), authentic encounter, performance is appraised by an assessor using a generic rating form often reflecting multiple competencies, such as the competency frameworks discussed earlier. Observation is generally followed by discussion or feedback between assessor and assessee. For individual trainees, this assessment procedure is repeated across a number of encounters and assessors.


The second class of methods we propose are longer-term methods, in which performance is assessed over a longer period of time, ranging from several weeks to months or even years. Instead of judging individual encounters, assessors here rely on their exposure to the learner’s work for an extended period of time. Examples of these methods include peer assessment and multisource feedback. Multisource, or 360°, feedback (MSF) is an extension of peer feedback. It often includes a self-assessment and assessments from a range of others who are in a position to give a relevant judgement of one or more aspects of the candidate’s performance. These may include peers, supervisors, other health-care workers, patients, etc. The evaluation format usually involves a questionnaire with rating scales, which, again, evaluate multiple competencies. In many cases, additional narrative information is provided as well. Concrete procedures around MSF may vary. In some implementations, the learner selects the assessors; in others, the learner has no say in this. Sometimes the assessors remain anonymous and sometimes their identity is disclosed to the learner. Sometimes the feedback from MSF is mediated, that is, by a discussion with a supervisor or facilitator. This class of performance-appraisal methods can also be seen to comprise classic in-training evaluations by a supervisor, programme director or teacher. Unlike all other performance-appraisal methods, in-training evaluation is based on a single assessor. This does not mean that it is less useful, it only means that it should be treated as such. Naturally, it can be part of a larger assessment programme (remember any method can have utility depending on its function within a programme). It should also be noted that, with sufficient sampling across assessors, there is no reason why these global performance evaluations cannot be reliable.


Aggregation methods


The second class of methods comprises aggregation methods, sampling performance across a longer period of time or even continuously. Two much-used instruments are the logbook and the portfolio. Portfolios have become particularly popular as an aggregation instrument. Just like ‘OSCE’, the term portfolio is an umbrella term that covers many manifestations, purposes of use and procedures surrounding it. Van Tartwijk and Driessen classify portfolios in terms of the functions they can serve: monitoring and planning, coaching and reflection, and assessment. In fact, one might classify a logbook as a particular kind of portfolio with an exclusive focus on monitoring and planning. Portfolios can be used for a short time span and for a very limited set of competencies, even for a single competency. They can play a minor part in a larger assessment programme or they can be the main method to aggregate and evaluate all assessments at the ‘does’ level. Alternatively, they can be the single method of assessment across the entire curriculum. Obviously, it is hard to generalise across all these manifestations to provide general conclusions around validity and reliability. However, recent reviews have made such attempts, resulting in clear recommendations. We will partly use these to infer our general notions. For specific details on portfolios, we refer to the reviews. For our thinking here, it is important to be aware that portfolios tend to work best if functions are combined, in other words, when the portfolio is used for planning, coaching ‘and’ assessment. Portfolios also tend to work best if they perform a very central function (rather than peripheral) in guiding learning, in coaching and in monitoring longitudinal competency development.


So what general notions can we infer from the work published so far regarding direct performance measures?


A feasible sample is required to achieve reliable inferences


Recent reviews of direct observations in individual encounters summarise a number of studies, some based on large samples, which examine how many observations are needed for adequate reliability. Similar findings have been published for peer evaluations and multisource feedback instruments where assessment ranges across a longer period of time. Despite variation between studies, we conclude that reliable inferences can be made with very feasible samples. The magical number seems to be somewhere between 8 and 10, irrespective of the type of instrument and of what is being measured (except when patient ratings are used; then many more are needed). This is a very clear confirmation that reliability is a matter of sampling, not of standardisation or structuring of assessment. Compared with other methods, the reliabilities actually appear to be somewhat better than those of standardised assessments. One may speculate that this could be an indication that global performance appraisals pick up more generalisable competencies. Further research will be needed to answer this question, but it is an interesting thought that global expert judgement might bring more unique information to assessment, information that is not, or to a lesser extent, captured by more analytical methods.


Bias is an inherent characteristic of expert judgement


Adequate reliability does not preclude bias in global judgements. Indeed, global judgements are prone to bias, probably much more so than more structured, analytical methods. With direct observation methods, inflation of scores has been noted. In multisource feedback, selection of assessors or the background of assessors can introduce worrisome biases. Another potentially important source of bias is the assessment context. Assessors’ propensity to use only (the positive) part of the scale is heavily influenced by their desire not to compromise the relationship with the learner or to avoid more work (and trouble) consequent to negative evaluations. We need more research on biases in global judgements and why and how they operate, but, for the time being, we must be aware of their presence and take appropriate precautions wherever possible. For example, in those situations where the learner and the assessor have a relationship (very instrumental in any good learning, ) we would suggest that measures be taken to protect the assessor. In direct observation methods, such measures could entail removing the summative aspect of the assessment from the individual encounter. The assessor’s task is not to judge if the learner is a good doctor, but to judge what happens in a specific encounter, to feed this back in a way that helps the learner to improve performance and, finally, to document this in an appropriate way for later meaningful review by the learner and by others. This is not to imply that the information cannot be used summatively somewhere somehow, later in the process, but the point is to remove the pass/fail decision from the individual encounter. A high-stakes decision should be based on multiple sources of assessment within or across methods, and robustness lies in the aggregation of all that rich information. Wherever possible, we would encourage relieving the assessor of potentially compromising, multiple roles. In making high-stake decisions based on aggregated information, protection could be provided by installing procedures that surpass the ‘power’ of the individual assessor. We will revisit this issue later.


Another important bias stems from self-assessment. The literature is crystal clear: we are very poor self-assessors, equally likely to underestimate as to overestimate ourselves. From a sampling perspective, this is not surprising. Self-assessment is inherently confined to a single assessment. In fact, the validity of a single self-assessment may not be so bad when it is compared with other single assessments. Nevertheless, sample size in self-assessment cannot be increased. The implication is that self-assessment can never stand on its own and should always be triangulated with other information. A continuous process of combining self-evaluations with information from others – such as in multisource feedback or in the reflection part of a portfolio – will hopefully pay off in the long run, and stimulate lifelong learning skills. However, even in continuous professional development, it is suggested that self-assessment should always be complemented by other assessments, an approach sometimes referred to as ‘directed self-assessment’.


Validity resides more in the users of the instruments than in the instruments that are used


We feel particularly strongly about this issue, because it is central and unique to assessment at the ‘does’ level and has profound practical implications. It complements our view on the earlier ‘built-in validity’ issue. In the lower layers of Miller’s pyramid, we can control much around test development and test administration. We can ‘sharpen’ the instrument as much as we can, but at the ‘does’ level, assessment can only be as good as the job done by the assessors using the instrument. For example, the utility of an assessment will depend not so much on the operationalisation of the rating scale used in the direct observation, but much more on the way the assessor and the learner deal with the information that emerges from the encounter. Conscientiousness is essential to the process of assessment and determines its value. Increased control of the noisy real world by standardising, structuring and objectifying is not the answer. On the contrary, it will only harm and trivialise the assessment. To improve we must ‘sharpen’ the people rather than the instruments. Therefore, the quality of the implementation will be the key to success. Published research so far seems to indicate we can do a much better job here: assessors are only rarely trained for their task and if they are, training is a brief and one-off event. Receiving and giving feedback requires skills that need to be trained, honed and kept up-to-date. From personal experience with assessor training, we know that the skills required are very similar to the skills for the doctor–patient encounter. Nevertheless, like communication skills, they are not part of every teacher’s make-up: they can and must be fostered.


Formative and summative functions are typically combined


In the preceding section, we already noted that in assessment at the ‘does’ level, the summative functions are typically linked with the formative functions. Indeed, we would argue that without formative value the summative function would be ineffective, leading to trivialisation of the assessment. As soon as the learner sees no learning value in an assessment, it becomes trivial. If the purpose is narrowed to doing eight summative Mini-CEXs, learners will start to play the game and make their own strategic choices regarding moments of observation and selection of assessors. If the assessors join in the game, they will provide judgement without adequate information and return to their routines. If the main objective of the reflections in the portfolio is to please the assessment committee, the portfolio will lose all significance to the learner. We have seen similar things happen with logbooks. We argue that whenever assessment becomes a goal in itself, it is trivialised and will ultimately be abandoned. Assessment has utility insofar as it succeeds in driving learning, is integrated in a routine, and ultimately comes to be regarded as indispensable to the learning practice. For assessment to be effective, certain conditions need to be met. We know that feedback is often ignored and fails to reach the intended recipient, positive feedback has more impact than negative feedback, (not implying that negative feedback has no value) feedback directed at the individual should be avoided and task-oriented feedback is to be preferred. We know the rules of feedback and we know that a positive learning climate is essential. The literature suggests that successful feedback is conditional on social interaction, such as coaching, mentoring, discussing portfolios and mediation around multisource feedback, and this principle may even extend to all assessment at the ‘does’ level. It stipulates that assessment should be fully integrated in the learning process, firmly embedded within the training programme and serves a direct function in driving learning and personal development. For that matter, the principle that assessment drives learning is strongly reinforced by evidence around assessment, but we would argue that at the top of the pyramid it is the sine qua non of effective assessment.


Qualitative, narrative information carries a lot of weight


If feedback is central to assessment and if social interaction mediates effective feedback, numerical and quantitative information has obvious limitations, while narrative, qualitative information has benefits. This is also reported in empirical studies: narrative, descriptive and linguistic information is often much richer and more appreciated by learners. Inescapably, narrative and qualitative information is something the assessment field will have to get used to. The assessment literature is strongly associated with quantification, scoring, averaging, etc., what Hodges calls the ‘psychometric discourse’. It is quite clear that a rating of 2 out of 5 on counselling skills in a patient encounter should raise some concern with the learner, but a mere numerical rating fails to disclose what the learner actually did and what she should do to improve. To provide richness to the assessment to a greater extent, we have an excellent tool: language. We would argue that effective formative assessment is predicated on qualitatively rich information. We should encourage instrument developers to ensure that all their instruments have built-in facilities to elicit qualitative information (e.g., space for narrative comments) and we should stimulate assessors to routinely provide and document such information. This argument has even more relevance if we wish to assess difficult to define, domain-independent competencies, such as professionalism. These competencies, in particular, have much to gain from enriched narrative information.


Summative decisions can be rigorous with non-psychometric qualitative research procedures


Looking beyond the psychometric discourse is also imperative if we wish to strengthen decisions based on information that is aggregated across assessment sources. Within the conventional psychometric discourse, we typically quantify: we calculate and average scores and grades, and determine the reliability and validity of decisions. However, as soon as information of different kinds is aggregated across all kinds of sources, psychometric evaluation is bound to fall short. We argue that aggregation in a programme of assessment (either at the ‘does’ level or across the full pyramid) depends on expert judgement. There are few situations in which purely quantitative strategies suffice, requiring no further judgement strategies. As soon as one source of information is qualitative, quantitative strategies will be found wanting. In trying to force quantification, similar to any individual method, we inevitably incur the risk of trivialisation.


In our efforts to proceed beyond the psychometric discourse, we find inspiration in methodologies from qualitative research. As in quantitative research, rigour is built into qualitative research, but the terminology and procedures are different. Rigour depends on ‘trustworthiness’ strategies replacing conventional notions of internal validity by credibility, external validity by transferability, reliability by dependability and objectivity by conformability. For each of these notions, methodological strategies are proposed that bring rigour to the research: prolonged engagement, triangulation, peer examination, member checking, structural coherence, time sampling, stepwise replication, audit and thick description. With some creativity, we can apply these strategies to assessment to achieve rigour of decision making. In Table 1 , we list some examples of assessment strategies that mirror these trustworthiness strategies and criteria.


Nov 9, 2017 | Posted by in OBSTETRICS | Comments Off on The assessment of professional competence: building blocks for theory development

Full access? Get Clinical Tree

Get Clinical Tree app for offline access