Validity refers to an evidence-based claim about the trustworthiness of decisions made from context-specific performance data. Validity requirements for competency-based assessments in obstetrics and gynecology have not been defined in the literature. We explain why validity is intrinsic to any discussion about competency assessment and provide a model for obstetrics and gynecology programs to use in determining the essential validity evidence for various forms of assessments. The implications of decisions made from assessment results influence the requisite level and precision of validity evidence. Although validity evidence is essential, it is also flexibly tied to the implications of decisions made from assessment results and not all assessments require the same degree of validity. We propose a model for considering validity, and build a discussion around specific assessment examples targeting progressive levels of expertise along the training continuum.
The mandate to assure clinical competence in obstetrics and gynecology will require assessment evidence that demonstrates trainees have met defined performance standards. The form that evidence takes may become more complex and rigorous as knowledge and skills increase, and the implications of competence become more pronounced. Validity and reliability are critical factors when considering the value of evidence derived from performance assessment, especially in simulation-based contexts where the transfer of knowledge and skills to applied clinical contexts is expected.
Historic definitions of validity, such as face validity, content validity, construct validity, and predictive validity, refer to the extent that an assessment score predicts performance on some criterion measure (typically applied performance). There has been a significant transformation in how psychometricians and measurement experts conceptualize validity in the past decade, taking into account the contextual and situational factors that impact performance, as well as the paradoxical need for both more flexibility and more rigorousness in measuring human performance. The uses of these historic definitions have changed as a result. In this article, we focus on the conceptual nature of validity rather than nomenclature, in large part because of the broad-scale rethinking taking place in these assessment and evaluation communities. By addressing the fundamental aspects of performance measurement, coupled with an understanding of how resulting scores are used for both low- and high-stakes assessments, we provide a framework that will withstand fluctuations in nomenclature without creating methodological obsolescence. This framework provides a foundation for determining the type and degree of validity evidence required for assessment in obstetrics and gynecology training programs that is easily grasped by non-psychometricians.
Fundamentally, validity refers to decisions made from the interpretation of scores derived from assessment methods. It is a function of what is being measured (the construct), how it is being measured (measurement tool and context), and how the resulting scores (data) will be used to make decisions. Consider a residency program director who wishes to assess the laparoscopic surgery knowledge of obstetrics and gynecology interns by having them perform a laparoscopic dissection of an ovarian endometrioma using a simulator with built-in scoring derived for minimally invasive surgery fellows. Although the assessment may have excellent reliability (consistency of measurement) and address aspects of laparoscopic surgical skills, drawing conclusions on the basis of these scores (eg, pass/fail/remediate) would lead to an invalid decision about the laparoscopic surgery knowledge of the interns because it only assesses a subset of knowledge using psychomotor performance measures. Nor would it be defensible to base decisions about their ability to perform a simpler laparoscopic procedure (eg, laparoscopic tubal ligation) because the task is too difficult for the level of trainee. However, making decisions on the basis of these assessment scores might very well produce valid (accurate) decisions about the laparoscopic surgical skills of minimally invasive surgery fellows because the task is appropriate for their skill level. Therefore, validity is neither a characteristic of the measurement tool nor the scores derived from the measurement tool. Validity is a characteristic of the decisions made on the basis of the scores within the specific construct (what is being measured). Construct, then, is the key issue when considering validity.
Constructs and assessment
The intention of assessment is to measure an underlying construct in a quantifiable way. The challenge is that although a construct may be described, it may not be directly measurable. Examples of broad constructs include family planning, surgical procedures, and teamwork. Some constructs may overlap with others. For example, suturing skill may be considered a construct in its own right (eg, intracorporeal, extracorporeal, laparoscopic, robot-assisted), or suturing may be considered as a component associated with another construct (eg, cesarean delivery, fourth-degree laceration repair). Medical constructs may be relatively complex and consequently defining them can be quite challenging.
The components that make up a construct provide the basis for assessment. Some components lend themselves well to direct quantifiable measurement because they assess overt performance (eg, elapsed time between deciding to perform an emergency cesarean delivery and making the incision). Other construct components are less amenable to direct measurement because they assess covert performance (eg, clinical reasoning). Constructs that include covert performance components are extremely difficult to assess because they require the interpretation of overt performance to infer what is happening covertly (eg, a program director could infer the clinical reasoning processes that take place covertly from the overt responses to examination questions or performance during patient care). An overt performance that is used to infer a covert component of a construct is referred to as an indicator behavior. If indicator behaviors are used for assessment purposes in a construct, there must be high correlation between the overt performance and the covert component (eg, scores for clinical reasoning on an examination and during patient care should be highly correlated because ostensibly they measure the same construct).
Validity is a measure of how well the assessment reflects the construct . Therefore, the construct must be clearly defined for a connected assessment to provide convincing evidence of associated competence. Ideally, an assessment maps the entire construct with 100% confidence that all components are accurately captured and reflected in the assessment data, and that all irrelevant information is excluded from the assessment. Figure 1 illustrates how the multiple components comprising the construct for placing an intrauterine device (IUD) are captured by the assessment. However, even if the construct is well defined, aligning assessment to adequately map it requires care.
Validity may be limited if the assessment underrepresents the construct. Consider the IUD placement construct and an assessment composed only of securely inserting the IUD in a simulated uterus. The construct will be underrepresented because other components are not included in the assessment ( Figure 2 ), and the validity of decisions based on the results will be limited because the assessment does not adequately map the defined construct. For instance, an intern may be able to successfully insert the IUD in the simulated uterus–and that skill may transfer to an actual patient–but the intern has not demonstrated the ability to manage potential complications from a perforated uterus (among other components). An assessment that underrepresents a construct will lead to misrepresentation of competence.
Overrepresentation of the construct can occur when aligned assessments not only capture all components of the construct, but also capture extraneous unessential information. This extraneous information will also limit the validity of any decisions based on assessment results because they may infer either unmerited competence (aggregate scores include high performance in extraneous components) or incompetence (aggregate scores include low performance in extraneous components). Using the IUD placement example ( Figure 3 ), the intern may perform well on the safe insertion of an IUD, management of complications, and follow-up care components, but very poorly on the knowledge of contraceptive failure rates, and history or current costs of IUD components such that her/his total score is below a set competency standard. Decisions made on the construct competence of the intern would be compromised by limited validity because the construct was overrepresented. Likewise, the intern could perform poorly on the safe insertion of an IUD, management of complications, and follow-up care components, but very well on the knowledge of contraceptive failure rates, and history or current costs of IUDs components, such that her/his total score is above the set competency standard. Decisions made on these results would also have limited validity.