The ICS defined symptom as “any morbid phenomenon or departure from normal in structure, function or sensation, possibly indicative of a disease or health problem. Symptoms are either volunteered by, or elicited from the individual, or may be described by the individual’s partner or caregiver.”
36 Traditionally, clinician obtain the patient’s history to understand the patients’ symptoms in relation to their health condition. However, traditional history taking usually fails to assess the perception and impact that the patient’s condition has in his or her daily activities and is at risk of clinician’s bias when interpreting the severity of these symptoms. Urogynecologic symptoms, as perceived by patients, do not always provide a definitive diagnosis. Through a standardized method of data collection, patient-reported outcomes (PRO) provide clinicians with a more objective rather than subjective clinical review of patients’ experiences of their symptoms.
Why Use Questionnaires?
PRO, a term introduced by the U.S. Food and Drug Administration (FDA), is any report of the status of a patient’s health condition that comes directly form the patient, without interpretation of the patient’s response by a clinician or anyone else.
37,
38 In the United Kingdom, it is sometimes known as patient-reported outcome measures (PROM). In clinical trials, a PRO instrument or PRO questionnaire can be used to measure the impact of an intervention on one or more aspects of patient’s health status (PRO concepts), ranging from purely symptomatic (e.g., vaginal bulge) to more complex concepts (e.g., ability to carry out activities of daily living), to extremely complex concepts such as quality of life, which is widely understood to be a multidomain concept with physical, psychological, and social components. Data generated by a validated PRO instrument can provide evidence of a treatment benefit or risk from the patient perspective, thereby informing the relative effectiveness and quality of treatment. The use of PRO helps provide a framework to agree on treatments and its goals as well as to inform decisions about treatment options and assess treatment outcomes.
The growing prominence of PRO is a shift in focus from clinical outcomes often related solely to survival and complications to outcomes that included the patient’s perspective. PRO tools could bridge the disconnect that sometimes occur between what the observer deemed important versus what the patient considers important with regard to symptom management and the balance between relief and quality of life. The PRO’s importance is evident in the wide recognition they received by major health care providers and organization, such as the FDA.
Psychometric Properties of Questionnaires
A PRO questionnaire needs to be psychometrically robust, in being able to measure the concepts it claims to measure, with a consistent measuring process and is able to depict change in health status when change had happened. The appropriately selected PRO tool would be applicable to the particular clinical problem of interest as well as to the appropriate population. It should ideally be acceptable and feasible, being not too lengthy and easy to administer, usually confirmed by pilot testing. Most PRO tools are usually designed to be self-administered through pen/paper or Web-based electronic format,
39 although telephone interviews
40 were sometimes used. An additional aspect worth considering before deciding on which questionnaire to use is the recall period (period of time patients are asked to consider in responding to a PRO item) that allows factors to affect the patients’ memory. Shorter recall periods may underestimate symptom burden, especially if symptoms have diurnal or day-to-day fluctuation, placing undue burden on patients. Longer recall periods are at risk for either over- or underestimating the health state. Further, parts of certain questions from the PRO should not be used alone, or in modification, or in changing the order or content because the psychometric properties may alter the response, invalidating its score.
41
Validated PRO instruments must demonstrate robust psychometric properties which includes reliability, validity, and responsiveness.
42 Reliability refers to the ability of a measure to produce similar results when assessments are repeated. Reliability is critical to ensure that change detected by the measure is due to the treatment or intervention and not due to measurement error. It reflects its ability to provide reproducible results, free from random errors of measurement. One measure of reliability is the questionnaire’s
internal consistency, which indicates how well individual items within the same domain correlate. Cronbach’s alpha assesses internal consistency, with higher alphas indicating greater correlation, with Cronbach’s alpha greater than 0.7 generally indicating good internal consistency.
42 Test-retest reliability or reproducibility or repeatability indicates how well results can be reproduced with repeated testing. It demonstrates stability of scores over time when no change is expected in the concept of interest. The Spearman correlation of coefficient and intraclass correlation coefficient are used to demonstrate reproducibility, with either correlation coefficient of at least 0.7 would indicate good test-retest reliability.
42 Inter-rater reliability indicates how well scores correlate when a measure is administered by different interviewers or when multiple observers rate the same phenomenon. Demonstration of inter-rater reliability is not necessary for self-administered questionnaires but is required for instruments based on observer ratings
or using multiple interviewers. A correlation of at least 0.8 between raters indicate good inter-rater reliability.
Validity is the ability of an instrument to measure what it was intended to measure.
42 A measure should be validated for the specific condition or outcome for which it will be used. An instrument designed to assess stress incontinence would not be valid for overactive bladder (OAB) unless it were specifically validated in patients with OAB symptoms. Content validity, convergent validity, discriminant validity, and criterion validity are required to validate a questionnaire.
Content validity is a qualitative assessment of whether the questionnaire captures the range of the content it is intended to measure. For example, does a measure of symptom severity capture all the symptoms that patients with a particular condition have, and if so, is the measure capturing the items in a manner meaningful to patients in a language patients can understand? To obtain content validity, patients review the measure and provide feedback as to whether the questions are clear, unambiguous, and comprehensive. Construct validity is made up of convergent and discriminant validity. Construct validity is the appropriateness of inferences made on the basis of observations or measurements, for example, test scores, specifically whether a test measures the intended construct. It examines whether the intended measures behave like the theory says a measure of that construct should behave. It is evidence that relationships among items, domains, and concepts conform to an a priori hypothesis concerning logical relationships that should exist with measures of related concepts or scores produced in similar or diverse patient groups.
Convergent validity is a quantitative assessment of whether the questionnaire measures the theoretical construct it was intended to measure. It refers to the degree to which two measures of the constructs that theoretically should be related are in fact related. Convergent validity indicates whether a questionnaire has stronger relationships with similar concepts or variables. Stronger relationships should be seen with the most closely related constructs and weaker relationships seen with less-related constructs.
Discriminant validity indicates whether the questionnaire can differentiate between known patient groups (e.g., those with mild, moderate, or severe disease). Generally, measures that are highly discriminative are also highly responsive. It tests whether concepts or measurements that are supposed to be unrelated are in fact unrelated.
Criterion validity reflects the correlation between the new questionnaire and an accepted reference, or gold standard. If the gold standard measure is not available, criterion validity cannot be established. Concurrent and predictive validity are two types of criterion-related validity. Concurrent validity applies to validation studies in which two measures are administered simultaneously or approximately at the same time,
43 whereas in predictive validity,
44 one measure occurs earlier and is meant to predict a later measure. When criterion validity can be established with an existing measure, the correlation should be 0.40 to 0.70; correlations approaching 1.0 indicate that the new questionnaire may be too similar to the gold standard measure and therefore redundant.
Responsiveness is the ability of an instrument to detect change over time in the construct to be measured. An aspect of responsiveness is determining not only whether the measure detects (statistically significant) change but whether the change is meaningful to the patient. The
minimally important difference (MID) is the smallest change in a PRO questionnaire score that would be considered meaningful or important to a patient.
45 MIDs for a given PRO measure may vary across populations, so the specific context in which the MID was established should be considered.
46 Thus, the MID score could vary, depending on population or context (e.g., conservative or surgical intervention). Determining the MID is an iterative process that involve two methodologies—the anchor-based approach and distribution-based approach. The anchor-based approach involves using an external indicator, or anchor, to classify individuals into groups according to degree and direction of change. Through an appropriate anchor, individuals are classified as having experienced no change, small change (positive or negative), or large change (positive or negative). The MID is estimated as the mean difference in PRO score that is derived from patients in the small change groups. The most commonly used anchor is patient-reported global rating of change. The distribution-based approach for estimation of MID is determined by statistical distribution of the data, using analyses such as effect size, one-half standard deviation, and standard error of measurement. It is at best an indirect method of estimating MID and is typically used when the anchor-based approach is not possible. An important disadvantage of the distribution-based method is that it does not allow direct calculation of MID, but a standardized mean difference of about 0.5 (i.e., a half standard deviation) is likely to be at least the MID,
47 which corresponds to the widely accepted criterion of a medium effect size.
48 Ideally, MIDs are established using both anchor-based (with multiple clinical and patient-based anchors) and distribution-based methods. Nevertheless, the anchor-based approach has been recommended to produce primary evidence for MID and the distribution-based approach be used to provide secondary or supportive evidence for that MID.
46
PRO questionnaires are often used in a number of different populations and settings, but these instruments and their psychometric properties may not necessarily be transferable. Linguistic and cultural adaptation of a questionnaire can occur during the development phase before validation, or it can be done after validation in its original language. Affirmation of a PRO instrument’s linguistic and cultural validity is important for its use in multinational clinical trials, not to mention during lumping of data during a meta-analysis. Linguistic and
cultural adaptation of a PRO instrument generally involve two forward translations, followed by quality control procedures such as backward translation into the original language, adjudication of all translated versions with discussion by an expert panel to ensure clarity of the translated questionnaire, and followed by testing the translated instrument in monolingual or bilingual patients to ensure it measures the same concepts as the original instrument.
49
Screening or Detection
Screeners or screening questionnaires that may be used to detect patients who might have POP or PFD before a clinical examination has its origin in 1989 when the World Health Organization (WHO) conducted a meeting to develop specific questions about chronic obstetric morbidities.
68 These final seven questions could identify 80% to 90% of moderate to severe vaginal prolapse.
Do you feel anything coming out of your vagina?
Do you have pain or difficulty in urinating?
Is it uncomfortable down below?
Do you have a feeling of heaviness?
Do you feel any swelling down below when you urinate or move your bowels?
Do you need to manipulate it to urinate or defecate?
Do you have any difficulty with intercourse?
A single question screening
69 “Do you have a bulge or something falling out that you can see or feel in your vaginal area?” has a 96% sensitivity and 79% specificity for prolapse beyond level of hymen. The Epidemiology of Prolapse and Incontinence Questionnaire
70 screens well for pelvic floor disorders, including prolapse, stress incontinence, OAB, and anal incontinence. Its positive and negative predictive value for prolapse is 76% and 97%, respectively; stress incontinence 88% and 87%, respectively; OAB 77% and 90%, respectively; and anal incontinence 61% and 91%, respectively. Nevertheless, these screening tools should not be misinterpreted as diagnostic tools even when their cutoff scores were met.