Standardization of Terminology, Validated Questionnaires, and Outcome Assessments

Standardization of Terminology, Validated Questionnaires, and Outcome Assessments

Joseph Kim-sang Lee

Bernard T. Haylen


Standardized terminology, as used in communications by scientifically focused societies such as the International Continence Society (ICS) and the International Urogynecological Association (IUGA), is as basic and important as a dictionary is to the wider society.1 Terminology documents, carefully prepared by experts in a long, consensus-based process, provide a valuable reference for society members and others wishing to publish in the different relevant journals. More importantly, everyone is “speaking the same language” both in their clinical practices and when presenting their research endeavors at scientific meetings and in any written form. Since February 2019, all the relevant terms (1,455 as of February 2021) from both the earlier-mentioned societies are instantaneously available, with their reference documents in a digital format, the ICS Glossary2—

History of Standardization

International Continence Society (1972 to 2008)

In 2020, the ICS completed 50 years of Standardization Committee history.2 Standardization was an early priority of the early ICS with the first Standardization Committee formed in 1972 under the chairmanship of Tage Hald with committee members Patrick Bates, Hansjorg Melchior, Art Sterling, David Rowan, Derek Griffiths, and Eric Glen (Fig. 11.1). Between 1972 and 1980, there were three reports of the terminology for lower urinary tract (LUT) dysfunction (1974—Annual General Meeting (AGM) Mainz, 1976, 1980). A fourth report was produced in 1983, adding Torsten Sundin, David Thomas, Michael Torrens, Richard Turner-Warwick, and Norman Zinner to the authors of the 1980 report.2

The original committee (1980 report) retired in 1983 with Jens Thorup Andersen (1983 to 1991) taking over as chair. Anders Mattiasson (1991 to 1998) followed by Derek Griffiths (1998 to 2000), and Philip van Kerrebroeck (2000 to 2007).

The 1998 LUT report3 collated the six LUT/pelvic floor reports till that time. Other documents leading up to 2002 LUT report include (1) technical aspects of urodynamic equipment,4 (2) LUT rehabilitation techniques,5 (3) pelvic organ prolapse (POP) and pelvic floor dysfunction (PFD),6 (4) LUT function: pressure-flow studies,7 (5) standardization of outcome studies: general principles,8 (6) outcome measures in adult women with symptoms of LUT dysfunction,9 (7) standardization of definitions of LUT dysfunction in children,10 (8) neurogenic LUT dysfunction,11 (9) standardization of urodynamic ambulatory monitoring,12 (10) treatment of males with symptoms of LUT dysfunction,13 (11) nocturia,14 and four ICS outcome reports.

The 2002 ICS terminology for LUT dysfunction report15 incorporated male and female terminology and became the most referenced terminology document in ICS and LUT dysfunction history (by 2020, more than 11,000 citations in two journals). Standardization documents between 2002 and 2010 include (1) good urodynamic practices16 and (2) pelvic floor muscle function and dysfunction.17 Figure 11.2 shows the past and present ICS standardization chairs.

International Urogynecological Association (1999 to 2008)

Although the early history of the IUGA Terminology and Standardization (T & S) Committee is not so well documented, it is commonly accepted that the original chair was Professor Ulf Ulmsten from Sweden. The committee was set up under the guidance of Dr. Harold Drutz from Canada, president and chairman of IUGA committees in 1999 to 2000. Early members included
Peter Sand (United States), Bob Freeman (United Kingdom), and Eckhard Petri (Germany), all of whom were to become IUGA presidents. This perhaps indicates the importance of this committee within IUGA.

As noted earlier, around 2000 to 2002, the ICS was developing a report on “The Standardization of Terminology of Lower Urinary Tract Function,” encompassing terminology for men, women, and children in the one document.15 Ulf Ulmsten, as T & S chair and a coauthor of the ICS document,15 would have been updating the IUGA T & S committee on its development, publication, and implementation. Bob Freeman took over the chair from 2003 to 2005, at which time there would have been no particular incentive for IUGA to seek
alternative terminology documents. Steven Swift (United States) served as IUGA T & S chair from 2006 to 2008. At that time, the Pelvic Organ Prolapse Quantification (POP-Q) system from 1996 was viewed as somewhat complex.6 An emphasis of the chair was to abridge the questionnaire and create a simplified POP-Q.18

International Continence Society-International Urogynecological Association (2007 to 2017)

In 2007, an excellent working relationship developed between ICS Standardization Steering Committee Chair (2007 to 2010) Dirk de Ridder and the IUGA T & S Committee Chair (2008 to 2014) Bernard Haylen, which led to the publication of the IUGA-ICS terminology for PFD,19 the most cited IUGA terminology publication (second most cited ICS publication—more than 3,800 citations), in January 2010. This female-specific text20 notably added three “most common” diagnoses (bladder oversensitivity, recurrent urinary tract infections, and voiding dysfunction) to the three existing female diagnoses in the 2002 report (urodynamic stress incontinence, detrusor overactivity, POP). The terminology for PFD,19 the initial product of that IUGA-ICS collaboration, was published simultaneously in the International Urogynecology Journal and Neurourology and Urodynamics in January 2010. Its value in consolidating the definitions for symptoms, signs, investigations, imaging, and the six most common diagnoses have seen it become the core female PFD terminology document. This document is most commonly cited to confirm compliance in Neurourology and Urodynamics and the International Urogynecology Journal publications with IUGA-ICS terminology. It was the forerunner to further joint IUGA-ICS publications and a template for other ICS publications.

The IUGA-ICS standardization and terminology relationship was maintained till 2017 with seven other joint documents initiated: (1) complications for prostheses and grafts (Fig. 11.3),21 (2) reporting outcomes of surgical procedures for POP,22 (3) complications
of native tissue surgery,23 (4) POP,24 (5) conservative and nonpharmacologic management of female PFD,25 (6) female anorectal dysfunction,26 and (7) sexual health in women with PFD.27

Complications related to the insertion of prostheses and grafts in pelvic floor surgeries21 was a timely document, published in January 2011, to acknowledge the widespread clinical issues arising from the greatly expanded use of synthetic products in prolapse surgeries, in particular from around

2005. A category, time, and site (CTS) classification, as noted in Figure 11.3, was created. This is the third most cited (626) IUGA document with the POP document (774),24 the second most cited. Its native tissue equivalent23 was published in April 2012, along with the important document on definitions, standardized outcome measures, and how to present results of prolapse surgeries.22 This should be used in all intervention studies of prolapse surgeries.

There were two non-IUGA-ICS documents between 2010 and 2016 when Marcus Drake (2000 to 2016) was Standardization Steering Committee chair: (1) update on good urodynamic practice28 (2016) and (2) chronic pelvic pain29 (2017). Figure 11.4 shows the past and present IUGA T & S chairs.

The ICS-IUGA partnership now has a collection of eight key terminology documents. These have allowed tremendous interactions between members of the working groups and between the two societies. All chairs will attest to the different “labors of love” in producing these documents. Both societies are in a strong position going forward in regard to terminology and standardization. Ongoing principles are that (1) documents be of the highest quality, contemporary, interesting, and a valuable contribution to the academic wealth of the respective organizations; and (2) definitions be accurate, concise, and—unless there is good cause.

International Continence Society (2017 to 2020)

Under ICS Standardization Chair Bernard Haylen (2016 to 2020), the emphasis was to update male-specific terminology while initiating further female-specific and other projects. Adult neurogenic LUT dysfunction30 and nocturnal LUT function31 were brought to completion. The first core male terminology for LUT and pelvic floor symptoms32 and dysfunction was published in February 2019 with 390 definitions of which 211 (54%) were new. In February 2019, following the completion of the male terminology paper, the ICS Glossary ( was launched (Haylen—compiler and Glossary Editor). In 2020, published documents added a further 321 new definitions to the ICS Glossary33 (1,455 total). These include those for (1) single-use absorbent pads,34 (2) female pelvic floor fistula,35 and (3) male LUT surgery.

How Has the Terminology for Female Lower Urinary Tract Dysfunction Progressed?

The 2002 ICS Terminology Report15 was the first and last to combine female and male terminology in a single document. It has been certainly heavily cited (more than 11,000). The missing diagnoses, voiding dysfunction, recurrent urinary tract infections, and bladder oversensitivity20 were a problem; there had been a natural male bias to terminology because a majority of authors were urologists. Overall, male diagnoses are oriented toward sensory and voiding dysfunctions including detrusor overactivity and bladder outlet obstruction. Leading female diagnoses are urodynamic stress incontinence, POP, and voiding dysfunction. Only with a female-only core terminology report19 could eight related areas be properly addressed. These include the terminology for prostheses and grafts21 and native tissue surgeries,23 surgical outcomes,22 POP,24 anorectal dysfunction,25 conservative and nonpharmacologic management,26 sexual health,27 and pelvic floor fistulas.35 In turn, it has provided the model for revising male terminology,32 the benefits of which will be the update of the core female terminology document19 and exploring specialty male areas beginning with LUT surgery. The availability of the ICS Glossary means that there is no excuse for using incorrect terminology. This is particularly relevant in rating verbal or written research presentations.

Where Is Standardization Heading?

Standardization, overall, is going from strength to strength. Both ICS and IUGA have, over time, provided an example to other societies and other specialties the importance of defining as much of the areas relevant to the society as possible. This has only been possible firstly by a culture that standardization is one of the most important aspects of the society. It relies on the dedication of those chairs and members of the ICS/IUGA Standardization Steering Committees and members of working groups over the last 50 years. Each chair and the members of each committee have provided a legacy for those who follow. Although excellent cooperation has occurred in developing reports, strong debate hasn’t been omitted along the way. “The standardization reports are living documents and modification and change is always possible in the future,” as reported in the 40-year ICS anniversary report. They have enhanced the stature of the ICS and IUGA as well as the citation index of its two journals, Neurourology and Urodynamics and the International Urogynecology Journal.


The ICS defined symptom as “any morbid phenomenon or departure from normal in structure, function or sensation, possibly indicative of a disease or health problem. Symptoms are either volunteered by, or elicited from the individual, or may be described by the individual’s partner or caregiver.”36 Traditionally, clinician obtain the patient’s history to understand the patients’ symptoms in relation to their health condition. However, traditional history taking usually fails to assess the perception and impact that the patient’s condition has in his or her daily activities and is at risk of clinician’s bias when interpreting the severity of these symptoms. Urogynecologic symptoms, as perceived by patients, do not always provide a definitive diagnosis. Through a standardized method of data collection, patient-reported outcomes (PRO) provide clinicians with a more objective rather than subjective clinical review of patients’ experiences of their symptoms.

Why Use Questionnaires?

PRO, a term introduced by the U.S. Food and Drug Administration (FDA), is any report of the status of a patient’s health condition that comes directly form the patient, without interpretation of the patient’s response by a clinician or anyone else.37,38 In the United Kingdom, it is sometimes known as patient-reported outcome measures (PROM). In clinical trials, a PRO instrument or PRO questionnaire can be used to measure the impact of an intervention on one or more aspects of patient’s health status (PRO concepts), ranging from purely symptomatic (e.g., vaginal bulge) to more complex concepts (e.g., ability to carry out activities of daily living), to extremely complex concepts such as quality of life, which is widely understood to be a multidomain concept with physical, psychological, and social components. Data generated by a validated PRO instrument can provide evidence of a treatment benefit or risk from the patient perspective, thereby informing the relative effectiveness and quality of treatment. The use of PRO helps provide a framework to agree on treatments and its goals as well as to inform decisions about treatment options and assess treatment outcomes.

The growing prominence of PRO is a shift in focus from clinical outcomes often related solely to survival and complications to outcomes that included the patient’s perspective. PRO tools could bridge the disconnect that sometimes occur between what the observer deemed important versus what the patient considers important with regard to symptom management and the balance between relief and quality of life. The PRO’s importance is evident in the wide recognition they received by major health care providers and organization, such as the FDA.

Psychometric Properties of Questionnaires

A PRO questionnaire needs to be psychometrically robust, in being able to measure the concepts it claims to measure, with a consistent measuring process and is able to depict change in health status when change had happened. The appropriately selected PRO tool would be applicable to the particular clinical problem of interest as well as to the appropriate population. It should ideally be acceptable and feasible, being not too lengthy and easy to administer, usually confirmed by pilot testing. Most PRO tools are usually designed to be self-administered through pen/paper or Web-based electronic format,39 although telephone interviews40 were sometimes used. An additional aspect worth considering before deciding on which questionnaire to use is the recall period (period of time patients are asked to consider in responding to a PRO item) that allows factors to affect the patients’ memory. Shorter recall periods may underestimate symptom burden, especially if symptoms have diurnal or day-to-day fluctuation, placing undue burden on patients. Longer recall periods are at risk for either over- or underestimating the health state. Further, parts of certain questions from the PRO should not be used alone, or in modification, or in changing the order or content because the psychometric properties may alter the response, invalidating its score.41

Validated PRO instruments must demonstrate robust psychometric properties which includes reliability, validity, and responsiveness.42 Reliability refers to the ability of a measure to produce similar results when assessments are repeated. Reliability is critical to ensure that change detected by the measure is due to the treatment or intervention and not due to measurement error. It reflects its ability to provide reproducible results, free from random errors of measurement. One measure of reliability is the questionnaire’s internal consistency, which indicates how well individual items within the same domain correlate. Cronbach’s alpha assesses internal consistency, with higher alphas indicating greater correlation, with Cronbach’s alpha greater than 0.7 generally indicating good internal consistency.42 Test-retest reliability or reproducibility or repeatability indicates how well results can be reproduced with repeated testing. It demonstrates stability of scores over time when no change is expected in the concept of interest. The Spearman correlation of coefficient and intraclass correlation coefficient are used to demonstrate reproducibility, with either correlation coefficient of at least 0.7 would indicate good test-retest reliability.42 Inter-rater reliability indicates how well scores correlate when a measure is administered by different interviewers or when multiple observers rate the same phenomenon. Demonstration of inter-rater reliability is not necessary for self-administered questionnaires but is required for instruments based on observer ratings
or using multiple interviewers. A correlation of at least 0.8 between raters indicate good inter-rater reliability.

Validity is the ability of an instrument to measure what it was intended to measure.42 A measure should be validated for the specific condition or outcome for which it will be used. An instrument designed to assess stress incontinence would not be valid for overactive bladder (OAB) unless it were specifically validated in patients with OAB symptoms. Content validity, convergent validity, discriminant validity, and criterion validity are required to validate a questionnaire. Content validity is a qualitative assessment of whether the questionnaire captures the range of the content it is intended to measure. For example, does a measure of symptom severity capture all the symptoms that patients with a particular condition have, and if so, is the measure capturing the items in a manner meaningful to patients in a language patients can understand? To obtain content validity, patients review the measure and provide feedback as to whether the questions are clear, unambiguous, and comprehensive. Construct validity is made up of convergent and discriminant validity. Construct validity is the appropriateness of inferences made on the basis of observations or measurements, for example, test scores, specifically whether a test measures the intended construct. It examines whether the intended measures behave like the theory says a measure of that construct should behave. It is evidence that relationships among items, domains, and concepts conform to an a priori hypothesis concerning logical relationships that should exist with measures of related concepts or scores produced in similar or diverse patient groups. Convergent validity is a quantitative assessment of whether the questionnaire measures the theoretical construct it was intended to measure. It refers to the degree to which two measures of the constructs that theoretically should be related are in fact related. Convergent validity indicates whether a questionnaire has stronger relationships with similar concepts or variables. Stronger relationships should be seen with the most closely related constructs and weaker relationships seen with less-related constructs. Discriminant validity indicates whether the questionnaire can differentiate between known patient groups (e.g., those with mild, moderate, or severe disease). Generally, measures that are highly discriminative are also highly responsive. It tests whether concepts or measurements that are supposed to be unrelated are in fact unrelated. Criterion validity reflects the correlation between the new questionnaire and an accepted reference, or gold standard. If the gold standard measure is not available, criterion validity cannot be established. Concurrent and predictive validity are two types of criterion-related validity. Concurrent validity applies to validation studies in which two measures are administered simultaneously or approximately at the same time,43 whereas in predictive validity,44 one measure occurs earlier and is meant to predict a later measure. When criterion validity can be established with an existing measure, the correlation should be 0.40 to 0.70; correlations approaching 1.0 indicate that the new questionnaire may be too similar to the gold standard measure and therefore redundant.

Responsiveness is the ability of an instrument to detect change over time in the construct to be measured. An aspect of responsiveness is determining not only whether the measure detects (statistically significant) change but whether the change is meaningful to the patient. The minimally important difference (MID) is the smallest change in a PRO questionnaire score that would be considered meaningful or important to a patient.45 MIDs for a given PRO measure may vary across populations, so the specific context in which the MID was established should be considered.46 Thus, the MID score could vary, depending on population or context (e.g., conservative or surgical intervention). Determining the MID is an iterative process that involve two methodologies—the anchor-based approach and distribution-based approach. The anchor-based approach involves using an external indicator, or anchor, to classify individuals into groups according to degree and direction of change. Through an appropriate anchor, individuals are classified as having experienced no change, small change (positive or negative), or large change (positive or negative). The MID is estimated as the mean difference in PRO score that is derived from patients in the small change groups. The most commonly used anchor is patient-reported global rating of change. The distribution-based approach for estimation of MID is determined by statistical distribution of the data, using analyses such as effect size, one-half standard deviation, and standard error of measurement. It is at best an indirect method of estimating MID and is typically used when the anchor-based approach is not possible. An important disadvantage of the distribution-based method is that it does not allow direct calculation of MID, but a standardized mean difference of about 0.5 (i.e., a half standard deviation) is likely to be at least the MID,47 which corresponds to the widely accepted criterion of a medium effect size.48 Ideally, MIDs are established using both anchor-based (with multiple clinical and patient-based anchors) and distribution-based methods. Nevertheless, the anchor-based approach has been recommended to produce primary evidence for MID and the distribution-based approach be used to provide secondary or supportive evidence for that MID.46

PRO questionnaires are often used in a number of different populations and settings, but these instruments and their psychometric properties may not necessarily be transferable. Linguistic and cultural adaptation of a questionnaire can occur during the development phase before validation, or it can be done after validation in its original language. Affirmation of a PRO instrument’s linguistic and cultural validity is important for its use in multinational clinical trials, not to mention during lumping of data during a meta-analysis. Linguistic and
cultural adaptation of a PRO instrument generally involve two forward translations, followed by quality control procedures such as backward translation into the original language, adjudication of all translated versions with discussion by an expert panel to ensure clarity of the translated questionnaire, and followed by testing the translated instrument in monolingual or bilingual patients to ensure it measures the same concepts as the original instrument.49

Questionnaires Types

PRO instruments are broadly divided into generic or condition-specific questionnaires. Generic questionnaires are multidimensional and are designed to attribute to a broad range of population because they tend to assess physical, social, and emotional dimensions of life. Because they do not focus on specific effects of the evaluated therapeutic approach, they lack sensitivity to measure clinically important changes in patients, but they do enable assessment of health gains beyond dimensions captured by condition-specific instruments. Generic questionnaires are applicable to the widest range of patients and can facilitate comparisons between disease and nondisease states as well as comparisons across different patient populations. Condition-specific questionnaires ask questions that are sensitive to changes in health status that are related to a given disease, disability or surgery. By design, they are able to detect small changes in health or functional status and therefore be more precise at evaluating the efficacy of treatment specific to the target condition. In addition, questionnaires can be divided into additional five other categories, namely screening questionnaires; symptom questionnaires that measure presence intensity discomfort and impact of specific symptoms; quality-of-life questionnaires; sexual function questionnaires; and measures of the patient’s satisfaction, expectations, goal achievements, or work productivity measures. Clinicians often use a range of complementary questionnaires to fully capture different aspects of the patient’s experiences of PFD. Some of the more commonly used questionnaires are listed in Table 11.1, together with their respective MIDs.

Screening or Detection

Screeners or screening questionnaires that may be used to detect patients who might have POP or PFD before a clinical examination has its origin in 1989 when the World Health Organization (WHO) conducted a meeting to develop specific questions about chronic obstetric morbidities.68 These final seven questions could identify 80% to 90% of moderate to severe vaginal prolapse.

  • Do you feel anything coming out of your vagina?

  • Do you have pain or difficulty in urinating?

  • Is it uncomfortable down below?

  • Do you have a feeling of heaviness?

  • Do you feel any swelling down below when you urinate or move your bowels?

  • Do you need to manipulate it to urinate or defecate?

  • Do you have any difficulty with intercourse?

A single question screening69 “Do you have a bulge or something falling out that you can see or feel in your vaginal area?” has a 96% sensitivity and 79% specificity for prolapse beyond level of hymen. The Epidemiology of Prolapse and Incontinence Questionnaire70 screens well for pelvic floor disorders, including prolapse, stress incontinence, OAB, and anal incontinence. Its positive and negative predictive value for prolapse is 76% and 97%, respectively; stress incontinence 88% and 87%, respectively; OAB 77% and 90%, respectively; and anal incontinence 61% and 91%, respectively. Nevertheless, these screening tools should not be misinterpreted as diagnostic tools even when their cutoff scores were met.

Symptom Questionnaires

Symptom questionnaires generally sought to assess the presence, severity, and bother of a particular pelvic floor symptom or groups of pelvic floor symptoms. Commonly used symptom questionnaires for specific LUT symptoms, those covering multiple domains including prolapse, as well as bowel symptoms are described in the following text.

Lower urinary tract patient-reported outcomes

Bladder diary is a record of patient-completed information regarding urinary and voiding habits. It is known as frequency volume chart when it provides information on frequency of micturition and void volumes only. Bladder diary generally includes information from the frequency volume chart as well as additional insights including urgency and incontinence episodes as well as type and amount of fluid intake completed prospectively. Some diaries incorporate severity of urgency symptoms, nature of events at the time of leakage (rushing to toilet or physical exertion), as well as pads usage. A further unstandardized nature of bladder diary is its length of record. In general, reproducibility improves as the duration of self-reporting increases, although patient compliance tends to also decrease with longer diary duration. Reliability of a 24-hour diary is generally poor. In patients with OAB, a 7-day diary has better reproducibility than a 3-day diary, whereas a 3-day diary has similar reproducibility to a 7-day diary in patients with stress incontinence. The electronic diary has some advantages over the paper diary, including patient prompt or reminder and automated calculation of some parameters. Accurately completed bladder diary contains quantifiable urinary symptoms, and they offer

important PRO that are commonly reported in OAB drug trials.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

May 1, 2023 | Posted by in GYNECOLOGY | Comments Off on Standardization of Terminology, Validated Questionnaires, and Outcome Assessments

Full access? Get Clinical Tree

Get Clinical Tree app for offline access